Posts about Technology

Book review: PHP and MongoDB Web Development

I've been interested in mongodb for quite some time now, so when a co-worker of mine asked if I was interested in reviewing a book about mongodb, I of course said yes! She put me in touch with the publisher of a book on MongoDB and web development entitled, "PHP and MongoDB Web Development". I was given a electronic copy of the book to review, and so here are my thoughts after spending a few weeks reading it and playing around with mongodb independently.

This book is subtitled "Beginner's Guide", and I think it achieves its goal of being a good introduction to mongodb for beginners. That being said, my primary criticism of the book is that it should include more information on some more advanced features like sharding and replica sets. It's easy to create web applications for small scales, or that don't need to be up 99.99% of the time. It's much harder to design applications that are robust to bursts in load, and to various kinds of network or hardware failures. Without much discussion on these points, it's hard to form an opinion on whether mongodb would be a suitable choice for developing large scale web applications given the information in this book alone.

Other than that, I quite enjoyed the book and found it filled in quite a few gaps in my (limited) knowledge. Seeing full examples of working code on more complex topics like map reduce, GridFS and geospacial indexing is very helpful to understanding how these features of mongodb could be used in a real application. I found the examples to be a bit verbose at times, although that's more a fault of PHP than of the book I think, and the formatting in the examples was inconsistent at times. Fortunately all the examples can be downloaded from the publisher's web site, http://www.packtpub.com/support saving you from having to type it all in!

The book also covers topics like integrating applications with traditional RDBMS like MySQL, and offers some practical examples of how mongodb could be used to augment an application which already is using SQL. It also includes helpful real world examples of how mongodb is used for web analytics, or by foursquare for 2d geospacial indexing.

In summary, the book is a good introduction to mongodb, especially if you're familiar with php. If you're looking for more in-depth information about optimizing your queries, or scaling mongodb, or if your language of choice isn't php, this probably isn't a good book for you.

Investigating hg performance

(caveat lector: this is a long post with lots of shell snippets and output; it's mostly a brain dump of what I did to investigate performance issues on hg.mozilla.org. I hope you find it useful. Scroll to the bottom for the summary.)

Everybody knows that pushing to try can be slow. but why?

while waiting for my push to try to complete, I wondered what exactly was slow.

I started by cloning my own version of try:


$ hg clone http://hg.mozilla.org try

destination directory: try

requesting all changes

adding changesets

adding manifests

adding file changes

added 95917 changesets with 447521 changes to 89564 files (+2446 heads)

updating to branch default

53650 files updated, 0 files merged, 0 files removed, 0 files unresolved

Next I instrumented hg so I could get some profile information:


$ sudo vi /usr/local/bin/hg

python -m cProfile -o /tmp/hg.profile /usr/bin/hg $*

Then I timed out long it took me to check what would be pushed:


$ time hg out ssh://localhost//home/catlee/mozilla/try

hg out ssh://localhost//home/catlee/mozilla/try  0.57s user 0.04s system 54% cpu 1.114 total

That's not too bad. Let's check our profile:


import pstats

pstats.Stats("/tmp/hg.profile").strip_dirs().sort_stats('time').print_stats(10)

Fri Dec  9 00:25:02 2011    /tmp/hg.profile


         38744 function calls (37761 primitive calls) in 0.593 seconds

   Ordered by: internal time
   List reduced from 476 to 10 due to restriction 

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       13    0.462    0.036    0.462    0.036 {method 'readline' of 'file' objects}
        1    0.039    0.039    0.039    0.039 {mercurial.parsers.parse_index2}
       40    0.031    0.001    0.031    0.001 revlog.py:291(rev)
        1    0.019    0.019    0.019    0.019 revlog.py:622(headrevs)
   177/70    0.009    0.000    0.019    0.000 {__import__}
     6326    0.004    0.000    0.006    0.000 cmdutil.py:15(parsealiases)
       13    0.003    0.000    0.003    0.000 {method 'read' of 'file' objects}
       93    0.002    0.000    0.008    0.000 cmdutil.py:18(findpossible)
     7212    0.001    0.000    0.001    0.000 {method 'split' of 'str' objects}
  392/313    0.001    0.000    0.007    0.000 demandimport.py:92(_demandimport)

The top item is readline() on file objects? I wonder if that's socket operations. I'm ssh'ing to localhost, so it's really fast. Let's add 100ms latency:


$ sudo tc qdisc add dev lo root handle 1:0 netem delay 100ms

$ time hg out ssh://localhost//home/catlee/mozilla/try

hg out ssh://localhost//home/catlee/mozilla/try  0.58s user 0.05s system 14% cpu 4.339 total


import pstats

pstats.Stats("/tmp/hg.profile").strip_dirs().sort_stats('time').print_stats(10)

Fri Dec  9 00:42:09 2011    /tmp/hg.profile


         38744 function calls (37761 primitive calls) in 2.728 seconds

   Ordered by: internal time
   List reduced from 476 to 10 due to restriction 

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       13    2.583    0.199    2.583    0.199 {method 'readline' of 'file' objects}
        1    0.054    0.054    0.054    0.054 {mercurial.parsers.parse_index2}
       40    0.028    0.001    0.028    0.001 revlog.py:291(rev)
        1    0.019    0.019    0.019    0.019 revlog.py:622(headrevs)
   177/70    0.010    0.000    0.019    0.000 {__import__}
       13    0.006    0.000    0.006    0.000 {method 'read' of 'file' objects}
     6326    0.002    0.000    0.004    0.000 cmdutil.py:15(parsealiases)
       93    0.002    0.000    0.006    0.000 cmdutil.py:18(findpossible)
  392/313    0.002    0.000    0.008    0.000 demandimport.py:92(_demandimport)
     7212    0.001    0.000    0.001    0.000 {method 'split' of 'str' objects}

Yep, definitely getting worse with more latency on the network connection.

Oh, and I'm using a recent version of hg:


$ hg --version

Mercurial Distributed SCM (version 2.0)



$ echo hello | ssh localhost hg -R /home/catlee/mozilla/try serve --stdio

145

capabilities: lookup changegroupsubset branchmap pushkey known getbundle unbundlehash batch stream unbundle=HG10GZ,HG10BZ,HG10UN httpheader=1024

This doesn't match what hg.mozilla.org is running:


$ echo hello | ssh hg.mozilla.org hg -R /mozilla-central serve --stdio  

67

capabilities: unbundle lookup changegroupsubset branchmap stream=1

So it must be using an older version. Let's see what mercurial 1.6 does:


$ mkvirtualenv hg16

New python executable in hg16/bin/python

Installing setuptools...



(hg16)$ pip install mercurial==1.6

Downloading/unpacking mercurial==1.6
  Downloading mercurial-1.6.tar.gz (2.2Mb): 2.2Mb downloaded
...



(hg16)$ hg --version

Mercurial Distributed SCM (version 1.6)



(hg16)$ echo hello | ssh localhost /home/catlee/.virtualenvs/hg16/bin/hg -R /home/catlee/mozilla/mozilla-central serve --stdio

75

capabilities: unbundle lookup changegroupsubset branchmap pushkey stream=1

That looks pretty close to what hg.mozilla.org claims it supports, so let's time 'hg out' again:


(hg16)$ time hg out ssh://localhost//home/catlee/mozilla/try

hg out ssh://localhost//home/catlee/mozilla/try  0.73s user 0.04s system 3% cpu 24.278 total

tl;dr

Finding missing changesets between two local repositories is 6x slower with hg 1.6 (4 seconds with hg 2.0 to 24 seconds hg 1.6). Add a few hundred people and machines hitting the same repository at the same time, and I imagine things can get bad pretty quickly.

Some further searching reveals that mercurial does support a faster method of finding missing changesets in "newer" versions, although I can't figure out exactly when this change was introduced. There's already a bug on file for upgrading mercurial on hg.mozilla.org, so hopefully that improves the situation for pushes to try.

The tools we use everyday aren't magical; they're subject to normal debugging and profiling techniques. If a tool you're using is holding you back, find out why!

Christmas tree preparations with an Arduino

We usually get a real Christmas tree if we're going to be in town for Christmas. A real tree needs watering though, which is something we've been less than...consistent with over the past years.

I decided to do something about this and build something to alert me when the water level gets too low. Two strips of aluminum foil taped to either side of a piece of plastic provide my water sensor. One strip is connected to an analog input on the arduino, and the other strip is connected to +3.3V. When the sensor is submerged I get a reading of around 300 "units" from the ADC. When it's removed from the water, a 10k pulldown resistor brings the reading down to 0.

I've hooked up a tri-colour LED to indicate various states, and plan to have an audible alert as well.

I'm not sure if the aluminum will end up corroding, nor if I'll be able to power this off batteries for any length of time. Still, I'm pretty pleased with it so far!

Here you can see that LED is green when the sensor is submerged, and changes colours (like a traffic light, as per Thomas' request) when the sensor is removed.

Linux on a new Thinkpad T510

I got a new Thinkpad T510 at work to replace my aging MacBook Pro. I asked for a Thinkpad instead of another MacBook because I wanted hardware with better hardware support, in particular the trackpad. I got into the habit of bringing a USB mouse everywhere I went because the trackpad on the MacBook was so unreliable on linux.

So when my new T510 arrived, I was pretty excited. And, except for one tiny problem (of the PEBKAC kind), transferring all my files from the old machine to the new one went flawlessly.

Here's how I set up the new machine:

<li>Boot up computer off USB drive.  Resize the existing NTFS partition to be really small.  Add 2 new partitions in the new-free space: one for the boot partition for linux, and one to be encrypted and be formatted with lvm.</li>

<li>Format boot partition as ext3.  Setup encrypted partition with 'cryptsetup luksFormat /dev/sda6; cryptsetup luksOpen /dev/sda6 crypt_sda6'.  Setup LVM with 'pvcreate /dev/mapper/crypt_sda6'.  Setup two volumes, one for swap, and one for the root partition.</li>

<li>Connect network cable between old laptop and new one.  Configure local network.</li>
<li>Copy files from old /boot to new /boot.</li>
<li>Copy files from old / to new /.  Here's where I messed up.  My command was: 'rsync -aPxX 192.168.2.1:/ /target/'.</li>
<li>Install grub.</li>
<li>Reboot!</li>

At this point the machine came up ok, but wasn't prompting to decrypt my root drive, and so I had to do some manual steps to get the root drive to mount initially. Fixing up /etc/crypttab and the initramfs solved this.

However even after this I was having some problems. I couldn't connect to wireless networks with Network Manager. I couldn't run gnome-power-manager. Files in /var/lib/mysql were owned by ntp! Then I realized that my initial rsync had copied over files preserving the user/group names, not the uid/gid values. And since I wasn't booting off a Debian image, the id/name mappings were quite different. Re-running rsync with '--numeric-ids' got all the ownerships fixed up. After the next reboot things were working flawlessly.

Now after a few weeks of using it, I'm enjoying it a lot more than my MacBook Pro. It boots up faster. It connects to wireless networks faster. It suspends/unsuspends faster. It's got real, live, page-up/page-down keys! The trackpad actually works!

poster 0.7.0 released!

I've just pushed poster 0.7.0 to the cheeseshop.

Thanks again to everybody who sent in bug reports, and for letting me know how you're using poster! It's really great to hear from users.

poster 0.7.0 fixes a few problems with 0.6.0, most notably:

  • Added callback parameters to MutipartParam and multipart_encode so you can add progress indicators to your applications. Thanks to Ludvig Ericson for the suggestion.
  • Fixed a bug where posting to a url that returned a 401 code would hang. Thanks to Patrick Guido and Andreas Loupasakis for the bug reports.
  • MultipartParam.from_params will now accept MultipartParam instances as the values of a dict object passed in. The parameter name must match the key corresponding to the parameter in the dict. Thanks to Matthew King for the suggestion.
  • poster now works under python2.7

poster 0.7.0 can be downloaded from the cheeseshop, or from my website. Documentation can be found at https://atlee.ca/software/poster/

I'm planning on looking at python 3 compatibility soon.

Also, if anybody has suggestions on a reliable way to test the streaming http code, I'm open to suggestions! My current methods result in intermittent failures because of the test harness I suspect.

poster's code is now available on bitbucket.

A year in RelEng

Something prompted me to look at the size of our codebase here in RelEng, and how much it changes over time. This is the code that drives all the build, test and release automation for Firefox, project branches, and Try, as well as configuration management for the various build and test machines that we have.

Here are some simple stats:

2,193 changesets across 5 repositories...that's about 6 changes a day on average.

We grew from 43,294 lines of code last year to 73,549 lines of code as of today. That's 70% more code today than we had last year.

We added 88,154 lines to our code base, and removed 51,957. I'm not sure what this means, but it seems like a pretty high rate of change!

Getting free diskspace in python, on Windows

Amazingly, one of the most popular links on this site is the quick tip, Getting free diskspace in python.

One of the comments shows that this method doesn't work on Windows. Here's a version that does:

import win32file

def freespace(p): """ Returns the number of free bytes on the drive that p is on """ secsPerClus, bytesPerSec, nFreeClus, totClus = win32file.GetDiskFreeSpace(p) return secsPerClus * bytesPerSec * nFreeClus

The win32file module is part of the pywin32 extension module.

poster 0.6.0 released

I've just pushed poster 0.6.0 to the cheeseshop.

Thanks again to everybody who sent in bug reports, and for letting me know how you're using poster! It's really great to hear from users.

poster 0.6.0 fixes a few problems with 0.5, most notably:

  • Documentation updates to clarify some common use cases.
  • Added a poster.version attribute. Thanks to JP!
  • Fix for unicode filenames. Thanks to Zed Shaw.
  • Handle StringIO file objects. Thanks to Christophe Combelles.

poster 0.6.0 can be downloaded from the cheeseshop, or from my website. Documentation can be found at https://atlee.ca/software/poster/

What do you want to know about builds?

Mozilla has been quite involved in recent buildbot development, in particular, helping to make it scale across multiple machines. More on this in another post!

Once deployed, these changes will give us the ability to give real time access to various information about our build queue: the list of jobs waiting to start, and which jobs are in progress. This should help other tools like Tinderboxpushlog show more accurate information. One limitation of the upstream work so far is that it only captures a very coarse level of detail about builds: start/end time, and result code is pretty much it. No further detail about the build is captured, like which slave it executed on, what properties it generated (which could include useful information like the URL to the generated binaries), etc.

We've also been exporting a json dump of our build status for many months now. It's been useful for some analysis, but it also has limitations: the data is always at least 5 minutes old by the time you look, and in-progress builds are not represented at all.

We're starting to look at ways of exporting all this detail in a way that's useful to more people. You want to get notified when your try builds are done? You want to look at which test suites are taking the most time? You want to determine how our build times change over time? You want to find out what the last all-green revision was on trunk? We want to make this data available, so anybody can write these tools.

Just how big is that firehose?

I think we have one of the largest buildbot setups out there and we generate a non-trivial amount of data:

  • 6-10 buildbot master processes generating updates, on different machines in 2 or 3 data centers
  • around 130 jobs per hour composed of 4,773 individual steps total per hour. That works out to about 1.4 updates per second that are generated

How you can help

This is where you come in.

I can think of two main classes of interfaces we could set up: a query-type interface where you poll for information that you are interested in, and a notification system where you register a listener for certain types (or all!) events.

What would be the best way for us to make this data available to you? Some kind of REST API? A message or event brokering system? pubsubhubbub?

Is there some type of data or filtering that would be super helpful to you?