Skip to main content

When do builds happen?

As part of our RelEng Blogging Blitz, I'll give a quick overview of when and how builds get triggered on our main build infrastructure.

There are three ways builds can be triggered.

The first, and most common way, is when a developer pushes his or her changes to hg.mozilla.org. Our systems check for new changes every minute or so, and put new changes into a queue. Once the tree has been quiet for 3 minutes (i.e. no changes for 3 minutes), a new build request is triggered with all queued changes. If there is a free slave available, then a new build starts immediately, otherwise the build request is put in a queue.

The second way builds are triggered is via a nightly scheduler. We start triggering builds on branches at 3:02am pacific local time (some branches are triggered at 3:32am or 4:02 am). We run at 3:02am to avoid problems with daylight savings <-> standard time transitions. In the fall there are two 2:59am's when we go back to standard time, and in the spring transition there is no 2:59am. The start times are staggered to avoid slamming hg.mozilla.org, or other shared resources.

The last way builds can be triggered is manually. The Release Engineering Sheriff can trigger builds on specific revisions, or rebuild past builds pretty easily, so if you need a build triggered, contact your friendly neighbourhood RelEng Sheriff!

RelEng Blogging Blitz is coming soon!

Several members of the Release Engineering team are going to be blogging next week about various bits of the build, test, and release automation infrastructure for Firefox.

If there's something about our infrastructure you've always wondered about, give us a shout and we'll do our best to explain it!

Upcoming changes to unittests

Since May, we've been running our unittest suite twice for each checkin: once for the current reference counting build + test job, and once for the packaged unittests. Current way of running tests

Our end goal is to run unittests on our optimized and debug builds. We would stop doing our current reference counting builds, since the debug builds also have reference counting enabled.

New way of running tests

To get there requires a few intermediate steps:

  • Turn off tests for the current build+tests unittest job. We're hoping to do this soon. The tests run on the packaged builds would still be active at this point (first phase of bug 507540).
  • Turn on tests for debug builds (bug 372581). There's a bit of work left to do here to make sure that hung processes don't kill the buildbot master with multi-gigabyte log files.
  • Turn on tests for optimized builds (bug 486783). This would include nightly and release builds eventually as well, and will give us test coverage on our windows PGO builds.
  • Turn off original reference counting builds completely, along with the tests done on these builds (second phase of bug 507540).

At some point we're also going to be changing how we run mochitests, by splitting the tests up according to which directory they're in, and then running each of the subset of tests in parallel on different machines (bug 452861).

Welcome Martin!

Martin Joseph AtLee was born by caesarean section on Wednesday morning and weighed in at 7 pounds, 15 ounces. The surgery went well for both Melissa and Martin. Mel had quite a bit of blood loss later in the day on Wednesday, giving us a bit of a scare as dozens of nurses and doctors rushed into her room! She recovered nicely the next day though.

Martin had a bit of jaundice, so he got to spend Friday night in a tanning salon!

Thomas has been a wonderful big brother, always wanting to hold and kiss his baby brother.

Faster signing

Once upon a time, when Mozilla released a new version of Firefox, signing all of the .exes and .dlls for all of the locales took a few hours. As more locales were added, and the file sizes increased, this time has increased to over 8 hours to sign all the locales in a Firefox 3.5 release. This has been a huge bottleneck for getting new releases out the door.

For our most recent Firefox releases, we've started using some new signing infrastructure that I've been working on over the past few months. There are quite a few reasons why the new infrastructure is faster:

  • Faster hardware. We've moved from an aging single core system to a new quad-core 2.5 GHz machine with 4 GB of RAM.
  • <li><strong>Concurrent signing of locales.</strong>  Since we've got more cores on the system, we should take advantage of them!  The new signing scripts spawn off 4 child processes, each one grabs one locale at a time to process.</li>
    
    <li><strong>In-process compression/decompression.</strong>  Our complete.mar files use bz2 compression for every file in the archive.  The old scripts would call the 'bzip2' and 'bunzip2' binaries to do compression and decompression.  It's significantly faster to do these operations in-process.</li>
    
    <li><strong>Better caching of signed files.</strong>  The biggest win came from the simple observation that after you sign a file, and re-compress it to include in a .mar file, you should be caching the compressed version to use later.  The old scripts did cache signed files, but only the decompressed versions.  So to sign the contents of our mar files, the contents would have to be completely unpacked and decompressed, then the cache was checked, files were signed or pulled from cache as necessary, and then re-compressed again.
    

    Now, we unpack the mar file, check the cache, and only decompress / sign / re-compress any files that need signing but aren't in the cache. We don't even bother decompressing files that don't need signing, another difference from the original scripts.

    Big thanks to Nick Thomas for having the idea for this at the Mozilla All-Hands in April.

As a result of all of this, signing all our locales can be done in less than 15 minutes now! See bug 470146 for the gory details.

The main bottleneck at this point is the time it takes to transfer all of the files to the signing machine and back again. For a 3.5 release, there's around 3.5 GB of data to transfer back and forth, which takes on the order of 10 minutes to pull down to the signing machine, and another few minutes to push back after signing is done.

Profiling Buildbot

Buildbot is a critical part of our build infrastructure at Mozilla. We use it to manage builds on 5 different platforms (Linux, Mac, Windows, Maemo and Windows Mobile), and 5 different branches (mozilla-1.9.1, mozilla-central, TraceMonkey, Electrolysis, and Places). All in all we have 80 machines doing builds across 150 different build types (not counting Talos; all the Talos test runs and slaves are managed by a different master).

And buildbot is at the center of it all.

The load on our machine running buildbot is normally fairly high, and occasionally spikes so that the buildbot process is unresponsive. It normally restores itself within a few minutes, but I'd really like to know why it's doing this!

Running our staging buildbot master with python's cProfile module for almost two hours yields the following profile:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   416377 4771.188    0.011 4796.749    0.012 {select.select}
       52  526.891   10.133  651.043   12.520 /tools/buildbot/lib/python2.5/site-packages/buildbot-0.7.10p1-py2.5.egg/buildbot/status/web/waterfall.py:834(phase2)
     6518  355.370    0.055  355.370    0.055 {posix.fsync}
   232582  238.943    0.001 1112.039    0.005 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/banana.py:150(dataReceived)
 10089681  104.395    0.000  130.089    0.000 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/banana.py:36(b1282int)
36798140/36797962   83.536    0.000   83.537    0.000 {len}
 29913653   70.458    0.000   70.458    0.000 {method 'append' of 'list' objects}
      311   63.775    0.205   63.775    0.205 {bz2.compress}
 10088987   56.581    0.000  665.982    0.000 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/banana.py:141(gotItem)
4010792/1014652   56.079    0.000  176.693    0.000 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/jelly.py:639(unjelly)

2343910/512709   47.954    0.000  112.446    0.000 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/banana.py:281(_encode)

Interpreting the results

select shows up in the profile because we're profiling wall clock time, not cpu time. So the more time we're spending in select, the better, since that means we're just waiting for data. The overall run time for this profile was 7,532 seconds, so select is taking around 63% of our total time. I believe the more time spent here, the better. Time spent inside select is idle time.

We already knew that the buildbot waterfall was slow (the second line in profile).

fsync isn't too surprising either. buildbot calls fsync after writing log files to disk. We've considered removing this call, and this profile lends support to our original guess.

The next entries really surprised me, twisted's dataReceived and a decoding function, b1282int. These are called when processing data received from the slaves. If I'm reading this correctly, this means that dataReceived and children account for around 40% of our total time after you remove the time spent in select. 1112 / (7532-4796) = 40%.

These results are from our staging buildbot master, which doesn't have anywhere near the same load as the production buildbot master. I would expect that the time spent waiting in select would go down on the production master (there's more data being received, more often), and that time spent in fsync and dataReceived would go up.

What to do about it?

A few ideas....

  • Use psyco to get some JIT goodness applied to some of the slower python functions.
  • Remove the fsync call after saving logs.
  • Use the cpu-time to profile rather than wallclock time. This will give a different perspective on the performance of buildbot, which should give better information about where we're spending time processing data.
  • Implement slow pieces in C (or cython). Twisted's Banana library looks do-able in C, and also is high up in the profile.
  • Send less data from the slaves. We're currently logging all stdout/stderr produced by the slaves. All of this data is processed by the master process and then saved to disk.
  • Rearchitect buildbot to handle this kind of load.
  • Have more than one buildbot master, each one handling fewer slaves. We're actively looking into this approach, since it also allows us to have some redundancy for this critical piece of our infrastructure.

Gardening update

The garden is continuing to grow. There's plenty of lettuce ready to be eaten, and we've harvested enough swiss chard to make two quiches!

I also picked a few of the carrots because I was curious how big they had gotten. They look pretty tasty!

Fresh carrots!

I planted two strawberry plants as well. They sent out tons of runners, but only one has bloomed. I'm going to see what happens, and if I get any berries off of it. Next year I think I'll set up a separate area for the strawberries, otherwise they'd easily take over my 4x4 block with their runners.

Strawberry blooms

There are lots of little tomatoes ripening on the vines as well. I can't wait for them to be ready to eat, there's nothing like fresh tomatoes from your own garden.

Baby tomato

I'm going to start a worm composting bin so I can return some of the waste products from the garden (like carrot and onion tops, tomato vines) back to the garden. I'm a bit nervous about this...but the investment is pretty small. There's a local company, wormcomposting.ca, that sells the red wiggler worms, so I'll probably be ordering a quarter pound of worms from them shortly.