Profiling Buildbot

chris

2009-07-28 11:13

Buildbot is a critical part of our build infrastructure at Mozilla. We use it to manage builds on 5 different platforms (Linux, Mac, Windows, Maemo and Windows Mobile), and 5 different branches (mozilla-1.9.1, mozilla-central, TraceMonkey, Electrolysis, and Places). All in all we have 80 machines doing builds across 150 different build types (not counting Talos; all the Talos test runs and slaves are managed by a different master). And buildbot is at the center of it all. The load on our machine running buildbot is normally fairly high, and occasionally spikes so that the buildbot process is unresponsive. It normally restores itself within a few minutes, but I'd really like to know why it's doing this! Running our staging buildbot master with python's cProfile module for almost two hours yields the following profile:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   416377 4771.188    0.011 4796.749    0.012 {select.select}
       52  526.891   10.133  651.043   12.520 /tools/buildbot/lib/python2.5/site-packages/buildbot-0.7.10p1-py2.5.egg/buildbot/status/web/waterfall.py:834(phase2)
     6518  355.370    0.055  355.370    0.055 {posix.fsync}
   232582  238.943    0.001 1112.039    0.005 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/banana.py:150(dataReceived)
 10089681  104.395    0.000  130.089    0.000 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/banana.py:36(b1282int)
36798140/36797962   83.536    0.000   83.537    0.000 {len}
 29913653   70.458    0.000   70.458    0.000 {method 'append' of 'list' objects}
      311   63.775    0.205   63.775    0.205 {bz2.compress}
 10088987   56.581    0.000  665.982    0.000 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/banana.py:141(gotItem)
4010792/1014652   56.079    0.000  176.693    0.000 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/jelly.py:639(unjelly)

2343910/512709   47.954    0.000  112.446    0.000 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/banana.py:281(_encode)

Interpreting the results

select shows up in the profile because we're profiling wall clock time, not cpu time. So the more time we're spending in select, the better, since that means we're just waiting for data. The overall run time for this profile was 7,532 seconds, so select is taking around 63% of our total time. I believe the more time spent here, the better. Time spent inside select is idle time. We already knew that the buildbot waterfall was slow (the second line in profile). fsync isn't too surprising either. buildbot calls fsync after writing log files to disk. We've considered removing this call, and this profile lends support to our original guess. The next entries really surprised me, twisted's dataReceived and a decoding function, b1282int. These are called when processing data received from the slaves. If I'm reading this correctly, this means that dataReceived and children account for around 40% of our total time after you remove the time spent in select. 1112 / (7532-4796) = 40%. These results are from our staging buildbot master, which doesn't have anywhere near the same load as the production buildbot master. I would expect that the time spent waiting in select would go down on the production master (there's more data being received, more often), and that time spent in fsync and dataReceived would go up.

What to do about it?

A few ideas....

Use psyco to get some JIT goodness applied to some of the slower python functions.
Remove the fsync call after saving logs.
Use the cpu-time to profile rather than wallclock time. This will give a different perspective on the performance of buildbot, which should give better information about where we're spending time processing data.
Implement slow pieces in C (or cython). Twisted's Banana library looks do-able in C, and also is high up in the profile.
Send less data from the slaves. We're currently logging all stdout/stderr produced by the slaves. All of this data is processed by the master process and then saved to disk.
Rearchitect buildbot to handle this kind of load.
Have more than one buildbot master, each one handling fewer slaves. We're actively looking into this approach, since it also allows us to have some redundancy for this critical piece of our infrastructure.

Pooling the Talos slaves

chris

2009-06-03 00:00

Comments

One of the big projects for me this quarter was getting our Talos slaves configured as a pool of machines shared across branches. The details are being tracked in bug 488367 for those interested in the details. This is a continuation of our work on pooling our slaves, like we've done over the past year with our build, unittest, and l10n slaves. Up until now each branch has had a dedicated set of Mac Minis to run performance tests for just that branch, on five different operating systems. For example, the Firefox 3.0 branch used to have 19 Mac Minis doing regular Talos tests: 4 of each platform (except for Leopard, which had 3). Across our 4 active branches (Firefox 3.0, 3.5, 3.next, and TraceMonkey), we have around 80 minis in total! That's a lot of minis! What we've been working towards is to put all the Talos slaves into one pool that is shared between all our active branches. Slaves will be given builds to test in FIFO order, regardless of which branch the build is produced on. This new pool will be....

Faster

With more slaves available to all branches, the time to wait for a free slave will go down, so testing can start more quickly...which means you get your results sooner!

Smarter

It will be able to handle varying load between branches. If there's a lot of activity on one branch, like on the Firefox 3.5 branch before a release, then more slaves will be available to test those builds and won't be sitting idle waiting for builds from low activity branches.

Scalable

We will be able to scale our infrastructure much better using a pooled system. Similar to how moving to pooled build and unittest slaves has allowed us to scale based on number of checkins rather than number of branches, having pooled Talos slaves will allow us to scale our capacity based on number of builds produced rather than the number of branches. In the current setup, each new release or project branch required an allocation of at least 15 minis to dedicate to the branch. Once all our Talos slaves are pooled, we will be able to add Talos support for new project or release branches with a few configuration changes instead of waiting for new minis to be provisioned. This means we can get up and running with new project branches much more quickly!

More Robust

We'll also be in a much better position in terms of maintenance of the machines. When a slave goes offline, the test coverage for any one branch won't be jeopardized since we'll still have the rest of the slaves that can test builds from that branch. In the current setup, if one or two machines of the same platform needs maintenance on one branch, then our performance test coverage of that branch is significantly impacted. With only one or two machines remaining to run tests on that platform, it can be difficult to determine if a performance regression is caused by a code change, or is caused by some machine issue. Losing two or three machines in this scenario is enough to close the tree, since we no longer have reliable performance data. With pooled slaves we would see a much more gradual decrease in coverage when machines go offline. It's the difference between losing one third of the machines on your branch, and losing one tenth.

When is all this going to happen?

Some of it has started already! We have a small pool of slaves testing builds from our four branches right now. If you know how to coerce Tinderbox to show you hidden columns, you can take a look for yourself. They're also reporting to the new graph server using machines names starting with 'talos-rev2'. We have some new minis waiting to be added to the pool. Together with existing slaves, this will give us around 25 machines in total to start off the new pool. This isn't enough yet to be able to test every build from each branch without skipping any, so for the moment the pool will be skipping to the most recent build per branch if there's any backlog. It's worth pointing out that our current Talos system also skips builds if there's any backlog. However, our goal is to turn off skipping once we have enough slaves in the pool to handle our peak loads comfortably. After this initial batch is up and running, we'll be waiting for a suitable time to start moving the existing Talos slaves into the pool. All in all, this should be a big win for everyone!

Parallelizing Unit Tests

chris

2009-05-15 17:59

Comments

Last week we flipped the switch and turned on running unit tests on packaged builds for our mozilla-1.9.1, mozilla-central, and tracemonkey branches. What this means is that our current unit test builds are uploaded to a web server along with all their unit tests. Another machine will then download the build and tests, and run various test suites on them. Splitting up the tests this way allows us to run the test suites in parallel, so the mochitest suite will run on one machine, and all the other suites will be run on another machine (this group of tests is creatively named 'everythingelse' on Tinderbox). paralleltests Splitting up the tests is a critical step towards reducing our end-to-end time, which is the total time elapsed between when a change is pushed into one of the source repositories, and when all of the results from that build are available. Up until now, you had to wait for all the test suites to be completed in sequence, which could take over an hour in total. Now that we can split the tests up, the wait time is determined by the longest test suite. The mochitest suite is currently the biggest chunk here, taking somewhere around 35 minutes to complete, and all of the other tests combined take around 20 minutes. One of the next steps for us to do is to look at splitting up the mochitests into smaller pieces. For the time being, we will continue to run the existing unit tests on the same machine that is creating the build. This is so that we can make sure that running tests on the packaged builds is giving us the same results (there are already some known differences: bug 491675, bug 475383) Parallelizing the unit tests, and the infrastructure required to run them, is the first step towards achieving a few important goals. - Reducing end-to-end time. - Running unit tests on debug, as well as on optimized builds. Once we've got both of these going, we can turn off the builds that are currently done solely to be able to run tests on them. - Running unit tests on the same build multiple times, to help isolate intermittent test failures. All of the gory details can be found in bug 383136.

Upcoming Identity Management with Weave

chris

2009-05-12 12:46

Comments

I was really excited to read a recent post about upcoming identity support with Weave on Mozilla Labs' blog. Why is this so cool? Weave lets you securely synchronize parts of your browser profile between different machines. All your bookmarks, AwesomeBar history, saved passwords can be synchronized between your laptop, desktop and mobile phone. Your data is always encrypted with a private key that only you have access to. Combine this with intelligent form-filling, automatic detection of OpenID-enabled sites, and you've got what is essentially single sign-on onto all your websites from all your browsers. Now you'll be able to sign into Firefox, and Firefox will know how to sign into all your websites. Keep up the great work Labs!

poster 0.4 released

chris

2009-04-05 14:23

Comments

I'm happy to announce the release of poster version 0.4. This is a bug fix release, which fixes problems when trying to use poster over a secure connection (with https). I've also reworked some of the code so that it can hopefully work with python 2.4. It passes all the unit tests that I have under python 2.4 now, but since I don't normally use python 2.4, I'd be interested to hear other people's experience using it. One of the things that I love about working on poster, and about open source software in general, is hearing from users all over the world who have found it helpful in some way. It's always encouraging to hear about how poster is being used, so thank you to all who have e-mailed me! poster can be downloaded from my website, or from the cheeseshop. As always, bug reports, comments, and questions are always welcome.

Exporting MQ patches

chris

2009-03-23 12:30

Comments

I've been trying to use Mercurial Queues to manage my work on different tasks in several repositories. I try to name all my patches with the name of the bug it's related to; so for my recent work on getting Talos not skipping builds, I would call my patch 'bug468731'. I noticed that I was running this series of steps a lot: cd ~/mozilla/buildbot-configs hg qdiff > ~/patches/bug468731-buildbot-configs.patch cd ~/mozilla/buildbotcustom hg qdiff > ~/patches/bug468731-buildbotcustom.patch ...and then uploading the resulting patch files as attachments to the bug. There's a lot of repetition and extra mental work in those steps:

I have to type the bug number manually twice. This is annoying, and error-prone. I've made a typo on more than one occasion and then wasted a few minutes trying to track down where the file went.
I have to type the correct repository name for each patch. Again, I've managed to screw this up in the past. Often I have several terminals open, one for each repository, and I can get mixed up as to which repository I've currently got active.
mercurial already knows the bug number, since I've used it in the name of my patch.
mercurial already knows which repository I'm in.

I wrote the mercurial extension below to help with this. It will take the current patch name, and the basename of the current repository, and save a patch in ~/patches called [patch_name]-[repo_name].patch. It will also compare the current patch to any previous ones in the patches directory, and save a new file if the patches are different, or tell you that you've already saved this patch. To enable this extension, save the code below somewhere like ~/.hgext/mkpatch.py, and then add "mkpatch = ~/.hgext/mkpatch.py" to your .hgrc's extensions section. Then you can run 'hg mkpatch' to automatically create a patch for you in your ~/patches directory!


import os, hashlib



from mercurial import commands, util

from hgext import mq



def mkpatch(ui, repo, *pats, **opts):
    """Saves the current patch to a file called -.patch
    in your patch directory (defaults to ~/patches)
    """
    repo_name = os.path.basename(ui.config('paths', 'default'))
    if opts.get('patchdir'):
        patch_dir = opts.get('patchdir')
        del opts['patchdir']
    else:
        patch_dir = os.path.expanduser(ui.config('mkpatch', 'patchdir', "~/patches"))

    ui.pushbuffer()
    mq.top(ui, repo)
    patch_name = ui.popbuffer().strip()

    if not os.path.exists(patch_dir):
        os.makedirs(patch_dir)
    elif not os.path.isdir(patch_dir):
        raise util.Abort("%s is not a directory" % patch_dir)

    ui.pushbuffer()
    mq.diff(ui, repo, *pats, **opts)
    patch_data = ui.popbuffer()
    patch_hash = hashlib.new('sha1', patch_data).digest()

    full_name = os.path.join(patch_dir, "%s-%s.patch" % (patch_name, repo_name))
    i = 0
    while os.path.exists(full_name):
        file_hash = hashlib.new('sha1', open(full_name).read()).digest()
        if file_hash == patch_hash:
            ui.status("Patch is identical to ", full_name, "; not saving")
            return
        full_name = os.path.join(patch_dir, "%s-%s.patch.%i" % (patch_name, repo_name, i))
        i += 1

    open(full_name, "w").write(patch_data)
    ui.status("Patch saved to ", full_name)


mkpatch_options = [
        ("", "patchdir", '', "patch directory"),
        ]
cmdtable = {
    "mkpatch": (mkpatch, mkpatch_options + mq.cmdtable['^qdiff'][1], "hg mkpatch [OPTION]... [FILE]...")
}

Give Firefox 3.1 Beta 3 a try!

chris

2009-03-13 08:58

Comments

We just released the third beta of Firefox 3.1. Read more about it over at Mozilla Developer Center, or if you're impatient, go download it now!

Upgraded to Wordpress 2.7

chris

2009-02-08 02:40

Comments

I just spent a few minutes upgrading my blog to wordpress 2.7. Looks like everything went smoothly! I did this upgrade with mercurial queues again. Wordpress 2.7 is supposed to have better upgrade support built in, so I may not need mercurial for future upgrades. Please let me know if you notice anything strange or missing since the upgrade.

ssh on-the-fly port forwarding

chris

2008-12-09 22:35

Comments

Check out this great tip from nion's blog: ssh on-the-fly port forwarding. I've often wanted to open up new port forwards, but haven't wanted to shut down my existing session.

If you follow this by # character (and thus type ~#) you get a list of all forwarded connections. Using ~C you can open an internal ssh shell that enables you to add and remove local/remote port forwardings ssh> help Commands: -L[bind_address:]port:host:hostport Request local forward -R[bind_address:]port:host:hostport Request remote forward -KR[bind_address:]port Cancel remote forward ssh> -L 8080:localhost:8080

I'm on The Twitter

chris

2008-12-08 23:04

Comments

Well, it seems like all the cool kids are tweetering these days, so to keep up with the times, I've signed up on the twitter. You can follow my twittering here: http://twitter.com/chrisatlee

chris' random ramblings

Posts about technology (old posts, page 7)

Profiling Buildbot

Interpreting the results

What to do about it?

Pooling the Talos slaves

Faster

Smarter

Scalable

More Robust

When is all this going to happen?

Parallelizing Unit Tests

Upcoming Identity Management with Weave

poster 0.4 released

Exporting MQ patches

Give Firefox 3.1 Beta 3 a try!

Upgraded to Wordpress 2.7

ssh on-the-fly port forwarding

I'm on The Twitter