Skip to main content

Posts about buildbot (old posts, page 1)

BYO Build Dashboard

I'm happy to announce that we've started publishing some build data we've been collecting for the past several months. If you're interested in build data, like what jobs get run on which machines, and how long they take, then the files at http://build.mozilla.org/builds will be of interest to you. These are JSON dumps of build data collected for most of our systems since October. There is one file per 24-hour period, as well as one file for the past 4 hours of data. If you want to look at the format, the 4 hour file is easier to read; the other files don't use any extra white space. All times should be interpreted as unix timestamps (seconds since Jan 1, 1970 00:00:00 UTC). The "requesttime" is a best-effort calculation of when a build was requested based on when the revision in question was pushed to hg, or when the unit test or talos run was triggered. That's it! This should be enough data to get started writing some dashboards and other visualizations. I'd love to hear how people are using this data, and if there's anything missing from the data provided that would be useful.

Profiling Buildbot

Buildbot is a critical part of our build infrastructure at Mozilla. We use it to manage builds on 5 different platforms (Linux, Mac, Windows, Maemo and Windows Mobile), and 5 different branches (mozilla-1.9.1, mozilla-central, TraceMonkey, Electrolysis, and Places). All in all we have 80 machines doing builds across 150 different build types (not counting Talos; all the Talos test runs and slaves are managed by a different master). And buildbot is at the center of it all. The load on our machine running buildbot is normally fairly high, and occasionally spikes so that the buildbot process is unresponsive. It normally restores itself within a few minutes, but I'd really like to know why it's doing this! Running our staging buildbot master with python's cProfile module for almost two hours yields the following profile:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   416377 4771.188    0.011 4796.749    0.012 {select.select}
       52  526.891   10.133  651.043   12.520 /tools/buildbot/lib/python2.5/site-packages/buildbot-0.7.10p1-py2.5.egg/buildbot/status/web/waterfall.py:834(phase2)
     6518  355.370    0.055  355.370    0.055 {posix.fsync}
   232582  238.943    0.001 1112.039    0.005 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/banana.py:150(dataReceived)
 10089681  104.395    0.000  130.089    0.000 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/banana.py:36(b1282int)
36798140/36797962   83.536    0.000   83.537    0.000 {len}
 29913653   70.458    0.000   70.458    0.000 {method 'append' of 'list' objects}
      311   63.775    0.205   63.775    0.205 {bz2.compress}
 10088987   56.581    0.000  665.982    0.000 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/banana.py:141(gotItem)
4010792/1014652   56.079    0.000  176.693    0.000 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/jelly.py:639(unjelly)

2343910/512709   47.954    0.000  112.446    0.000 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/banana.py:281(_encode)

Interpreting the results

select shows up in the profile because we're profiling wall clock time, not cpu time. So the more time we're spending in select, the better, since that means we're just waiting for data. The overall run time for this profile was 7,532 seconds, so select is taking around 63% of our total time. I believe the more time spent here, the better. Time spent inside select is idle time. We already knew that the buildbot waterfall was slow (the second line in profile). fsync isn't too surprising either. buildbot calls fsync after writing log files to disk. We've considered removing this call, and this profile lends support to our original guess. The next entries really surprised me, twisted's dataReceived and a decoding function, b1282int. These are called when processing data received from the slaves. If I'm reading this correctly, this means that dataReceived and children account for around 40% of our total time after you remove the time spent in select. 1112 / (7532-4796) = 40%. These results are from our staging buildbot master, which doesn't have anywhere near the same load as the production buildbot master. I would expect that the time spent waiting in select would go down on the production master (there's more data being received, more often), and that time spent in fsync and dataReceived would go up.

What to do about it?

A few ideas....
  • Use psyco to get some JIT goodness applied to some of the slower python functions.
  • Remove the fsync call after saving logs.
  • Use the cpu-time to profile rather than wallclock time. This will give a different perspective on the performance of buildbot, which should give better information about where we're spending time processing data.
  • Implement slow pieces in C (or cython). Twisted's Banana library looks do-able in C, and also is high up in the profile.
  • Send less data from the slaves. We're currently logging all stdout/stderr produced by the slaves. All of this data is processed by the master process and then saved to disk.
  • Rearchitect buildbot to handle this kind of load.
  • Have more than one buildbot master, each one handling fewer slaves. We're actively looking into this approach, since it also allows us to have some redundancy for this critical piece of our infrastructure.

Pooling the Talos slaves

One of the big projects for me this quarter was getting our Talos slaves configured as a pool of machines shared across branches. The details are being tracked in bug 488367 for those interested in the details. This is a continuation of our work on pooling our slaves, like we've done over the past year with our build, unittest, and l10n slaves. Up until now each branch has had a dedicated set of Mac Minis to run performance tests for just that branch, on five different operating systems. For example, the Firefox 3.0 branch used to have 19 Mac Minis doing regular Talos tests: 4 of each platform (except for Leopard, which had 3). Across our 4 active branches (Firefox 3.0, 3.5, 3.next, and TraceMonkey), we have around 80 minis in total! That's a lot of minis! What we've been working towards is to put all the Talos slaves into one pool that is shared between all our active branches. Slaves will be given builds to test in FIFO order, regardless of which branch the build is produced on. This new pool will be....

Faster

With more slaves available to all branches, the time to wait for a free slave will go down, so testing can start more quickly...which means you get your results sooner!

Smarter

It will be able to handle varying load between branches. If there's a lot of activity on one branch, like on the Firefox 3.5 branch before a release, then more slaves will be available to test those builds and won't be sitting idle waiting for builds from low activity branches.

Scalable

We will be able to scale our infrastructure much better using a pooled system. Similar to how moving to pooled build and unittest slaves has allowed us to scale based on number of checkins rather than number of branches, having pooled Talos slaves will allow us to scale our capacity based on number of builds produced rather than the number of branches. In the current setup, each new release or project branch required an allocation of at least 15 minis to dedicate to the branch. Once all our Talos slaves are pooled, we will be able to add Talos support for new project or release branches with a few configuration changes instead of waiting for new minis to be provisioned. This means we can get up and running with new project branches much more quickly!

More Robust

We'll also be in a much better position in terms of maintenance of the machines. When a slave goes offline, the test coverage for any one branch won't be jeopardized since we'll still have the rest of the slaves that can test builds from that branch. In the current setup, if one or two machines of the same platform needs maintenance on one branch, then our performance test coverage of that branch is significantly impacted. With only one or two machines remaining to run tests on that platform, it can be difficult to determine if a performance regression is caused by a code change, or is caused by some machine issue. Losing two or three machines in this scenario is enough to close the tree, since we no longer have reliable performance data. With pooled slaves we would see a much more gradual decrease in coverage when machines go offline. It's the difference between losing one third of the machines on your branch, and losing one tenth.

When is all this going to happen?

Some of it has started already! We have a small pool of slaves testing builds from our four branches right now. If you know how to coerce Tinderbox to show you hidden columns, you can take a look for yourself. They're also reporting to the new graph server using machines names starting with 'talos-rev2'. We have some new minis waiting to be added to the pool. Together with existing slaves, this will give us around 25 machines in total to start off the new pool. This isn't enough yet to be able to test every build from each branch without skipping any, so for the moment the pool will be skipping to the most recent build per branch if there's any backlog. It's worth pointing out that our current Talos system also skips builds if there's any backlog. However, our goal is to turn off skipping once we have enough slaves in the pool to handle our peak loads comfortably. After this initial batch is up and running, we'll be waiting for a suitable time to start moving the existing Talos slaves into the pool. All in all, this should be a big win for everyone!

XULRunner Nightlies now available

As Mossop mentioned, I've been working for the past week on getting XULRunner nightly builds up and running. I'm happy to announce that they're now available! The first builds of XULRunner for Linux (i686 and x86_64), Windows and OS X (i386 and ppc) for both mozilla-1.9.1 and mozilla-central (1.9.2) are finishing up, and are available (or will be soon!) at http://ftp.mozilla.org/pub/mozilla.org/xulrunner/nightly/. Fresh builds will be available every night. Enjoy!

Clobbering the trees

Today we landed some changes that will give developers self-serve clobber ability on our Mozilla Central / Mozilla 1.9.1 / Tracemonkey infrastructure. In our current infrastructure, we have a large pool of slave machines for each platform that each build all the various branches. This makes it nice and easy to spin up new project and release branches, and automatically distributes jobs across branches. However, it can sometimes be confusing when tracking down a build or test failure. Sometimes, a particular machine needs to have its build directory cleaned out; and sometimes all the machines for one branch or build type need to be cleaned up. Until now, this could only be done by RelEng by accessing the build machines directly. But now you can do it too! If you've got a valid LDAP account, head on over to http://build.mozilla.org/clobberer. You'll see a giant table, with lots of checkboxes on it. If you check a box next to one of the slaves on a particular branch / builder, then the next time that slave runs a build on that branch, it will first delete the entire build directory, and then do a fresh checkout, and continue on with the rest of the build. Selecting a builder-level checkbox merely selects all the slaves for that builder, and similarly, selecting the branch-level checkbox selects all the slaves for all the builders in that branch. In addition, if a slave has not been clobbered in a configurable time period (currently set to 1 week), it will clobber on the next run. Slaves are added to the database as they report in to ask for their clobber data, so it could take a little while for all the slave / builder / branch combinations to show up. See bug 432236 for more information.

python reload: danger, here be dragons

At Mozilla, we use buildbot to coordinate performing builds, unit tests, performance tests, and l10n repacks across all of our build slaves. There is a lot of activity on a project the size of Firefox, which means that the build slaves are kept pretty busy most of the time. Unfortunately, like most software out there, our buildbot code has bugs in it. buildbot provides two ways of picking up new changes to code and configuration: 'buildbot restart' and 'buildbot reconfig'. Restarting buildbot is the cleanest thing to do: it shuts down the existing buildbot process, and starts a new one once the original has shut down cleanly. The problem with restarting is that it interrupts any builds that are currently active. The second option, 'reconfig', is usually a great way to pick up changes to buildbot code without interrupting existing builds. 'reconfig' is implemented by sending SIGHUP to the buildbot process, which triggers a python reload() of certain files. This is where the problem starts. Reloading a module basically re-initializes the module, including redefining any classes that are in the module...which is what you want, right? The whole reason you're reloading is to pick up changes to the code you have in the module! So let's say you have a module, foo.py, with these classes:


class Foo(object):
    def foo(self):
        print "Foo.foo"


class Bar(Foo):
    def foo(self):
        print "Bar.foo"
        Foo.foo(self)
and you're using it like this:

>>> import foo

>>> b = foo.Bar()

>>> b.foo()

Bar.foo

Foo.foo

Looks good! Now, let's do a reload, which is what buildbot does on a 'reconfig':

>>> reload(foo)



>>> b.foo()

Bar.foo

Traceback (most recent call last):
  File "", line 1, in 
  File "/Users/catlee/test/foo.py", line 13, in foo
    Foo.foo(self)
TypeError: unbound method foo() must be called with Foo instance as first argument (got Bar instance instead)

Whoops! What happened? The TypeError exception is complaining that Foo.foo must be called with an instance of Foo as the first argument. (NB: we're calling the unbound method on the class here, not a bound method on the instance, which is why we need to pass in 'self' as the first argument. This is typical when calling your parent class) But wait! Isn't Bar a sub-class of Foo? And why did this work before? Let's try this again, but let's watch what happens to Foo and Bar this time, using the id() function:

>>> import foo

>>> b = foo.Bar()

>>> id(foo.Bar)

3217664

>>> reload(foo)



>>> id(foo.Bar)

3218592

(The id() function returns a unique identifier for objects in python; if two objects have the same id, then they refer to the same object) The id's are different, which means that we get a new Bar class after we reload...I guess that makes sense. Take a look at our b object, which was created before the reload:

>>> b.__class__



>>> id(b.__class__)

3217664

So b is an instance of the old Bar class, not the new one. Let's look deeper:

>>> b.__class__.__bases__

(,)

>>> id(b.__class__.__bases__[0])

3216336

>>> id(foo.Foo)

3218128

A ha! The old Bar's base class (Foo) is different than what's currently defined in the module. After we reloaded the foo module, the Foo class was redefined, which is presumably what we want. The unfortunate side effect of this is that any references by name to the class 'Foo' will pick up the new Foo class, including code in methods of subclasses. There are probably other places where this has unexpected results, but for us, this is the biggest problem. Reloading essentially breaks class inheritance for objects whose lifetime spans the reload. Using super() in the normal way doesn't even work, since you usually refer to your instance's class by name:

class Bar(Foo):
    def foo(self):
        print "Bar.foo"
        super(Bar, self).foo()
If you're using new-style classes, it looks like you can get around this by looking at your __class__ attribute:

class Bar(Foo):
    def foo(self):
        print "Bar.foo"
        super(self.__class__, self).foo()
Buildbot isn't using new-style classes...yet...so we can't use super(). Another workaround I'm playing around with is to use the inspect module to get at the class hierarchy:

def get_parent(obj, n=1):
    import inspect
    return inspect.getmro(obj.__class__)[n]


class Bar(Foo):
    def foo(self):
        print "Bar.foo"
        get_parent(self).foo(self)