Posts about buildbot

So long Buildbot, and thanks for all the fish

Last week, without a lot of fanfare, we shut off the last of the Buildbot infrastructure here at Mozilla.

Our primary release branches have been switched over to taskcluster for some time now. We needed to keep buildbot running to support the old ESR52 branch. With the release of Firefox 60.2.0esr earlier this month, ESR52 is now officially end-of-life, and therefore so is buildbot here at Mozilla.

Looking back in time, the first commits to our buildbot-configs repository was over 10 years ago on April 27, 2008 by Ben Hearsum: "Basic Mozilla2 configs". Buildbot usage at Mozilla actually predates that by at least two years, Ben was working on some patches in 2006.

Earlier in my career here at Mozilla, I was doing a lot of work with Buildbot, and blogged quite a bit about our experiences with it.

Buildbot served us well, especially in the early days. There really were no other CI systems at the time that could operate at Mozilla's scale.

Unfortunately, as we kept increasing the scale of our CI and release infrastructure, even buildbot started showing some problems. The main architectural limitations of buildbot we encountered were:

  1. Long lived TCP sessions had to stay connected to specific server processes. If the network blipped, or you needed to restart a server, then any jobs running on workers were interrupted.

  2. Its monolithic design meant that small components of the project were hard to develop independently from each other.

  3. The database schema used to implement the job queue became a bottleneck once we started doing hundreds of thousands of jobs a day.

On top of that, our configuration for all the various branches and platforms had grown over the years to a complex set of inheritance rules, defaults, and overrides. Only a few brave souls outside of RelEng managed to effectively make changes to these configs.

Today, much much more of the CI and release configuration lives in tree. This has many benefits including:

  1. Changes are local to the branches they land on. They ride the trains naturally. No need for ugly looooooops.

  2. Developers can self-service most of their own requests. Adding new types of tests, or even changing the compiler are possible without any involvement from RelEng!

Buildbot is dead! Long live taskcluster!

A tale of python profiling and optimizing

The Release Engineering infrastructure at Mozilla relies heavily on buildbot for much of our infrastructure. For various reasons we're running an older version based on buildbot 0.8.2, which we maintain in our own mercurial repository. We have many different masters with all sorts of different configurations.

To make sure that we don't break things too often, we wrote a tools called test-masters.sh that creates local versions of each unique master configuration and then runs a configuration check on it. Currently there are 20 unique master configurations to test, and it would take 4 minutes to run test-masters.sh on all of them on my machine. Recently sfink landed some changes which would test all the masters in parallel, which brought the time down from a previously interminable 11 minutes.

Four minutes is a long time to wait! What's taking so much time?

My go-to tool for profiling python code is cProfile. I ended up writing a small script to do the equivalent of 'buildbot checkconfig':

import cProfile
import sys
from buildbot.scripts.checkconfig import ConfigLoader

def loadMaster(filename):
    ConfigLoader(configFileName=filename)

cProfile.run("loadMaster(sys.argv[1])", "master.prof")

Running this against my buildbot master's master.cfg file will produce a master.prof file I can load to look at the profile.

>>> import pstats
>>> pstats.Stats("master.prof").sort_stats('time').print_stats(5)
Thu Nov  7 21:42:25 2013    master.prof

         26601516 function calls (24716688 primitive calls) in 439.756 seconds

   Ordered by: internal time
   List reduced from 1997 to 5 due to restriction <5>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1  409.381  409.381  438.936  438.936 /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_
hg_b4673f1f2a86_default-py2.6.egg/buildbot/master.py:621(loadConfig)
   170046    3.907    0.000   10.706    0.000 /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_
hg_b4673f1f2a86_default-py2.6.egg/buildbot/steps/shell.py:65(__init__)
   222809    3.745    0.000    4.124    0.000 /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_
hg_b4673f1f2a86_default-py2.6.egg/buildbot/process/buildstep.py:611(__init__)
        1    3.025    3.025   29.352   29.352 /home/catlee/mozilla/buildbot-configs/tests-master/master.cfg:2(<module>)
   170046    2.385    0.000    6.033    0.000 /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_
hg_b4673f1f2a86_default-py2.6.egg/buildbot/process/buildstep.py:1027(__init__)

Looks like buildbot's loadConfig is taking a long time! Unfortunately we don't get any more detail than that from cProfile. To get line-by-line profiling info, I've started using kernprof. This requires a few changes to our setup. First, we don't want to run cProfile any more, so modify our test script like this:

import cProfile
import sys
from buildbot.scripts.checkconfig import ConfigLoader

def loadMaster(filename):
    ConfigLoader(configFileName=filename)

#cProfile.run("loadMaster(sys.argv[1])", "master.prof")
loadMaster(sys.argv[1])

And since we want to get line-by-line profiling of buildbot's loadConfig, we need to annotate buildbot's code with the @profile decorator. In buildbot's master.py, I added @profile to loadConfig so it looks like this now:

@profile
def loadConfig(self, f, check_synchronously_only=False):
    """Internal function to load a specific configuration file. Any
    errors in the file will be signalled by raising an exception.
    <snip>
    """

Now run kernprof.py to get our profile:

kernprof.py -l -v ../profile_master.py master.cfg

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   621                                               @profile
   622                                               def loadConfig(self, f, check_synchronously_only=False):
   623                                                   """Internal function to load a specific configuration file. Any
   624                                                   errors in the file will be signalled by raising an exception.
   625
   626                                                   If check_synchronously_only=True, I will return (with None)
   627                                                   synchronously, after checking the config file for sanity, or raise an
   628                                                   exception. I may also emit some DeprecationWarnings.
   629
   630                                                   If check_synchronously_only=False, I will return a Deferred that
   631                                                   fires (with None) when the configuration changes have been completed.
   632                                                   This may involve a round-trip to each buildslave that was involved."""
   633
   634         1           17     17.0      0.0          localDict = {'basedir': os.path.expanduser(self.basedir)}
   635         1            7      7.0      0.0          try:
   636         1     68783020 68783020.0     12.0              exec f in localDict
   637                                                   except:
   638                                                       log.msg("error while parsing config file")
   639                                                       raise

   <snip>

   785     13303        86781      6.5      0.0          for b in builders:
   786     13302        92920      7.0      0.0              if b.has_key('slavename') and b['slavename'] not in slavenames:
   787                                                           raise ValueError("builder %s uses undefined slave %s" \
   788                                                                            % (b['name'], b['slavename']))
   789   6935914     42782768      6.2      7.5              for n in b.get('slavenames', []):
   790   6922612    449928915     65.0     78.4                  if n not in slavenames:
   791                                                               raise ValueError("builder %s uses undefined slave %s" \
   792                                                                                % (b['name'], n))
   793     13302      2478517    186.3      0.4              if b['name'] in buildernames:

Wow! Line 790 is taking 78% of our runtime! What's going on?

If I look at my config, I see that I have 13,302 builders and 13,988 slaves configured. Each builder has a subset of slaves attached, but we're still doing 6,922,612 checks to see if each slave that's configured for the builder is configured as a top-level slave. slavenames happens to be a list, so each check does a full scan of the list to see if the slave exists or not!

A very simple patch fixes this:

diff --git a/master/buildbot/master.py b/master/buildbot/master.py
--- a/master/buildbot/master.py
+++ b/master/buildbot/master.py
@@ -761,19 +761,19 @@ class BuildMaster(service.MultiService):
         errmsg = "c['schedulers'] must be a list of Scheduler instances"
         assert isinstance(schedulers, (list, tuple)), errmsg
         for s in schedulers:
             assert interfaces.IScheduler(s, None), errmsg
         assert isinstance(status, (list, tuple))
         for s in status:
             assert interfaces.IStatusReceiver(s, None)

-        slavenames = [s.slavename for s in slaves]
+        slavenames = set(s.slavename for s in slaves)

         # convert builders from objects to config dictionaries
         builders_dicts = []
         for b in builders:
             if isinstance(b, BuilderConfig):
                 builders_dicts.append(b.getConfigDict())
             elif type(b) is dict:
                 builders_dicts.append(b)

Now our checks are into a set instead of a list, which is an O(log(n)) operation instead of O(n). Let's re-run our profile with this patch:

File: /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_hg_b4673f1f2a86_default-py2.6.egg/buildbot/master.py
Function: loadConfig at line 621
Total time: 109.885 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   621                                               @profile
   622                                               def loadConfig(self, f, check_synchronously_only=False):
   623                                                   """Internal function to load a specific configuration file. Any
   624                                                   errors in the file will be signalled by raising an exception.
   625
   626                                                   If check_synchronously_only=True, I will return (with None)
   627                                                   synchronously, after checking the config file for sanity, or raise an
   628                                                   exception. I may also emit some DeprecationWarnings.
   629
   630                                                   If check_synchronously_only=False, I will return a Deferred that
   631                                                   fires (with None) when the configuration changes have been completed.
   632                                                   This may involve a round-trip to each buildslave that was involved."""
   633
   634         1           30     30.0      0.0          localDict = {'basedir': os.path.expanduser(self.basedir)}
   635         1           13     13.0      0.0          try:
   636         1     46268025 46268025.0     42.1              exec f in localDict
   637                                                   except:
   638                                                       log.msg("error while parsing config file")
   639                                                       raise

   <snip>

   785     13303        56199      4.2      0.1          for b in builders:
   786     13302        60438      4.5      0.1              if b.has_key('slavename') and b['slavename'] not in slavenames:
   787                                                           raise ValueError("builder %s uses undefined slave %s" \
   788                                                                            % (b['name'], b['slavename']))
   789   6935914     27430456      4.0     25.0              for n in b.get('slavenames', []):
   790   6922612     29028880      4.2     26.4                  if n not in slavenames:
   791                                                               raise ValueError("builder %s uses undefined slave %s" \
   792                                                                                % (b['name'], n))
   793     13302      1286628     96.7      1.2              if b['name'] in buildernames:
   794                                                           raise ValueError("duplicate builder name %s"
   795                                                                            % b['name'])

We're down to 25% of the runtime for this check now. If we apply the same treatment to a few of the other data structures here, we get the total time for test-masters down to 45 seconds!

I've landed resulting patch into our repo. I encourage you to upgrade!

self-serve builds!

Do you want to be able to cancel your own try server builds?

Do you want to be able to re-trigger a failed nightly build before the RelEng sheriff wakes up?

Do you want to be able to get additional test runs on your build?

If you answered an enthusiastic YES to any or all of these questions, then self-serve is for you.

self-serve was created to provide an API to allow developers to interact with our build infrastructure, with the goal being that others would then create tools against it. It's still early days for this self-serve API, so just a few caveats:

  • This is very much pre-alpha and may cause your computer to explode, your keg to run dry, or may simply hang.
  • It's slower than I want. I've spent a bit of time optimizing and caching, but I think it can be much better. Just look at shaver's bugzilla search to see what's possible for speed. Part of the problem here is that it's currently running on a VM that's doing a few dozen other things. We're working on getting faster hardware, but didn't want to block this pre-alpha-rollout on that.
  • You need to log in with your LDAP credentials to work with it.
  • The HTML interface is teh suck. Good thing I'm not paid to be a front-end webdev! Really, the goal here wasn't to create a fully functional web interface, but rather to provide a functional programmatic interface.
  • Changing build priorities may run afoul of bug 555664...haven't had a chance to test out exactly what happens right now if a high priority job gets merged with a lower priority one.

That being said, I'm proud to be able to finally make this public. Documentation for the REST API is available as part of the web interface itself, and the code is available as part of the buildapi repository on hg.mozilla.org

https://build.mozilla.org/buildapi/self-serve

Please be gentle!

Any questions, problems or feedback can be left here, or filed in bugzilla.

Faster try builds!

When we run a try build, we wipe out the build directory between each job; we want to make sure that every user's build has a fresh environment to build in.

Unfortunately this means that we also wipe out the clone of the try repo, and so we have to re-clone try every time.

On Linux and OSX we were spending an average of 30 minutes to re-clone try, and on Windows 40 minutes. The majority of that is simply 'hg clone' time, but a good portion is due to locks: we need to limit how many simultaneous build slaves are cloning from try at once, otherwise the hg server blows up.

Way back in September, Steve Fink suggested using hg's share extension to make cloning faster.

Then in November, Ben Hearsum landed some changes that paved the way to actually turning this on.

Today we've enabled the share extension for Linux (both 32 and 64-bit) and OSX 10.6 builds on try. Windows and OSX 10.5 are coming too, we need to upgrade hg on the build machines first.

Average times for the 'clone' step are down to less than 5 minutes now.

This means you get your builds 25 minutes faster! It also means we're not hammering the try repo so badly, and so hopefully won't have to reset it for a long long time.

We're planning on rolling this out across the board, so nightly builds get faster, release builds get faster, clobber builds get faster, etc...

Enjoy!

Better nightly builds

On November 24th we landed some changes we think are a big improvement to how we've been doing nightly builds.

We've started doing nightly builds on the same revision across platforms, and where possible, on revisions that we already know compile. In addition, all of the nightly builds will share the same buildid.

We pick the revision to build by looking at the past 24 hours of builds, and finding the latest one that built successfully on all platforms. If there is no such revision, we'll build the latest revision on the branch and hope for the best. We also do some extra checking to make sure we don't build a revision that's older than the previous nightly.

Prior to this, we would trigger the nightly builds at the same time every day, but which revision and buildid each platform got was completely dependent on what time the builds actually started. There were many instances of nightly builds having different revisions between the various platforms.

These changes are a big deal because they mean that we have a much better chance of getting a working nightly build every day (or at least one that compiles!), and all the platforms will be based on the same revision, regardless of when they are run. No more guessing if today's Windows build includes the same fixes as the Mac build!

If you're interested in the implementation, which includes some pretty fun buildbot hacking, the details are in bug 570814. Background discussion as to how to choose the revision to use can be found in this thread.

3 days of fun: a journey into the bowels of buildbot

I've just spent 3 days trying to debug some new code in buildbot.

The code in question is to implement a change to how we do nightly builds such that they use the same revision for all platforms.

I was hitting a KeyError exception inside buildbot's util.loop code, specifically at a line where it is trying to delete a key from a dictionary. In simple form, the loop is doing this:

for k in d.keys(): if condition: del d[k] # Raises KeyError....sometimes...

Tricky bit was, it didn't happen every time. I'd have to wait at least 3 minutes between attempts.

So I added a bunch of debugging code:


print d

print d.keys()

for k in d.keys():
    print k
    if condition:
        try:
            del d[k] # Raises KeyError....sometimes...
        except KeyError:
            print k in d # sanity check 1
            print k in d.keys() # sanity check 2

Can you guess what the results of sanity checks 1 and 2 were?

'k in d' is False, but 'k in d.keys()' is True.

whhhaaaaa? Much head scratching and hair pulling ensued. I tried many different variations of iterating through the loop, all with the same result.

In the end, I posted a question on Stack Overflow.

At the same time, Bear and Dustin were zeroing in on a solution. The crucial bit here is that the keys of d are (follow me here...) methods of instances of my new scheduler classes, which inherit from buildbot.util.ComparableMixin...which implements cmp and hash. cmp is used in the 'k in d.keys()' test, but hash is used in the 'k in d' test.

Some further digging revealed that my scheduler was modifying state that ComparableMixin.hash was referring to, resulting in the scheduler instances not having stable hashes over time.

Meanwhile, on stackoverflow, adw came up with an answer that confirmed what Dustin and Bear were saying, and katrielalex came up with a simple example to reproduce the problem.

In the end, the fix was simple, just a few extra lines of python code. Too bad it took me 3 days to figure out!

The long road to victory: build & test logs on FTP

One of the biggest pieces of infrastructure keeping us attached to Tinderbox is the issue of build logs. Tinderbox has (and continues to be) the canonical source of a build or test's output.

A while ago we made some changes that should help pry us loose from this close dependency on Tinderbox.

Build & test logs for all regular depend & try builds are now available on FTP. For example, today's nightly build log for windows is available at http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2010-10-27-06-mozilla-central/mozilla-central-win32-nightly-build26.txt.gz. It's also in the same directory as the build itself!

Test logs are available for depend builds, e.g. http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx64/1288215864/mozilla-central_leopard_test-crashtest-build1283.txt.gz. Again, all the builds, logs and other files (test archives, symbols, etc.) are in the same directory.

Logs for your try builds and test are also available, e.g. http://stage.mozilla.org/pub/mozilla.org/firefox/tryserver-builds/jdrew@mozilla.com-525f51996a52/tryserver-macosx64/. Here too all the logs and binaries are in the same directory. In addition you should be getting links to those logs in your try server notification emails.

There's room for improvement here, and a few places where logs aren't getting put in all the right places, but this is still a major step forward.

Enjoy!

Note to future readers that a some of the links to the logs above will be invalid a few weeks after this post as old directories are purged.

A year in RelEng

Something prompted me to look at the size of our codebase here in RelEng, and how much it changes over time. This is the code that drives all the build, test and release automation for Firefox, project branches, and Try, as well as configuration management for the various build and test machines that we have.

Here are some simple stats:

2,193 changesets across 5 repositories...that's about 6 changes a day on average.

We grew from 43,294 lines of code last year to 73,549 lines of code as of today. That's 70% more code today than we had last year.

We added 88,154 lines to our code base, and removed 51,957. I'm not sure what this means, but it seems like a pretty high rate of change!

What do you want to know about builds?

Mozilla has been quite involved in recent buildbot development, in particular, helping to make it scale across multiple machines. More on this in another post!

Once deployed, these changes will give us the ability to give real time access to various information about our build queue: the list of jobs waiting to start, and which jobs are in progress. This should help other tools like Tinderboxpushlog show more accurate information. One limitation of the upstream work so far is that it only captures a very coarse level of detail about builds: start/end time, and result code is pretty much it. No further detail about the build is captured, like which slave it executed on, what properties it generated (which could include useful information like the URL to the generated binaries), etc.

We've also been exporting a json dump of our build status for many months now. It's been useful for some analysis, but it also has limitations: the data is always at least 5 minutes old by the time you look, and in-progress builds are not represented at all.

We're starting to look at ways of exporting all this detail in a way that's useful to more people. You want to get notified when your try builds are done? You want to look at which test suites are taking the most time? You want to determine how our build times change over time? You want to find out what the last all-green revision was on trunk? We want to make this data available, so anybody can write these tools.

Just how big is that firehose?

I think we have one of the largest buildbot setups out there and we generate a non-trivial amount of data:

  • 6-10 buildbot master processes generating updates, on different machines in 2 or 3 data centers
  • around 130 jobs per hour composed of 4,773 individual steps total per hour. That works out to about 1.4 updates per second that are generated

How you can help

This is where you come in.

I can think of two main classes of interfaces we could set up: a query-type interface where you poll for information that you are interested in, and a notification system where you register a listener for certain types (or all!) events.

What would be the best way for us to make this data available to you? Some kind of REST API? A message or event brokering system? pubsubhubbub?

Is there some type of data or filtering that would be super helpful to you?

Buildbot performance and scaling

It seems like it was ages ago when I posted about profiling buildbot.

One of the hot spots identified there was the dataReceived call. This has been sped up a little bit in recent versions of twisted, but our buildbot masters were still severely overloaded.

It turns out that the buildbot slaves make a lot of RPC calls when sending log data, which results in tens of thousands of dataReceived calls. Multiply that by several dozen build slaves sending log data peaking at a combined throughput of 10 megabits/s and you've got an awful lot of data to handle.

By adding a small slave-side buffer, the number of RPC calls to send log data is drastically reduced by an order of magnitude in some tests, resulting in a much better load situation on the master. This is good for us, because it means the masters are much more responsive, and it's good for everybody else because it means we have fewer failures and wasted time due to the master being too busy to handle everything. It also means we can throw more build slaves onto the masters!

The new code was deployed towards the end of the day on the 26th, or the end of the 12th week.