Firefox Download Button

Pages

Two great, completely unrelated links

Yesterday was a bit of an overwhelming day. After getting home at 1am after a long bus ride home, I was unwinding by catching up on some news and email. I came across these two links, both of which really lifted my mood.

The first, Grokking the Zen of the Vi Wu-Wei, talks about a programmer’s journey from emacs to BBEdit to vim. This post is a great read in and of itself, but what’s really worth it, is the link around the middle of the post to http://stackoverflow.com/questions/1218390/what-is-your-most-productive-shortcut-with-vim/1220118#1220118. This was truly a joy to read. Definitely the best answer I’ve ever seen on Stack Overflow, and quite possibly the best discussion of vi I’ve ever read. It taught me a lot, but I enjoyed reading it for more than that. It was almost like being on a little adventure, discovering all these little hidden secrets about the neighbourhood you’ve been living in for years. Like I said, it was 1am.

The second, The Pope, the judge, the paedophile priest and The New York Times, gave me some reassurance that things aren’t always as they seem as reported by the media. Regardless of how you feel about the Church or the Pope, it seems that journalistic integrity has fallen by the wayside here. From the article:

Fr Thomas Brundage, the former Archdiocese of Milwaukee Judicial Vicar who presided over the canonical criminal case of the Wisconsin child abuser Fr Lawrence Murphy, has broken his silence to give a devastating account of the scandal – and of the behaviour of The New York Times, which resurrected the story.

It looks as if the media were in such a hurry to to blame the Pope for this wretched business that not one news organisation contacted Fr Brundage. As a result, crucial details were unreported.

The entire article is worth a read.

Buildbot performance and scaling

It seems like it was ages ago when I posted about profiling buildbot.

One of the hot spots identified there was the dataReceived call. This has been sped up a little bit in recent versions of twisted, but our buildbot masters were still severely overloaded.

It turns out that the buildbot slaves make a lot of RPC calls when sending log data, which results in tens of thousands of dataReceived calls. Multiply that by several dozen build slaves sending log data peaking at a combined throughput of 10 megabits/s and you’ve got an awful lot of data to handle.

By adding a small slave-side buffer, the number of RPC calls to send log data is drastically reduced by an order of magnitude in some tests, resulting in a much better load situation on the master. This is good for us, because it means the masters are much more responsive, and it’s good for everybody else because it means we have fewer failures and wasted time due to the master being too busy to handle everything. It also means we can throw more build slaves onto the masters!

The new code was deployed towards the end of the day on the 26th, or the end of the 12th week.

BYO Build Dashboard

I’m happy to announce that we’ve started publishing some build data we’ve been collecting for the past several months.

If you’re interested in build data, like what jobs get run on which machines, and how long they take, then the files at http://build.mozilla.org/builds will be of interest to you.

These are JSON dumps of build data collected for most of our systems since October. There is one file per 24-hour period, as well as one file for the past 4 hours of data. If you want to look at the format, the 4 hour file is easier to read; the other files don’t use any extra white space. All times should be interpreted as unix timestamps (seconds since Jan 1, 1970 00:00:00 UTC). The “requesttime” is a best-effort calculation of when a build was requested based on when the revision in question was pushed to hg, or when the unit test or talos run was triggered.

That’s it!

This should be enough data to get started writing some dashboards and other visualizations. I’d love to hear how people are using this data, and if there’s anything missing from the data provided that would be useful.

Faster build machines, faster end-to-end time

One metric Release Engineering focuses on a lot is the time between a commit to the source tree and when all the builds and tests for that revision are completed. We call this our end-to-end time. Our approach to improving this time has been to identify the longest bits in the total end-to-end time, and to make them faster.

One big chunk of the end-to-end time used to be wait times: how long a build or test waited for a machine to become free. We’ve addressed this by adding more build and test slaves into the pool.

We’ve also focused on parallelizing the entire process, so instead of doing this:

before we now do this: after

(not exactly to scale, but you get the idea)

Faster builds please!

After splitting up tests and reducing wait times, the longest part of the entire process was now the actual build time.

Last week we added 25 new machines to the pool of build slaves for windows. These new machines are real hardware machines (not VMs), with a quad-core 2.4 GHz processor, 4 GB RAM, and dedicated hard drives.

We’ve been anticipating that the new build machines can crank out builds much faster.

And they can.

(the break in the graph is the time when mozilla-central was closed due to the PGO link failure bug)

We started rolling the new builders into production on February 22nd (where the orange diamonds start on the graph). You can see the end-to-end time immediately start to drop. From February 1st to the 22nd, our average end-to-end time for win32 on mozilla-central was 4h09. Since the 22nd the average time has dropped down to 3h02. That’s over an hour better on average, a 26% improvement.

Faster builds means tests can start faster, which means tests can be done sooner, which means a better end-to-end time. It also means that build machines become free to do another build sooner, and so we’re hoping that these faster build machines will also improve our wait time situation (but see below).

Currently we’re limited to running the build on these machines with -j1 because of random hangs in the build when using -j2 or higher (bug 524149). Once we fix that, or move to pymake, we should see even better improvements.

What about OSX?

In preparation for deploying these new hardware build machines, we also implemented some prioritization algorithms to choose fast build machines over slow ones, and also to try and choose a machine that has a recent object directory (to make our incremental builds faster). This has helped us out quite a bit on our OSX end-to-end times as well, where we have a mixed pool of xserves and minis doing builds and tests.

OSX End to End time

Simply selecting the right machine for the job reduced our end-to-end time from 3h12 to 2h13, again almost an hour improvement, or 30% better.

What’s next?

We have 25 linux machines that should be coming into our production pools this week.

We’ll continue to monitor the end-to-end and wait times over the next weeks to see how everything is behaving. One thing I’m watching out for is that having faster builds means we can produce more builds in less time…which means more builds to test! Without enough machines running tests, we may end up making wait times and therefore our end-to-end times worse!

We’re already begun work on handling this. Our plan is to start doing the unittest runs on the talos hardware….but that’s another post!