The amazing RelEng Jacuzzis Posted:
RelEng has jacuzzis???
On Tuesday, we enabled the first of our "jacuzzis" in production, and initial results look great.
A few weeks ago, Ben blogged about some initial experiments with running builds on smaller pools of machines ("hot tubs", get it? we've since renamed them as "jacuzzis"). His results confirmed glandium's findings on the effectiveness (or lack thereof!) of incremental builds on mozilla-inbound.
The basic idea behind smaller pools of workers is that by restricting which machines are eligible to run jobs, you get much faster incremental builds since you have more recent checkouts, object dirs, etc.
Additionally, we've made some improvements to how we initialize mock environments. We don't reset and re-install packages into the mock chroot if the previous package list is the same as the new package list. This works especially well with jacuzzis, as we can arrange for machines to run jobs with the same package lists.
On Tuesday we enabled jacuzzis for some build types on b2g-inbound: hamachi device builds, and opt/debug ICS emulator builds.
We've dropped build times by more than 50%!
The spikes earlier this morning look like they're caused by running on fresh spot instances. When spot nodes first come online, they have no previous state, and so their first builds will always be slower. The machines should stay up most of the day, so you really only have to pay this cost once per day.
For the B2G emulator builds, this means we're getting tests started earlier and therefore get much faster feedback as to the quality of recent patches.
I'm super happy with these results. What's next? Well, turning on MOAR JACUZZIS for MOAR THINGS! Additionally, having fewer build types per machine means our disk footprint is much lower, and we should be able to use local SSDs for builds on AWS.
As usual, all our work has a tracking bug: bug 970738
There are three major pieces involved in pulling this together: the jacuzzi allocations themselves, buildbot integration, and finally AWS integration.
Ben has published the allocations here: http://jacuzzi-allocator.pub.build.mozilla.org/v1/
Each builder (or job type) has a specific list of workers associated with it. Ben has been working on ways of automatically managing these allocations so we don't need to tune them by hand too much.
The bulk of the work required was in buildbot. Previous to jacuzzis, we had several large pools of workers, each capable doing any one of hundreds of different job types. Each builder in buildbot has each of the workers in a pool listed as able to do that job. We wanted to avoid having to reconfigure buildbot every time we needed to change jacuzzi allocations, which is why we decided to put the allocations in an external service.
There are two places where buildbot fetches the allocation data: nextSlave functions and our builder prioritizing function. The first is straightforward, and was the only place I was expecting to make changes. These nextSlave functions are called with a list of available machines, and the function's job is to pick one of the connected machines to do the job. It was relatively simple to add support for this to buildbot. The need to handle latter case, modifying our builder prioritization, didn't become apparent until testing...The reasoning is a bit convoluted, so I'll explain below for those interested.
Now that we had buildbot using the right workers, we needed to make sure that we were starting those workers in Amazon!
The gory details of build prioritizations
We have a function in buildbot which handles a lot of the prioritization of the job queue. For example, pending jobs for mozilla-central will get priority over jobs for any of the twigs, like ash or birch. Older jobs also tend to get priority over newer jobs. The function needs to return the list of builders in priority sorted order. Buildbot then processes each builder in turn, trying to assign pending jobs to any idle workers.
There are two factors which make this function complicated: each buildbot master is doing this prioritization independently of the others, and workers are becoming idle while buildbot is still processing the sorted list of builders. This second point caused prioritization to be broken (bug 659222) for a long time.
Imagine a case where you have 3 pending jobs (A, B, C), all for the same set of workers (W1, W2, W3). Job A is the most important, job C is the least. All the workers are busy. prioritizeBuilders has sorted our list of builders, and buildbot looks at A first. No workers are available, so it moves onto B next. Still no free workers. But now worker W1 connects, and buildbot examines job C, and finds there are available workers (W1). So job C buds in line and gets run before jobs A or B.
Our fix for this was to maintain a list of pending jobs for each set of available workers, and then discard all but the most important pending job for each worker set. In our example, we would see that all 3 pending jobs have the same worker set (W1, W2, W3), and so would temporarily ignore pending jobs B and C. This leaves buildbot only job A to consider. In our example above, it would find no available workers and stop processing. When W1 becomes available, it triggers another prioritization run, and again job A is the sole job under consideration and gets the worker.
Unfortunately, this behaviour conflicted with what we were trying to do with jacuzzis. Imagine in the same examble above we have jacuzzis allocated so that W1 is allocated to only do jobs of type C. If W1 is the only available worker, and C is getting discarded each time the prioritization is done, we're not making any forward progress. In fact, we've triggered a bit of an infinite loop, since currently we trigger another round of prioritizing/job assignments if we have available workers and have temporarily discarded lower priority jobs.
The fix was to integrate the jacuzzi allocations into this prioritization logic as well. I'm a little concerned about the runtime impact of this, since we need to query the allocated workers for every pending job type. One thing we're considering is to change the allocator to return the allocations as a single monolithic blob, rather than having per-job-type requests. Or, we could support both types.