Digging into Firefox update sizes    Posted:


Update sizes up 20-37% since last year!

Mozilla relies on our automatic update infrastructure to make sure that our users are kept up to date with the latest, most secure and fastest browser.

Having smaller updates means users are able to get the latest version of Firefox more quickly.

Since Firefox 19.0 (released just over a year ago - February 16th, 2013) our complete update size for Windows has grown from 25.6MB to 30.9MB for Firefox 28.0 (released March 15th, 2014). That's a 20% increase in just one year. On OSX it's grown from 37.8MB to 47.8MB, a 26% increase.

Partial updates have similarly grown. For Windows, a user coming form 27.0.1 to 28.0 would receive a 14.3MB update compared to a 10.4MB update going from 18.0.2 to 19.0, an increase of 37.5%

This means it's taking users 20-37% longer to download updates than it did last year. Many of our users don't have fast reliable internet, so an increase in update size makes it even harder for them to stay up to date. In addition, this size increase translates directly into bandwidth costs to Mozilla. All else being equal, we're now serving 20-37% more data from our CDNs for Firefox updates this year compared to last year.

update-sizes.png

Reducing update size

How can we reduce the update size? There are a few ways:

  1. Make sure we're serving partial updates to as many users as possible, and that these updates are applied properly. More analysis is needed, but it appears that only roughly half of our users are updating via partial updates.
  2. Reduce the amount of code we ship in the update. This could mean more features and content are distributed at runtime as needed This is always a tricky trade-off to make between having features available for all users out of the box, and introducing a delay the first time the user wants to use a feature that requires remote content. It also adds complexity to the product.
  3. Change how we generate updates. This is going to be the focus of the rest of my post.

Smaller updates are more better

There are a few techniques I know of to reduce our update sizes:

  • Use xz compression instead of bzip2 compression inside the MAR files (bug 641212). xz generally gets better compression ratios at the cost of using more memory.
  • Use courgette instead of bsdiff for generating the binary diff between .exe and .dll files (bug 504624). Courgette is designed specifically for diffing executables, and generates much smaller patches than bsdiff.
  • Handle omni.ja files more effectively (bug 772868). omni.ja files are essentially zip files, and are using zip style compression. That means each member of the zip archive is individually compressed. This is problematic for two reasons: it makes generating binary diffs between omni.ja files much less effective, and it makes compressing the omni.ja file with bzip2 or xz less effective. You get far better results packing files into a zip file with 0 compression and using xz to compress it afterwards. Similarly for generating diffs, the diff between two omni.ja files using no zip compression is much smaller than the diff between two omni.ja files using the default zip -9 compression.
  • Don't use per-file compression inside the MAR file at all, rather compress the entire archive with xz. I simulated this by xz-compressing tar files of the MAR contents.

27% smaller complete updates

complete-updates.png

We can see that using xz alone saves about 10.9%. There's not a big difference between xz -6 and xz -9e, only a few kb generally. ("xz -6" and "xz -9e" series in the chart)

Disabling zip compression in the omni.ja files and using the standard bzip2 compression saves about 9.7%. ("zip0 .ja" in the chart)

Combining xz compression with the above yields a 24.8% saving, which is 7.6MB. ("zip0 .ja, xz -9e" in the chart)

Finally, disabling zip compression for omni.ja and per-file compression and compressing the entire archive at once yields a 27.2% saving, or 8.4MB.

66% smaller partial updates

partial-updates.png

Very similar results here for xz, 8.4% savings with xz -9e.

Disabling zip compression in the omni.ja files has a much bigger impact for partial updates because we're able to produce a much smaller binary diff. This method alone is saves 42%, or 6.0MB.

Using courgette for executable files yields a 19.1% savings. ("courgette" in the chart)

Combining courgette for executable files, no zip level compression, and per-file xz compression reduces the partial update by 61%. ("courgette, zip0 .ja, xz -9e" in the chart)

And if we compress the entire archive at once rather than per-file, we can reduce the update by 65.9%. ("courgette, zip0 .ja, tar, xz -9e" in the chart)

Some notes on my methodology: I'm focusing on 32-bit Windows, en-US only. Most of the optimizations, with the exception of courgette, are applicable to other platforms. I'm measuring the total size of the files inside the MAR file, rather than the size of the MAR file itself. The MAR file format overhead is small, only a few kilobytes, so this doesn't significantly impact the results, and significantly simplifies the analysis.

Finally, the code I used for analysis is here.

Comments

The amazing RelEng Jacuzzis    Posted:


RelEng has jacuzzis???

On Tuesday, we enabled the first of our "jacuzzis" in production, and initial results look great.

A few weeks ago, Ben blogged about some initial experiments with running builds on smaller pools of machines ("hot tubs", get it? we've since renamed them as "jacuzzis"). His results confirmed glandium's findings on the effectiveness (or lack thereof!) of incremental builds on mozilla-inbound.

The basic idea behind smaller pools of workers is that by restricting which machines are eligible to run jobs, you get much faster incremental builds since you have more recent checkouts, object dirs, etc.

Additionally, we've made some improvements to how we initialize mock environments. We don't reset and re-install packages into the mock chroot if the previous package list is the same as the new package list. This works especially well with jacuzzis, as we can arrange for machines to run jobs with the same package lists.

On Tuesday we enabled jacuzzis for some build types on b2g-inbound: hamachi device builds, and opt/debug ICS emulator builds.

jacuzzi-results.png

We've dropped build times by more than 50%!

The spikes earlier this morning look like they're caused by running on fresh spot instances. When spot nodes first come online, they have no previous state, and so their first builds will always be slower. The machines should stay up most of the day, so you really only have to pay this cost once per day.

For the B2G emulator builds, this means we're getting tests started earlier and therefore get much faster feedback as to the quality of recent patches.

I'm super happy with these results. What's next? Well, turning on MOAR JACUZZIS for MOAR THINGS! Additionally, having fewer build types per machine means our disk footprint is much lower, and we should be able to use local SSDs for builds on AWS.

As usual, all our work has a tracking bug: bug 970738

Implementation details

There are three major pieces involved in pulling this together: the jacuzzi allocations themselves, buildbot integration, and finally AWS integration.

Jacuzzis

Ben has published the allocations here: http://jacuzzi-allocator.pub.build.mozilla.org/v1/

Each builder (or job type) has a specific list of workers associated with it. Ben has been working on ways of automatically managing these allocations so we don't need to tune them by hand too much.

Buildbot

The bulk of the work required was in buildbot. Previous to jacuzzis, we had several large pools of workers, each capable doing any one of hundreds of different job types. Each builder in buildbot has each of the workers in a pool listed as able to do that job. We wanted to avoid having to reconfigure buildbot every time we needed to change jacuzzi allocations, which is why we decided to put the allocations in an external service.

There are two places where buildbot fetches the allocation data: nextSlave functions and our builder prioritizing function. The first is straightforward, and was the only place I was expecting to make changes. These nextSlave functions are called with a list of available machines, and the function's job is to pick one of the connected machines to do the job. It was relatively simple to add support for this to buildbot. The need to handle latter case, modifying our builder prioritization, didn't become apparent until testing...The reasoning is a bit convoluted, so I'll explain below for those interested.

AWS Support

Now that we had buildbot using the right workers, we needed to make sure that we were starting those workers in Amazon!

We adjusted our tools that monitor the job queue and start new EC2 instances to also check the jacuzzi allocations, and start the correct instances.

The gory details of build prioritizations

We have a function in buildbot which handles a lot of the prioritization of the job queue. For example, pending jobs for mozilla-central will get priority over jobs for any of the twigs, like ash or birch. Older jobs also tend to get priority over newer jobs. The function needs to return the list of builders in priority sorted order. Buildbot then processes each builder in turn, trying to assign pending jobs to any idle workers.

There are two factors which make this function complicated: each buildbot master is doing this prioritization independently of the others, and workers are becoming idle while buildbot is still processing the sorted list of builders. This second point caused prioritization to be broken (bug 659222) for a long time.

Imagine a case where you have 3 pending jobs (A, B, C), all for the same set of workers (W1, W2, W3). Job A is the most important, job C is the least. All the workers are busy. prioritizeBuilders has sorted our list of builders, and buildbot looks at A first. No workers are available, so it moves onto B next. Still no free workers. But now worker W1 connects, and buildbot examines job C, and finds there are available workers (W1). So job C buds in line and gets run before jobs A or B.

Our fix for this was to maintain a list of pending jobs for each set of available workers, and then discard all but the most important pending job for each worker set. In our example, we would see that all 3 pending jobs have the same worker set (W1, W2, W3), and so would temporarily ignore pending jobs B and C. This leaves buildbot only job A to consider. In our example above, it would find no available workers and stop processing. When W1 becomes available, it triggers another prioritization run, and again job A is the sole job under consideration and gets the worker.

Unfortunately, this behaviour conflicted with what we were trying to do with jacuzzis. Imagine in the same examble above we have jacuzzis allocated so that W1 is allocated to only do jobs of type C. If W1 is the only available worker, and C is getting discarded each time the prioritization is done, we're not making any forward progress. In fact, we've triggered a bit of an infinite loop, since currently we trigger another round of prioritizing/job assignments if we have available workers and have temporarily discarded lower priority jobs.

The fix was to integrate the jacuzzi allocations into this prioritization logic as well. I'm a little concerned about the runtime impact of this, since we need to query the allocated workers for every pending job type. One thing we're considering is to change the allocator to return the allocations as a single monolithic blob, rather than having per-job-type requests. Or, we could support both types.

Comments

AWS, networks, and burning trees    Posted:


Help! The trees are on fire!

You may have noticed that the trees were closed quite a bit last week due to infrastructure issues that all stem from network flakiness. We're really sorry about that! RelEng and IT have been working very hard to stabilize the network.

Symptoms of the problem have been pretty varied, but the two most visible ones are:

  • bug 957502 - All trees closed due to timeouts as usw2 slaves download or clone
  • bug 964804 - jobs not being scheduled

We've been having more and more problems with our VPN tunnel to our Amazon infrastructure, particularly in the us-west-2 region. Prior to any changes we put in place last week, ALL traffic from our infrastructure in EC2 was routed over the VPN tunnel to be handled by Mozilla's firewall in our SCL3 data center.

While debugging the scheduling bug early last week, we discovered that latency to our mysql database used for scheduling was nearly 500ms. No wonder scheduling jobs was slow!

Digging deeper - the network is in bad shape

Investigation of this latency revealed that one of the core firewalls deployed for RelEng traffic was overloaded, and that a major contributor to the firewall load was traffic to/from AWS over the VPN tunnels. We were consistently pushing around 1 gigabit/s to our private network inside Amazon. The extra load on the firewall required for the VPN encryption caused the latency to go up for all traffic passing through the firewall, even for traffic not bound for AWS!

Our next step was to do a traffic analysis to see how we could reduce the amount of traffic going over the VPN tunnel.

Michal was able to get the traffic analysis done, and his report indicated that the worst traffic offender was traffic coming from ftp.m.o. All of our test jobs pull the builds, tests and crash symbols from ftp. With the increase in push load, more types of jobs, and more tests, our traffic from ftp has really exploded in the past months. Because all of our traffic was going over the VPN tunnel, this put a huge load on the VPN component of the firewall.

Firefighting

Since all of the content on ftp is public, we decided we could route traffic to/from ftp over the public internet rather than our VPN tunnel, provided we could ensure file integrity at the same time. This required a few changes to our infrastructure:

  • Rail re-created all of our EC2 instances to have public IP addresses, in addition to private IP addresses they had already. Without a public IP, Amazon can't route traffic to the public internet. You can set up extra instances to act as NAT gateways, but that's much more complicated and introduces another point of failure. (bug 965001)
  • We needed a new IP endpoint for ftp so that we could be sure that only SSL traffic was going over the public routes. Chris Turra set up ftp-ssl.m.o, and then I changed our routing tables in AWS to route traffic to ftp-ssl via the public internet rather than our VPN tunnel (bug 965289).
  • I landed a change to mozharness to download files from ftp-ssl instead of ftp.

In addition, we also looked at other sources of traffic. The next highest source of traffic was hg.m.o, followed by pvtbuilds.m.o.

Ben quickly rolled out a fix to our test slaves to allow them to cache the gaia repository locally rather than re-cloning gaia each time (bug 964411). We were surprised to discover the gaia repository has grown to 1.2 GB, so this was a big win!

It was clear we would need to divert traffic to hg in a similar way we did for ftp. Unfortunately, adding a DNS/IP endpoint for hg isn't as simple as for ftp, so aki has been going through our code changing all our references of http://hg.mozilla.org to https://hg.mozilla.org (bug 960571). Once we've found and fixed all usages of unsecured access to hg, we can change the routing for hg traffic like we did for FTP.

Aki also patched up some of our FirefoxOS build configs to limit which device builds upload per-checkin (bug 966412), and reduce the amount of data we're sending back to pvtbuilds.m.o over the VPN tunnel.

Ted tracked down a regression in how we were packaging some files in our test zips which cut the size by about 100MB (bug 963651).

On Monday, Adam added some more capacity to the firewall, which should allow us to cope with the remaining load better.

State of the network

This week we're in much better shape than we were last week. If you look at traffic this week vs last week, you can see that traffic is down significantly (see far right of graph):

aws-traffic.png

and latency has been kept under control:

aws-ping.png

We're not out of the woods yet though - we're still seeing occasional traffic spikes close to our maximum. The good news is there's still more we can do to reduce or divert the traffic:

  • we're not done transitioning all FTP/HG traffic to public routes
  • there's still plenty of room to reduce test zip size by splitting them up and changing the compression format used (bug 917999)
  • we can use caching in S3 from the test machines to avoid having to download identical builds/tests from FTP multiple times

Extra special thanks also to Hal who has been keeping all of us focused and on track on this issue.

Comments

Blobber is live - upload ALL the things!    Posted:


Last week without any fanfare, we closed bug 749421 - Allow test slaves to save and upload files somewhere. This has actually been working well for a few months now, it's just taken a while to close it out properly, and I completely failed to announce it anywhere. mea culpa!

This was a really important project, and deserves some fanfare! cue trumpets, parades and skywriters

The goal of this project was to make it easier for developers to get important data out of the test machines reporting to TBPL. Previously the only real output from a test job was the textual log. That meant if you wanted a screen shot from a failing test, or the dump from a crashing process, you needed to format it somehow into the log. For screen shots we would base64 encode a png image and print it to the log as a data URL!

With blobber successfully running now, it's now possible to upload extra files from your test runs on TBPL. Things like screen shots, minidump logs and zip files are already supported.

Getting new files uploaded is pretty straightforward. If the environment variable MOZ_UPLOAD_DIR is set in your test's environment, you can simply copy files there and they will be uploaded after the test run is complete. Links to the files are output in the log. e.g.

15:21:18     INFO -  (blobuploader) - INFO - TinderboxPrint: Uploaded 70485077-b08a-4530-8d4b-c85b0d6f9bc7.dmp to http://mozilla-releng-blobs.s3.amazonaws.com/blobs/mozilla-inbound/sha512/5778e0be8288fe8c91ab69dd9c2b4fbcc00d0ccad4d3a8bd78d3abe681af13c664bd7c57705822a5585655e96ebd999b0649d7b5049fee1bd75a410ae6ee55af

Your thanks and praise should go to our awesome intern, Mihai Tabara, who did most of the work here.

Most test jobs are already supported; if you're unsure if the job type you're interested is supported, just search for MOZ_UPLOAD_DIR in the log on tbpl. If it's not there and you need it, please file a bug!

Comments

Valgrind now running per-push    Posted:


This week we started running valgrind jobs per push (bug 946002) on mozilla-central and project branches (bug 801955).

We've been running valgrind jobs nightly on mozilla-central for years, but they were very rarely ever green. Few people looked at their results, and so they were simply ignored.

Recently Nicholas Nethercote has taken up the torch and put in a lot of hard work to get these valgrind jobs working again. They're now running successfully per-push on mozilla-central and project branches and on Try.

Thanks Nicholas! Happy valgrinding all!

Comments

Now using AWS Spot instances for tests    Posted:


Release Engineering makes heavy use of Amazon's EC2 infrastucture. The vast majority of our Firefox builds for Linux and Android, as well as our Firefox OS builds happen in EC2. We also do a ton of unit testing inside EC2.

Amazon offers a service inside EC2 called spot instances. This is a way for Amazon to sell off unused capacity by auction. You can place a bid for how much you want to pay by the hour for a particular type of VM, and if your price is more than the current market price, you get a cheap VM! For example, we're able to run tests on machines for $0.025/hr instead of the regular $0.12/hr on-demand price. We started experimenting with spot instances back in November.

There are a few downsides to using spot instances however. One is that your VM can be killed at any moment if somebody else bids a higher price than yours. The second is that your instances can't (easily) use extra EBS volumes for persistent storage. Once the VM powers off, the storage is gone.

These issues posed different challenges for us. In the first case, we were worried about the impact that interrupted jobs would have on the tree. We wanted to avoid the case where jobs were run on a spot instance, interrupted because of the market price changing, and then being retried a second time on another spot instance subject to the same termination. This required changes to two systems:

  • aws_watch_pending needed to know to start regular on-demand EC2 instances in the case of retried jobs. This has been landed and has been working well, but really needs the next piece to be complete.
  • buildbot needed to know to not pick a spot instance to run retried jobs. This work is being tracked in bug 936222. It turns out that we're not seeing too many spot instances being killed off due to market price [1], so this work is less urgent.

The second issue, the VM storage, turned out to be much more complicated to fix. We rely on puppet to make sure that VMs have consistent software packages and configuration files. Puppet requires per-host SSL certificates generated, and at Mozilla, these certificates need to be signed by a central certificate authority. In our previous usage of EC2 we work around this by puppetizing new instances on first boot, and saving the disk image for later use.

With spot instances, we essentially need to re-puppetize every time we create a new VM.

Having fresh storage on boot also impacts the type of jobs we can run. We're starting with running test jobs on spot instances, since there's no state from previous tests that is valuable for the next test.

Builds are more complicated, since we depend on the state of previous builds to have faster incremental builds. In addition, the cost of having to retry a build is much higher than it is for a test. It could be that the spot instances stay up long enough or that we end up clobbering frequently enough that this won't have a major impact on build times. Try builds are always clobbers though, so we'll be running try builds on spot instances shortly.

All this work is being tracked in https://bugzilla.mozilla.org/show_bug.cgi?id=935683

Big props to Rail for getting this done so quickly. With all this great work, we should be able to scale better while reducing our costs.

[1]https://bugzilla.mozilla.org/show_bug.cgi?id=935533#c21

Comments

Next generation job scheduling    Posted:


As coop mentioned, we had a really great brainstorming session on Tuesday about the kinds of things we'd like to do with job scheduling in the RelEng infrastructure.

Our idea is to implement a "job graph", which will be a representation of a set of jobs to run, and dependencies between them. For example right now we have a set of tests that are dependent on builds finishing, or l10n repacks that are dependent on the en-US nightly build finishing. Theses graphs are implicit right now in our buildbot configs, and are pretty inflexible, opaque and hard to test.

One of our design goals for any new system or improvements is to make this job graph explicit, and to have it checked into the tree This has a few really nice features:

  • Makes it easier for developers to modify the set of jobs that run on their branch or push.
  • Other tools like try chooser and self-serve can use this information to control what jobs get run.
  • The sets of builds and tests running on branches follow merges. This is really helpful for our 6-week uplifts.
  • It will be possible to predict the set of builds and tests that would happen for a push in advance. This isn't possible right now without horrible hacks.

Our plan is to implement the graph parser and generator first so we can validate some of our assumptions, and make sure we can generate equivalent job graphs to what exists now. After we have that working, we can focus on integrating the new job graphs with the existing infrastructure.

Comments

A tale of python profiling and optimizing    Posted:


The Release Engineering infrastructure at Mozilla relies heavily on buildbot for much of our infrastructure. For various reasons we're running an older version based on buildbot 0.8.2, which we maintain in our own mercurial repository. We have many different masters with all sorts of different configurations.

To make sure that we don't break things too often, we wrote a tools called test-masters.sh that creates local versions of each unique master configuration and then runs a configuration check on it. Currently there are 20 unique master configurations to test, and it would take 4 minutes to run test-masters.sh on all of them on my machine. Recently sfink landed some changes which would test all the masters in parallel, which brought the time down from a previously interminable 11 minutes.

Four minutes is a long time to wait! What's taking so much time?

My go-to tool for profiling python code is cProfile. I ended up writing a small script to do the equivalent of 'buildbot checkconfig':

import cProfile
import sys
from buildbot.scripts.checkconfig import ConfigLoader

def loadMaster(filename):
    ConfigLoader(configFileName=filename)

cProfile.run("loadMaster(sys.argv[1])", "master.prof")

Running this against my buildbot master's master.cfg file will produce a master.prof file I can load to look at the profile.

>>> import pstats
>>> pstats.Stats("master.prof").sort_stats('time').print_stats(5)
Thu Nov  7 21:42:25 2013    master.prof

         26601516 function calls (24716688 primitive calls) in 439.756 seconds

   Ordered by: internal time
   List reduced from 1997 to 5 due to restriction <5>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1  409.381  409.381  438.936  438.936 /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_
hg_b4673f1f2a86_default-py2.6.egg/buildbot/master.py:621(loadConfig)
   170046    3.907    0.000   10.706    0.000 /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_
hg_b4673f1f2a86_default-py2.6.egg/buildbot/steps/shell.py:65(__init__)
   222809    3.745    0.000    4.124    0.000 /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_
hg_b4673f1f2a86_default-py2.6.egg/buildbot/process/buildstep.py:611(__init__)
        1    3.025    3.025   29.352   29.352 /home/catlee/mozilla/buildbot-configs/tests-master/master.cfg:2(<module>)
   170046    2.385    0.000    6.033    0.000 /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_
hg_b4673f1f2a86_default-py2.6.egg/buildbot/process/buildstep.py:1027(__init__)

Looks like buildbot's loadConfig is taking a long time! Unfortunately we don't get any more detail than that from cProfile. To get line-by-line profiling info, I've started using kernprof. This requires a few changes to our setup. First, we don't want to run cProfile any more, so modify our test script like this:

import cProfile
import sys
from buildbot.scripts.checkconfig import ConfigLoader

def loadMaster(filename):
    ConfigLoader(configFileName=filename)

#cProfile.run("loadMaster(sys.argv[1])", "master.prof")
loadMaster(sys.argv[1])

And since we want to get line-by-line profiling of buildbot's loadConfig, we need to annotate buildbot's code with the @profile decorator. In buildbot's master.py, I added @profile to loadConfig so it looks like this now:

@profile
def loadConfig(self, f, check_synchronously_only=False):
    """Internal function to load a specific configuration file. Any
    errors in the file will be signalled by raising an exception.
    <snip>
    """

Now run kernprof.py to get our profile:

kernprof.py -l -v ../profile_master.py master.cfg

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   621                                               @profile
   622                                               def loadConfig(self, f, check_synchronously_only=False):
   623                                                   """Internal function to load a specific configuration file. Any
   624                                                   errors in the file will be signalled by raising an exception.
   625
   626                                                   If check_synchronously_only=True, I will return (with None)
   627                                                   synchronously, after checking the config file for sanity, or raise an
   628                                                   exception. I may also emit some DeprecationWarnings.
   629
   630                                                   If check_synchronously_only=False, I will return a Deferred that
   631                                                   fires (with None) when the configuration changes have been completed.
   632                                                   This may involve a round-trip to each buildslave that was involved."""
   633
   634         1           17     17.0      0.0          localDict = {'basedir': os.path.expanduser(self.basedir)}
   635         1            7      7.0      0.0          try:
   636         1     68783020 68783020.0     12.0              exec f in localDict
   637                                                   except:
   638                                                       log.msg("error while parsing config file")
   639                                                       raise

   <snip>

   785     13303        86781      6.5      0.0          for b in builders:
   786     13302        92920      7.0      0.0              if b.has_key('slavename') and b['slavename'] not in slavenames:
   787                                                           raise ValueError("builder %s uses undefined slave %s" \
   788                                                                            % (b['name'], b['slavename']))
   789   6935914     42782768      6.2      7.5              for n in b.get('slavenames', []):
   790   6922612    449928915     65.0     78.4                  if n not in slavenames:
   791                                                               raise ValueError("builder %s uses undefined slave %s" \
   792                                                                                % (b['name'], n))
   793     13302      2478517    186.3      0.4              if b['name'] in buildernames:

Wow! Line 790 is taking 78% of our runtime! What's going on?

If I look at my config, I see that I have 13,302 builders and 13,988 slaves configured. Each builder has a subset of slaves attached, but we're still doing 6,922,612 checks to see if each slave that's configured for the builder is configured as a top-level slave. slavenames happens to be a list, so each check does a full scan of the list to see if the slave exists or not!

A very simple patch fixes this:

diff --git a/master/buildbot/master.py b/master/buildbot/master.py
--- a/master/buildbot/master.py
+++ b/master/buildbot/master.py
@@ -761,19 +761,19 @@ class BuildMaster(service.MultiService):
         errmsg = "c['schedulers'] must be a list of Scheduler instances"
         assert isinstance(schedulers, (list, tuple)), errmsg
         for s in schedulers:
             assert interfaces.IScheduler(s, None), errmsg
         assert isinstance(status, (list, tuple))
         for s in status:
             assert interfaces.IStatusReceiver(s, None)

-        slavenames = [s.slavename for s in slaves]
+        slavenames = set(s.slavename for s in slaves)

         # convert builders from objects to config dictionaries
         builders_dicts = []
         for b in builders:
             if isinstance(b, BuilderConfig):
                 builders_dicts.append(b.getConfigDict())
             elif type(b) is dict:
                 builders_dicts.append(b)

Now our checks are into a set instead of a list, which is an O(log(n)) operation instead of O(n). Let's re-run our profile with this patch:

File: /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_hg_b4673f1f2a86_default-py2.6.egg/buildbot/master.py
Function: loadConfig at line 621
Total time: 109.885 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   621                                               @profile
   622                                               def loadConfig(self, f, check_synchronously_only=False):
   623                                                   """Internal function to load a specific configuration file. Any
   624                                                   errors in the file will be signalled by raising an exception.
   625
   626                                                   If check_synchronously_only=True, I will return (with None)
   627                                                   synchronously, after checking the config file for sanity, or raise an
   628                                                   exception. I may also emit some DeprecationWarnings.
   629
   630                                                   If check_synchronously_only=False, I will return a Deferred that
   631                                                   fires (with None) when the configuration changes have been completed.
   632                                                   This may involve a round-trip to each buildslave that was involved."""
   633
   634         1           30     30.0      0.0          localDict = {'basedir': os.path.expanduser(self.basedir)}
   635         1           13     13.0      0.0          try:
   636         1     46268025 46268025.0     42.1              exec f in localDict
   637                                                   except:
   638                                                       log.msg("error while parsing config file")
   639                                                       raise

   <snip>

   785     13303        56199      4.2      0.1          for b in builders:
   786     13302        60438      4.5      0.1              if b.has_key('slavename') and b['slavename'] not in slavenames:
   787                                                           raise ValueError("builder %s uses undefined slave %s" \
   788                                                                            % (b['name'], b['slavename']))
   789   6935914     27430456      4.0     25.0              for n in b.get('slavenames', []):
   790   6922612     29028880      4.2     26.4                  if n not in slavenames:
   791                                                               raise ValueError("builder %s uses undefined slave %s" \
   792                                                                                % (b['name'], n))
   793     13302      1286628     96.7      1.2              if b['name'] in buildernames:
   794                                                           raise ValueError("duplicate builder name %s"
   795                                                                            % b['name'])

We're down to 25% of the runtime for this check now. If we apply the same treatment to a few of the other data structures here, we get the total time for test-masters down to 45 seconds!

I've landed resulting patch into our repo. I encourage you to upgrade!

Comments

Archive of mozilla-inbound builds for regression hunting    Posted:


Nick Thomas, of recent Balrog fame, has also been hard at work improving life for our intrepid regression hunters.

Sometimes we don't detect problems in Firefox until many weeks or months after code has landed on mozilla-inbound. We do builds for almost every checkin that happens, but end up having to delete them after a month to keep disk usage under control. Unfortunately that means that if a problem is detected while in the Beta cycle or even after release, we're only left with nightly builds for regression hunting, and so the regression window can be up to 24 hours. A lot can change in 24 hours!

Nick's hard work means we are keeping a full release cycle's worth of mozilla-inbound builds archived on S3. We're focused on mozilla-inbound for now, but I imagine we'll be adding other integration branches shortly.

The archive is currently available at http://inbound-archive.pub.build.mozilla.org/pub/mozilla.org/

Please send Nick your thanks, and if you run into any issues, please file bugs blocking bug 765258

Comments

Dealing with bursty load    Posted:


John has been doing a regular series of posts about load on our infrastructure, going back years. Recently, GPS also posted about infrastructure efficiency here. What I've always found interesting is the bursty nature of the load at different times throughout the day, so thought people might find the following data useful.

Every time a developer checks in code, we schedule a huge number of build and test jobs. Right now on mozilla-central every checkin generates just over 10 CPU days worth of work. How many jobs do we end up running over the course of the day? How much time to they take to run?

I created a few graphs to explore these questions - these get refreshed hourly, so should serve as a good dashboard to monitor recent load on an ongoing basis.

# of jobs running over time

running jobs This shows how many jobs are running at any given hour over the past 7 days

# of cpu hours run over time

compute times This shows how many CPU hours were spent for jobs that started for any given hour over the same time period.

A few observations about the range in the load:

  • Weekends can be really quiet! Our peak weekday load is about 20x the lowest weekend load.
  • Weekday load varies by about 2x within any given day.

Over the years RelEng have scaled our capacity to meet peak demand. After all, we're trying to give the best experience to developers; making people wait for builds or tests to start is frustrating and can also really impact the productivity of our development teams, and ultimately the quality of Firefox/Fennec/B2G.

Scaling physical capacity takes a lot of time and advance planning. Unfortunately this means that many of our physical machines are idle for non-peak times, since there isn't any work to do. We've been able to make use of some of this idle time by doing things like run fuzzing jobs for our security team.

Scaling quickly for peak load is something that Amazon is great for, and we've been taking advantage of scaling build/test capacity in EC2 for nearly a year. We've been able to move a significant amount of our Linux build and test infrastructure to run inside Amazon's EC2 infrastructure. This means we can start machines in response to increasing load, and shut them down when they're not needed. (Aside - not all of our reporting tools know about this, which can cause some reporting confusion because it appears as if the slaves are broken or not taking work when in reality they're shut off because of low demand - we're working on fixing that up!)

Hopefully this data and these dashboards help give some context about why our infrastructure isn't always 100% busy 100% of the time.

Comments