RelEng 2015 (Part 2)

This the second part of my report on the RelEng 2015 conference that I attended a few weeks ago. Don't forget to read part 1!

I've put up links to the talks whenever I've been able to find them.

Defect Prediction

Davide Giacomo Cavezza gave a presentation about his research into defect prediction in rapidly evolving software. The question he was trying to answer was if it's possible to predict if any given commit is defective. He compared static models against dynamic models that adapt over the lifetime of a project. The models look at various metrics about the commit such as:

  • Number of files the commit is changing
  • Entropy of the changes
  • The amount of time since the files being changed were last modified
  • Experience of the developer

His simulations showed that a dynamic model yields superior results to a static model, but that even a dynamic model doesn't give sufficiently accurate results.

I wonder about adopting something like this at Mozilla. Would a warning that a given commit looks particularly risky be a useful signal from automation?

Continuous Deployment and Schema Evolution

Michael de Jong spoke about a technique for coping with SQL schema evolution in a continuously deployed applications. There are a few major issues with changing SQL schemas in production including:

  • schema changes typically block other reads/writes from occurring
  • application logic needs to be synchronized with the schema change. If the schema change takes non-trivial time, which version of the application should be running in the meanwhile?

Michael's solution was to essentially create forks of all tables whenever the schema changes, and to identify each by a unique version. Applications need to specify which version of the schema they're working with. There's a special DB layer that manages the double writes to both versions of the schema that are live at the same time.

Securing a Deployment Pipeline

Paul Rimba spoke about security deployment pipelines.

One of the areas of focus was on the "build server". He started with the observation that a naive implementation of a Jenkins infrastructure has the master and worker on the same machine, and so the workers have the opportunity to corrupt the state of the master, or of other job types' workspaces. His approach to hardening this part of the pipeline was to isolate each job type into its own docker worker. He also mentioned using a microservices architecture to isolate each task into its own discrete component - easier to secure and reason about.

We've been moving this direction at Mozilla for our build infrastructure as well, so it's good to see our decision corroborated!

Makefile analysis

Shurui Zhou presented some data about extracting configuration data from build systems. Her goal was to determine if all possible configurations of a system were valid.

The major problem here is make (of course!). Expressions such as $(shell uname -r) make static analysis impossible.

We had a very good discussion about build systems following this, and how various organizations have moved themselves away from using make. Somebody described these projects as requiring a large amount of "activation energy". I thought this was a very apt description: lots of upfront cost for little visible change.

However, people unanimously agreed that make was not optimal, and that the benefits of moving to a more modern system were worth the effort.

RelEng 2015 (part 1)

Last week, I had the opportunity to attend RelEng 2015 - the 3rd International Workshop of Release Engineering. This was a fantastic conference, and I came away with lots of new ideas for things to try here at Mozilla.

I'd like to share some of my thoughts and notes I took about some of the sessions. As of yet, the speakers' slides aren't collected or linked to from the conference website. Hopefully they'll get them up soon! The program and abstracts are available here.

For your sake (and mine!) I've split up my notes into a few separate posts. This post covers the introduction and keynote.


"Continuous deployment" of web applications is basically a solved problem today. What remains is for organizations to adopt best practices. Mobile/desktop applications remain a challenge.

Cisco relies heavily on collecting and analyzing metrics to inform their approach to software development. Statistically speaking, quality is the best driver of customer satisfaction. There are many aspects to product quality, but new lines of code introduced per release gives a good predictor of how many new bugs will be introduced. It's always challenging to find enough resources to focus on software quality; being able to correlate quality to customer satisfaction (and therefore market share, $$$) is one technique for getting organizational support for shipping high quality software. Common release criteria such as bugs found during testing, or bug fix rate, are used to inform stakeholders as to the quality of the release.

Introductory Session

Bram Adams and Foutse Khomh kicked things off with an overview of "continuous deployment" over the last 5 years. Back in 2009 we were already talking about systems where pushing to version control would trigger tens of thousands of tests, and do canary deployments up to 50 times a day.

Today we see companies like Facebook demonstrating that continuous deployment of web applications is basically a solved problem. Many organizations are still trying to implement these techniques. Mobile [and desktop!] applications still present a challenge.


Pete Rotella from Cisco discussed how he and his team measured and predicted release quality for various projects at Cisco. His team is quite focused on data and analytics.

Cisco has relatively long release cycles compared to what we at Mozilla are used to now. They release 2-3 times per year, with each release representing approximately 500kloc of new code. Their customers really like predictable release cycles, and also don't like releases that are too frequent. Many of their customers have their own testing / validation cycles for releases, and so are only willing to update for something they deem critical.

Pete described how he thought software projects had four degrees of freedom in which to operate, and how quality ends up being the one sacrificed most often in order to compensate for constraints in the others:

  • resources (people / money): It's generally hard to hire more people or find room in the budget to meet the increasing demands of customers. You also run into the mythical man month problem by trying to throw more people at a problem.

  • schedule (time): Having standard release cycles means organizations don't usually have a lot of room to push out the schedule so that features can be completed properly.

    I feel that at Mozilla, the rapid release cycle has helped us out to some extent here. The theory is that if your feature isn't ready for the current version, it can wait for the next release which is only 6 weeks behind. However, I do worry that we have too many features trying to get finished off in aurora or even in beta.

  • content (features): Another way to get more room to operate is to cut features. However, it's generally hard to cut content or features, because those are what customers are most interested in.

  • quality: Pete believes this is where most organizations steal resources for to make up for people/schedule/content constraints. It's a poor long-term play, and despite "quality is our top priority" being the Official Party Line, most organizations don't invest enough here. What's working against quality?

    • plethora of releases: lots of projects / products / special requests for releases. Attempts to reduce the # of releases have failed on most occasions.
    • monetization of quality is difficult. Pete suggests tying the cost of a poor quality release to this. How many customers will we lose with a buggy release?
    • having RelEng and QA embedded in Engineering teams is a problem; they should be independent organizations so that their recommendations can have more weight.
    • "control point exceptions" are common. e.g. VP overrides recommendations of QA / RelEng and ships the release.

Why should we focus on quality? Pete's metrics show that it's the strongest driver of customer satisfaction. Your product's customer satisfaction needs to be more than 4.3/5 to get more than marginal market share.

How can RelEng improve metrics?

  • simple dashboards
  • actionable metrics - people need to know how to move the needle
  • passive - use existing data. everybody's stretched thin, so requiring other teams to add more metadata for your metrics isn't going to work.
  • standardized quality metrics across the company
  • informing engineering teams about risk
  • correlation with customer experience.

Interestingly, handling the backlog of bugs has minimal impact on customer satisfaction. In addition, there's substantial risk introduced whenever bugs are fixed late in a release cycle. There's an exponential relationship between new lines of code added and # of defects introduced, and therefore customer satisfaction.

Another good indicator of customer satisfaction is the number of "Customer found defects" - i.e. the number of bugs found and reported by their customers vs. bugs found internally.

Pete's data shows that if they can find more than 80% of the bugs in a release prior to it being shipped, then the remaining bugs are very unlikely to impact customers. He uses lines of code added for previous releases, and historical bug counts per version to estimate number of bugs introduced in the current version given the new lines of code added. This 80% figure represents one of their "Release Criteria". If less than 80% of predicted bugs have been found, then the release is considered risky.

Another "Release Criteria" Pete discussed was the weekly rate of fixing bugs. Data shows that good quality releases have the weekly bug fix rate drop to 43% of the maximum rate at the end of the testing cycle. This data demonstrates that changes late in the cycle have a negative impact on software quality. You really want to be fixing fewer and fewer bugs as you get closer to release.

I really enjoyed Pete's talk! There are definitely a lot of things to think about, and how we might apply them at Mozilla.

RelEng Retrospective - Q1 2015

RelEng had a great start to 2015. We hit some major milestones on projects like Balrog and were able to turn off some old legacy systems, which is always an extremely satisfying thing to do!

We also made some exciting new changes to the underlying infrastructure, got some projects off the drawing board and into production, and drastically reduced our test load!

Firefox updates



All Firefox update queries are now being served by Balrog! Earlier this year, we switched all Firefox update queries off of the old update server,, to the new update server, codenamed Balrog.

Already, Balrog has enabled us to be much more flexible in handling updates than the previous system. As an example, in bug 1150021, the About Firefox dialog was broken in the Beta version of Firefox 38 for users with RTL locales. Once the problem was discovered, we were able to quickly disable updates just for those users until a fix was ready. With the previous system it would have taken many hours of specialized manual work to disable the updates for just these locales, and to make sure they didn't get updates for subsequent Betas.

Once we were confident that Balrog was able to handle all previous traffic, we shut down the old update server (aus3). aus3 was also one of the last systems relying on CVS (!! I know, rite?). It's a great feeling to be one step closer to axing one more old system!


When we started the quarter, we had an exciting new plan for generating partial updates for Firefox in a scalable way.

Then we threw out that plan and came up with an EVEN MOAR BETTER plan!

The new architecture for funsize relies on Pulse for notifications about new nightly builds that need partial updates, and uses TaskCluster for doing the generation of the partials and publishing to Balrog.

The current status of funsize is that we're using it to generate partial updates for nightly builds, but not published to the regular nightly update channel yet.

There's lots more to say here...stay tuned!

FTP & S3

Brace yourselves... is going away...

brace yourselves...ftp is going away its current incarnation at least.

Expect to hear MUCH more about this in the coming months.

tl;dr is that we're migrating as much of the Firefox build/test/release automation to S3 as possible.

The existing machinery behind will be going away near the end of Q3. We have some ideas of how we're going to handle migrating existing content, as well as handling new content. You should expect that you'll still be able to access nightly and CI Firefox builds, but you may need to adjust your scripts or links to do so.

Currently we have most builds and tests doing their transfers to/from S3 via the task cluster index in addition to doing parallel uploads to We're aiming to shut off most uploads to ftp this quarter.

Please let us know if you have particular systems or use cases that rely on the current host or directory structure!

Release build promotion

Our new Firefox release pipeline got off the drawing board, and the initial proof-of-concept work is done.

The main idea here is to take an existing build based on a push to mozilla-beta, and to "promote" it to a release build. So we need to generate all the l10n repacks, partner repacks, generate partial updates, publish files to CDNs, etc.

The big win here is that it cuts our time-to-release nearly in half, and also simplifies our codebase quite a bit!

Again, expect to hear more about this in the coming months.


In addition to all those projects in development, we also tackled quite a few important infrastructure projects.

OSX test platform

10.10 is now the most widely used Mac platform for Firefox, and it's important to test what our users are running. We performed a rolling upgrade of our OS X testing environment, migrating from 10.8 to 10.10 while spending nearly zero capital, and with no downtime. We worked jointly with the Sheriffs and A-Team to green up all the tests, and shut coverage off on the old platform as we brought it up on the new one. We have a few 10.8 machines left riding the trains that will join our 10.10 pool with the release of ESR 38.1.

Got Windows builds in AWS

We saw the first successful builds of Firefox for Windows in AWS this quarter as well! This paves the way for greater flexibility, on-demand burst capacity, faster developer prototyping, and disaster recovery and resiliency for windows Firefox builds. We'll be working on making these virtualized instances more performant and being able to do large-scale automation before we roll them out into production.

Puppet on windows

RelEng uses puppet to manage our Linux and OS X infrastructure. Presently, we use a very different tool chain, Active Directory and Group Policy Object, to manage our Windows infrastructure. This quarter we deployed a prototype Windows build machine which is managed with puppet instead. Our goal here is to increase visibility and hackability of our Windows infrastructure. A common deployment tool will also make it easier for RelEng and community to deploy new tools to our Windows machines.

New Tooltool Features

We've redesigned and deployed a new version of tooltool, the content-addressable store for large binary files used in build and test jobs. Tooltool is now integrated with RelengAPI and uses S3 as a backing store. This gives us scalability and a more flexible permissioning model that, in addition to serving public files, will allow the same access outside the releng network as inside. That means that developers as well as external automation like TaskCluster can use the service just like Buildbot jobs. The new implementation also boasts a much simpler HTTP-based upload mechanism that will enable easier use of the service.

Centralized POSIX System Logging

Using syslogd/rsyslogd and Papertrail, we've set up centralized system logging for all our POSIX infrastructure. Now that all our system logs are going to one location and we can see trends across multiple machines, we've been able to quickly identify and fix a number of previously hard-to-discover bugs. We're planning on adding additional logs (like Windows system logs) so we can do even greater correlation. We're also in the process of adding more automated detection and notification of some easily recognizable problems.

Security work

Q1 included some significant effort to avoid serious security exploits like GHOST, escalation of privilege bugs in the Linux kernel, etc. We manage 14 different operating systems, some of which are fairly esoteric and/or no longer supported by the vendor, and we worked to backport some code and patches to some platforms while upgrading others entirely. Because of the way our infrastructure is architected, we were able to do this with minimal downtime or impact to developers.

API to manage AWS workers

As part of our ongoing effort to automate the loaning of releng machines when required, we created an API layer to facilitate the creation and loan of AWS resources, which was previously, and perhaps ironically, one of the bigger time-sinks for buildduty when loaning machines.

Cross-platform worker for task cluster

Release engineering is in the process of migrating from our stalwart, buildbot-driven infrastructure, to a newer, more purpose-built solution in taskcluster. Many FirefoxOS jobs have already migrated, but those all conveniently run on Linux. In order to support the entire range of release engineering jobs, we need support for Mac and Windows as well. In Q1, we created what we call a "generic worker," essentially a base class that allows us to extend taskcluster job support to non-Linux operating systems.


Last, but not least, we deployed initial support for SETA, the search for extraneous test automation!

This means we've stopped running all tests on all builds. Instead, we use historical data to determine which tests to run that have been catching the most regressions. Other tests are run less frequently.

Diving into python logging

Python has a very rich logging system. It's very easy to add structured or unstructured log output to your python code, and have it written to a file, or output to the console, or sent to syslog, or to customize the output format.

We're in the middle of re-examining how logging works in mozharness to make it easier to factor-out code and have fewer mixins.

Here are a few tips and tricks that have really helped me with python logging:

There can be only more than one

Well, there can be only one logger with a given name. There is a special "root" logger with no name. Multiple getLogger(name) calls with the same name will return the same logger object. This is an important property because it means you don't need to explicitly pass logger objects around in your code. You can retrieve them by name if you wish. The logging module is maintaining a global registry of logging objects.

You can have multiple loggers active, each specific to its own module or even class or instance.

Each logger has a name, typically the name of the module it's being used from. A common pattern you see in python modules is this:

# in module
import logging
log = logging.getLogger(__name__)

This works because inside, __name__ is equal to "foo". So inside this module the log object is specific to this module.

Loggers are hierarchical

The names of the loggers form their own namespace, with "." separating levels. This means that if you have have loggers called, and foo.baz, you can do things on logger foo that will impact both of the children. In particular, you can set the logging level of foo to show or ignore debug messages for both submodules.

# Let's enable all the debug logging for all the foo modules
import logging

Log messages are like events that flow up through the hierarchy

Let's say we have a module

import logging
log = logging.getLogger(__name__)  # __name__ is "" here

def make_widget():
    log.debug("made a widget!")

When we call make_widget(), the code generates a debug log message. Each logger in the hierarchy has a chance to output something for the message, ignore it, or pass the message along to its parent.

The default configuration for loggers is to have their levels unset (or set to NOTSET). This means the logger will just pass the message on up to its parent. Rinse & repeat until you get up to the root logger.

So if the logger hasn't specified a level, the message will continue up to the foo logger. If the foo logger hasn't specified a level, the message will continue up to the root logger.

This is why you typically configure the logging output on the root logger; it typically gets ALL THE MESSAGES!!! Because this is so common, there's a dedicated method for configuring the root logger: logging.basicConfig()

This also allows us to use mixed levels of log output depending on where the message are coming from:

import logging

# Enable debug logging for all the foo modules

# Configure the root logger to log only INFO calls, and output to the console
# (the default)

# This will output the debug message

If you comment out the setLevel(logging.DEBUG) call, you won't see the message at all.

exc_info is teh awesome

All the built-in logging calls support a keyword called exc_info, which if isn't false, causes the current exception information to be logged in addition to the log message. e.g.:

import logging

log = logging.getLogger(__name__)

    assert False
except AssertionError:"surprise! got an exception!", exc_info=True)

There's a special case for this, log.exception(), which is equivalent to log.error(..., exc_info=True)

Python 3.2 introduced a new keyword, stack_info, which will output the current stack to the current code. Very handy to figure out how you got to a certain point in the code, even if no exceptions have occurred!

"No handlers found..."

You've probably come across this message, especially when working with 3rd party modules. What this means is that you don't have any logging handlers configured, and something is trying to log a message. The message has gone all the way up the logging hierarchy and fallen off of the chain (maybe I need a better metaphor).

import logging
log = logging.getLogger()
log.error("no log for you!")


No handlers could be found for logger "root"

There are two things that can be done here:

  1. Configure logging in your module with basicConfig() or similar

  2. Library authors should add a NullHandler at the root of their module to prevent this. See the cookbook and this blog for more details here.

Want more?

I really recommend that you read the logging documentation and cookbook which have a lot more great information (and are also very well written!) There's a lot more you can do, with custom log handlers, different output formats, outputting to many locations at once, etc. Have fun!

Upcoming hotness from RelEng

To kick off the new year, I'd like to share some of the exciting projects we have underway in Release Engineering.


First off we have Balrog, our next generation update server. Work on Balrog has been underway for quite some time. Last fall we switched beta users to use it. Shortly after, we did some additional load testing to see if we were ready to flip over release traffic. The load testing revealed some areas that needed optimization, which isn't surprising since almost no optimization work had been done up to that point!

Ben and Nick added the required caching, and our subsequent load testing was a huge success. We're planning on flipping the switch to divert release users over on January 19th. \o/


Next up we have Funsize. (Don't ask about the name; it's all Laura's fault). Funsize is a web service to generate partial updates between two versions of Firefox. There are a number of places where we want to generate these partial updates, so wrapping the logic up into a service makes a lot of sense, and also affords the possibility of faster generation due to caching.

We're aiming to have nightly builds use funsize for partial update generation this quarter.

I'd really like to see us get away from the model where the "nightly build" job is responsible for not only the builds, but generating and publishing the complete and partial updates. The problem with this is that the single job is responsible for too many deliverables, and touches too many systems. It's hard to make and test changes in isolation.

The model we're trying to move to is where the build jobs are responsible only for generating the required binaries. It should be the responsibility of a separate system to generate partials and publish updates to users. I believe splitting up these functions into their own systems will allow us to be more flexible in how we work on changes to each piece independently.

S3 uploads from automation

This quarter we're also working on migrating build and test files off our aging file server infrastructure (aka "FTP", which is a bit of a misnomer...) and onto S3. All of our build and test binaries are currently uploaded and downloaded via a central file server in our data center. It doesn't make sense to do this when most of our builds and tests are being generated and consumed inside AWS now. In addition, we can get much better cost-per-GB by moving the storage to S3.

No reboots

Morgan has been doing awesome work with runner. One of the primary aims here is to stop rebooting build and test machines between every job. We're hoping that by not rebooting between builds, we can get a small speedup in build times since a lot of the build tree should be cached in memory already. Also, by not rebooting we can have shorter turnaround times between jobs on a single machine; we can effectively save 3-4 minutes of overhead per job by not rebooting. There's also the opportunity to move lots of machine maintenance work from inside the build/test jobs themselves to instead run before buildbot starts.

Release build promotion

Finally I'd like to share some ideas we have about how to radically change how we do release builds of Firefox.

Our plan is to create a new release pipeline that works with already built binaries and "promotes" them to the release/beta channel. The release pipeline we have today creates a fresh new set of release builds that are distinct from the builds created as part of continuous integration.

This new approach should cut the amount of time required to release nearly in half, since we only need to do one set of builds instead of two. It also has the benefit of aligning the release and continuous-integration pipelines, which should simplify a lot of our code.

... and much more!

This is certainly not an exhaustive list of the things we have planned for this year. Expect to hear more from us over the coming weeks!

Gotta Cache 'Em All


Waaaaaaay back in February we identified overall network bandwidth as a cause of job failures on TBPL. We were pushing too much traffic over our VPN link between Mozilla's datacentre and AWS. Since then we've been working on a few approaches to cope with the increased traffic while at the same time reducing our overall network load. Most recently we've deployed HTTP caches inside each AWS region.

Network traffic from January to August 2014

The answer - cache all the things!

Obligatory XKCD

Caching build artifacts

The primary target for caching was downloads of build/test/symbol packages by test machines from file servers. These packages are generated by the build machines and uploaded to various file servers. The same packages are then downloaded many times by different machines running tests. This was a perfect candidate for caching, since the same files were being requested by many different hosts in a relatively short timespan.

Caching tooltool downloads

Tooltool is a simple system RelEng uses to distribute static assets to build/test machines. While the machines do maintain a local cache of files, the caches are often empty because the machines are newly created in AWS. Having the files in local HTTP caches speeds up transfer times and decreases network load.

Results so far - 50% decrease in bandwidth

Initial deployment was completed on August 8th (end of week 32 of 2014). You can see by the graph above that we've cut our bandwidth by about 50%!

What's next?

There are a few more low hanging fruit for caching. We have internal pypi repositories that could benefit from caches. There's a long tail of other miscellaneous downloads that could be cached as well.

There are other improvements we can make to reduce bandwidth as well, such as moving uploads from build machines to be outside the VPN tunnel, or perhaps to S3 directly. Additionally, a big source of network traffic is doing signing of various packages (gpg signatures, MAR files, etc.). We're looking at ways to do that more efficiently. I'd love to investigate more efficient ways of compressing or transferring build artifacts overall; there is a ton of duplication between the build and test packages between different platforms and even between different pushes.

I want to know MOAR!

Great! As always, all our work has been tracked in a bug, and worked out in the open. The bug for this project is 1017759. The source code lives in, and we have some basic documentation available on our wiki. If this kind of work excites you, we're hiring!

Big thanks to George Miroshnykov for his work on developing proxxy.

B2G now building using unified sources

Last week, with the help of Ehsan and John, we finally enabled unified source builds for B2G devices.

As a result we're building device builds approximately 40% faster than before.

Between June 12th and June 17th, 50% of our successful JB emulator builds on mozilla-inbound finished in 97 minutes or less. Using unified sources for these builds reduced the 50th percentile of build times down to 60 minutes (from June 19th to June 24th).

To mitigate the risks of changes landing that break non-unified builds, we're also doing periodic non-unified builds for these devices.

As usual, all our work here was done in the open. If you're interested, read along in bug 950676, and bug 942167.

Do you enjoy building, debugging and optimizing build, test & release pipelines? Great, because we're hiring!

Limiting coalescing on the build/test farm

tl;dr - as of yesterday we've limited coalescing on all builds/tests to merge at most 3 pending jobs together

Coalescing (aka queue collapsing aka merging) has been part of Mozilla's build/test CI for a long, long time. Back in the days of Tinderbox, a single machine would do a checkout/build/upload loop. If there were more checkins while the build was taking place, well, they would get built on the next iteration through the loop.

Fast forward a few years later to our move to buildbot, and having pools of machines all able to do the same builds. Now we create separate jobs in the queue for each build for each push. However, we didn't always have capacity to do all these builds in a reasonable amount of time, so we left buildbot's default behaviour (merging all pending jobs together) enabled for the majority of jobs. This means that if there are pending jobs for a particular build type, the first free machine skips all but the most recent item on the queue. The skipped jobs are "merged" into the job that was actually run.

In the case that all builds and tests are green, coalescing is actually a good thing most of the time. It saves you from doing a bunch of extra useless work.

However, not all pushes are perfect (just see how often the tree is closed due to build/test failures), and coalescing makes bisecting the failure very painful and time consuming, especially in the case that we've coalesced away intermediate build jobs.

To try and find a balance between capacity and sane results, we've recently added a limit to how many jobs can be coalesced at once.

By rigorous statistical analysis:

@catlee     so it's easiest to pick a single upper bound for coalescing and go with that at first
@catlee     did you have any ideas for what that should be?
@catlee     I was thinking 3
edmorley|sheriffduty        catlee: that sounds good to me as a first go :-)
mshal       chosen by fair dice roll? :)
@catlee     1d4
bhearsum    Saving throw failed. You are dead.
philor      wfm

we've chosen 3 as the upper bound on the number of jobs we'll coalesce, and we can tweak this as necessary.

I hope this makes the trees a bit more manageable! Please let us know what you think!

As always, all our work is done in the open. See the bug with the patch here:

Digging into Firefox update sizes

Update sizes up 20-37% since last year!

Mozilla relies on our automatic update infrastructure to make sure that our users are kept up to date with the latest, most secure and fastest browser.

Having smaller updates means users are able to get the latest version of Firefox more quickly.

Since Firefox 19.0 (released just over a year ago - February 16th, 2013) our complete update size for Windows has grown from 25.6MB to 30.9MB for Firefox 28.0 (released March 15th, 2014). That's a 20% increase in just one year. On OSX it's grown from 37.8MB to 47.8MB, a 26% increase.

Partial updates have similarly grown. For Windows, a user coming form 27.0.1 to 28.0 would receive a 14.3MB update compared to a 10.4MB update going from 18.0.2 to 19.0, an increase of 37.5%

This means it's taking users 20-37% longer to download updates than it did last year. Many of our users don't have fast reliable internet, so an increase in update size makes it even harder for them to stay up to date. In addition, this size increase translates directly into bandwidth costs to Mozilla. All else being equal, we're now serving 20-37% more data from our CDNs for Firefox updates this year compared to last year.


Reducing update size

How can we reduce the update size? There are a few ways:

  1. Make sure we're serving partial updates to as many users as possible, and that these updates are applied properly. More analysis is needed, but it appears that only roughly half of our users are updating via partial updates.
  2. Reduce the amount of code we ship in the update. This could mean more features and content are distributed at runtime as needed This is always a tricky trade-off to make between having features available for all users out of the box, and introducing a delay the first time the user wants to use a feature that requires remote content. It also adds complexity to the product.
  3. Change how we generate updates. This is going to be the focus of the rest of my post.

Smaller updates are more better

There are a few techniques I know of to reduce our update sizes:

  • Use xz compression instead of bzip2 compression inside the MAR files (bug 641212). xz generally gets better compression ratios at the cost of using more memory.
  • Use courgette instead of bsdiff for generating the binary diff between .exe and .dll files (bug 504624). Courgette is designed specifically for diffing executables, and generates much smaller patches than bsdiff.
  • Handle omni.ja files more effectively (bug 772868). omni.ja files are essentially zip files, and are using zip style compression. That means each member of the zip archive is individually compressed. This is problematic for two reasons: it makes generating binary diffs between omni.ja files much less effective, and it makes compressing the omni.ja file with bzip2 or xz less effective. You get far better results packing files into a zip file with 0 compression and using xz to compress it afterwards. Similarly for generating diffs, the diff between two omni.ja files using no zip compression is much smaller than the diff between two omni.ja files using the default zip -9 compression.
  • Don't use per-file compression inside the MAR file at all, rather compress the entire archive with xz. I simulated this by xz-compressing tar files of the MAR contents.

27% smaller complete updates


We can see that using xz alone saves about 10.9%. There's not a big difference between xz -6 and xz -9e, only a few kb generally. ("xz -6" and "xz -9e" series in the chart)

Disabling zip compression in the omni.ja files and using the standard bzip2 compression saves about 9.7%. ("zip0 .ja" in the chart)

Combining xz compression with the above yields a 24.8% saving, which is 7.6MB. ("zip0 .ja, xz -9e" in the chart)

Finally, disabling zip compression for omni.ja and per-file compression and compressing the entire archive at once yields a 27.2% saving, or 8.4MB.

66% smaller partial updates


Very similar results here for xz, 8.4% savings with xz -9e.

Disabling zip compression in the omni.ja files has a much bigger impact for partial updates because we're able to produce a much smaller binary diff. This method alone is saves 42%, or 6.0MB.

Using courgette for executable files yields a 19.1% savings. ("courgette" in the chart)

Combining courgette for executable files, no zip level compression, and per-file xz compression reduces the partial update by 61%. ("courgette, zip0 .ja, xz -9e" in the chart)

And if we compress the entire archive at once rather than per-file, we can reduce the update by 65.9%. ("courgette, zip0 .ja, tar, xz -9e" in the chart)

Some notes on my methodology: I'm focusing on 32-bit Windows, en-US only. Most of the optimizations, with the exception of courgette, are applicable to other platforms. I'm measuring the total size of the files inside the MAR file, rather than the size of the MAR file itself. The MAR file format overhead is small, only a few kilobytes, so this doesn't significantly impact the results, and significantly simplifies the analysis.

Finally, the code I used for analysis is here.

The amazing RelEng Jacuzzis

RelEng has jacuzzis???

On Tuesday, we enabled the first of our "jacuzzis" in production, and initial results look great.

A few weeks ago, Ben blogged about some initial experiments with running builds on smaller pools of machines ("hot tubs", get it? we've since renamed them as "jacuzzis"). His results confirmed glandium's findings on the effectiveness (or lack thereof!) of incremental builds on mozilla-inbound.

The basic idea behind smaller pools of workers is that by restricting which machines are eligible to run jobs, you get much faster incremental builds since you have more recent checkouts, object dirs, etc.

Additionally, we've made some improvements to how we initialize mock environments. We don't reset and re-install packages into the mock chroot if the previous package list is the same as the new package list. This works especially well with jacuzzis, as we can arrange for machines to run jobs with the same package lists.

On Tuesday we enabled jacuzzis for some build types on b2g-inbound: hamachi device builds, and opt/debug ICS emulator builds.


We've dropped build times by more than 50%!

The spikes earlier this morning look like they're caused by running on fresh spot instances. When spot nodes first come online, they have no previous state, and so their first builds will always be slower. The machines should stay up most of the day, so you really only have to pay this cost once per day.

For the B2G emulator builds, this means we're getting tests started earlier and therefore get much faster feedback as to the quality of recent patches.

I'm super happy with these results. What's next? Well, turning on MOAR JACUZZIS for MOAR THINGS! Additionally, having fewer build types per machine means our disk footprint is much lower, and we should be able to use local SSDs for builds on AWS.

As usual, all our work has a tracking bug: bug 970738

Implementation details

There are three major pieces involved in pulling this together: the jacuzzi allocations themselves, buildbot integration, and finally AWS integration.


Ben has published the allocations here:

Each builder (or job type) has a specific list of workers associated with it. Ben has been working on ways of automatically managing these allocations so we don't need to tune them by hand too much.


The bulk of the work required was in buildbot. Previous to jacuzzis, we had several large pools of workers, each capable doing any one of hundreds of different job types. Each builder in buildbot has each of the workers in a pool listed as able to do that job. We wanted to avoid having to reconfigure buildbot every time we needed to change jacuzzi allocations, which is why we decided to put the allocations in an external service.

There are two places where buildbot fetches the allocation data: nextSlave functions and our builder prioritizing function. The first is straightforward, and was the only place I was expecting to make changes. These nextSlave functions are called with a list of available machines, and the function's job is to pick one of the connected machines to do the job. It was relatively simple to add support for this to buildbot. The need to handle latter case, modifying our builder prioritization, didn't become apparent until testing...The reasoning is a bit convoluted, so I'll explain below for those interested.

AWS Support

Now that we had buildbot using the right workers, we needed to make sure that we were starting those workers in Amazon!

We adjusted our tools that monitor the job queue and start new EC2 instances to also check the jacuzzi allocations, and start the correct instances.

The gory details of build prioritizations

We have a function in buildbot which handles a lot of the prioritization of the job queue. For example, pending jobs for mozilla-central will get priority over jobs for any of the twigs, like ash or birch. Older jobs also tend to get priority over newer jobs. The function needs to return the list of builders in priority sorted order. Buildbot then processes each builder in turn, trying to assign pending jobs to any idle workers.

There are two factors which make this function complicated: each buildbot master is doing this prioritization independently of the others, and workers are becoming idle while buildbot is still processing the sorted list of builders. This second point caused prioritization to be broken (bug 659222) for a long time.

Imagine a case where you have 3 pending jobs (A, B, C), all for the same set of workers (W1, W2, W3). Job A is the most important, job C is the least. All the workers are busy. prioritizeBuilders has sorted our list of builders, and buildbot looks at A first. No workers are available, so it moves onto B next. Still no free workers. But now worker W1 connects, and buildbot examines job C, and finds there are available workers (W1). So job C buds in line and gets run before jobs A or B.

Our fix for this was to maintain a list of pending jobs for each set of available workers, and then discard all but the most important pending job for each worker set. In our example, we would see that all 3 pending jobs have the same worker set (W1, W2, W3), and so would temporarily ignore pending jobs B and C. This leaves buildbot only job A to consider. In our example above, it would find no available workers and stop processing. When W1 becomes available, it triggers another prioritization run, and again job A is the sole job under consideration and gets the worker.

Unfortunately, this behaviour conflicted with what we were trying to do with jacuzzis. Imagine in the same examble above we have jacuzzis allocated so that W1 is allocated to only do jobs of type C. If W1 is the only available worker, and C is getting discarded each time the prioritization is done, we're not making any forward progress. In fact, we've triggered a bit of an infinite loop, since currently we trigger another round of prioritizing/job assignments if we have available workers and have temporarily discarded lower priority jobs.

The fix was to integrate the jacuzzi allocations into this prioritization logic as well. I'm a little concerned about the runtime impact of this, since we need to query the allocated workers for every pending job type. One thing we're considering is to change the allocator to return the allocations as a single monolithic blob, rather than having per-job-type requests. Or, we could support both types.