Posts about firefox

Firefox release speed wins

Sylvestre wrote about how we were able to ship new releases for Nightly, Beta, Release and ESR versions of Firefox for Desktop and Android in less than a day in response to the pwn2own contest.

People commented on how much faster the Beta and Release releases were compared to the ESR release, so I wanted to dive into the releases on the different branches to understand if this really was the case, and if so, why?

Chemspill timings

                    | Firefox ESR 52.7.2 | Firefox 59.0.1  | Firefox 60.0b4
 ------------------ | ------------------ | --------------- | --------------
 Fix landed in HG   | 23:33:06           | 23:31:28        | 23:29:54
 en-US builds ready | 03:19:03 +3h45m    | 01:16:41 +1h45m | 01:16:47 +1h46m
 Updates ready      | 08:43:03 +5h42m    | 04:21:17 +3h04m | 04:41:02 +3h25m
 Total              | 9h09m              | 4h49m           | 5h11m

(All times UTC from 2018-03-15 -> 2018-03-16)

Summary

via GIPHY

We can see that Firefox 59 and 60.0b4 were significantly faster to run than ESR 52 was! What's behind this speedup?

Release Engineering have been busy migrating release automation from buildbot to taskcluster . Much of ESR52 still runs on buildbot, while Firefox 59 is mostly done in Taskcluster, and Firefox 60 is entirely done in Taskcluster.

In ESR52 the initial builds are still done in buildbot, which has been missing out on many performance gains from the build system and AWS side. Update testing is done via buildbot on slower mac minis or windows hardware.

The Firefox 59 release had much faster builds, and update verification is done in Taskcluster on fast linux machines instead of the old mac minis or windows hardware.

The Firefox 60.0b4 release also had much faster builds, and ended up running in about the same time as Firefox 59. It turns out that we hit several intermittent infrastructure failures in 60.0b4 that caused this release to be slower than it could have been. Also, because we had multiple releases running simultaneously, we did see some resource contention for tasks like signing.

For comparison, here's what 60.0b11 looks like:

                    | Firefox 60.0b11
 ------------------ | --------------- 
 Fix landed in HG   | 18:45:45
 en-US builds ready | 20:41:53 +1h56m
 Updates ready      | 22:19:30 +1h37m
 Total              | 3h33m

Wow, down to 3.5 hours!

In addition to the faster builds and faster update tests, we're seeing a lot of wins from increased parallelization that we can do now using taskcluster's much more flexible scheduling engine. There's still more we can do to speed up certain types of tasks, fix up intermittent failures, and increase parallelization. I'm curious just how fast this pipeline can be :)

Taskcluster migration update: we're finished!

We're done!

Over the past few weeks we've hit a few major milestones in our project to migrate all of Firefox's CI and release automation to taskcluster.

Firefox 60 and higher are now 100% on taskcluster!

Tests

At the end of March, our Release Operations and Project Integrity teams finished migrating Windows tests onto new hardware machines, all running taskcluster. That work was later uplifted to beta so that CI automation on beta would also be completely done using taskcluster.

This marked the last usage of buildbot for Firefox CI.

Periodic updates of blocklist and pinning data

Last week we switched off the buildbot versions of the periodic update jobs. These jobs keep the in-tree versions of blocklist, HSTS and HPKP lists up to date.

These were the last buildbot jobs running on trunk branches.

Partner repacks

And to wrap things up, yesterday the final patches landed to migrate partner repacks to taskcluster. Firefox 60.0b14 was built yesterday and shipped today 100% using taskcluster.

A massive amount of work went into migrating partner repacks from buildbot to taskcluster, and I'm really proud of the whole team for pulling this off.

So, starting today, Firefox 60 and higher will be completely off taskcluster and not rely on buildbot.

It feels really good to write that :)

We've been working on migrating Firefox to taskcluster for over three years! Code archaeology is hard, but I think the first Firefox jobs to start running in Taskcluster were the Linux64 builds, done by Morgan in bug 1155749.

Into the glorious future

It's great to have migrated everything off of buildbot and onto taskcluster, and we have endless ideas for how to improve things now that we're there. First we need to spend some time cleaning up after ourselves and paying down some technical debt we've accumulated. It's a good time to start ripping out buildbot code from the tree as well.

We've got other plans to make release automation easier for other people to work with, including doing staging releases on try(!!), making the nightly release process more similar to the beta/release process, and for exposing different parts of the release process to release management so that releng doesn't have to be directly involved with the day-to-day release mechanics.

RelEng Retrospective - Q1 2015

RelEng had a great start to 2015. We hit some major milestones on projects like Balrog and were able to turn off some old legacy systems, which is always an extremely satisfying thing to do!

We also made some exciting new changes to the underlying infrastructure, got some projects off the drawing board and into production, and drastically reduced our test load!

Firefox updates

Balrog

balrog

All Firefox update queries are now being served by Balrog! Earlier this year, we switched all Firefox update queries off of the old update server, aus3.mozilla.org, to the new update server, codenamed Balrog.

Already, Balrog has enabled us to be much more flexible in handling updates than the previous system. As an example, in bug 1150021, the About Firefox dialog was broken in the Beta version of Firefox 38 for users with RTL locales. Once the problem was discovered, we were able to quickly disable updates just for those users until a fix was ready. With the previous system it would have taken many hours of specialized manual work to disable the updates for just these locales, and to make sure they didn't get updates for subsequent Betas.

Once we were confident that Balrog was able to handle all previous traffic, we shut down the old update server (aus3). aus3 was also one of the last systems relying on CVS (!! I know, rite?). It's a great feeling to be one step closer to axing one more old system!

Funsize

When we started the quarter, we had an exciting new plan for generating partial updates for Firefox in a scalable way.

Then we threw out that plan and came up with an EVEN MOAR BETTER plan!

The new architecture for funsize relies on Pulse for notifications about new nightly builds that need partial updates, and uses TaskCluster for doing the generation of the partials and publishing to Balrog.

The current status of funsize is that we're using it to generate partial updates for nightly builds, but not published to the regular nightly update channel yet.

There's lots more to say here...stay tuned!

FTP & S3

Brace yourselves... ftp.mozilla.org is going away...

brace yourselves...ftp is going away

...in its current incarnation at least.

Expect to hear MUCH more about this in the coming months.

tl;dr is that we're migrating as much of the Firefox build/test/release automation to S3 as possible.

The existing machinery behind ftp.mozilla.org will be going away near the end of Q3. We have some ideas of how we're going to handle migrating existing content, as well as handling new content. You should expect that you'll still be able to access nightly and CI Firefox builds, but you may need to adjust your scripts or links to do so.

Currently we have most builds and tests doing their transfers to/from S3 via the task cluster index in addition to doing parallel uploads to ftp.mozilla.org. We're aiming to shut off most uploads to ftp this quarter.

Please let us know if you have particular systems or use cases that rely on the current host or directory structure!

Release build promotion

Our new Firefox release pipeline got off the drawing board, and the initial proof-of-concept work is done.

The main idea here is to take an existing build based on a push to mozilla-beta, and to "promote" it to a release build. So we need to generate all the l10n repacks, partner repacks, generate partial updates, publish files to CDNs, etc.

The big win here is that it cuts our time-to-release nearly in half, and also simplifies our codebase quite a bit!

Again, expect to hear more about this in the coming months.

Infrastructure

In addition to all those projects in development, we also tackled quite a few important infrastructure projects.

OSX test platform

10.10 is now the most widely used Mac platform for Firefox, and it's important to test what our users are running. We performed a rolling upgrade of our OS X testing environment, migrating from 10.8 to 10.10 while spending nearly zero capital, and with no downtime. We worked jointly with the Sheriffs and A-Team to green up all the tests, and shut coverage off on the old platform as we brought it up on the new one. We have a few 10.8 machines left riding the trains that will join our 10.10 pool with the release of ESR 38.1.

Got Windows builds in AWS

We saw the first successful builds of Firefox for Windows in AWS this quarter as well! This paves the way for greater flexibility, on-demand burst capacity, faster developer prototyping, and disaster recovery and resiliency for windows Firefox builds. We'll be working on making these virtualized instances more performant and being able to do large-scale automation before we roll them out into production.

Puppet on windows

RelEng uses puppet to manage our Linux and OS X infrastructure. Presently, we use a very different tool chain, Active Directory and Group Policy Object, to manage our Windows infrastructure. This quarter we deployed a prototype Windows build machine which is managed with puppet instead. Our goal here is to increase visibility and hackability of our Windows infrastructure. A common deployment tool will also make it easier for RelEng and community to deploy new tools to our Windows machines.

New Tooltool Features

We've redesigned and deployed a new version of tooltool, the content-addressable store for large binary files used in build and test jobs. Tooltool is now integrated with RelengAPI and uses S3 as a backing store. This gives us scalability and a more flexible permissioning model that, in addition to serving public files, will allow the same access outside the releng network as inside. That means that developers as well as external automation like TaskCluster can use the service just like Buildbot jobs. The new implementation also boasts a much simpler HTTP-based upload mechanism that will enable easier use of the service.

Centralized POSIX System Logging

Using syslogd/rsyslogd and Papertrail, we've set up centralized system logging for all our POSIX infrastructure. Now that all our system logs are going to one location and we can see trends across multiple machines, we've been able to quickly identify and fix a number of previously hard-to-discover bugs. We're planning on adding additional logs (like Windows system logs) so we can do even greater correlation. We're also in the process of adding more automated detection and notification of some easily recognizable problems.

Security work

Q1 included some significant effort to avoid serious security exploits like GHOST, escalation of privilege bugs in the Linux kernel, etc. We manage 14 different operating systems, some of which are fairly esoteric and/or no longer supported by the vendor, and we worked to backport some code and patches to some platforms while upgrading others entirely. Because of the way our infrastructure is architected, we were able to do this with minimal downtime or impact to developers.

API to manage AWS workers

As part of our ongoing effort to automate the loaning of releng machines when required, we created an API layer to facilitate the creation and loan of AWS resources, which was previously, and perhaps ironically, one of the bigger time-sinks for buildduty when loaning machines.

Cross-platform worker for task cluster

Release engineering is in the process of migrating from our stalwart, buildbot-driven infrastructure, to a newer, more purpose-built solution in taskcluster. Many FirefoxOS jobs have already migrated, but those all conveniently run on Linux. In order to support the entire range of release engineering jobs, we need support for Mac and Windows as well. In Q1, we created what we call a "generic worker," essentially a base class that allows us to extend taskcluster job support to non-Linux operating systems.

Testing

Last, but not least, we deployed initial support for SETA, the search for extraneous test automation!

This means we've stopped running all tests on all builds. Instead, we use historical data to determine which tests to run that have been catching the most regressions. Other tests are run less frequently.

Behind the clouds: how RelEng do Firefox builds on AWS

RelEng have been expanding our usage of Amazon's AWS over the past few months as the development pace of the B2G project increases. In October we began moving builds off of Mozilla-only infrastructure and into a hybrid model where some jobs are done in Mozilla's infra, and others are done in Amazon. Since October we've expanded into 3 amazon regions, and now have nearly 300 build machines in Amazon. Within each AWS region we've distributed our load across 3 availability zones.

That's great! But how does it work?

Behind the scenes, we've written quite a bit of code to manage our new AWS infrastructure. This code is in our cloud-tools repo (github|hg.m.o) and uses the excellent boto library extensively.

The two work horses in there are aws_watch_pending and aws_stop_idle. aws_stop_idle's job is pretty easy, it goes around looking at EC2 instances that are idle and shuts them off safely. If an EC2 slave hasn't done any work in more than 10 minutes, it is shut down.

aws_watch_pending is a little more involved. Its job is to notice when there are pending jobs (like your build waiting to start!) and to resume EC2 instances. We take a few factors into account when starting up instances:

  • We wait until a pending job is more than a minute old before starting anything. This allows in-house capacity to grab the job if possible, and other EC2 slaves that are online but idle also have a chance to take it.
  • Use any reserved instances first. As our AWS load stabilizes, we've been able to purchase some reserved instances to reduce our cost. Obviously, to reduce our cost, we have to use those reservations wherever possible! The code to do this is a bit more complicated than I'd like it to be since AWS reservations are specific to individual availability zones rather than whole regions.
  • Some regions are cheaper than others, so we prefer to start instances in the cheaper regions first.
  • Start instances that were most recently running. This should give both better depend-build time, and also helps with billing slightly. Amazon bills for complete hours. So if you start one instance twice in an hour, you're charged for a single hour. If you start two instances once in the hour, you're charged for two hours.

Overall we're really happy with Amazon's services. Having APIs for nearly everything has made development really easy.

What's next?

Seeing as how test capacity is always woefully behind, we're hoping to be able to run a large number of our linux-based unittests on EC2, particularly those that don't require an accelerated graphics context.

After that? Maybe windows builds? Maybe automated regression hunting? What do you want to see?

self-serve builds!

Do you want to be able to cancel your own try server builds?

Do you want to be able to re-trigger a failed nightly build before the RelEng sheriff wakes up?

Do you want to be able to get additional test runs on your build?

If you answered an enthusiastic YES to any or all of these questions, then self-serve is for you.

self-serve was created to provide an API to allow developers to interact with our build infrastructure, with the goal being that others would then create tools against it. It's still early days for this self-serve API, so just a few caveats:

  • This is very much pre-alpha and may cause your computer to explode, your keg to run dry, or may simply hang.
  • It's slower than I want. I've spent a bit of time optimizing and caching, but I think it can be much better. Just look at shaver's bugzilla search to see what's possible for speed. Part of the problem here is that it's currently running on a VM that's doing a few dozen other things. We're working on getting faster hardware, but didn't want to block this pre-alpha-rollout on that.
  • You need to log in with your LDAP credentials to work with it.
  • The HTML interface is teh suck. Good thing I'm not paid to be a front-end webdev! Really, the goal here wasn't to create a fully functional web interface, but rather to provide a functional programmatic interface.
  • Changing build priorities may run afoul of bug 555664...haven't had a chance to test out exactly what happens right now if a high priority job gets merged with a lower priority one.

That being said, I'm proud to be able to finally make this public. Documentation for the REST API is available as part of the web interface itself, and the code is available as part of the buildapi repository on hg.mozilla.org

https://build.mozilla.org/buildapi/self-serve

Please be gentle!

Any questions, problems or feedback can be left here, or filed in bugzilla.

Pooling the Talos slaves

One of the big projects for me this quarter was getting our Talos slaves configured as a pool of machines shared across branches. The details are being tracked in bug 488367 for those interested in the details.

This is a continuation of our work on pooling our slaves, like we've done over the past year with our build, unittest, and l10n slaves.

Up until now each branch has had a dedicated set of Mac Minis to run performance tests for just that branch, on five different operating systems. For example, the Firefox 3.0 branch used to have 19 Mac Minis doing regular Talos tests: 4 of each platform (except for Leopard, which had 3). Across our 4 active branches (Firefox 3.0, 3.5, 3.next, and TraceMonkey), we have around 80 minis in total! That's a lot of minis!

What we've been working towards is to put all the Talos slaves into one pool that is shared between all our active branches. Slaves will be given builds to test in FIFO order, regardless of which branch the build is produced on.

This new pool will be....

Faster

With more slaves available to all branches, the time to wait for a free slave will go down, so testing can start more quickly...which means you get your results sooner!

Smarter

It will be able to handle varying load between branches. If there's a lot of activity on one branch, like on the Firefox 3.5 branch before a release, then more slaves will be available to test those builds and won't be sitting idle waiting for builds from low activity branches.

Scalable

We will be able to scale our infrastructure much better using a pooled system. Similar to how moving to pooled build and unittest slaves has allowed us to scale based on number of checkins rather than number of branches, having pooled Talos slaves will allow us to scale our capacity based on number of builds produced rather than the number of branches.

In the current setup, each new release or project branch required an allocation of at least 15 minis to dedicate to the branch.

Once all our Talos slaves are pooled, we will be able to add Talos support for new project or release branches with a few configuration changes instead of waiting for new minis to be provisioned.

This means we can get up and running with new project branches much more quickly!

More Robust

We'll also be in a much better position in terms of maintenance of the machines. When a slave goes offline, the test coverage for any one branch won't be jeopardized since we'll still have the rest of the slaves that can test builds from that branch.

In the current setup, if one or two machines of the same platform needs maintenance on one branch, then our performance test coverage of that branch is significantly impacted. With only one or two machines remaining to run tests on that platform, it can be difficult to determine if a performance regression is caused by a code change, or is caused by some machine issue. Losing two or three machines in this scenario is enough to close the tree, since we no longer have reliable performance data.

With pooled slaves we would see a much more gradual decrease in coverage when machines go offline. It's the difference between losing one third of the machines on your branch, and losing one tenth.

When is all this going to happen?

Some of it has started already! We have a small pool of slaves testing builds from our four branches right now. If you know how to coerce Tinderbox to show you hidden columns, you can take a look for yourself. They're also reporting to the new graph server using machines names starting with 'talos-rev2'.

We have some new minis waiting to be added to the pool. Together with existing slaves, this will give us around 25 machines in total to start off the new pool. This isn't enough yet to be able to test every build from each branch without skipping any, so for the moment the pool will be skipping to the most recent build per branch if there's any backlog.

It's worth pointing out that our current Talos system also skips builds if there's any backlog. However, our goal is to turn off skipping once we have enough slaves in the pool to handle our peak loads comfortably.

After this initial batch is up and running, we'll be waiting for a suitable time to start moving the existing Talos slaves into the pool.

All in all, this should be a big win for everyone!

Parallelizing Unit Tests

Last week we flipped the switch and turned on running unit tests on packaged builds for our mozilla-1.9.1, mozilla-central, and tracemonkey branches.

What this means is that our current unit test builds are uploaded to a web server along with all their unit tests. Another machine will then download the build and tests, and run various test suites on them.

Splitting up the tests this way allows us to run the test suites in parallel, so the mochitest suite will run on one machine, and all the other suites will be run on another machine (this group of tests is creatively named 'everythingelse' on Tinderbox).

paralleltests

Splitting up the tests is a critical step towards reducing our end-to-end time, which is the total time elapsed between when a change is pushed into one of the source repositories, and when all of the results from that build are available. Up until now, you had to wait for all the test suites to be completed in sequence, which could take over an hour in total. Now that we can split the tests up, the wait time is determined by the longest test suite. The mochitest suite is currently the biggest chunk here, taking somewhere around 35 minutes to complete, and all of the other tests combined take around 20 minutes. One of the next steps for us to do is to look at splitting up the mochitests into smaller pieces.

For the time being, we will continue to run the existing unit tests on the same machine that is creating the build. This is so that we can make sure that running tests on the packaged builds is giving us the same results (there are already some known differences: bug 491675, bug 475383)

Parallelizing the unit tests, and the infrastructure required to run them, is the first step towards achieving a few important goals.

  • Reducing end-to-end time.

  • Running unit tests on debug, as well as on optimized builds. Once we've got both of these going, we can turn off the builds that are currently done solely to be able to run tests on them.

  • Running unit tests on the same build multiple times, to help isolate intermittent test failures.

All of the gory details can be found in bug 383136.