Posts about taskcluster

Taskcluster migration update: we're finished!

We're done!

Over the past few weeks we've hit a few major milestones in our project to migrate all of Firefox's CI and release automation to taskcluster.

Firefox 60 and higher are now 100% on taskcluster!

Tests

At the end of March, our Release Operations and Project Integrity teams finished migrating Windows tests onto new hardware machines, all running taskcluster. That work was later uplifted to beta so that CI automation on beta would also be completely done using taskcluster.

This marked the last usage of buildbot for Firefox CI.

Periodic updates of blocklist and pinning data

Last week we switched off the buildbot versions of the periodic update jobs. These jobs keep the in-tree versions of blocklist, HSTS and HPKP lists up to date.

These were the last buildbot jobs running on trunk branches.

Partner repacks

And to wrap things up, yesterday the final patches landed to migrate partner repacks to taskcluster. Firefox 60.0b14 was built yesterday and shipped today 100% using taskcluster.

A massive amount of work went into migrating partner repacks from buildbot to taskcluster, and I'm really proud of the whole team for pulling this off.

So, starting today, Firefox 60 and higher will be completely off taskcluster and not rely on buildbot.

It feels really good to write that :)

We've been working on migrating Firefox to taskcluster for over three years! Code archaeology is hard, but I think the first Firefox jobs to start running in Taskcluster were the Linux64 builds, done by Morgan in bug 1155749.

Into the glorious future

It's great to have migrated everything off of buildbot and onto taskcluster, and we have endless ideas for how to improve things now that we're there. First we need to spend some time cleaning up after ourselves and paying down some technical debt we've accumulated. It's a good time to start ripping out buildbot code from the tree as well.

We've got other plans to make release automation easier for other people to work with, including doing staging releases on try(!!), making the nightly release process more similar to the beta/release process, and for exposing different parts of the release process to release management so that releng doesn't have to be directly involved with the day-to-day release mechanics.

Taskcluster migration update, the sequel

Firefox, now 100% buildbot-free!

First, the good news - Developer Edition 60.0b1 will be the first release in nearly 10 years done without using buildbot. This is an amazing milestone, and I'm incredibly proud of everybody who has contributed to make this possible!

Long time, no update

How did we get here? It's been, uh, almost 6 months since I last posted an update about our migration to Taskcluster.

In my last update, I described our plans for the end of 2017...

We're on track to ship builds produced in Taskcluster as part of the
56.0 release scheduled for late September. After that the only Firefox
builds being produced by buildbot will be for ESR52.

Meanwhile, we've started tackling the remaining parts of release
automation. We prioritized getting nightly and CI builds migrated to
Taskcluster, however, there are still parts of the release process
still implemented in Buildbot.

We're aiming to have release automation completely migrated
off of buildbot by the end of the year. We've already seen many
benefits from migrating CI to Taskcluster, and migrating the release
process will realize many of those same benefits.

How'd we do?

We're past the end of 2017, so how are we doing?

Well, we successfully shipped 56.0 with builds produced in Taskcluster. Our big Firefox Quantum release (57.0), was also shipped with builds produced by Taskcluster.

(side note: 57 had the most complex update scenarios we've ever had to support for Firefox...a subject for another post!)

Release scheduling

Post-56.0, our release process was using Taskcluster exclusively for producing the initial builds, and all the release process scheduling. We were still using Buildbot for many of the post-build tasks, like l10n repacks, publishing updates, pushing files to S3, etc. Once again we relied on the buildbot bridge to allow us to integrate existing buildbot components with the newer taskcluster pipeline. I learned from Kim Moir that this is a great example of the strangler pattern.

In the fall of 2017, we decided to begin migrating all of the scheduling logic for release automation into taskcluster using the in-tree taskgraph scheduling system. We did this for a few reasons...

  1. Having the release scheduling logic ride the trains is much more maintainable. Previous to this we had an externally defined release pipeline in our releasetasks repo. It was hard to keep this repository in sync with changes required for beta/release and ESR branches.

  2. More importantly, having the release scheduling logic in-tree meant that we could then rely on chain-of-trust to verify artifacts produced by the release pipeline.

  3. We felt that having the complete release pipeline defined in taskcluster would make it easier for us to tackle the remaining buildbot bridge tasks in parallel.

We hit this milestone in the 58 cycle. Starting with 58.0b3, Firefox and Fennec releases were completely scheduled using the in-tree taskgraph generation. We also migrated over the l10n repacks at the same time, removing a longstanding source of problems where repacks would fail when we first got to beta due to environmental differences between taskcluster and buildbot.

No-BBB Releases

Still, as of 58, much of release automation still ran on buildbot, even if Taskcluster was doing all the scheduling.

Since December, we've been working on removing these last few pieces of buildbot from the release process. Progress was initially a bit slow, given Austin and Christmas, but we've been hard at work in the new year.

That brings us to today.

We've moved uptake monitoring, update verify (and made it 2x faster too!), update submission, final verify, bouncer submission, version bumping and tagging, balrog submission all to run in Taskcluster via various kinds of scriptworkers.

As I mentioned above, DevEdition 60.0b1 will be the first release in nearly 10 years done without using buildbot. The rest of the 60 release cycle will follow suit, and once 60 hits the release channel, only ESR52 will remain on buildbot!

Taskcluster migration update

All your nightlies are belong to Taskcluster

In January I announced that we had just migrated Linux nightly builds to Taskcluster.

We completed a huge milestone in July: starting in Firefox 56, we've been doing all our nightly Firefox builds in Taskcluster.

https://media.giphy.com/media/MOWPkhRAUbR7i/giphy.gif

This includes all Windows, macOS, Linux, and Android builds. You can see all the builds and repacks on Treeherder.

In August, after 56 merged to Beta, we've also been doing our Firefox Beta builds using Taskcluster. We're on track to be shipping Firefox 56, built from Taskcluster to release users at the end of September.

Windows and macOS each had their own challenges to get them ready to build and ship to our nightly users.

Windows signing

We've had Windows builds running in Taskcluster for quite a while now. The biggest missing piece stopping us from shipping these builds was signing. Windows builds end up being a bit complicated to sign.

First, each compiled .exe and .dll binary needs to be signed. Signing binaries in windows changes their contents, and so we need to regenerate some files that depend on the exact contents of binaries. Next, we need to create packages in various formats: a "setup.exe" for installing Firefox, and also MAR files for updates. Each of these package formats in turn need to be signed.

In buildbot, this process was monolithic. All of the binary generation and signing happened as part of the same build process. The same process would also publish symbols to the symbol server and publish updates to Balrog The downside of this monolithic process is that it adds additional dependencies to the build, which is already a really long process. If something goes wrong with signing, or publishing updates, you don't want to have to restart a 2 hour build!

As part of our migration to Taskcluster, we decided that builds should minimize their external dependencies. This means that the build task produces only unsigned binaries, and it is the responsibility of downstream tasks to sign them. We also wanted discrete tasks for symbol and update submission.

One wrinkle in this approach is that the logic that defines how to create a setup.exe package or a MAR file lives in tree. We didn't want to run that code in the same context as the code that generates signatures.

Our solution to this was to create a sequence of build -> signing -> repackage -> signing tasks. The signing tasks run in a restricted environment while the build and repackage tasks have access to the build system in order to produce the required artifacts. Using the chain of trust, we can demonstrate that the artifacts weren't tampered with between intermediate tasks.

Finally, we need to consider l10n repacks. We ship Firefox in over 90 locales. The repacking process downloads the en-US build and replaces the English strings with localized strings. Each of these repacks needs to be based on the signed en-US build. Each will also generate its own setup.exe and complete MAR for updates.

macOS performance (and why your build directory matters)

Like Windows, we've had macOS builds running on Taskcluster for a long time. Also like Windows, we had to solve signing for macOS.

However, the biggest blocker for the macOS build migration, was a performance bug. Builds produced on Taskcluster showed some serious performance regressions as compared to the builds produced on buildbot.

Many very smart people looked at this bug since it was first discovered in February. They compared library versions being used. They compared compiler versions and compiler flags. They even inspected the generated assembly code from both systems.

Mike Shal stumbled across the first clue to what was going on in June: if he stripped the Taskcluster binaries, then the performance problems disappeared! At this point we decided that we could go ahead and ship these builds to nightly users, knowing that the performance regression would disappear on beta and release.

Later on, Mike realized that it's not the presence or absence of symbols in the binary that cause the performance hit, it's what directory the builds are done in. On buildbot we build under /builds/..., and on Taskcluster we build under /home/...

https://media.giphy.com/media/zjQrmdlR9ZCM/giphy.gif

Read the bug for more gory details. This is definitely one of the strangest bugs I've seen.

Lessons learned

We learned quite a bit in the process of migrating Windows and macOS nightly builds to Taskcluster.

First, we gained a huge amount of experience with the in-tree scheduling system. There's a bit of a learning curve to climb, but it's an extremely powerful and flexible system. Many kudos to Dustin for his work creating the foundation of this system here. His blog post, "What's So Special About "In-Tree"?", is a great explanation of why having this code as part of Firefox's repository is so important.

One of the killer features of having all the scheduling logic live in-tree is that you can do quite a bit of work locally, without requiring any build infrastructure. This is extremely useful when working on the complex build / signing / repackage sequence of tasks described above. You can make your changes, generate a new task graph, and inspect the results.

Once you're happy with your local changes, you can push them to try to validate your local testing, get your patch reviewed, and then finally landed in gecko. Your scheduling changes will take effect as soon as they land into the repo. This made it possible for us to do a lot of testing on another project branch, and then merge the code to central once we were ready.

What's next?

We're on track to ship builds produced in Taskcluster as part of the 56.0 release scheduled for late September. After that the only Firefox builds being produced by buildbot will be for ESR52.

Meanwhile, we've started tackling the remaining parts of release automation. We prioritized getting nightly and CI builds migrated to Taskcluster, however, there are still parts of the release process still implemented in Buildbot.

We're aiming to have release automation completely migrated off of buildbot by the end of the year. We've already seen many benefits from migrating CI to Taskcluster, and migrating the release process will realize many of those same benefits.

Thanks!

Thank you for reading this far!

Members from the Release Engineering, Release Operations, Taskcluster, Build, and Product Integrity teams all were involved in finishing up this migration. Thanks to everyone involved (there are a lot of you!) to getting us across the finish line here.

In particular, if you come across one of these fine individuals at the office, or maybe on IRC, I'm sure they would appreciate a quick "thank you":

  • Aki Sasaki
  • Dustin Mitchell
  • Greg Arndt
  • Joel Maher
  • Johan Lorenzo
  • Justin Wood
  • Kim Moir
  • Mihai Tabara
  • Mike Shal
  • Nick Thomas
  • Rail Aliiev
  • Rob Thijssen
  • Simon Fraser
  • Wander Costa

Nightly builds from Taskcluster

Yesterday, for the very first time, we started shipping Linux Desktop and Android Firefox nightly builds from Taskcluster.

74851712.jpg

We now have a much more secure, resilient, and hackable nightly build and release process.

It's more secure, because we have developed a chain of trust that allows us to verify all generated artifacts back to the original decision task and docker image. Signing is no longer done as part of the build process, but is now split out into a discrete task after the build completes.

The new process is more resilient because we've split up the monolithic build process into smaller bits: build, signing, symbol upload, upload to CDN, and publishing updates are all done as separate tasks. If any one of these fail, they can be retried independently. We don't have to re-compile the entire build again just because an external service was temporarily unavailable.

Finally, it's more hackable - in a good way! All the configuration files for the nightly build and release process are contained in-tree. That means it's easier to inspect and change how nightly builds are done. Changes will automatically ride the trains to aurora, beta, etc.

Ideally you didn't even notice this change! We try and get these changes done quietly, smoothly, in the background.

This is a giant milestone for Mozilla's Release Engineering and Taskcluster teams, and is the result of many months of hard work, planning, coding, reviewing and debugging.

Big big thanks to jlund, Callek, mtabara, kmoir, aki, dustin, sfraser, jlorenzo, coop, jmaher, bstack, gbrown, and everybody else who made this possible!

RelEng Retrospective - Q1 2015

RelEng had a great start to 2015. We hit some major milestones on projects like Balrog and were able to turn off some old legacy systems, which is always an extremely satisfying thing to do!

We also made some exciting new changes to the underlying infrastructure, got some projects off the drawing board and into production, and drastically reduced our test load!

Firefox updates

Balrog

balrog

All Firefox update queries are now being served by Balrog! Earlier this year, we switched all Firefox update queries off of the old update server, aus3.mozilla.org, to the new update server, codenamed Balrog.

Already, Balrog has enabled us to be much more flexible in handling updates than the previous system. As an example, in bug 1150021, the About Firefox dialog was broken in the Beta version of Firefox 38 for users with RTL locales. Once the problem was discovered, we were able to quickly disable updates just for those users until a fix was ready. With the previous system it would have taken many hours of specialized manual work to disable the updates for just these locales, and to make sure they didn't get updates for subsequent Betas.

Once we were confident that Balrog was able to handle all previous traffic, we shut down the old update server (aus3). aus3 was also one of the last systems relying on CVS (!! I know, rite?). It's a great feeling to be one step closer to axing one more old system!

Funsize

When we started the quarter, we had an exciting new plan for generating partial updates for Firefox in a scalable way.

Then we threw out that plan and came up with an EVEN MOAR BETTER plan!

The new architecture for funsize relies on Pulse for notifications about new nightly builds that need partial updates, and uses TaskCluster for doing the generation of the partials and publishing to Balrog.

The current status of funsize is that we're using it to generate partial updates for nightly builds, but not published to the regular nightly update channel yet.

There's lots more to say here...stay tuned!

FTP & S3

Brace yourselves... ftp.mozilla.org is going away...

brace yourselves...ftp is going away

...in its current incarnation at least.

Expect to hear MUCH more about this in the coming months.

tl;dr is that we're migrating as much of the Firefox build/test/release automation to S3 as possible.

The existing machinery behind ftp.mozilla.org will be going away near the end of Q3. We have some ideas of how we're going to handle migrating existing content, as well as handling new content. You should expect that you'll still be able to access nightly and CI Firefox builds, but you may need to adjust your scripts or links to do so.

Currently we have most builds and tests doing their transfers to/from S3 via the task cluster index in addition to doing parallel uploads to ftp.mozilla.org. We're aiming to shut off most uploads to ftp this quarter.

Please let us know if you have particular systems or use cases that rely on the current host or directory structure!

Release build promotion

Our new Firefox release pipeline got off the drawing board, and the initial proof-of-concept work is done.

The main idea here is to take an existing build based on a push to mozilla-beta, and to "promote" it to a release build. So we need to generate all the l10n repacks, partner repacks, generate partial updates, publish files to CDNs, etc.

The big win here is that it cuts our time-to-release nearly in half, and also simplifies our codebase quite a bit!

Again, expect to hear more about this in the coming months.

Infrastructure

In addition to all those projects in development, we also tackled quite a few important infrastructure projects.

OSX test platform

10.10 is now the most widely used Mac platform for Firefox, and it's important to test what our users are running. We performed a rolling upgrade of our OS X testing environment, migrating from 10.8 to 10.10 while spending nearly zero capital, and with no downtime. We worked jointly with the Sheriffs and A-Team to green up all the tests, and shut coverage off on the old platform as we brought it up on the new one. We have a few 10.8 machines left riding the trains that will join our 10.10 pool with the release of ESR 38.1.

Got Windows builds in AWS

We saw the first successful builds of Firefox for Windows in AWS this quarter as well! This paves the way for greater flexibility, on-demand burst capacity, faster developer prototyping, and disaster recovery and resiliency for windows Firefox builds. We'll be working on making these virtualized instances more performant and being able to do large-scale automation before we roll them out into production.

Puppet on windows

RelEng uses puppet to manage our Linux and OS X infrastructure. Presently, we use a very different tool chain, Active Directory and Group Policy Object, to manage our Windows infrastructure. This quarter we deployed a prototype Windows build machine which is managed with puppet instead. Our goal here is to increase visibility and hackability of our Windows infrastructure. A common deployment tool will also make it easier for RelEng and community to deploy new tools to our Windows machines.

New Tooltool Features

We've redesigned and deployed a new version of tooltool, the content-addressable store for large binary files used in build and test jobs. Tooltool is now integrated with RelengAPI and uses S3 as a backing store. This gives us scalability and a more flexible permissioning model that, in addition to serving public files, will allow the same access outside the releng network as inside. That means that developers as well as external automation like TaskCluster can use the service just like Buildbot jobs. The new implementation also boasts a much simpler HTTP-based upload mechanism that will enable easier use of the service.

Centralized POSIX System Logging

Using syslogd/rsyslogd and Papertrail, we've set up centralized system logging for all our POSIX infrastructure. Now that all our system logs are going to one location and we can see trends across multiple machines, we've been able to quickly identify and fix a number of previously hard-to-discover bugs. We're planning on adding additional logs (like Windows system logs) so we can do even greater correlation. We're also in the process of adding more automated detection and notification of some easily recognizable problems.

Security work

Q1 included some significant effort to avoid serious security exploits like GHOST, escalation of privilege bugs in the Linux kernel, etc. We manage 14 different operating systems, some of which are fairly esoteric and/or no longer supported by the vendor, and we worked to backport some code and patches to some platforms while upgrading others entirely. Because of the way our infrastructure is architected, we were able to do this with minimal downtime or impact to developers.

API to manage AWS workers

As part of our ongoing effort to automate the loaning of releng machines when required, we created an API layer to facilitate the creation and loan of AWS resources, which was previously, and perhaps ironically, one of the bigger time-sinks for buildduty when loaning machines.

Cross-platform worker for task cluster

Release engineering is in the process of migrating from our stalwart, buildbot-driven infrastructure, to a newer, more purpose-built solution in taskcluster. Many FirefoxOS jobs have already migrated, but those all conveniently run on Linux. In order to support the entire range of release engineering jobs, we need support for Mac and Windows as well. In Q1, we created what we call a "generic worker," essentially a base class that allows us to extend taskcluster job support to non-Linux operating systems.

Testing

Last, but not least, we deployed initial support for SETA, the search for extraneous test automation!

This means we've stopped running all tests on all builds. Instead, we use historical data to determine which tests to run that have been catching the most regressions. Other tests are run less frequently.