Taskcluster migration update

All your nightlies are belong to Taskcluster

In January I announced that we had just migrated Linux nightly builds to Taskcluster.

We completed a huge milestone in July: starting in Firefox 56, we've been doing all our nightly Firefox builds in Taskcluster.

https://media.giphy.com/media/MOWPkhRAUbR7i/giphy.gif

This includes all Windows, macOS, Linux, and Android builds. You can see all the builds and repacks on Treeherder.

In August, after 56 merged to Beta, we've also been doing our Firefox Beta builds using Taskcluster. We're on track to be shipping Firefox 56, built from Taskcluster to release users at the end of September.

Windows and macOS each had their own challenges to get them ready to build and ship to our nightly users.

Windows signing

We've had Windows builds running in Taskcluster for quite a while now. The biggest missing piece stopping us from shipping these builds was signing. Windows builds end up being a bit complicated to sign.

First, each compiled .exe and .dll binary needs to be signed. Signing binaries in windows changes their contents, and so we need to regenerate some files that depend on the exact contents of binaries. Next, we need to create packages in various formats: a "setup.exe" for installing Firefox, and also MAR files for updates. Each of these package formats in turn need to be signed.

In buildbot, this process was monolithic. All of the binary generation and signing happened as part of the same build process. The same process would also publish symbols to the symbol server and publish updates to Balrog The downside of this monolithic process is that it adds additional dependencies to the build, which is already a really long process. If something goes wrong with signing, or publishing updates, you don't want to have to restart a 2 hour build!

As part of our migration to Taskcluster, we decided that builds should minimize their external dependencies. This means that the build task produces only unsigned binaries, and it is the responsibility of downstream tasks to sign them. We also wanted discrete tasks for symbol and update submission.

One wrinkle in this approach is that the logic that defines how to create a setup.exe package or a MAR file lives in tree. We didn't want to run that code in the same context as the code that generates signatures.

Our solution to this was to create a sequence of build -> signing -> repackage -> signing tasks. The signing tasks run in a restricted environment while the build and repackage tasks have access to the build system in order to produce the required artifacts. Using the chain of trust, we can demonstrate that the artifacts weren't tampered with between intermediate tasks.

Finally, we need to consider l10n repacks. We ship Firefox in over 90 locales. The repacking process downloads the en-US build and replaces the English strings with localized strings. Each of these repacks needs to be based on the signed en-US build. Each will also generate its own setup.exe and complete MAR for updates.

macOS performance (and why your build directory matters)

Like Windows, we've had macOS builds running on Taskcluster for a long time. Also like Windows, we had to solve signing for macOS.

However, the biggest blocker for the macOS build migration, was a performance bug. Builds produced on Taskcluster showed some serious performance regressions as compared to the builds produced on buildbot.

Many very smart people looked at this bug since it was first discovered in February. They compared library versions being used. They compared compiler versions and compiler flags. They even inspected the generated assembly code from both systems.

Mike Shal stumbled across the first clue to what was going on in June: if he stripped the Taskcluster binaries, then the performance problems disappeared! At this point we decided that we could go ahead and ship these builds to nightly users, knowing that the performance regression would disappear on beta and release.

Later on, Mike realized that it's not the presence or absence of symbols in the binary that cause the performance hit, it's what directory the builds are done in. On buildbot we build under /builds/..., and on Taskcluster we build under /home/...

https://media.giphy.com/media/zjQrmdlR9ZCM/giphy.gif

Read the bug for more gory details. This is definitely one of the strangest bugs I've seen.

Lessons learned

We learned quite a bit in the process of migrating Windows and macOS nightly builds to Taskcluster.

First, we gained a huge amount of experience with the in-tree scheduling system. There's a bit of a learning curve to climb, but it's an extremely powerful and flexible system. Many kudos to Dustin for his work creating the foundation of this system here. His blog post, "What's So Special About "In-Tree"?", is a great explanation of why having this code as part of Firefox's repository is so important.

One of the killer features of having all the scheduling logic live in-tree is that you can do quite a bit of work locally, without requiring any build infrastructure. This is extremely useful when working on the complex build / signing / repackage sequence of tasks described above. You can make your changes, generate a new task graph, and inspect the results.

Once you're happy with your local changes, you can push them to try to validate your local testing, get your patch reviewed, and then finally landed in gecko. Your scheduling changes will take effect as soon as they land into the repo. This made it possible for us to do a lot of testing on another project branch, and then merge the code to central once we were ready.

What's next?

We're on track to ship builds produced in Taskcluster as part of the 56.0 release scheduled for late September. After that the only Firefox builds being produced by buildbot will be for ESR52.

Meanwhile, we've started tackling the remaining parts of release automation. We prioritized getting nightly and CI builds migrated to Taskcluster, however, there are still parts of the release process still implemented in Buildbot.

We're aiming to have release automation completely migrated off of buildbot by the end of the year. We've already seen many benefits from migrating CI to Taskcluster, and migrating the release process will realize many of those same benefits.

Thanks!

Thank you for reading this far!

Members from the Release Engineering, Release Operations, Taskcluster, Build, and Product Integrity teams all were involved in finishing up this migration. Thanks to everyone involved (there are a lot of you!) to getting us across the finish line here.

In particular, if you come across one of these fine individuals at the office, or maybe on IRC, I'm sure they would appreciate a quick "thank you":

  • Aki Sasaki
  • Dustin Mitchell
  • Greg Arndt
  • Joel Maher
  • Johan Lorenzo
  • Justin Wood
  • Kim Moir
  • Mihai Tabara
  • Mike Shal
  • Nick Thomas
  • Rail Aliiev
  • Rob Thijssen
  • Simon Fraser
  • Wander Costa