Taskcluster migration update

All your nightlies are belong to Taskcluster

In January I announced that we had just migrated Linux nightly builds to Taskcluster.

We completed a huge milestone in July: starting in Firefox 56, we've been doing all our nightly Firefox builds in Taskcluster.


This includes all Windows, macOS, Linux, and Android builds. You can see all the builds and repacks on Treeherder.

In August, after 56 merged to Beta, we've also been doing our Firefox Beta builds using Taskcluster. We're on track to be shipping Firefox 56, built from Taskcluster to release users at the end of September.

Windows and macOS each had their own challenges to get them ready to build and ship to our nightly users.

Windows signing

We've had Windows builds running in Taskcluster for quite a while now. The biggest missing piece stopping us from shipping these builds was signing. Windows builds end up being a bit complicated to sign.

First, each compiled .exe and .dll binary needs to be signed. Signing binaries in windows changes their contents, and so we need to regenerate some files that depend on the exact contents of binaries. Next, we need to create packages in various formats: a "setup.exe" for installing Firefox, and also MAR files for updates. Each of these package formats in turn need to be signed.

In buildbot, this process was monolithic. All of the binary generation and signing happened as part of the same build process. The same process would also publish symbols to the symbol server and publish updates to Balrog The downside of this monolithic process is that it adds additional dependencies to the build, which is already a really long process. If something goes wrong with signing, or publishing updates, you don't want to have to restart a 2 hour build!

As part of our migration to Taskcluster, we decided that builds should minimize their external dependencies. This means that the build task produces only unsigned binaries, and it is the responsibility of downstream tasks to sign them. We also wanted discrete tasks for symbol and update submission.

One wrinkle in this approach is that the logic that defines how to create a setup.exe package or a MAR file lives in tree. We didn't want to run that code in the same context as the code that generates signatures.

Our solution to this was to create a sequence of build -> signing -> repackage -> signing tasks. The signing tasks run in a restricted environment while the build and repackage tasks have access to the build system in order to produce the required artifacts. Using the chain of trust, we can demonstrate that the artifacts weren't tampered with between intermediate tasks.

Finally, we need to consider l10n repacks. We ship Firefox in over 90 locales. The repacking process downloads the en-US build and replaces the English strings with localized strings. Each of these repacks needs to be based on the signed en-US build. Each will also generate its own setup.exe and complete MAR for updates.

macOS performance (and why your build directory matters)

Like Windows, we've had macOS builds running on Taskcluster for a long time. Also like Windows, we had to solve signing for macOS.

However, the biggest blocker for the macOS build migration, was a performance bug. Builds produced on Taskcluster showed some serious performance regressions as compared to the builds produced on buildbot.

Many very smart people looked at this bug since it was first discovered in February. They compared library versions being used. They compared compiler versions and compiler flags. They even inspected the generated assembly code from both systems.

Mike Shal stumbled across the first clue to what was going on in June: if he stripped the Taskcluster binaries, then the performance problems disappeared! At this point we decided that we could go ahead and ship these builds to nightly users, knowing that the performance regression would disappear on beta and release.

Later on, Mike realized that it's not the presence or absence of symbols in the binary that cause the performance hit, it's what directory the builds are done in. On buildbot we build under /builds/..., and on Taskcluster we build under /home/...


Read the bug for more gory details. This is definitely one of the strangest bugs I've seen.

Lessons learned

We learned quite a bit in the process of migrating Windows and macOS nightly builds to Taskcluster.

First, we gained a huge amount of experience with the in-tree scheduling system. There's a bit of a learning curve to climb, but it's an extremely powerful and flexible system. Many kudos to Dustin for his work creating the foundation of this system here. His blog post, "What's So Special About "In-Tree"?", is a great explanation of why having this code as part of Firefox's repository is so important.

One of the killer features of having all the scheduling logic live in-tree is that you can do quite a bit of work locally, without requiring any build infrastructure. This is extremely useful when working on the complex build / signing / repackage sequence of tasks described above. You can make your changes, generate a new task graph, and inspect the results.

Once you're happy with your local changes, you can push them to try to validate your local testing, get your patch reviewed, and then finally landed in gecko. Your scheduling changes will take effect as soon as they land into the repo. This made it possible for us to do a lot of testing on another project branch, and then merge the code to central once we were ready.

What's next?

We're on track to ship builds produced in Taskcluster as part of the 56.0 release scheduled for late September. After that the only Firefox builds being produced by buildbot will be for ESR52.

Meanwhile, we've started tackling the remaining parts of release automation. We prioritized getting nightly and CI builds migrated to Taskcluster, however, there are still parts of the release process still implemented in Buildbot.

We're aiming to have release automation completely migrated off of buildbot by the end of the year. We've already seen many benefits from migrating CI to Taskcluster, and migrating the release process will realize many of those same benefits.


Thank you for reading this far!

Members from the Release Engineering, Release Operations, Taskcluster, Build, and Product Integrity teams all were involved in finishing up this migration. Thanks to everyone involved (there are a lot of you!) to getting us across the finish line here.

In particular, if you come across one of these fine individuals at the office, or maybe on IRC, I'm sure they would appreciate a quick "thank you":

  • Aki Sasaki
  • Dustin Mitchell
  • Greg Arndt
  • Joel Maher
  • Johan Lorenzo
  • Justin Wood
  • Kim Moir
  • Mihai Tabara
  • Mike Shal
  • Nick Thomas
  • Rail Aliiev
  • Rob Thijssen
  • Simon Fraser
  • Wander Costa

Nightly builds from Taskcluster

Yesterday, for the very first time, we started shipping Linux Desktop and Android Firefox nightly builds from Taskcluster.


We now have a much more secure, resilient, and hackable nightly build and release process.

It's more secure, because we have developed a chain of trust that allows us to verify all generated artifacts back to the original decision task and docker image. Signing is no longer done as part of the build process, but is now split out into a discrete task after the build completes.

The new process is more resilient because we've split up the monolithic build process into smaller bits: build, signing, symbol upload, upload to CDN, and publishing updates are all done as separate tasks. If any one of these fail, they can be retried independently. We don't have to re-compile the entire build again just because an external service was temporarily unavailable.

Finally, it's more hackable - in a good way! All the configuration files for the nightly build and release process are contained in-tree. That means it's easier to inspect and change how nightly builds are done. Changes will automatically ride the trains to aurora, beta, etc.

Ideally you didn't even notice this change! We try and get these changes done quietly, smoothly, in the background.

This is a giant milestone for Mozilla's Release Engineering and Taskcluster teams, and is the result of many months of hard work, planning, coding, reviewing and debugging.

Big big thanks to jlund, Callek, mtabara, kmoir, aki, dustin, sfraser, jlorenzo, coop, jmaher, bstack, gbrown, and everybody else who made this possible!

2016 RelEng Retrospective

As 2016 winds down, I wanted to take some time to highlight all the work our Release Engineering team has done this year. Personally, I really enjoy writing these retrospective posts. I think it's good to spend some time remembering how far we've come in a year. It's really easy to forget what you did last month, and 6 months ago seems like ancient history!


We added four people to our team this year!

Aki (:aki) re-joined us in January and has been working hard on developing a security model for Taskcluster for sensitive tasks like signing and publishing binaries.

Rok (:garbas) started in February and has been working on modernizing our web application framework development and deployment processes.

Johan (:jlorenzo) started in August and has been improving our release automation, Balrog, and automatically publishing Android builds to the Google Play Store.

Simon (:sfraser) started in October and has been improving monitoring of our production systems, as well as getting his feet wet with our partial update generation system.


This year we released 104 desktop versions of Firefox, and 58 android versions (including Beta, Release and ESR branches).

5 of those releases were just in the week prior to our all hands meeting in Hawaii!

Several other releases this year were special for particular reasons, and required special efforts on our part. We continued to provide SHA-1 signed installers for Windows XP users. We also produced a special 47.0.2 release in order to try and rescue users stuck on 47. We've never shipped a point release for a previous release branch before! We've also generated partial updates to try and help users on 43.0.1 and 47.0.2 get faster updates to the latest version of Firefox.

Release promotion

We couldn't have shipped so many releases so quickly last week if it weren't for release promotion. Previous to Firefox 46, our release process would generate completely new builds after CI was finished. This wasted a lot of time, and also meant we weren't shipping the exact binaries we had tested. Today, we ship the same builds that CI has generated and tested. This saves a ton of time (up to 8 hours!), and gives us a lot more confidence in the quality of the release.

This is one of those major kinds of changes that really transforms how we approach doing releases. I can't really remember what it was like doing releases prior to release promotion!

We also added support in Shipit to allow starting a release before all the en-US builds are done. This lets our Release Management team kick off a release early, assuming all the builds pass. It saves a person having to wait around watching Treeherder for the coveted green builds.

Windows in AWS

This year we completed our migration to AWS for Windows builds. 100% of our Windows builds are now done in AWS. This means that we now have a much faster and more scalable Windows build platform.

In addition, we also migrated most of the Windows 7 unittests to run in AWS. Previously these were running on dedicated hardware in our datacentre. By moving these tests to AWS, we again get a much more scalable test platform, but we also freed up hardware capacity for other test platforms (e.g. Windows XP).


One of our major focus areas this year was migrating our infrastructure from Buildbot to Taskcluster. As of today, we have:

  • Fully migrated Linux64 and Android debug builds and tests
  • Builds for all other platforms operating as Tier2
  • Linux64 and Android nightly builds, l10n repacks and updates operating as Tier2
  • Tons of security design & implementation work


Scheduled Changes in Balrog means that now we can have machines set background update rate to 0% 24 hours after release, instead of having a human do it.

Balrog itself was migrated from our datacentre in SCL3 into AWS. We now have a much more flexible deployment pipeline.

Balrog has also been one of our best projects for getting volunteer contributions! Many of the work done this year was done by contributors!


Being able to shut off old, crufty and deprecated stuff is an important part of staying agile. This year we were finally able to develop an end of life plan for Windows XP. In addition, we discontinued support for OSX 10.6-10.8, systems without SSE2, and 32-bit OSX systems. Not having to support these old platforms simplifies managing our infrastructure, and also makes product development easier.

We also shut down all the panda mobile testing infrastructure and legacy vcs-sync.

What's next?

2017 is looking like it's going to be another interesting (and busy!) year for RelEng.

Our top priority is to finish the migration to Taskcluster. Hopefully by the end of 2017, the only thing left on buildbot will be the ESR52 branch. This will require some big changes to our release automation, especially for Fennec.

We're also planning to provide some automated processes to assist with the rest of the release process. Releases still involve a lot of human to human handoffs, and places where humans are responsible for triggering automation. We'd like to provide a platform to be able to manage these handoffs more reliably, and allow different pieces of automation to coordinate more effectively.

PyCon 2016 report

I had the opportunity to spend last week in Portland for PyCon 2016. I'd like to share some of my thoughts and some pointers to good talks I was able to attend. The full schedule can be found here and all the videos are here.


Brandon Rhodes' Welcome to PyCon was one of the best introductions to a conference I've ever seen. Unfortunately I can't find a link to a recording... What I liked about it was that he made everyone feel very welcome to PyCon and to Portland. He explained some of the simple (but important!) practical details like where to find the conference rooms, how to take transit, etc. He noted that for the first time, they have live transcriptions of the talks being done and put up on screens beside the speaker slides for the hearing impaired.

He also emphasized the importance of keeping questions short during Q&A after the regular sessions. "Please form your question in the form of a question." I've been to way too many Q&A sessions where the person asking the question took the opportunity to go off on a long, unrelated tangent. For the most part, this advice was followed at PyCon: I didn't see very many long winded questions or statements during Q&A sessions.

Machete-mode Debugging

(abstract; video)

Ned Batchelder gave this great talk about using python's language features to debug problematic code. He ran through several examples of tricky problems that could come up, and how to use things like monkey patching and the debug trace hook to find out where the problem is. One piece of advice I liked was when he said that it doesn't matter how ugly the code is, since it's only going to last 10 minutes. The point is the get the information you need out of the system the easiest way possible, and then you can undo your changes.

Refactoring Python

(abstract; video)

I found this session pretty interesting. We certainly have lots of code that needs refactoring!

Security with object-capabilities

(abstract; video; slides)

I found this interesting, but a little too theoretical. Object capabilities are a completely orthogonal way to access control lists as a way model security and permissions. It was hard for me to see how we could apply this to the systems we're building.

Awaken your home

(abstract; video)

A really cool intro to the Home Assistant project, which integrates all kinds of IoT type things in your home. E.g. Nest, Sonos, IFTTT, OpenWrt, light bulbs, switches, automatic sprinkler systems. I'm definitely going to give this a try once I free up my raspberry pi.

Finding closure with closures

(abstract; video)

A very entertaining session about closures in Python. Does Python even have closures? (yes!)

Life cycle of a Python class

(abstract; video)

Lots of good information about how classes work in Python, including some details about meta-classes. I think I understand meta-classes better after having attended this session. I still don't get descriptors though!

(I hope Mike learns soon that __new__ is pronounced "dunder new" and not "under under new"!)

Deep learning

(abstract; video)

Very good presentation about getting started with deep learning. There are lots of great libraries and pre-trained neural networks out there to get started with!

Building protocol libraries the right way

(abstract; video)

I really enjoyed this talk. Cory Benfield describes the importance of keeping a clean separation between your protocol parsing code, and your IO. It not only makes things more testable, but makes code more reusable. Nearly every HTTP library in the Python ecosystem needs to re-implement its own HTTP parsing code, since all the existing code is tightly coupled to the network IO calls.


Guido's Keynote


Some interesting notes in here about the history of Python, and a look at what's coming in 3.6.


(abstract; video)

An intro to the click module for creating beautiful command line interfaces.

I like that click helps you to build testable CLIs.

HTTP/2 and asynchronous APIs

(abstract; video)

A good introduction to what HTTP/2 can do, and why it's such an improvement over HTTP/1.x.

Remote calls != local calls

(abstract; video)

Really good talk about failing gracefully. He covered some familiar topics like adding timeouts and retries to things that can fail, but also introduced to me the concept of circuit breakers. The idea with a circuit breaker is to prevent talking to services you know are down. For example, if you have failed to get a response from service X the past 5 times due to timeouts or errors, then open the circuit breaker for a set amount of time. Future calls to service X from your application will be intercepted, and will fail early. This can avoid hammering a service while it's in an error state, and works well in combination with timeouts and retries of course.

I was thinking quite a bit about Ben's redo module during this talk. It's a great module for handling retries!

Diving into the wreck

(abstract; video)

A look into diagnosing performance problems in applications. Some neat tools and techniques introduced here, but I felt he blamed the DB a little too much :)


Magic Wormhole

(abstract; video; slides)

I didn't end up going to this talk, but I did have a chance to chat with Brian before. magic-wormhole is a tool to safely transfer files from one computer to another. Think scp, but without needing ssh keys set up already, or direct network flows. Very neat tool!

Computational Physics

(abstract; video)

How to do planetary orbit simulations in Python. Pretty interesting talk, he introduced me to Feynman, and some of the important characteristics of the simulation methods introduced.

Small batch artisinal bots

(abstract; video)

Hilarious talk about building bots with Python. Definitely worth watching, although unfortunately it's only a partial recording.


(abstract; video)

The infamous GIL is gone! And your Python programs only run 25x slower!

Larry describes why the GIL was introduced, what it does, and what's involved with removing it. He's actually got a fork of Python with the GIL removed, but performance suffers quite a bit when run without the GIL.

Lars' Keynote


If you watch only one video from PyCon, watch this. It's just incredible.

MozLando Survival Guide

MozLando is coming!

I thought I would share a few tips I've learned over the years of how to make the most of these company gatherings. These summits or workweeks are always full of awesomeness, but they can also be confusing and overwhelming.

#1 Seek out people

It's great to have a (short!) list of people you'd like to see in person. Maybe somebody you've only met on IRC / vidyo or bugzilla?

Having a list of people you want to say "thank you" in person to is a great way to approach this. Who doesn't like to hear a sincere "thank you" from someone they work with?

#2 Take advantage of increased bandwidth

I don't know about you, but I can find it pretty challenging at times to get my ideas across in IRC or on an etherpad. It's so much easier in person, with a pad of paper or whiteboard in front of you. You can share ideas with people, and have a latency/lag-free conversation! No more fighting AV issues!

#3 Don't burn yourself out

A week of full days of meetings, code sprints, and blue sky dreaming can be really draining. Don't feel bad if you need to take a breather. Go for a walk or a jog. Take a nap. Read a book. You'll come back refreshed, and ready to engage again.

That's it!

I look forward to seeing you all next week!

Firefox builds on the Taskcluster Index


You have have heard rumblings that FTP is going away...


Over the past few quarters we've been working to migrate our infrastructure off of the ageing "FTP" [1] system to Amazon S3.

We've maintained some backwards compatibility for the time being [2], so that current Firefox CI and release builds are still available via ftp.mozilla.org, or preferably, archive.mozilla.org since we don't support the ftp protocol any more!

Our long term plan is to make the builds available via the Taskcluster Index, and stop uploading builds to archive.mozilla.org

How do I find my builds???


This is pretty big change, but we really think this will make it easier to find the builds you're looking for.

The Taskcluster Index allows us to attach multiple "routes" to a build job. Think of a route as a kind of hierarchical tag, or directory. Unlike regular directories, a build can be tagged with multiple routes, for example, according to the revision or buildid used.

A great tool for exploring the Taskcluster Index is the Indexed Artifact Browser

Here are some recent examples of nightly Firefox builds:

The latest win64 nightly Firefox build is available via the
gecko.v2.mozilla-central.nightly.latest.firefox.win64-opt route

This same build (as of this writing) is also available via its revision:


Or the date:


The artifact browser is simply an interface on top of the index API. Using this API, you can also fetch files directly using wget, curl, python requests, etc.:

https://index.taskcluster.net/v1/task/gecko.v2.mozilla-central.nightly.latest.firefox.win64-opt/artifacts/public/build/firefox-45.0a1.en-US.win64.installer.exe [3]

Similar routes exist for other platforms, for B2G and mobile, and for opt/debug variations. I encourage you to explore the gecko.v2 namespace, and see if it makes things easier for you to find what you're looking for! [4]

Can't find what you want in the index? Please let us know!

[1] A historical name referring back to the time when we used the FTP prototol to serve these files. Today, the files are available only via HTTP(S)
[2] in fact, all Firefox builds right now are currently uploaded to S3. we've just had to implement some compatibility layers to make S3 appear in many ways like the old FTP service.
[3] yes, you need to know the version number...for now. we're considering stripping that from the filenames. if you have thoughts on this, please get in touch!
[4] ignore the warning on the right about "Task not found" - that just means there are no tasks with that exact route; kind of like an empty directory

MozFest 2015

I had the privilege of attending MozFest last week. Overall it was a really great experience. I met lots of really wonderful people, and learned about so many really interesting and inspiring projects.

My biggest takeaway from MozFest was how important it is to provide good APIs and data for your systems. You can't predict how somebody else will be able to make use of your data to create something new and wonderful. But if you're not making your data available in a convenient way, nobody can make use of it at all!

It was a really good reminder for me. We generate a lot of data in Release Engineering, but it's not always exposed in a way that's convenient for other people to make use of.

The rest of this post is a summary of various sessions I attended.


Friday night started with a Science Fair. Lots of really interesting stuff here. Some of the projects that stood out for me were:

  • naturebytes - a DIY wildlife camera based on the raspberry pi, with an added bonus of aiding conservation efforts.
  • histropedia - really cool visualizations of time lines, based on data in Wikipedia and Wikidata. This was the first time I'd heard of Wikidata, and the possibilities were very exciting to me! More on this later, as I attended a whole session on Wikidata.
  • Several projects related to the Internet-of-Things (IOT)


On Saturday, the festival started with some keynotes. Surman spoke about how MozFest was a bit chaotic, but this was by design. In a similar way that the web is an open platform that you can use as a platform for building your own ideas, MozFest should be an open platform so you can meet, brainstorm, and work on your ideas. This means it can seem a bit disorganized, but that's a good thing :) You get what you want out of it.

I attended several good sessions on Saturday as well:

  • Ending online tracking. We discussed various methods currently used to track users, such as cookies and fingerprinting, and what can be done to combat these. I learned, or re-learned, about a few interesting Firefox extensions as a result:

    • privacybadger. Similar to Firefox's tracking protection, except it doesn't rely on a central blacklist. Instead, it tries to automatically identify third party domains that are setting cookies, etc. across multiple websites. Once identified, these third party domains are blocked.
    • https everywhere. Makes it easier to use HTTPS by default everywhere.
  • Intro to D3JS. d3js is a JS data visualization library. It's quite powerful, but something I learned is that you're expected to do quite a bit of work up-front to make sure it's showing you the things you want. It's not great as a data exploration library, where you're not sure exactly what the data means, and want to look at it from different points of view. The nvd3 library may be more suitable for first time users.

  • 6 kitchen cases for IOT We discussed the proposed IOT design manifesto briefly, and then split up into small groups to try and design a product, using the principles outlined in the manifesto. Our group was tasked with designing some product that would help connect hospitals with amateur chefs in their local area, to provide meals for patients at the hospital. We ended up designing a "smart cutting board" with a built in display, that would show you your recipes as you prepared them, but also collect data on the frequency of your meal preparations, and what types of foods you were preparing.

    Going through the exercise of evaluating the product with each of the design principles was fun. You could be pretty evil going into this and try and collect all customer data :)


  • How to fight an internet shutdown - we role played how we would react if the internet was suddenly shut down during some political protests. What kind of communications would be effective? What kind of preparation can you have done ahead of time for such an event?

    This session was run by Deji from accessnow. It was really eye opening to see how internet shutdowns happen fairly regularly around the world.

  • Data is beaufitul Introduction to wikidata Wikidata is like Wikipedia, but for data. An open database of...stuff. Anybody can edit and query the database. One of the really interesting features of Wikidata is that localization is kind of built-in as part of the design. Each item in the database is assigned an id (prefixed by "Q"). E.g. Q42 is Douglas Adams. The description for each item is simply a table of locale -> localized description. There's no inherent bias towards English, or any other language. The beauty of this is that you can reference the same piece of data from multiple languages, only having to focus on localizing the various descriptions. You can imagine different translations of the same Wikipedia page right now being slightly inconsistent due to each one having to be updated separately. If they could instead reference the data in Wikidata, then there's only one place to update the data, and all the other places that reference that data would automatically benefit from it.

    The query language is quite powerful as well. A simple demonstration was "list all the works of art in the same room in the Louvre as the Mona Lisa."

    It really got me thinking about how powerful open data is. How can we in Release Engineering publish our data so others can build new, interesting and useful tools on top of it?

  • Local web Various options for purely local web / networks were discussed. There are some interesting mesh network options available commotion was demo'ed. These kind of distributions give you file exchange, messaging, email, etc. on a local network that's not necessarily connected to the internet.

Using oatmeal in homebrew

I'm planning an Oatmeal Stout as my next brew. I've used oatmeal in previous batches, but have never really been happy with the results. They've lacked the mouthfeel I've been going after.

I've been using just regular rolled oats from the local bulk food store. Advice I've found online isn't clear whether you need to boil oats, or can just throw them into the mash, so I decided to do a little experiment last night!

I prepared two batches of oats. Into each batch went 750mL water, and 90g oatmeal.

For batch 1, I heated the water to 70C, tossed in the oats, stirred it a bit, and then let it sit for 20 minutes.

For batch 2, I brought the water to a boil, tossed in the oats, returned to a simmer, and then let it simmer for 10 minutes.

I strained a few spoonfuls of each through a mesh strainer.

It was immediately obvious that batch 2 was thicker. It had actually absorbed most of the water. The strained liquid was more opaque, and had a thicker consistency. According to my refractometer, it had a SG of 1.006.

By contrast, batch 1 was still pretty watery. The strained liquid was fairly thin, although still opaque. There was no noticeable increase in SG vs. tap water.

tl;dr Boiling my oats resulted in a much thicker sample, will be trying this in my next oatmeal stout!

RelEng 2015 (Part 2)

This the second part of my report on the RelEng 2015 conference that I attended a few weeks ago. Don't forget to read part 1!

I've put up links to the talks whenever I've been able to find them.

Defect Prediction

Davide Giacomo Cavezza gave a presentation about his research into defect prediction in rapidly evolving software. The question he was trying to answer was if it's possible to predict if any given commit is defective. He compared static models against dynamic models that adapt over the lifetime of a project. The models look at various metrics about the commit such as:

  • Number of files the commit is changing
  • Entropy of the changes
  • The amount of time since the files being changed were last modified
  • Experience of the developer

His simulations showed that a dynamic model yields superior results to a static model, but that even a dynamic model doesn't give sufficiently accurate results.

I wonder about adopting something like this at Mozilla. Would a warning that a given commit looks particularly risky be a useful signal from automation?

Continuous Deployment and Schema Evolution

Michael de Jong spoke about a technique for coping with SQL schema evolution in a continuously deployed applications. There are a few major issues with changing SQL schemas in production including:

  • schema changes typically block other reads/writes from occurring
  • application logic needs to be synchronized with the schema change. If the schema change takes non-trivial time, which version of the application should be running in the meanwhile?

Michael's solution was to essentially create forks of all tables whenever the schema changes, and to identify each by a unique version. Applications need to specify which version of the schema they're working with. There's a special DB layer that manages the double writes to both versions of the schema that are live at the same time.

Securing a Deployment Pipeline

Paul Rimba spoke about security deployment pipelines.

One of the areas of focus was on the "build server". He started with the observation that a naive implementation of a Jenkins infrastructure has the master and worker on the same machine, and so the workers have the opportunity to corrupt the state of the master, or of other job types' workspaces. His approach to hardening this part of the pipeline was to isolate each job type into its own docker worker. He also mentioned using a microservices architecture to isolate each task into its own discrete component - easier to secure and reason about.

We've been moving this direction at Mozilla for our build infrastructure as well, so it's good to see our decision corroborated!

Makefile analysis

Shurui Zhou presented some data about extracting configuration data from build systems. Her goal was to determine if all possible configurations of a system were valid.

The major problem here is make (of course!). Expressions such as $(shell uname -r) make static analysis impossible.

We had a very good discussion about build systems following this, and how various organizations have moved themselves away from using make. Somebody described these projects as requiring a large amount of "activation energy". I thought this was a very apt description: lots of upfront cost for little visible change.

However, people unanimously agreed that make was not optimal, and that the benefits of moving to a more modern system were worth the effort.

Yak Shaving

Benjamin Smedberg wrote about shaving yaks a few weeks ago. This is an experience all programmers (and probably most other professions!) encounter regularly, and yet it still baffles and frustrates us!

Just last week I got tangled up in a herd of yaks.

I started the week wanting to deploy the new bundleclone extension that :gps has been working on. If you're not familiar with bundleclone, it's a combination of server and client side extensions to mercurial that allow the server to advertise an externally hosted static bundle to clients requesting full clones. hg.mozilla.org advertises bundles hosted on S3 for certain repositories. This can significantly reduce load on the mercurial servers, something that has caused us problems in the past!

With the initial puppet patch written and sanity-checked, I wanted to verify it worked on other platforms before deploying it more widely.

Linux...looks ok. I was able to make some minor tweaks to existing puppet modules to make testing a bit easier as well. Always good to try and leave things in better shape for the next person!

OSX...looks ok...er, wait! All our test jobs are re-cloning tools and mozharness for every job! What's going on? I'm worried that we could be paying for unnecessary S3 bandwidth if we don't fix this.


I find bug 1103700 where we started deploying and relying on machine-wide caches of mozharness and tools. Unfortunately this wasn't done on OSX yet. Ok, time to file a new bug and start a new puppet branch to get these shared repositories working on OSX. The patch is pretty straightforward, and I start testing it out.

Because these tasks run at boot as part of runner, I need to be watching logs on the test machines very early on in the bootup process. Hmmm, the machine seems really sluggish when I'm trying to interact with it right after startup. A quick glance at top shows something fishy:


There's a system daemon called fseventsd taking a substantial amount of system resources! I wonder if this is causing any impact to build times. After filing a new bug, I do a few web searches to see if this service can be disabled. Most of the articles I find suggest to write a no_log file into the volume's .fseventsd directory to disable fseventds from writing out event logs for the volume. Time for another puppet branch!.

And that's the current state of this project! I'm currently running testing the no_log fix to see if it impacts build times, or if it causes other problems on the machines. I'm hoping to deploy the new runner tasks this week, and switch our buildbot-configs and mozharness configs to make use of them.

And finally, the first yak I started shaving could be the last one I finish shaving; I'm hoping to deploy the bundleclone extension this week as well.