Skip to main content

Archive of mozilla-inbound builds for regression hunting

Nick Thomas, of recent Balrog fame, has also been hard at work improving life for our intrepid regression hunters.

Sometimes we don't detect problems in Firefox until many weeks or months after code has landed on mozilla-inbound. We do builds for almost every checkin that happens, but end up having to delete them after a month to keep disk usage under control. Unfortunately that means that if a problem is detected while in the Beta cycle or even after release, we're only left with nightly builds for regression hunting, and so the regression window can be up to 24 hours. A lot can change in 24 hours!

Nick's hard work means we are keeping a full release cycle's worth of mozilla-inbound builds archived on S3. We're focused on mozilla-inbound for now, but I imagine we'll be adding other integration branches shortly.

The archive is currently available at http://inbound-archive.pub.build.mozilla.org/pub/mozilla.org/

Please send Nick your thanks, and if you run into any issues, please file bugs blocking bug 765258

Dealing with bursty load

John has been doing a regular series of posts about load on our infrastructure, going back years. Recently, GPS also posted about infrastructure efficiency here. What I've always found interesting is the bursty nature of the load at different times throughout the day, so thought people might find the following data useful.

Every time a developer checks in code, we schedule a huge number of build and test jobs. Right now on mozilla-central every checkin generates just over 10 CPU days worth of work. How many jobs do we end up running over the course of the day? How much time to they take to run?

I created a few graphs to explore these questions - these get refreshed hourly, so should serve as a good dashboard to monitor recent load on an ongoing basis.

# of jobs running over time

running jobs This shows how many jobs are running at any given hour over the past 7 days

# of cpu hours run over time

compute times This shows how many CPU hours were spent for jobs that started for any given hour over the same time period.

A few observations about the range in the load:

  • Weekends can be really quiet! Our peak weekday load is about 20x the lowest weekend load.
  • Weekday load varies by about 2x within any given day.

Over the years RelEng have scaled our capacity to meet peak demand. After all, we're trying to give the best experience to developers; making people wait for builds or tests to start is frustrating and can also really impact the productivity of our development teams, and ultimately the quality of Firefox/Fennec/B2G.

Scaling physical capacity takes a lot of time and advance planning. Unfortunately this means that many of our physical machines are idle for non-peak times, since there isn't any work to do. We've been able to make use of some of this idle time by doing things like run fuzzing jobs for our security team.

Scaling quickly for peak load is something that Amazon is great for, and we've been taking advantage of scaling build/test capacity in EC2 for nearly a year. We've been able to move a significant amount of our Linux build and test infrastructure to run inside Amazon's EC2 infrastructure. This means we can start machines in response to increasing load, and shut them down when they're not needed. (Aside - not all of our reporting tools know about this, which can cause some reporting confusion because it appears as if the slaves are broken or not taking work when in reality they're shut off because of low demand - we're working on fixing that up!)

Hopefully this data and these dashboards help give some context about why our infrastructure isn't always 100% busy 100% of the time.

The Brendan Voyage

A few weeks ago I finished reading The Brendan Voyage by Tim Severin. I actually stayed up late into the night to finish, I really couldn't put the book down.

It's an amazing story of Tim Severin's recreation of the 6th century voyage of St. Brendan from Ireland to North America. Yes, 6th century! The tale of the original voyage is told in the Voyage of Saint Brendan, which dates back to around 900AD.

The book begins with the author's quest to re-create the original boat as closely as possible to what would have been available to a 6th century Irish monk, also following details recorded in the original text. It turns out that this means a boat made out of wood with a leather hull. This is followed by much discussion and research into how well a leather boat could possibly survive years at sea, if at all! I found it really fascinating how Severin finds over and over again how traditional methods and materials actually perform quite well in the harsh environment of the North Atlantic. In fact, over the course of his voyage much of his more modern equipment breaks down with the constant exposure to salt water, where the traditional materials fare quite well.

Severin's planned voyage begins from the coast of Ireland, then continues north past Scotland, passing by the Faroe Islands on the way to Iceland. I found the passages describing the life of people living on these remote islands fascinating. It seems impossible that 6th century Irish monks had sought out and established monasteries on many of these lonely pillars of stone in the sea.

Tim Severin makes a really convincing case that not only was Saint Brendan's voyage possible, but that he wasn't the first Irish sailor to visit Iceland, Greenland, and possibly even North America.

The book was recommended on an episode of the Catholic Stuff You Should Know podcast, The First Saint in North America

I gave it 5 stars on goodreads. Highly, highly recommended book.

Switched to Nikola

Static sites have the advantage of having a much smaller footprint to attack, and also much easier to understand and control, so I thought I'd give it a try. This blog is now being created with Nikola!

New on TBPL - hot pink

jcranmer noticed a funny new colour on TBPL last week - hot pink!

This colour indicates jobs that have been cancelled manually. It's an important distinction to make between these jobs and jobs that have failed due to build or test or infrastructure problems.

It took a long time, but I finally finished deploying bug 704006 - add a new status for "user cancelled" last week.

With the help of fabric, I was able to script the upgrade so that we didn't require a downtime. Each master was first disabled in slavealloc so that slaves would connect to different masters after a reboot. Next I gracefully shut down the master. Once the master was shut down cleanly, buildbot was updated to pick up the new change, and the master started back up again. Each master could take several hours to shut down depending on what jobs were running, and we have quite a few masters now, so this whole process took most of the week to complete. It's awesome to be able to do this kind of thing without a downtime though....it makes me feel a bit like a ninja :P

Big thanks to ewong for the initial patch, and making sure it got accepted upstream as well!

Experiments with shared git repositories

I've been working on git support for Mozilla's build infrastructure over the past while.

RelEng's current model for creating our build machines is to create a homogeneous set of machines, each one capable of doing any type of build. This means that build machines can be doing a debug mozilla-central linux64 build one minute, and then an opt mozilla-b2g18 panda build the next. This model has worked well for us for scalability and load balancing: machines run on branches where there's currently activity, and we can focus our efforts on overall capacity. It's a huge improvement over the Firefox 2.x and 3.0 days where we had only one or two dedicated machines per branch!

However, there are challenges when using this model. We need to be reasonably efficient with disk space since the same disk is shared between all types of builds across all branches. We also need to make build setup time as quick as possible.

For mercurial based builds we've developed a utility called hgtool that manages shared local copies of repositories that are then locally cloned into each specific build directory. By using hg's share extension, we are able to save quite a bit of disk space and also have efficient pulling/cloning operations. However, because of the way Mozilla treats branches in hg (where each branch is in a separate repo), we need to keep each local clone of gecko branches separate from each other. We have a separate local clone of mozilla-central and another of mozilla-inbound. (although, maybe we could use bookmarks to keep them distinct in the same local repo...hmmm...)

What about git?

git's branches are quite different from mercurial's. Everything depends on "refs", which are names that refer to specific commit ids. It's actually possible to pull in completely unrelated git repositories into the same local repository. As long as you keep the refs from different repositories separate, all the commits and metadata coexist happily. In contrast, mercurial uses metadata on each commit to distinguish branches from each other. Most commits in the various gecko repositories are on the "default" branch. This makes it harder to combine mozilla-central and mozilla-inbound into the same repository without using some other means to keep each repository's "default" branch separate.

This gave me a crazy idea: what if we kept one giant shared git repository on the build machines that contained ALL THE REPOS.

combine ALL THE REPOS!

We have a script called gittool that is the analog of hgtool. I modified it to support b2g-style xml manifests, and pointed it at a local shared repository and let it loose. It works!

But...

I noticed that the more repositories were cloned into the shared repository, the longer it took to clone new ones. Repositories that normally take 5 seconds to clone were now taking 5 minutes to fetch the first time into the shared repository. What was going on?

One clue was that there are over 144,000 refs across all the b2g repositories. The vast majority of these are tags...there are only 12k non-tag refs.

I created a test script to create new repositories and measure how long it took to fetch them into a shared repository. The most significant factor affecting fetch speed was the number of refs! The new repositories each had 40 commits, and I created 100 of them. If I created one tag per commit then fetching a new repository into the shared repo added 0.0028s. However if I created 10 tags per commit then fetching the new repositories into the shared repo added 0.0113s per repo.

For now I've given up on the idea to have giant local shared repositories, and have moved to a model that keeps each repository separate, like we do with hg. Hopefully we can revisit this later to avoid having multiple copies of related repositories cloned on the same machine.

Now testing Firefox in AWS

You may have noticed some new tests appearing on TBPL this week: new tests

The Ubuntu32/64 test platforms have been enabled on most branches this week. Wherever we've had consistent green results on the new test platforms, we've disabled running those tests on the older Fedora32/64 platforms. Currently these are crashtests, jsreftests and marionette tests. We're working closely with the awesome A*Team to migrate over any remaining test suites that make sense. As always, if you notice anything that looks like it's not working right, please let us know - filing a bug is the best way. We expect there to be differences between the test platforms and therefore in some test results. We're committed to tracking down what those differences are so we can make sure the new test machines continue to give us good test results.

A lot of what we do in RelEng flies under the radar. When we're doing our jobs well, most of the time you shouldn't notice changes we make!

I wanted to highlight this change in particular, because it's a HUGE win for test scalability. If needed, we're able to add more Ubuntu test machines in a matter of minutes. And the more tests we can move over to this new pool of test machines, the more we can improve turnaround time on the overloaded Fedora slaves.

Rail deserves most of the credit for this awesome work, so send kudos and/or beer his way :)

Behind the clouds: how RelEng do Firefox builds on AWS

RelEng have been expanding our usage of Amazon's AWS over the past few months as the development pace of the B2G project increases. In October we began moving builds off of Mozilla-only infrastructure and into a hybrid model where some jobs are done in Mozilla's infra, and others are done in Amazon. Since October we've expanded into 3 amazon regions, and now have nearly 300 build machines in Amazon. Within each AWS region we've distributed our load across 3 availability zones.

That's great! But how does it work?

Behind the scenes, we've written quite a bit of code to manage our new AWS infrastructure. This code is in our cloud-tools repo (github|hg.m.o) and uses the excellent boto library extensively.

The two work horses in there are aws_watch_pending and aws_stop_idle. aws_stop_idle's job is pretty easy, it goes around looking at EC2 instances that are idle and shuts them off safely. If an EC2 slave hasn't done any work in more than 10 minutes, it is shut down.

aws_watch_pending is a little more involved. Its job is to notice when there are pending jobs (like your build waiting to start!) and to resume EC2 instances. We take a few factors into account when starting up instances:

  • We wait until a pending job is more than a minute old before starting anything. This allows in-house capacity to grab the job if possible, and other EC2 slaves that are online but idle also have a chance to take it.
  • Use any reserved instances first. As our AWS load stabilizes, we've been able to purchase some reserved instances to reduce our cost. Obviously, to reduce our cost, we have to use those reservations wherever possible! The code to do this is a bit more complicated than I'd like it to be since AWS reservations are specific to individual availability zones rather than whole regions.
  • Some regions are cheaper than others, so we prefer to start instances in the cheaper regions first.
  • Start instances that were most recently running. This should give both better depend-build time, and also helps with billing slightly. Amazon bills for complete hours. So if you start one instance twice in an hour, you're charged for a single hour. If you start two instances once in the hour, you're charged for two hours.

Overall we're really happy with Amazon's services. Having APIs for nearly everything has made development really easy.

What's next?

Seeing as how test capacity is always woefully behind, we're hoping to be able to run a large number of our linux-based unittests on EC2, particularly those that don't require an accelerated graphics context.

After that? Maybe windows builds? Maybe automated regression hunting? What do you want to see?

RelEng - we do builds, releases and updates!

Mozilla's Release Engineering team is responsible for making sure that our products get built, released and updated properly.

If you're working on a project which changes how or what we ship, or requires changes in how updates work, please get in touch with us as early in your project as possible. We can work with you to find the best solutions to your release/update problems, and save everybody a lot of last minute panic.

Firefox builds in the cloud!

Did you know that since last Thursday we've been doing the majority of our Linux Firefox builds in Amazon?

John Hopkins posted about this last week, but I wanted to highlight how important this is for Release Engineering. We can now scale up the number of Linux build machines in proportion to load. If there are no builds happening, great! Shut off the machines! If there are builds pending, we can start up more machines within minutes.

Migrating to these new build systems means that we can now convert excess in-house Linux build capacity into additional Windows build capacity.

Very shortly we'll be looking at running certain unit test suites in this environment as well.

For all the gory details, follow along in bug 772446.