Blobber is live - upload ALL the things!

Last week without any fanfare, we closed bug 749421 - Allow test slaves to save and upload files somewhere. This has actually been working well for a few months now, it's just taken a while to close it out properly, and I completely failed to announce it anywhere. mea culpa!

This was a really important project, and deserves some fanfare! cue trumpets, parades and skywriters

The goal of this project was to make it easier for developers to get important data out of the test machines reporting to TBPL. Previously the only real output from a test job was the textual log. That meant if you wanted a screen shot from a failing test, or the dump from a crashing process, you needed to format it somehow into the log. For screen shots we would base64 encode a png image and print it to the log as a data URL!

With blobber successfully running now, it's now possible to upload extra files from your test runs on TBPL. Things like screen shots, minidump logs and zip files are already supported.

Getting new files uploaded is pretty straightforward. If the environment variable MOZ_UPLOAD_DIR is set in your test's environment, you can simply copy files there and they will be uploaded after the test run is complete. Links to the files are output in the log. e.g.

15:21:18     INFO -  (blobuploader) - INFO - TinderboxPrint: Uploaded 70485077-b08a-4530-8d4b-c85b0d6f9bc7.dmp to http://mozilla-releng-blobs.s3.amazonaws.com/blobs/mozilla-inbound/sha512/5778e0be8288fe8c91ab69dd9c2b4fbcc00d0ccad4d3a8bd78d3abe681af13c664bd7c57705822a5585655e96ebd999b0649d7b5049fee1bd75a410ae6ee55af

Your thanks and praise should go to our awesome intern, Mihai Tabara, who did most of the work here.

Most test jobs are already supported; if you're unsure if the job type you're interested is supported, just search for MOZ_UPLOAD_DIR in the log on tbpl. If it's not there and you need it, please file a bug!

Behind the clouds: how RelEng do Firefox builds on AWS

RelEng have been expanding our usage of Amazon's AWS over the past few months as the development pace of the B2G project increases. In October we began moving builds off of Mozilla-only infrastructure and into a hybrid model where some jobs are done in Mozilla's infra, and others are done in Amazon. Since October we've expanded into 3 amazon regions, and now have nearly 300 build machines in Amazon. Within each AWS region we've distributed our load across 3 availability zones.

That's great! But how does it work?

Behind the scenes, we've written quite a bit of code to manage our new AWS infrastructure. This code is in our cloud-tools repo (github|hg.m.o) and uses the excellent boto library extensively.

The two work horses in there are aws_watch_pending and aws_stop_idle. aws_stop_idle's job is pretty easy, it goes around looking at EC2 instances that are idle and shuts them off safely. If an EC2 slave hasn't done any work in more than 10 minutes, it is shut down.

aws_watch_pending is a little more involved. Its job is to notice when there are pending jobs (like your build waiting to start!) and to resume EC2 instances. We take a few factors into account when starting up instances:

  • We wait until a pending job is more than a minute old before starting anything. This allows in-house capacity to grab the job if possible, and other EC2 slaves that are online but idle also have a chance to take it.
  • Use any reserved instances first. As our AWS load stabilizes, we've been able to purchase some reserved instances to reduce our cost. Obviously, to reduce our cost, we have to use those reservations wherever possible! The code to do this is a bit more complicated than I'd like it to be since AWS reservations are specific to individual availability zones rather than whole regions.
  • Some regions are cheaper than others, so we prefer to start instances in the cheaper regions first.
  • Start instances that were most recently running. This should give both better depend-build time, and also helps with billing slightly. Amazon bills for complete hours. So if you start one instance twice in an hour, you're charged for a single hour. If you start two instances once in the hour, you're charged for two hours.

Overall we're really happy with Amazon's services. Having APIs for nearly everything has made development really easy.

What's next?

Seeing as how test capacity is always woefully behind, we're hoping to be able to run a large number of our linux-based unittests on EC2, particularly those that don't require an accelerated graphics context.

After that? Maybe windows builds? Maybe automated regression hunting? What do you want to see?