Skip to main content

Posts about aws

RelEng Retrospective - Q1 2015

RelEng had a great start to 2015. We hit some major milestones on projects like Balrog and were able to turn off some old legacy systems, which is always an extremely satisfying thing to do!

We also made some exciting new changes to the underlying infrastructure, got some projects off the drawing board and into production, and drastically reduced our test load!

Firefox updates

Balrog

balrog

All Firefox update queries are now being served by Balrog! Earlier this year, we switched all Firefox update queries off of the old update server, aus3.mozilla.org, to the new update server, codenamed Balrog.

Already, Balrog has enabled us to be much more flexible in handling updates than the previous system. As an example, in bug 1150021, the About Firefox dialog was broken in the Beta version of Firefox 38 for users with RTL locales. Once the problem was discovered, we were able to quickly disable updates just for those users until a fix was ready. With the previous system it would have taken many hours of specialized manual work to disable the updates for just these locales, and to make sure they didn't get updates for subsequent Betas.

Once we were confident that Balrog was able to handle all previous traffic, we shut down the old update server (aus3). aus3 was also one of the last systems relying on CVS (!! I know, rite?). It's a great feeling to be one step closer to axing one more old system!

Funsize

When we started the quarter, we had an exciting new plan for generating partial updates for Firefox in a scalable way.

Then we threw out that plan and came up with an EVEN MOAR BETTER plan!

The new architecture for funsize relies on Pulse for notifications about new nightly builds that need partial updates, and uses TaskCluster for doing the generation of the partials and publishing to Balrog.

The current status of funsize is that we're using it to generate partial updates for nightly builds, but not published to the regular nightly update channel yet.

There's lots more to say here...stay tuned!

FTP & S3

Brace yourselves... ftp.mozilla.org is going away...

brace yourselves...ftp is going away

...in its current incarnation at least.

Expect to hear MUCH more about this in the coming months.

tl;dr is that we're migrating as much of the Firefox build/test/release automation to S3 as possible.

The existing machinery behind ftp.mozilla.org will be going away near the end of Q3. We have some ideas of how we're going to handle migrating existing content, as well as handling new content. You should expect that you'll still be able to access nightly and CI Firefox builds, but you may need to adjust your scripts or links to do so.

Currently we have most builds and tests doing their transfers to/from S3 via the task cluster index in addition to doing parallel uploads to ftp.mozilla.org. We're aiming to shut off most uploads to ftp this quarter.

Please let us know if you have particular systems or use cases that rely on the current host or directory structure!

Release build promotion

Our new Firefox release pipeline got off the drawing board, and the initial proof-of-concept work is done.

The main idea here is to take an existing build based on a push to mozilla-beta, and to "promote" it to a release build. So we need to generate all the l10n repacks, partner repacks, generate partial updates, publish files to CDNs, etc.

The big win here is that it cuts our time-to-release nearly in half, and also simplifies our codebase quite a bit!

Again, expect to hear more about this in the coming months.

Infrastructure

In addition to all those projects in development, we also tackled quite a few important infrastructure projects.

OSX test platform

10.10 is now the most widely used Mac platform for Firefox, and it's important to test what our users are running. We performed a rolling upgrade of our OS X testing environment, migrating from 10.8 to 10.10 while spending nearly zero capital, and with no downtime. We worked jointly with the Sheriffs and A-Team to green up all the tests, and shut coverage off on the old platform as we brought it up on the new one. We have a few 10.8 machines left riding the trains that will join our 10.10 pool with the release of ESR 38.1.

Got Windows builds in AWS

We saw the first successful builds of Firefox for Windows in AWS this quarter as well! This paves the way for greater flexibility, on-demand burst capacity, faster developer prototyping, and disaster recovery and resiliency for windows Firefox builds. We'll be working on making these virtualized instances more performant and being able to do large-scale automation before we roll them out into production.

Puppet on windows

RelEng uses puppet to manage our Linux and OS X infrastructure. Presently, we use a very different tool chain, Active Directory and Group Policy Object, to manage our Windows infrastructure. This quarter we deployed a prototype Windows build machine which is managed with puppet instead. Our goal here is to increase visibility and hackability of our Windows infrastructure. A common deployment tool will also make it easier for RelEng and community to deploy new tools to our Windows machines.

New Tooltool Features

We've redesigned and deployed a new version of tooltool, the content-addressable store for large binary files used in build and test jobs. Tooltool is now integrated with RelengAPI and uses S3 as a backing store. This gives us scalability and a more flexible permissioning model that, in addition to serving public files, will allow the same access outside the releng network as inside. That means that developers as well as external automation like TaskCluster can use the service just like Buildbot jobs. The new implementation also boasts a much simpler HTTP-based upload mechanism that will enable easier use of the service.

Centralized POSIX System Logging

Using syslogd/rsyslogd and Papertrail, we've set up centralized system logging for all our POSIX infrastructure. Now that all our system logs are going to one location and we can see trends across multiple machines, we've been able to quickly identify and fix a number of previously hard-to-discover bugs. We're planning on adding additional logs (like Windows system logs) so we can do even greater correlation. We're also in the process of adding more automated detection and notification of some easily recognizable problems.

Security work

Q1 included some significant effort to avoid serious security exploits like GHOST, escalation of privilege bugs in the Linux kernel, etc. We manage 14 different operating systems, some of which are fairly esoteric and/or no longer supported by the vendor, and we worked to backport some code and patches to some platforms while upgrading others entirely. Because of the way our infrastructure is architected, we were able to do this with minimal downtime or impact to developers.

API to manage AWS workers

As part of our ongoing effort to automate the loaning of releng machines when required, we created an API layer to facilitate the creation and loan of AWS resources, which was previously, and perhaps ironically, one of the bigger time-sinks for buildduty when loaning machines.

Cross-platform worker for task cluster

Release engineering is in the process of migrating from our stalwart, buildbot-driven infrastructure, to a newer, more purpose-built solution in taskcluster. Many FirefoxOS jobs have already migrated, but those all conveniently run on Linux. In order to support the entire range of release engineering jobs, we need support for Mac and Windows as well. In Q1, we created what we call a "generic worker," essentially a base class that allows us to extend taskcluster job support to non-Linux operating systems.

Testing

Last, but not least, we deployed initial support for SETA, the search for extraneous test automation!

This means we've stopped running all tests on all builds. Instead, we use historical data to determine which tests to run that have been catching the most regressions. Other tests are run less frequently.

Gotta Cache 'Em All

TOO MUCH TRAFFIC!!!!

Waaaaaaay back in February we identified overall network bandwidth as a cause of job failures on TBPL. We were pushing too much traffic over our VPN link between Mozilla's datacentre and AWS. Since then we've been working on a few approaches to cope with the increased traffic while at the same time reducing our overall network load. Most recently we've deployed HTTP caches inside each AWS region.

Network traffic from January to August 2014

The answer - cache all the things!

Obligatory XKCD

Caching build artifacts

The primary target for caching was downloads of build/test/symbol packages by test machines from file servers. These packages are generated by the build machines and uploaded to various file servers. The same packages are then downloaded many times by different machines running tests. This was a perfect candidate for caching, since the same files were being requested by many different hosts in a relatively short timespan.

Caching tooltool downloads

Tooltool is a simple system RelEng uses to distribute static assets to build/test machines. While the machines do maintain a local cache of files, the caches are often empty because the machines are newly created in AWS. Having the files in local HTTP caches speeds up transfer times and decreases network load.

Results so far - 50% decrease in bandwidth

Initial deployment was completed on August 8th (end of week 32 of 2014). You can see by the graph above that we've cut our bandwidth by about 50%!

What's next?

There are a few more low hanging fruit for caching. We have internal pypi repositories that could benefit from caches. There's a long tail of other miscellaneous downloads that could be cached as well.

There are other improvements we can make to reduce bandwidth as well, such as moving uploads from build machines to be outside the VPN tunnel, or perhaps to S3 directly. Additionally, a big source of network traffic is doing signing of various packages (gpg signatures, MAR files, etc.). We're looking at ways to do that more efficiently. I'd love to investigate more efficient ways of compressing or transferring build artifacts overall; there is a ton of duplication between the build and test packages between different platforms and even between different pushes.

I want to know MOAR!

Great! As always, all our work has been tracked in a bug, and worked out in the open. The bug for this project is 1017759. The source code lives in https://github.com/mozilla/build-proxxy/, and we have some basic documentation available on our wiki. If this kind of work excites you, we're hiring!

Big thanks to George Miroshnykov for his work on developing proxxy.

Blobber is live - upload ALL the things!

Last week without any fanfare, we closed bug 749421 - Allow test slaves to save and upload files somewhere. This has actually been working well for a few months now, it's just taken a while to close it out properly, and I completely failed to announce it anywhere. mea culpa!

This was a really important project, and deserves some fanfare! cue trumpets, parades and skywriters

The goal of this project was to make it easier for developers to get important data out of the test machines reporting to TBPL. Previously the only real output from a test job was the textual log. That meant if you wanted a screen shot from a failing test, or the dump from a crashing process, you needed to format it somehow into the log. For screen shots we would base64 encode a png image and print it to the log as a data URL!

With blobber successfully running now, it's now possible to upload extra files from your test runs on TBPL. Things like screen shots, minidump logs and zip files are already supported.

Getting new files uploaded is pretty straightforward. If the environment variable MOZ_UPLOAD_DIR is set in your test's environment, you can simply copy files there and they will be uploaded after the test run is complete. Links to the files are output in the log. e.g.

15:21:18     INFO -  (blobuploader) - INFO - TinderboxPrint: Uploaded 70485077-b08a-4530-8d4b-c85b0d6f9bc7.dmp to http://mozilla-releng-blobs.s3.amazonaws.com/blobs/mozilla-inbound/sha512/5778e0be8288fe8c91ab69dd9c2b4fbcc00d0ccad4d3a8bd78d3abe681af13c664bd7c57705822a5585655e96ebd999b0649d7b5049fee1bd75a410ae6ee55af

Your thanks and praise should go to our awesome intern, Mihai Tabara, who did most of the work here.

Most test jobs are already supported; if you're unsure if the job type you're interested is supported, just search for MOZ_UPLOAD_DIR in the log on tbpl. If it's not there and you need it, please file a bug!

Behind the clouds: how RelEng do Firefox builds on AWS

RelEng have been expanding our usage of Amazon's AWS over the past few months as the development pace of the B2G project increases. In October we began moving builds off of Mozilla-only infrastructure and into a hybrid model where some jobs are done in Mozilla's infra, and others are done in Amazon. Since October we've expanded into 3 amazon regions, and now have nearly 300 build machines in Amazon. Within each AWS region we've distributed our load across 3 availability zones.

That's great! But how does it work?

Behind the scenes, we've written quite a bit of code to manage our new AWS infrastructure. This code is in our cloud-tools repo (github|hg.m.o) and uses the excellent boto library extensively. The two work horses in there are aws_watch_pending and aws_stop_idle. aws_stop_idle's job is pretty easy, it goes around looking at EC2 instances that are idle and shuts them off safely. If an EC2 slave hasn't done any work in more than 10 minutes, it is shut down. aws_watch_pending is a little more involved. Its job is to notice when there are pending jobs (like your build waiting to start!) and to resume EC2 instances. We take a few factors into account when starting up instances:
  • We wait until a pending job is more than a minute old before starting anything. This allows in-house capacity to grab the job if possible, and other EC2 slaves that are online but idle also have a chance to take it.
  • Use any reserved instances first. As our AWS load stabilizes, we've been able to purchase some reserved instances to reduce our cost. Obviously, to reduce our cost, we have to use those reservations wherever possible! The code to do this is a bit more complicated than I'd like it to be since AWS reservations are specific to individual availability zones rather than whole regions.
  • Some regions are cheaper than others, so we prefer to start instances in the cheaper regions first.
  • Start instances that were most recently running. This should give both better depend-build time, and also helps with billing slightly. Amazon bills for complete hours. So if you start one instance twice in an hour, you're charged for a single hour. If you start two instances once in the hour, you're charged for two hours.
Overall we're really happy with Amazon's services. Having APIs for nearly everything has made development really easy.

What's next?

Seeing as how test capacity is always woefully behind, we're hoping to be able to run a large number of our linux-based unittests on EC2, particularly those that don't require an accelerated graphics context. After that? Maybe windows builds? Maybe automated regression hunting? What do you want to see?