Gotta Cache 'Em All

TOO MUCH TRAFFIC!!!!

Waaaaaaay back in February we identified overall network bandwidth as a cause of job failures on TBPL. We were pushing too much traffic over our VPN link between Mozilla's datacentre and AWS. Since then we've been working on a few approaches to cope with the increased traffic while at the same time reducing our overall network load. Most recently we've deployed HTTP caches inside each AWS region.

Network traffic from January to August 2014

The answer - cache all the things!

Obligatory XKCD

Caching build artifacts

The primary target for caching was downloads of build/test/symbol packages by test machines from file servers. These packages are generated by the build machines and uploaded to various file servers. The same packages are then downloaded many times by different machines running tests. This was a perfect candidate for caching, since the same files were being requested by many different hosts in a relatively short timespan.

Caching tooltool downloads

Tooltool is a simple system RelEng uses to distribute static assets to build/test machines. While the machines do maintain a local cache of files, the caches are often empty because the machines are newly created in AWS. Having the files in local HTTP caches speeds up transfer times and decreases network load.

Results so far - 50% decrease in bandwidth

Initial deployment was completed on August 8th (end of week 32 of 2014). You can see by the graph above that we've cut our bandwidth by about 50%!

What's next?

There are a few more low hanging fruit for caching. We have internal pypi repositories that could benefit from caches. There's a long tail of other miscellaneous downloads that could be cached as well.

There are other improvements we can make to reduce bandwidth as well, such as moving uploads from build machines to be outside the VPN tunnel, or perhaps to S3 directly. Additionally, a big source of network traffic is doing signing of various packages (gpg signatures, MAR files, etc.). We're looking at ways to do that more efficiently. I'd love to investigate more efficient ways of compressing or transferring build artifacts overall; there is a ton of duplication between the build and test packages between different platforms and even between different pushes.

I want to know MOAR!

Great! As always, all our work has been tracked in a bug, and worked out in the open. The bug for this project is 1017759. The source code lives in https://github.com/mozilla/build-proxxy/, and we have some basic documentation available on our wiki. If this kind of work excites you, we're hiring!

Big thanks to George Miroshnykov for his work on developing proxxy.

B2G now building using unified sources

Last week, with the help of Ehsan and John, we finally enabled unified source builds for B2G devices.

As a result we're building device builds approximately 40% faster than before.

Between June 12th and June 17th, 50% of our successful JB emulator builds on mozilla-inbound finished in 97 minutes or less. Using unified sources for these builds reduced the 50th percentile of build times down to 60 minutes (from June 19th to June 24th).

To mitigate the risks of changes landing that break non-unified builds, we're also doing periodic non-unified builds for these devices.

As usual, all our work here was done in the open. If you're interested, read along in bug 950676, and bug 942167.

Do you enjoy building, debugging and optimizing build, test & release pipelines? Great, because we're hiring!

First brew day

This weekend I finally took the plunge into home brewing. I've been interested in home brewing for a while, lurking on /r/homebrewing and other lists, putting together kit lists, revising kit lists, and learning a lot as I went.

brulosopher's post on /r/homebrewing finally convinced me to jump straight into all-grain brewing with BIAB rather than the normal beginner's method of extract brewing. To get started, I decided on this beginner BIAB kit from ontariobeerkegs.com. It was a bit more than I was hoping to spend initially, but I was having a really hard time putting together a reasonable kit for anything less.

Prep

The kit finally arrived last Thursday! I was really excited to open it up, and see all the shiny new equipment. It was all somewhat familiar from my months of research, but still excitingly new and real now that it was actually here.

equipment.png

In preparation for brew day, I cleaned out the kettle and fermenter buckets. I discovered a small leak in the ball valve on the kettle, and tightened up a little bit on the nuts holding it together. That slowed the leak down a bit, but I think I'll need to put some new teflon tape on it to stop it completely.

I used an old wooden dowel as a way to measure how much liquid was in the kettle. I filled up the kettle 2 quarts at a time, and used a knife to make a notch in the dowel for each additional gallon of liquid. This took a lot longer than I expected. 6 gallons is a lot of water!

On Friday night I filled up the kettle again with water from the tap, to let it sit overnight. I've read this helps to get rid of some of the chlorine in the water, but later wondered if it also allowed dissolved oxygen to escape.

I also tried putting the lid on the fermenter just to see how it worked. It was a real struggle to snap it on, and I could not for the life of me get it off again! I ended up tearing the strip off the base of the lid in order to remove the lid. Hopefully this doesn't increase risk of infection...

Brew Day

(May 24th, 2014) I was planning on getting started on Saturday afternoon, which should have given me plenty of time to get everything mashed, boiled, cooled and cleaned up. We were having such a great time at our family pilgrimage / picnic that we didn't get home until 6pm or so. No problem, I should only need 4 hours to brew, right? Onwards!

The kit came with a 5gal ingredient kit included, and I chose their American IPA. It comes with:

  • 10lb 2-row malt
  • 1lb Carafoam
  • 1lb Crystal 60L
  • 2oz Centennial Hops
  • 2oz Columbus Hops
  • 1pkg US-05 dry yeast

I kept notes as I went, and here's a summary of how the night went.

18:54   6.25gal water on high heat
19:49   water temp hit 150°F.
19:54   put in bag, grains
        heat off
        trying to keep temp at 150°F
        front dial thermometer shoots up to 170-180°F, while probe thermometers reads 145°F

at this point I wasn't really sure what to do...I didn't want to over-heat the mash, and I trusted my two probe thermometers more than the dial thermometer on the kettle. I wonder if the heat from the burner is affecting the dial thermometer?

mash.png
20:54   Turn heat on to begin mash out. Target 168°F
        Heat sparge water. ~0.5gal
21:06   Temperature @ 164°F according to probe; remove bag & grains to strainer in bowl on counter
        Dial thermometer was reading 180°F. Removed bag early because unsure of actual temp.

21:13   Drain, squeeze, pour sparge water over grain bag
        Pour collected runnings from bowl back into kettle

        I forgot to measure how much runnings I poured back in :\

21:15   I had about 5.8gal in the kettle. Took a sample:
        62°C 1.044 gravity = 1.052 gravity at 20°C according to online
        calculator
        Turn heat on high for boil
        Tasted sample after - tastes great!
21:52   Water boiling
hotbreak.png
21:54   Add 1oz columbus

22:45   50 minutes into boil. Checked gravity because it didn't look like
        volume had reduced enough. Looks like I still have ~5.75gal, but I
        had added more of the runnings from the grain bag since my initial
        pre-boil measurement. Again, something to fix in my process next
        time.
        Hydrometer reads 1.044 again at 70°C. I was a little disappointed
        since it seemed the gravity hadn't changed. Adjusted for
        temperature this works out to 1.062, which is the target gravity,
        so feeling better.
22:52   Add 1oz columbus
23:02   Turn heat off
        Add 2oz centennial
        Didn't take a volume reading at this point either. I assumed I
        could measure the amount in the fermenter, but ended up not
        transferring all the wort over...So this volume would have been
        good to have!
        Begin cooling wort
        kettle in sink with ice bath
        3x 1L frozen sanitized water bottles in wort
23:53   wort at 43°C
00:11   wort at 41°C
00:48   wort at 38°C
01:23   wort at 35°C
02:00   wort at 29°C
        too tired to wait longer!
        transferred to fermenter by pouring. Left hop sludge in
        kettle...not sure how much
        take gravity, volume, temp reading
        4.25gal 1.062 @ 29°C = 1.064 OG. Target was 1.062.
        pitch yeast dry
        put in airlock
02:05   move to basement
02:20   done cleanup; go to bed!

Did it work?

I must have checked the bucket 20 times the next day. I didn't see any airlock activity on Sunday at all. I peeked inside the bucket by removing the airlock, and it looked like there was a nice foam developing on top of the wort. I don't think the bucket has a great seal around the lid after removing the tab strip at the bottom, and I also noticed that the rubber stopper has a tendency to pop out of the hole in the lid, leaving a small gap. So CO2 could be escaping from a few places other than the airlock! I think I'd really prefer to ferment in something like a better bottle so I can see what's going on, and have an easier time keeping the airlock in place.

Much to my relief, this morning the airlock was bubbling a bit, and it looked like there was a nice krausen forming. It smells great!

What's next?

I'm hoping to bottle in a few weeks. The recipe says to leave in primary for 21 days.

There are definitely some parts of my process that need improvement. Because I didn't take accurate volume readings post-boil, including how much extra water I added and how much crud I left behind in the kettle, I don't have a good idea of the overall efficiency. I did hit the target gravity, which I was worried most about.

I still wonder if I should have poured the entire contents of the kettle into the fermenter. On the one hand I thought I didn't want all that crud clouding up my beer. On the other hand, I figured that an IPA should be sitting on hops for a long time...

I also meant to aerate the wort more in the fermenter by shaking it around for a few minutes. Again, I was pretty tired at 2am, and starting to worry about the wort being infected, so decided against anything fancy.

If I brew again, I'll definitely be buying or making a wort chiller. Waiting 3 hours to cool down to 29°C is insane, and is probably a bit on the warm side to be pitching yeast. My basement is a pretty consistent 19°C right now, which I hope is ok for fermenting. The bucket reads about 20-21°C. I've also considered building a fermentation chamber to control temperature.

Other things for next time would be getting some kind of wine thief or long turkey baster for drawing samples out of the wort for measuring. I used a Pyrex measuring cup to take a sample from the kettle right after the mash, and I just dipped the hydrometer sample jar into the fermenter bucket to take the final sample. Both of these ended up being messy, and I was worried about infecting the wort with my hands while removing the sample jar from the fermenting bucket.

All in all, I did have a lot of fun, even though I ended much later than I was planning. The beer smells great so far, and tasted great at each point along the way. I'm really hopeful the final result will be as tasty!

Limiting coalescing on the build/test farm

tl;dr - as of yesterday we've limited coalescing on all builds/tests to merge at most 3 pending jobs together

Coalescing (aka queue collapsing aka merging) has been part of Mozilla's build/test CI for a long, long time. Back in the days of Tinderbox, a single machine would do a checkout/build/upload loop. If there were more checkins while the build was taking place, well, they would get built on the next iteration through the loop.

Fast forward a few years later to our move to buildbot, and having pools of machines all able to do the same builds. Now we create separate jobs in the queue for each build for each push. However, we didn't always have capacity to do all these builds in a reasonable amount of time, so we left buildbot's default behaviour (merging all pending jobs together) enabled for the majority of jobs. This means that if there are pending jobs for a particular build type, the first free machine skips all but the most recent item on the queue. The skipped jobs are "merged" into the job that was actually run.

In the case that all builds and tests are green, coalescing is actually a good thing most of the time. It saves you from doing a bunch of extra useless work.

However, not all pushes are perfect (just see how often the tree is closed due to build/test failures), and coalescing makes bisecting the failure very painful and time consuming, especially in the case that we've coalesced away intermediate build jobs.

To try and find a balance between capacity and sane results, we've recently added a limit to how many jobs can be coalesced at once.

By rigorous statistical analysis:

@catlee     so it's easiest to pick a single upper bound for coalescing and go with that at first
@catlee     did you have any ideas for what that should be?
@catlee     I was thinking 3
edmorley|sheriffduty        catlee: that sounds good to me as a first go :-)
mshal       chosen by fair dice roll? :)
@catlee     1d4
bhearsum    Saving throw failed. You are dead.
philor      wfm

we've chosen 3 as the upper bound on the number of jobs we'll coalesce, and we can tweak this as necessary.

I hope this makes the trees a bit more manageable! Please let us know what you think!

As always, all our work is done in the open. See the bug with the patch here: https://bugzilla.mozilla.org/show_bug.cgi?id=1008213

Digging into Firefox update sizes

Update sizes up 20-37% since last year!

Mozilla relies on our automatic update infrastructure to make sure that our users are kept up to date with the latest, most secure and fastest browser.

Having smaller updates means users are able to get the latest version of Firefox more quickly.

Since Firefox 19.0 (released just over a year ago - February 16th, 2013) our complete update size for Windows has grown from 25.6MB to 30.9MB for Firefox 28.0 (released March 15th, 2014). That's a 20% increase in just one year. On OSX it's grown from 37.8MB to 47.8MB, a 26% increase.

Partial updates have similarly grown. For Windows, a user coming form 27.0.1 to 28.0 would receive a 14.3MB update compared to a 10.4MB update going from 18.0.2 to 19.0, an increase of 37.5%

This means it's taking users 20-37% longer to download updates than it did last year. Many of our users don't have fast reliable internet, so an increase in update size makes it even harder for them to stay up to date. In addition, this size increase translates directly into bandwidth costs to Mozilla. All else being equal, we're now serving 20-37% more data from our CDNs for Firefox updates this year compared to last year.

update-sizes.png

Reducing update size

How can we reduce the update size? There are a few ways:

  1. Make sure we're serving partial updates to as many users as possible, and that these updates are applied properly. More analysis is needed, but it appears that only roughly half of our users are updating via partial updates.
  2. Reduce the amount of code we ship in the update. This could mean more features and content are distributed at runtime as needed This is always a tricky trade-off to make between having features available for all users out of the box, and introducing a delay the first time the user wants to use a feature that requires remote content. It also adds complexity to the product.
  3. Change how we generate updates. This is going to be the focus of the rest of my post.

Smaller updates are more better

There are a few techniques I know of to reduce our update sizes:

  • Use xz compression instead of bzip2 compression inside the MAR files (bug 641212). xz generally gets better compression ratios at the cost of using more memory.
  • Use courgette instead of bsdiff for generating the binary diff between .exe and .dll files (bug 504624). Courgette is designed specifically for diffing executables, and generates much smaller patches than bsdiff.
  • Handle omni.ja files more effectively (bug 772868). omni.ja files are essentially zip files, and are using zip style compression. That means each member of the zip archive is individually compressed. This is problematic for two reasons: it makes generating binary diffs between omni.ja files much less effective, and it makes compressing the omni.ja file with bzip2 or xz less effective. You get far better results packing files into a zip file with 0 compression and using xz to compress it afterwards. Similarly for generating diffs, the diff between two omni.ja files using no zip compression is much smaller than the diff between two omni.ja files using the default zip -9 compression.
  • Don't use per-file compression inside the MAR file at all, rather compress the entire archive with xz. I simulated this by xz-compressing tar files of the MAR contents.

27% smaller complete updates

complete-updates.png

We can see that using xz alone saves about 10.9%. There's not a big difference between xz -6 and xz -9e, only a few kb generally. ("xz -6" and "xz -9e" series in the chart)

Disabling zip compression in the omni.ja files and using the standard bzip2 compression saves about 9.7%. ("zip0 .ja" in the chart)

Combining xz compression with the above yields a 24.8% saving, which is 7.6MB. ("zip0 .ja, xz -9e" in the chart)

Finally, disabling zip compression for omni.ja and per-file compression and compressing the entire archive at once yields a 27.2% saving, or 8.4MB.

66% smaller partial updates

partial-updates.png

Very similar results here for xz, 8.4% savings with xz -9e.

Disabling zip compression in the omni.ja files has a much bigger impact for partial updates because we're able to produce a much smaller binary diff. This method alone is saves 42%, or 6.0MB.

Using courgette for executable files yields a 19.1% savings. ("courgette" in the chart)

Combining courgette for executable files, no zip level compression, and per-file xz compression reduces the partial update by 61%. ("courgette, zip0 .ja, xz -9e" in the chart)

And if we compress the entire archive at once rather than per-file, we can reduce the update by 65.9%. ("courgette, zip0 .ja, tar, xz -9e" in the chart)

Some notes on my methodology: I'm focusing on 32-bit Windows, en-US only. Most of the optimizations, with the exception of courgette, are applicable to other platforms. I'm measuring the total size of the files inside the MAR file, rather than the size of the MAR file itself. The MAR file format overhead is small, only a few kilobytes, so this doesn't significantly impact the results, and significantly simplifies the analysis.

Finally, the code I used for analysis is here.

The amazing RelEng Jacuzzis

RelEng has jacuzzis???

On Tuesday, we enabled the first of our "jacuzzis" in production, and initial results look great.

A few weeks ago, Ben blogged about some initial experiments with running builds on smaller pools of machines ("hot tubs", get it? we've since renamed them as "jacuzzis"). His results confirmed glandium's findings on the effectiveness (or lack thereof!) of incremental builds on mozilla-inbound.

The basic idea behind smaller pools of workers is that by restricting which machines are eligible to run jobs, you get much faster incremental builds since you have more recent checkouts, object dirs, etc.

Additionally, we've made some improvements to how we initialize mock environments. We don't reset and re-install packages into the mock chroot if the previous package list is the same as the new package list. This works especially well with jacuzzis, as we can arrange for machines to run jobs with the same package lists.

On Tuesday we enabled jacuzzis for some build types on b2g-inbound: hamachi device builds, and opt/debug ICS emulator builds.

jacuzzi-results.png

We've dropped build times by more than 50%!

The spikes earlier this morning look like they're caused by running on fresh spot instances. When spot nodes first come online, they have no previous state, and so their first builds will always be slower. The machines should stay up most of the day, so you really only have to pay this cost once per day.

For the B2G emulator builds, this means we're getting tests started earlier and therefore get much faster feedback as to the quality of recent patches.

I'm super happy with these results. What's next? Well, turning on MOAR JACUZZIS for MOAR THINGS! Additionally, having fewer build types per machine means our disk footprint is much lower, and we should be able to use local SSDs for builds on AWS.

As usual, all our work has a tracking bug: bug 970738

Implementation details

There are three major pieces involved in pulling this together: the jacuzzi allocations themselves, buildbot integration, and finally AWS integration.

Jacuzzis

Ben has published the allocations here: http://jacuzzi-allocator.pub.build.mozilla.org/v1/

Each builder (or job type) has a specific list of workers associated with it. Ben has been working on ways of automatically managing these allocations so we don't need to tune them by hand too much.

Buildbot

The bulk of the work required was in buildbot. Previous to jacuzzis, we had several large pools of workers, each capable doing any one of hundreds of different job types. Each builder in buildbot has each of the workers in a pool listed as able to do that job. We wanted to avoid having to reconfigure buildbot every time we needed to change jacuzzi allocations, which is why we decided to put the allocations in an external service.

There are two places where buildbot fetches the allocation data: nextSlave functions and our builder prioritizing function. The first is straightforward, and was the only place I was expecting to make changes. These nextSlave functions are called with a list of available machines, and the function's job is to pick one of the connected machines to do the job. It was relatively simple to add support for this to buildbot. The need to handle latter case, modifying our builder prioritization, didn't become apparent until testing...The reasoning is a bit convoluted, so I'll explain below for those interested.

AWS Support

Now that we had buildbot using the right workers, we needed to make sure that we were starting those workers in Amazon!

We adjusted our tools that monitor the job queue and start new EC2 instances to also check the jacuzzi allocations, and start the correct instances.

The gory details of build prioritizations

We have a function in buildbot which handles a lot of the prioritization of the job queue. For example, pending jobs for mozilla-central will get priority over jobs for any of the twigs, like ash or birch. Older jobs also tend to get priority over newer jobs. The function needs to return the list of builders in priority sorted order. Buildbot then processes each builder in turn, trying to assign pending jobs to any idle workers.

There are two factors which make this function complicated: each buildbot master is doing this prioritization independently of the others, and workers are becoming idle while buildbot is still processing the sorted list of builders. This second point caused prioritization to be broken (bug 659222) for a long time.

Imagine a case where you have 3 pending jobs (A, B, C), all for the same set of workers (W1, W2, W3). Job A is the most important, job C is the least. All the workers are busy. prioritizeBuilders has sorted our list of builders, and buildbot looks at A first. No workers are available, so it moves onto B next. Still no free workers. But now worker W1 connects, and buildbot examines job C, and finds there are available workers (W1). So job C buds in line and gets run before jobs A or B.

Our fix for this was to maintain a list of pending jobs for each set of available workers, and then discard all but the most important pending job for each worker set. In our example, we would see that all 3 pending jobs have the same worker set (W1, W2, W3), and so would temporarily ignore pending jobs B and C. This leaves buildbot only job A to consider. In our example above, it would find no available workers and stop processing. When W1 becomes available, it triggers another prioritization run, and again job A is the sole job under consideration and gets the worker.

Unfortunately, this behaviour conflicted with what we were trying to do with jacuzzis. Imagine in the same examble above we have jacuzzis allocated so that W1 is allocated to only do jobs of type C. If W1 is the only available worker, and C is getting discarded each time the prioritization is done, we're not making any forward progress. In fact, we've triggered a bit of an infinite loop, since currently we trigger another round of prioritizing/job assignments if we have available workers and have temporarily discarded lower priority jobs.

The fix was to integrate the jacuzzi allocations into this prioritization logic as well. I'm a little concerned about the runtime impact of this, since we need to query the allocated workers for every pending job type. One thing we're considering is to change the allocator to return the allocations as a single monolithic blob, rather than having per-job-type requests. Or, we could support both types.

AWS, networks, and burning trees

Help! The trees are on fire!

You may have noticed that the trees were closed quite a bit last week due to infrastructure issues that all stem from network flakiness. We're really sorry about that! RelEng and IT have been working very hard to stabilize the network.

Symptoms of the problem have been pretty varied, but the two most visible ones are:

  • bug 957502 - All trees closed due to timeouts as usw2 slaves download or clone
  • bug 964804 - jobs not being scheduled

We've been having more and more problems with our VPN tunnel to our Amazon infrastructure, particularly in the us-west-2 region. Prior to any changes we put in place last week, ALL traffic from our infrastructure in EC2 was routed over the VPN tunnel to be handled by Mozilla's firewall in our SCL3 data center.

While debugging the scheduling bug early last week, we discovered that latency to our mysql database used for scheduling was nearly 500ms. No wonder scheduling jobs was slow!

Digging deeper - the network is in bad shape

Investigation of this latency revealed that one of the core firewalls deployed for RelEng traffic was overloaded, and that a major contributor to the firewall load was traffic to/from AWS over the VPN tunnels. We were consistently pushing around 1 gigabit/s to our private network inside Amazon. The extra load on the firewall required for the VPN encryption caused the latency to go up for all traffic passing through the firewall, even for traffic not bound for AWS!

Our next step was to do a traffic analysis to see how we could reduce the amount of traffic going over the VPN tunnel.

Michal was able to get the traffic analysis done, and his report indicated that the worst traffic offender was traffic coming from ftp.m.o. All of our test jobs pull the builds, tests and crash symbols from ftp. With the increase in push load, more types of jobs, and more tests, our traffic from ftp has really exploded in the past months. Because all of our traffic was going over the VPN tunnel, this put a huge load on the VPN component of the firewall.

Firefighting

Since all of the content on ftp is public, we decided we could route traffic to/from ftp over the public internet rather than our VPN tunnel, provided we could ensure file integrity at the same time. This required a few changes to our infrastructure:

  • Rail re-created all of our EC2 instances to have public IP addresses, in addition to private IP addresses they had already. Without a public IP, Amazon can't route traffic to the public internet. You can set up extra instances to act as NAT gateways, but that's much more complicated and introduces another point of failure. (bug 965001)
  • We needed a new IP endpoint for ftp so that we could be sure that only SSL traffic was going over the public routes. Chris Turra set up ftp-ssl.m.o, and then I changed our routing tables in AWS to route traffic to ftp-ssl via the public internet rather than our VPN tunnel (bug 965289).
  • I landed a change to mozharness to download files from ftp-ssl instead of ftp.

In addition, we also looked at other sources of traffic. The next highest source of traffic was hg.m.o, followed by pvtbuilds.m.o.

Ben quickly rolled out a fix to our test slaves to allow them to cache the gaia repository locally rather than re-cloning gaia each time (bug 964411). We were surprised to discover the gaia repository has grown to 1.2 GB, so this was a big win!

It was clear we would need to divert traffic to hg in a similar way we did for ftp. Unfortunately, adding a DNS/IP endpoint for hg isn't as simple as for ftp, so aki has been going through our code changing all our references of http://hg.mozilla.org to https://hg.mozilla.org (bug 960571). Once we've found and fixed all usages of unsecured access to hg, we can change the routing for hg traffic like we did for FTP.

Aki also patched up some of our FirefoxOS build configs to limit which device builds upload per-checkin (bug 966412), and reduce the amount of data we're sending back to pvtbuilds.m.o over the VPN tunnel.

Ted tracked down a regression in how we were packaging some files in our test zips which cut the size by about 100MB (bug 963651).

On Monday, Adam added some more capacity to the firewall, which should allow us to cope with the remaining load better.

State of the network

This week we're in much better shape than we were last week. If you look at traffic this week vs last week, you can see that traffic is down significantly (see far right of graph):

aws-traffic.png

and latency has been kept under control:

aws-ping.png

We're not out of the woods yet though - we're still seeing occasional traffic spikes close to our maximum. The good news is there's still more we can do to reduce or divert the traffic:

  • we're not done transitioning all FTP/HG traffic to public routes
  • there's still plenty of room to reduce test zip size by splitting them up and changing the compression format used (bug 917999)
  • we can use caching in S3 from the test machines to avoid having to download identical builds/tests from FTP multiple times

Extra special thanks also to Hal who has been keeping all of us focused and on track on this issue.

Blobber is live - upload ALL the things!

Last week without any fanfare, we closed bug 749421 - Allow test slaves to save and upload files somewhere. This has actually been working well for a few months now, it's just taken a while to close it out properly, and I completely failed to announce it anywhere. mea culpa!

This was a really important project, and deserves some fanfare! cue trumpets, parades and skywriters

The goal of this project was to make it easier for developers to get important data out of the test machines reporting to TBPL. Previously the only real output from a test job was the textual log. That meant if you wanted a screen shot from a failing test, or the dump from a crashing process, you needed to format it somehow into the log. For screen shots we would base64 encode a png image and print it to the log as a data URL!

With blobber successfully running now, it's now possible to upload extra files from your test runs on TBPL. Things like screen shots, minidump logs and zip files are already supported.

Getting new files uploaded is pretty straightforward. If the environment variable MOZ_UPLOAD_DIR is set in your test's environment, you can simply copy files there and they will be uploaded after the test run is complete. Links to the files are output in the log. e.g.

15:21:18     INFO -  (blobuploader) - INFO - TinderboxPrint: Uploaded 70485077-b08a-4530-8d4b-c85b0d6f9bc7.dmp to http://mozilla-releng-blobs.s3.amazonaws.com/blobs/mozilla-inbound/sha512/5778e0be8288fe8c91ab69dd9c2b4fbcc00d0ccad4d3a8bd78d3abe681af13c664bd7c57705822a5585655e96ebd999b0649d7b5049fee1bd75a410ae6ee55af

Your thanks and praise should go to our awesome intern, Mihai Tabara, who did most of the work here.

Most test jobs are already supported; if you're unsure if the job type you're interested is supported, just search for MOZ_UPLOAD_DIR in the log on tbpl. If it's not there and you need it, please file a bug!

Valgrind now running per-push

This week we started running valgrind jobs per push (bug 946002) on mozilla-central and project branches (bug 801955).

We've been running valgrind jobs nightly on mozilla-central for years, but they were very rarely ever green. Few people looked at their results, and so they were simply ignored.

Recently Nicholas Nethercote has taken up the torch and put in a lot of hard work to get these valgrind jobs working again. They're now running successfully per-push on mozilla-central and project branches and on Try.

Thanks Nicholas! Happy valgrinding all!

Now using AWS Spot instances for tests

Release Engineering makes heavy use of Amazon's EC2 infrastucture. The vast majority of our Firefox builds for Linux and Android, as well as our Firefox OS builds happen in EC2. We also do a ton of unit testing inside EC2.

Amazon offers a service inside EC2 called spot instances. This is a way for Amazon to sell off unused capacity by auction. You can place a bid for how much you want to pay by the hour for a particular type of VM, and if your price is more than the current market price, you get a cheap VM! For example, we're able to run tests on machines for $0.025/hr instead of the regular $0.12/hr on-demand price. We started experimenting with spot instances back in November.

There are a few downsides to using spot instances however. One is that your VM can be killed at any moment if somebody else bids a higher price than yours. The second is that your instances can't (easily) use extra EBS volumes for persistent storage. Once the VM powers off, the storage is gone.

These issues posed different challenges for us. In the first case, we were worried about the impact that interrupted jobs would have on the tree. We wanted to avoid the case where jobs were run on a spot instance, interrupted because of the market price changing, and then being retried a second time on another spot instance subject to the same termination. This required changes to two systems:

  • aws_watch_pending needed to know to start regular on-demand EC2 instances in the case of retried jobs. This has been landed and has been working well, but really needs the next piece to be complete.
  • buildbot needed to know to not pick a spot instance to run retried jobs. This work is being tracked in bug 936222. It turns out that we're not seeing too many spot instances being killed off due to market price [1], so this work is less urgent.

The second issue, the VM storage, turned out to be much more complicated to fix. We rely on puppet to make sure that VMs have consistent software packages and configuration files. Puppet requires per-host SSL certificates generated, and at Mozilla, these certificates need to be signed by a central certificate authority. In our previous usage of EC2 we work around this by puppetizing new instances on first boot, and saving the disk image for later use.

With spot instances, we essentially need to re-puppetize every time we create a new VM.

Having fresh storage on boot also impacts the type of jobs we can run. We're starting with running test jobs on spot instances, since there's no state from previous tests that is valuable for the next test.

Builds are more complicated, since we depend on the state of previous builds to have faster incremental builds. In addition, the cost of having to retry a build is much higher than it is for a test. It could be that the spot instances stay up long enough or that we end up clobbering frequently enough that this won't have a major impact on build times. Try builds are always clobbers though, so we'll be running try builds on spot instances shortly.

All this work is being tracked in https://bugzilla.mozilla.org/show_bug.cgi?id=935683

Big props to Rail for getting this done so quickly. With all this great work, we should be able to scale better while reducing our costs.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=935533#c21