Skip to main content

Posts about mozilla (old posts, page 7)

Gotta Cache 'Em All

TOO MUCH TRAFFIC!!!!

Waaaaaaay back in February we identified overall network bandwidth as a cause of job failures on TBPL. We were pushing too much traffic over our VPN link between Mozilla's datacentre and AWS. Since then we've been working on a few approaches to cope with the increased traffic while at the same time reducing our overall network load. Most recently we've deployed HTTP caches inside each AWS region.

Network traffic from January to August 2014

The answer - cache all the things!

Obligatory XKCD

Caching build artifacts

The primary target for caching was downloads of build/test/symbol packages by test machines from file servers. These packages are generated by the build machines and uploaded to various file servers. The same packages are then downloaded many times by different machines running tests. This was a perfect candidate for caching, since the same files were being requested by many different hosts in a relatively short timespan.

Caching tooltool downloads

Tooltool is a simple system RelEng uses to distribute static assets to build/test machines. While the machines do maintain a local cache of files, the caches are often empty because the machines are newly created in AWS. Having the files in local HTTP caches speeds up transfer times and decreases network load.

Results so far - 50% decrease in bandwidth

Initial deployment was completed on August 8th (end of week 32 of 2014). You can see by the graph above that we've cut our bandwidth by about 50%!

What's next?

There are a few more low hanging fruit for caching. We have internal pypi repositories that could benefit from caches. There's a long tail of other miscellaneous downloads that could be cached as well.

There are other improvements we can make to reduce bandwidth as well, such as moving uploads from build machines to be outside the VPN tunnel, or perhaps to S3 directly. Additionally, a big source of network traffic is doing signing of various packages (gpg signatures, MAR files, etc.). We're looking at ways to do that more efficiently. I'd love to investigate more efficient ways of compressing or transferring build artifacts overall; there is a ton of duplication between the build and test packages between different platforms and even between different pushes.

I want to know MOAR!

Great! As always, all our work has been tracked in a bug, and worked out in the open. The bug for this project is 1017759. The source code lives in https://github.com/mozilla/build-proxxy/, and we have some basic documentation available on our wiki. If this kind of work excites you, we're hiring!

Big thanks to George Miroshnykov for his work on developing proxxy.

B2G now building using unified sources

Last week, with the help of Ehsan and John, we finally enabled unified source builds for B2G devices.

As a result we're building device builds approximately 40% faster than before.

Between June 12th and June 17th, 50% of our successful JB emulator builds on mozilla-inbound finished in 97 minutes or less. Using unified sources for these builds reduced the 50th percentile of build times down to 60 minutes (from June 19th to June 24th).

To mitigate the risks of changes landing that break non-unified builds, we're also doing periodic non-unified builds for these devices.

As usual, all our work here was done in the open. If you're interested, read along in bug 950676, and bug 942167.

Do you enjoy building, debugging and optimizing build, test & release pipelines? Great, because we're hiring!

Limiting coalescing on the build/test farm

tl;dr - as of yesterday we've limited coalescing on all builds/tests to merge at most 3 pending jobs together

Coalescing (aka queue collapsing aka merging) has been part of Mozilla's build/test CI for a long, long time. Back in the days of Tinderbox, a single machine would do a checkout/build/upload loop. If there were more checkins while the build was taking place, well, they would get built on the next iteration through the loop.

Fast forward a few years later to our move to buildbot, and having pools of machines all able to do the same builds. Now we create separate jobs in the queue for each build for each push. However, we didn't always have capacity to do all these builds in a reasonable amount of time, so we left buildbot's default behaviour (merging all pending jobs together) enabled for the majority of jobs. This means that if there are pending jobs for a particular build type, the first free machine skips all but the most recent item on the queue. The skipped jobs are "merged" into the job that was actually run.

In the case that all builds and tests are green, coalescing is actually a good thing most of the time. It saves you from doing a bunch of extra useless work.

However, not all pushes are perfect (just see how often the tree is closed due to build/test failures), and coalescing makes bisecting the failure very painful and time consuming, especially in the case that we've coalesced away intermediate build jobs.

To try and find a balance between capacity and sane results, we've recently added a limit to how many jobs can be coalesced at once.

By rigorous statistical analysis:

@catlee     so it's easiest to pick a single upper bound for coalescing and go with that at first
@catlee     did you have any ideas for what that should be?
@catlee     I was thinking 3
edmorley|sheriffduty        catlee: that sounds good to me as a first go :-)
mshal       chosen by fair dice roll? :)
@catlee     1d4
bhearsum    Saving throw failed. You are dead.
philor      wfm

we've chosen 3 as the upper bound on the number of jobs we'll coalesce, and we can tweak this as necessary.

I hope this makes the trees a bit more manageable! Please let us know what you think!

As always, all our work is done in the open. See the bug with the patch here: https://bugzilla.mozilla.org/show_bug.cgi?id=1008213

Digging into Firefox update sizes

Update sizes up 20-37% since last year!

Mozilla relies on our automatic update infrastructure to make sure that our users are kept up to date with the latest, most secure and fastest browser.

Having smaller updates means users are able to get the latest version of Firefox more quickly.

Since Firefox 19.0 (released just over a year ago - February 16th, 2013) our complete update size for Windows has grown from 25.6MB to 30.9MB for Firefox 28.0 (released March 15th, 2014). That's a 20% increase in just one year. On OSX it's grown from 37.8MB to 47.8MB, a 26% increase.

Partial updates have similarly grown. For Windows, a user coming form 27.0.1 to 28.0 would receive a 14.3MB update compared to a 10.4MB update going from 18.0.2 to 19.0, an increase of 37.5%

This means it's taking users 20-37% longer to download updates than it did last year. Many of our users don't have fast reliable internet, so an increase in update size makes it even harder for them to stay up to date. In addition, this size increase translates directly into bandwidth costs to Mozilla. All else being equal, we're now serving 20-37% more data from our CDNs for Firefox updates this year compared to last year.

update-sizes.png

Reducing update size

How can we reduce the update size? There are a few ways:

  1. Make sure we're serving partial updates to as many users as possible, and that these updates are applied properly. More analysis is needed, but it appears that only roughly half of our users are updating via partial updates.

  2. Reduce the amount of code we ship in the update. This could mean more features and content are distributed at runtime as needed This is always a tricky trade-off to make between having features available for all users out of the box, and introducing a delay the first time the user wants to use a feature that requires remote content. It also adds complexity to the product.

  3. Change how we generate updates. This is going to be the focus of the rest of my post.

Smaller updates are more better

There are a few techniques I know of to reduce our update sizes:

  • Use xz compression instead of bzip2 compression inside the MAR files (bug 641212). xz generally gets better compression ratios at the cost of using more memory.

  • Use courgette instead of bsdiff for generating the binary diff between .exe and .dll files (bug 504624). Courgette is designed specifically for diffing executables, and generates much smaller patches than bsdiff.

  • Handle omni.ja files more effectively (bug 772868). omni.ja files are essentially zip files, and are using zip style compression. That means each member of the zip archive is individually compressed. This is problematic for two reasons: it makes generating binary diffs between omni.ja files much less effective, and it makes compressing the omni.ja file with bzip2 or xz less effective. You get far better results packing files into a zip file with 0 compression and using xz to compress it afterwards. Similarly for generating diffs, the diff between two omni.ja files using no zip compression is much smaller than the diff between two omni.ja files using the default zip -9 compression.

  • Don't use per-file compression inside the MAR file at all, rather compress the entire archive with xz. I simulated this by xz-compressing tar files of the MAR contents.

27% smaller complete updates

complete-updates.png

We can see that using xz alone saves about 10.9%. There's not a big difference between xz -6 and xz -9e, only a few kb generally. ("xz -6" and "xz -9e" series in the chart)

Disabling zip compression in the omni.ja files and using the standard bzip2 compression saves about 9.7%. ("zip0 .ja" in the chart)

Combining xz compression with the above yields a 24.8% saving, which is 7.6MB. ("zip0 .ja, xz -9e" in the chart)

Finally, disabling zip compression for omni.ja and per-file compression and compressing the entire archive at once yields a 27.2% saving, or 8.4MB.

66% smaller partial updates

partial-updates.png

Very similar results here for xz, 8.4% savings with xz -9e.

Disabling zip compression in the omni.ja files has a much bigger impact for partial updates because we're able to produce a much smaller binary diff. This method alone is saves 42%, or 6.0MB.

Using courgette for executable files yields a 19.1% savings. ("courgette" in the chart)

Combining courgette for executable files, no zip level compression, and per-file xz compression reduces the partial update by 61%. ("courgette, zip0 .ja, xz -9e" in the chart)

And if we compress the entire archive at once rather than per-file, we can reduce the update by 65.9%. ("courgette, zip0 .ja, tar, xz -9e" in the chart)

Some notes on my methodology: I'm focusing on 32-bit Windows, en-US only. Most of the optimizations, with the exception of courgette, are applicable to other platforms. I'm measuring the total size of the files inside the MAR file, rather than the size of the MAR file itself. The MAR file format overhead is small, only a few kilobytes, so this doesn't significantly impact the results, and significantly simplifies the analysis.

Finally, the code I used for analysis is here.

The amazing RelEng Jacuzzis

RelEng has jacuzzis???

On Tuesday, we enabled the first of our "jacuzzis" in production, and initial results look great.

A few weeks ago, Ben blogged about some initial experiments with running builds on smaller pools of machines ("hot tubs", get it? we've since renamed them as "jacuzzis"). His results confirmed glandium's findings on the effectiveness (or lack thereof!) of incremental builds on mozilla-inbound.

The basic idea behind smaller pools of workers is that by restricting which machines are eligible to run jobs, you get much faster incremental builds since you have more recent checkouts, object dirs, etc.

Additionally, we've made some improvements to how we initialize mock environments. We don't reset and re-install packages into the mock chroot if the previous package list is the same as the new package list. This works especially well with jacuzzis, as we can arrange for machines to run jobs with the same package lists.

On Tuesday we enabled jacuzzis for some build types on b2g-inbound: hamachi device builds, and opt/debug ICS emulator builds.

jacuzzi-results.png

We've dropped build times by more than 50%!

The spikes earlier this morning look like they're caused by running on fresh spot instances. When spot nodes first come online, they have no previous state, and so their first builds will always be slower. The machines should stay up most of the day, so you really only have to pay this cost once per day.

For the B2G emulator builds, this means we're getting tests started earlier and therefore get much faster feedback as to the quality of recent patches.

I'm super happy with these results. What's next? Well, turning on MOAR JACUZZIS for MOAR THINGS! Additionally, having fewer build types per machine means our disk footprint is much lower, and we should be able to use local SSDs for builds on AWS.

As usual, all our work has a tracking bug: bug 970738

Implementation details

There are three major pieces involved in pulling this together: the jacuzzi allocations themselves, buildbot integration, and finally AWS integration.

Jacuzzis

Ben has published the allocations here: http://jacuzzi-allocator.pub.build.mozilla.org/v1/

Each builder (or job type) has a specific list of workers associated with it. Ben has been working on ways of automatically managing these allocations so we don't need to tune them by hand too much.

Buildbot

The bulk of the work required was in buildbot. Previous to jacuzzis, we had several large pools of workers, each capable doing any one of hundreds of different job types. Each builder in buildbot has each of the workers in a pool listed as able to do that job. We wanted to avoid having to reconfigure buildbot every time we needed to change jacuzzi allocations, which is why we decided to put the allocations in an external service.

There are two places where buildbot fetches the allocation data: nextSlave functions and our builder prioritizing function. The first is straightforward, and was the only place I was expecting to make changes. These nextSlave functions are called with a list of available machines, and the function's job is to pick one of the connected machines to do the job. It was relatively simple to add support for this to buildbot. The need to handle latter case, modifying our builder prioritization, didn't become apparent until testing...The reasoning is a bit convoluted, so I'll explain below for those interested.

AWS Support

Now that we had buildbot using the right workers, we needed to make sure that we were starting those workers in Amazon!

We adjusted our tools that monitor the job queue and start new EC2 instances to also check the jacuzzi allocations, and start the correct instances.

The gory details of build prioritizations

We have a function in buildbot which handles a lot of the prioritization of the job queue. For example, pending jobs for mozilla-central will get priority over jobs for any of the twigs, like ash or birch. Older jobs also tend to get priority over newer jobs. The function needs to return the list of builders in priority sorted order. Buildbot then processes each builder in turn, trying to assign pending jobs to any idle workers.

There are two factors which make this function complicated: each buildbot master is doing this prioritization independently of the others, and workers are becoming idle while buildbot is still processing the sorted list of builders. This second point caused prioritization to be broken (bug 659222) for a long time.

Imagine a case where you have 3 pending jobs (A, B, C), all for the same set of workers (W1, W2, W3). Job A is the most important, job C is the least. All the workers are busy. prioritizeBuilders has sorted our list of builders, and buildbot looks at A first. No workers are available, so it moves onto B next. Still no free workers. But now worker W1 connects, and buildbot examines job C, and finds there are available workers (W1). So job C buds in line and gets run before jobs A or B.

Our fix for this was to maintain a list of pending jobs for each set of available workers, and then discard all but the most important pending job for each worker set. In our example, we would see that all 3 pending jobs have the same worker set (W1, W2, W3), and so would temporarily ignore pending jobs B and C. This leaves buildbot only job A to consider. In our example above, it would find no available workers and stop processing. When W1 becomes available, it triggers another prioritization run, and again job A is the sole job under consideration and gets the worker.

Unfortunately, this behaviour conflicted with what we were trying to do with jacuzzis. Imagine in the same examble above we have jacuzzis allocated so that W1 is allocated to only do jobs of type C. If W1 is the only available worker, and C is getting discarded each time the prioritization is done, we're not making any forward progress. In fact, we've triggered a bit of an infinite loop, since currently we trigger another round of prioritizing/job assignments if we have available workers and have temporarily discarded lower priority jobs.

The fix was to integrate the jacuzzi allocations into this prioritization logic as well. I'm a little concerned about the runtime impact of this, since we need to query the allocated workers for every pending job type. One thing we're considering is to change the allocator to return the allocations as a single monolithic blob, rather than having per-job-type requests. Or, we could support both types.

AWS, networks, and burning trees

Help! The trees are on fire!

You may have noticed that the trees were closed quite a bit last week due to infrastructure issues that all stem from network flakiness. We're really sorry about that! RelEng and IT have been working very hard to stabilize the network.

Symptoms of the problem have been pretty varied, but the two most visible ones are:

  • bug 957502 - All trees closed due to timeouts as usw2 slaves download or clone

  • bug 964804 - jobs not being scheduled

We've been having more and more problems with our VPN tunnel to our Amazon infrastructure, particularly in the us-west-2 region. Prior to any changes we put in place last week, ALL traffic from our infrastructure in EC2 was routed over the VPN tunnel to be handled by Mozilla's firewall in our SCL3 data center.

While debugging the scheduling bug early last week, we discovered that latency to our mysql database used for scheduling was nearly 500ms. No wonder scheduling jobs was slow!

Digging deeper - the network is in bad shape

Investigation of this latency revealed that one of the core firewalls deployed for RelEng traffic was overloaded, and that a major contributor to the firewall load was traffic to/from AWS over the VPN tunnels. We were consistently pushing around 1 gigabit/s to our private network inside Amazon. The extra load on the firewall required for the VPN encryption caused the latency to go up for all traffic passing through the firewall, even for traffic not bound for AWS!

Our next step was to do a traffic analysis to see how we could reduce the amount of traffic going over the VPN tunnel.

Michal was able to get the traffic analysis done, and his report indicated that the worst traffic offender was traffic coming from ftp.m.o. All of our test jobs pull the builds, tests and crash symbols from ftp. With the increase in push load, more types of jobs, and more tests, our traffic from ftp has really exploded in the past months. Because all of our traffic was going over the VPN tunnel, this put a huge load on the VPN component of the firewall.

Firefighting

Since all of the content on ftp is public, we decided we could route traffic to/from ftp over the public internet rather than our VPN tunnel, provided we could ensure file integrity at the same time. This required a few changes to our infrastructure:

  • Rail re-created all of our EC2 instances to have public IP addresses, in addition to private IP addresses they had already. Without a public IP, Amazon can't route traffic to the public internet. You can set up extra instances to act as NAT gateways, but that's much more complicated and introduces another point of failure. (bug 965001)

  • We needed a new IP endpoint for ftp so that we could be sure that only SSL traffic was going over the public routes. Chris Turra set up ftp-ssl.m.o, and then I changed our routing tables in AWS to route traffic to ftp-ssl via the public internet rather than our VPN tunnel (bug 965289).

  • I landed a change to mozharness to download files from ftp-ssl instead of ftp.

In addition, we also looked at other sources of traffic. The next highest source of traffic was hg.m.o, followed by pvtbuilds.m.o.

Ben quickly rolled out a fix to our test slaves to allow them to cache the gaia repository locally rather than re-cloning gaia each time (bug 964411). We were surprised to discover the gaia repository has grown to 1.2 GB, so this was a big win!

It was clear we would need to divert traffic to hg in a similar way we did for ftp. Unfortunately, adding a DNS/IP endpoint for hg isn't as simple as for ftp, so aki has been going through our code changing all our references of http://hg.mozilla.org to https://hg.mozilla.org (bug 960571). Once we've found and fixed all usages of unsecured access to hg, we can change the routing for hg traffic like we did for FTP.

Aki also patched up some of our FirefoxOS build configs to limit which device builds upload per-checkin (bug 966412), and reduce the amount of data we're sending back to pvtbuilds.m.o over the VPN tunnel.

Ted tracked down a regression in how we were packaging some files in our test zips which cut the size by about 100MB (bug 963651).

On Monday, Adam added some more capacity to the firewall, which should allow us to cope with the remaining load better.

State of the network

This week we're in much better shape than we were last week. If you look at traffic this week vs last week, you can see that traffic is down significantly (see far right of graph):

aws-traffic.png

and latency has been kept under control:

aws-ping.png

We're not out of the woods yet though - we're still seeing occasional traffic spikes close to our maximum. The good news is there's still more we can do to reduce or divert the traffic:

  • we're not done transitioning all FTP/HG traffic to public routes

  • there's still plenty of room to reduce test zip size by splitting them up and changing the compression format used (bug 917999)

  • we can use caching in S3 from the test machines to avoid having to download identical builds/tests from FTP multiple times

Extra special thanks also to Hal who has been keeping all of us focused and on track on this issue.

Blobber is live - upload ALL the things!

Last week without any fanfare, we closed bug 749421 - Allow test slaves to save and upload files somewhere. This has actually been working well for a few months now, it's just taken a while to close it out properly, and I completely failed to announce it anywhere. mea culpa!

This was a really important project, and deserves some fanfare! cue trumpets, parades and skywriters

The goal of this project was to make it easier for developers to get important data out of the test machines reporting to TBPL. Previously the only real output from a test job was the textual log. That meant if you wanted a screen shot from a failing test, or the dump from a crashing process, you needed to format it somehow into the log. For screen shots we would base64 encode a png image and print it to the log as a data URL!

With blobber successfully running now, it's now possible to upload extra files from your test runs on TBPL. Things like screen shots, minidump logs and zip files are already supported.

Getting new files uploaded is pretty straightforward. If the environment variable MOZ_UPLOAD_DIR is set in your test's environment, you can simply copy files there and they will be uploaded after the test run is complete. Links to the files are output in the log. e.g.

15:21:18     INFO -  (blobuploader) - INFO - TinderboxPrint: Uploaded 70485077-b08a-4530-8d4b-c85b0d6f9bc7.dmp to http://mozilla-releng-blobs.s3.amazonaws.com/blobs/mozilla-inbound/sha512/5778e0be8288fe8c91ab69dd9c2b4fbcc00d0ccad4d3a8bd78d3abe681af13c664bd7c57705822a5585655e96ebd999b0649d7b5049fee1bd75a410ae6ee55af

Your thanks and praise should go to our awesome intern, Mihai Tabara, who did most of the work here.

Most test jobs are already supported; if you're unsure if the job type you're interested is supported, just search for MOZ_UPLOAD_DIR in the log on tbpl. If it's not there and you need it, please file a bug!

Valgrind now running per-push

This week we started running valgrind jobs per push (bug 946002) on mozilla-central and project branches (bug 801955).

We've been running valgrind jobs nightly on mozilla-central for years, but they were very rarely ever green. Few people looked at their results, and so they were simply ignored.

Recently Nicholas Nethercote has taken up the torch and put in a lot of hard work to get these valgrind jobs working again. They're now running successfully per-push on mozilla-central and project branches and on Try.

Thanks Nicholas! Happy valgrinding all!

Now using AWS Spot instances for tests

Release Engineering makes heavy use of Amazon's EC2 infrastucture. The vast majority of our Firefox builds for Linux and Android, as well as our Firefox OS builds happen in EC2. We also do a ton of unit testing inside EC2.

Amazon offers a service inside EC2 called spot instances. This is a way for Amazon to sell off unused capacity by auction. You can place a bid for how much you want to pay by the hour for a particular type of VM, and if your price is more than the current market price, you get a cheap VM! For example, we're able to run tests on machines for $0.025/hr instead of the regular $0.12/hr on-demand price. We started experimenting with spot instances back in November.

There are a few downsides to using spot instances however. One is that your VM can be killed at any moment if somebody else bids a higher price than yours. The second is that your instances can't (easily) use extra EBS volumes for persistent storage. Once the VM powers off, the storage is gone.

These issues posed different challenges for us. In the first case, we were worried about the impact that interrupted jobs would have on the tree. We wanted to avoid the case where jobs were run on a spot instance, interrupted because of the market price changing, and then being retried a second time on another spot instance subject to the same termination. This required changes to two systems:

  • aws_watch_pending needed to know to start regular on-demand EC2 instances in the case of retried jobs. This has been landed and has been working well, but really needs the next piece to be complete.

  • buildbot needed to know to not pick a spot instance to run retried jobs. This work is being tracked in bug 936222. It turns out that we're not seeing too many spot instances being killed off due to market price [1], so this work is less urgent.

The second issue, the VM storage, turned out to be much more complicated to fix. We rely on puppet to make sure that VMs have consistent software packages and configuration files. Puppet requires per-host SSL certificates generated, and at Mozilla, these certificates need to be signed by a central certificate authority. In our previous usage of EC2 we work around this by puppetizing new instances on first boot, and saving the disk image for later use.

With spot instances, we essentially need to re-puppetize every time we create a new VM.

Having fresh storage on boot also impacts the type of jobs we can run. We're starting with running test jobs on spot instances, since there's no state from previous tests that is valuable for the next test.

Builds are more complicated, since we depend on the state of previous builds to have faster incremental builds. In addition, the cost of having to retry a build is much higher than it is for a test. It could be that the spot instances stay up long enough or that we end up clobbering frequently enough that this won't have a major impact on build times. Try builds are always clobbers though, so we'll be running try builds on spot instances shortly.

All this work is being tracked in https://bugzilla.mozilla.org/show_bug.cgi?id=935683

Big props to Rail for getting this done so quickly. With all this great work, we should be able to scale better while reducing our costs.

Next generation job scheduling

As coop mentioned, we had a really great brainstorming session on Tuesday about the kinds of things we'd like to do with job scheduling in the RelEng infrastructure.

Our idea is to implement a "job graph", which will be a representation of a set of jobs to run, and dependencies between them. For example right now we have a set of tests that are dependent on builds finishing, or l10n repacks that are dependent on the en-US nightly build finishing. Theses graphs are implicit right now in our buildbot configs, and are pretty inflexible, opaque and hard to test.

One of our design goals for any new system or improvements is to make this job graph explicit, and to have it checked into the tree This has a few really nice features:

  • Makes it easier for developers to modify the set of jobs that run on their branch or push.

  • Other tools like try chooser and self-serve can use this information to control what jobs get run.

  • The sets of builds and tests running on branches follow merges. This is really helpful for our 6-week uplifts.

  • It will be possible to predict the set of builds and tests that would happen for a push in advance. This isn't possible right now without horrible hacks.

Our plan is to implement the graph parser and generator first so we can validate some of our assumptions, and make sure we can generate equivalent job graphs to what exists now. After we have that working, we can focus on integrating the new job graphs with the existing infrastructure.