Gotta Cache 'Em All

TOO MUCH TRAFFIC!!!!

Waaaaaaay back in February we identified overall network bandwidth as a cause of job failures on TBPL. We were pushing too much traffic over our VPN link between Mozilla's datacentre and AWS. Since then we've been working on a few approaches to cope with the increased traffic while at the same time reducing our overall network load. Most recently we've deployed HTTP caches inside each AWS region.

Network traffic from January to August 2014

The answer - cache all the things!

Obligatory XKCD

Caching build artifacts

The primary target for caching was downloads of build/test/symbol packages by test machines from file servers. These packages are generated by the build machines and uploaded to various file servers. The same packages are then downloaded many times by different machines running tests. This was a perfect candidate for caching, since the same files were being requested by many different hosts in a relatively short timespan.

Caching tooltool downloads

Tooltool is a simple system RelEng uses to distribute static assets to build/test machines. While the machines do maintain a local cache of files, the caches are often empty because the machines are newly created in AWS. Having the files in local HTTP caches speeds up transfer times and decreases network load.

Results so far - 50% decrease in bandwidth

Initial deployment was completed on August 8th (end of week 32 of 2014). You can see by the graph above that we've cut our bandwidth by about 50%!

What's next?

There are a few more low hanging fruit for caching. We have internal pypi repositories that could benefit from caches. There's a long tail of other miscellaneous downloads that could be cached as well.

There are other improvements we can make to reduce bandwidth as well, such as moving uploads from build machines to be outside the VPN tunnel, or perhaps to S3 directly. Additionally, a big source of network traffic is doing signing of various packages (gpg signatures, MAR files, etc.). We're looking at ways to do that more efficiently. I'd love to investigate more efficient ways of compressing or transferring build artifacts overall; there is a ton of duplication between the build and test packages between different platforms and even between different pushes.

I want to know MOAR!

Great! As always, all our work has been tracked in a bug, and worked out in the open. The bug for this project is 1017759. The source code lives in https://github.com/mozilla/build-proxxy/, and we have some basic documentation available on our wiki. If this kind of work excites you, we're hiring!

Big thanks to George Miroshnykov for his work on developing proxxy.

Automated Talos Analysis

As part of one of our goals in Release Engineering this quarter, I'm investigating whether we can automatically detect variance in Talos performance data. Automatically detecting these changes in performance results would be a great help to developers and tree sheriffs.

Imagine if the Tinderbox tree could be made to burn if a performance regression was detected?

There are lots of possibilities if we can get this working: regressions could cause the tree to burn, firebot could spam #developers with information, try-talos data could be compared more easily to the baseline data, or we could automatically back out changes that cause regressions! :P

This is also exciting, because it allows us to consider moving towards a pool-o'-slaves model for the Talos machines, just like we have for build and unittests right now. Having Talos use a pool-o'-slaves allows us to scale to additional project / release branches much more quickly, and allows us to be more flexible in allocating machines across branches.

I've spent some time over the past few weeks playing around with data from graph server, bugging Johnathan, and having fun with flot, and I think I've come up with a workable solution.

How it works

I grab all the data for a test/branch/platform combination, and merge it into a single data series, ordered by buildid (the closest thing we've got right now to being able to sort the data in the same order in which changes landed).

Individual data points are classified into one of four buckets:

  • "Good" data. We think these data points are within a certain tolerance of the expected value. Determining what the expected value is a bit tricky, so read on!
  • "Spikes". These data points are outside of the specified tolerance, but don't seem to be part of an ongoing problem (yet). Spikes can be caused by having the tolerance set too low, random machine voodoo, or not having enough data to make a definitive call as to if it's a code regression or machine problem.
  • "Regressions". When 3 or more data points are outside of the tolerance in the same direction, we assume this is due to a problem with the code, and flag it as a regression.
  • "Machine problem". When the last 2 data points from the same machine have been outside of the tolerance, then we assume this is due to a problem with the machine.

For the purposes of the algorithm (and this post!), a regression is a deviation from the expected value, regardless of it's a performance gain or loss.

At this point the tolerance criteria is being set semi-manually. For each test/branch/platform combination, the tolerance is set as a certain number of standard deviations. The expected value is then determined by going back in the performance data history and looking for a certain sized window of data where no point is more than the configured number of standard deviations from the average. This can change over time, so we re-calculate the expected value at each point in the graph.

Initial Results

As an example, here's how data from Linux Tp3 tests on the Mozilla 1.9.2 branch is categorized: Linux Tp3 Data for Mozilla 1.9.2

Or, if you have a canvas-enabled browser, check out this interactive graph.

A window size of 20 and a standard deviation threshold of 2.5 was used here for this data set.

The green line represents all the good data. The orange line (which is mostly hidden by the green line), represents the raw data from the 3 Linux machines running that test.

The orange circles represent spikes in the data, red circles represent regressions, and blue circles represent possible machine problems.

For the most part we can ignore the spikes. Too many spikes probably means we need to tighten our tolerance a bit

There are two periods of to take notice of on this graph:

  • Jan 12, around noon, a regression was detected. Two orange spike circles are followed by three red regression circles. Recall that we wait for the 3rd data point to confirm an actual regression.
  • Jan 30, around noon, a similar case. Two orange spike circles, followed by regression points.

Although in these cases, the regression was actually a win in terms of performance, it shows that the algorithm works. The second regression is due to Alice unthrottling the Talos boxes.

In both cases, a new expected value is found after the data levels off again.

The analysis also produces some textual output more suitable for e-mail, nagios or irc notification, e.g.:

Regression: Tp3 decrease from 417.974 to 235.778 (43.59%) on Fri Jan 30 11:34:00 2009. Linux 1.9.2 build 20090130083434

http://graphs.mozilla.org/#show=395125,395135,395166&sel=1233236074,1233408874

http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=7f5292b5b9e2&tochange=f1493cf102b9

My code can be found on http://hg.mozilla.org/users/catlee_mozilla.com/talos-grokker.

Patches or comments welcome!