Automated Talos Analysis

chris

2009-02-05 23:44

As part of one of our goals in Release Engineering this quarter, I'm investigating whether we can automatically detect variance in Talos performance data. Automatically detecting these changes in performance results would be a great help to developers and tree sheriffs. Imagine if the Tinderbox tree could be made to burn if a performance regression was detected? There are lots of possibilities if we can get this working: regressions could cause the tree to burn, firebot could spam #developers with information, try-talos data could be compared more easily to the baseline data, or we could automatically back out changes that cause regressions! :P This is also exciting, because it allows us to consider moving towards a pool-o'-slaves model for the Talos machines, just like we have for build and unittests right now. Having Talos use a pool-o'-slaves allows us to scale to additional project / release branches much more quickly, and allows us to be more flexible in allocating machines across branches. I've spent some time over the past few weeks playing around with data from graph server, bugging Johnathan, and having fun with flot, and I think I've come up with a workable solution.

How it works

I grab all the data for a test/branch/platform combination, and merge it into a single data series, ordered by buildid (the closest thing we've got right now to being able to sort the data in the same order in which changes landed). Individual data points are classified into one of four buckets:

"Good" data. We think these data points are within a certain tolerance of the expected value. Determining what the expected value is a bit tricky, so read on!
"Spikes". These data points are outside of the specified tolerance, but don't seem to be part of an ongoing problem (yet). Spikes can be caused by having the tolerance set too low, random machine voodoo, or not having enough data to make a definitive call as to if it's a code regression or machine problem.
"Regressions". When 3 or more data points are outside of the tolerance in the same direction, we assume this is due to a problem with the code, and flag it as a regression.
"Machine problem". When the last 2 data points from the same machine have been outside of the tolerance, then we assume this is due to a problem with the machine.

For the purposes of the algorithm (and this post!), a regression is a deviation from the expected value, regardless of it's a performance gain or loss. At this point the tolerance criteria is being set semi-manually. For each test/branch/platform combination, the tolerance is set as a certain number of standard deviations. The expected value is then determined by going back in the performance data history and looking for a certain sized window of data where no point is more than the configured number of standard deviations from the average. This can change over time, so we re-calculate the expected value at each point in the graph.

Initial Results

As an example, here's how data from Linux Tp3 tests on the Mozilla 1.9.2 branch is categorized:

Or, if you have a canvas-enabled browser, check out this interactive graph. A window size of 20 and a standard deviation threshold of 2.5 was used here for this data set. The green line represents all the good data. The orange line (which is mostly hidden by the green line), represents the raw data from the 3 Linux machines running that test. The orange circles represent spikes in the data, red circles represent regressions, and blue circles represent possible machine problems. For the most part we can ignore the spikes. Too many spikes probably means we need to tighten our tolerance a bit There are two periods of to take notice of on this graph:

Jan 12, around noon, a regression was detected. Two orange spike circles are followed by three red regression circles. Recall that we wait for the 3rd data point to confirm an actual regression.
Jan 30, around noon, a similar case. Two orange spike circles, followed by regression points.

Although in these cases, the regression was actually a win in terms of performance, it shows that the algorithm works. The second regression is due to Alice unthrottling the Talos boxes. In both cases, a new expected value is found after the data levels off again. The analysis also produces some textual output more suitable for e-mail, nagios or irc notification, e.g.:

Regression: Tp3 decrease from 417.974 to 235.778 (43.59%) on Fri Jan 30 11:34:00 2009. Linux 1.9.2 build 20090130083434

http://graphs.mozilla.org/#show=395125,395135,395166&sel=1233236074,1233408874

http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=7f5292b5b9e2&tochange=f1493cf102b9

My code can be found on http://hg.mozilla.org/users/catlee_mozilla.com/talos-grokker. Patches or comments welcome!

chris' random ramblings

Posts about statistics

Automated Talos Analysis

How it works

Initial Results