Skip to main content

Posts about performance

Efficient Hash Lookups: Adventures with Benchmarks!

Efficient data retrieval is crucial for high performance applications. Whether you're configuring a complex system or handling large datasets, the speed of hash lookups can significantly impact performance. In this post, we'll explore various methods to optimize hash lookups and benchmark their performance.

You can find my benchmarking code available here.

The Challenge

Let's say that you have a hash (aka dictionary aka map aka associative array; something with key/value pairs) representing configuration for your system.

At runtime, let's assume that your application only needs access to a small number of these configuration items to handle a given request.

Let's also say that this hash can get pretty large; some users have very complex configurations.

{
  "setting_1":     "value 1",
  "setting_2":     "value 2",
  /* ... */
  "setting_10000": "value 10000",
}

A common way of storing this configuration is in a JSON file. JSON is really simple to use for this type of problem. Just use something like JSON.parse to parse your file, and do a lookup for whatever key you want:

config_data = File.read("config.json")
config = JSON.parse(config_data)
puts config["setting_500"]
# => "value 500"

Awesome, what's the problem?

Assuming that we just want to fetch a small number of items from this configuration, there are a few problems here:

  1. We have to load the entire file into memory
  2. We then have to parse the entire file to get our hash object. This involves inserting every item in configuration into a new hash object, which is an O(n) operation.

If our configuration is quite large, then the parsing & hash construction time will be far greater than our lookup time.

$ ruby benchmark.rb -f json -s hash
Calculating -------------------------------------
                load    829.185 (±30.0%) i/s    (1.21 ms/i)
               parse     24.779 (± 8.1%) i/s   (40.36 ms/i)
             get_key     4.242M (± 3.7%) i/s  (235.72 ns/i)
  load/parse/get_key     22.561 (± 8.9%) i/s   (44.32 ms/i)

(I've removed some superfluous information from the output snippets here in the interests of clarity; the number I'm paying attention to is the time per iteration represented in the last column.)

Lookups into the hash, once created, are very fast: about 200ns per lookup.

However, for 100,000 entries, just parsing the data and creating the hash takes about 40ms on my system.

In fact, the hash construction time is the fundamental issue to overcome when trying to speed up our overall time here. Simply creating a hash from the data takes about 20ms on my system:

$ ruby benchmark-to_h.rb < output/array.json
Calculating -------------------------------------
                to_h     37.764 (±18.5%) i/s   (26.48 ms/i)

It makes sense when you think about it. Before we can get our hash object containing our configuration, we need to iterate through each key/value pair and insert it into the hash object. Only then can we do a lookup for the key we're interested in.

How can we make this better?

A bunch of different ideas come to mind:

We can try different ways to encode the data. For example: using MessagePack, protobufs, or capnproto. Each of these use more efficient ways to serialize/de-serialize the data, speeding up the parsing time.

Another idea is to encode our data as a sorted list of key/value pairs instead of as a hash, and use a binary search to lookup the key that we want.

Or, we can try and use perfect hashes; here we would encode our data as a list of key/value pairs as above, but augmented with additional information that makes it possible to treat the array as a pre-computed hash table.

Other ideas: sqlite, dbm/lmdb. These benefit from encoding the index within the file itself, so we don't need to reconstruct a hash table/index again when loading the data. Storing more structured data in these is more challenging though, and lookups in these types of data stores are typically O(log N) instead of O(1)

MessagePack

MessagePack is an easy thing to try first since it's almost a drop-in replacement for JSON in our use case.

$ ruby benchmark.rb -f msgpack -s hash
Calculating -------------------------------------
                load    679.466 (± 5.3%) i/s    (1.47 ms/i)
               parse     27.958 (± 3.6%) i/s   (35.77 ms/i)
             get_key     4.692M (± 3.4%) i/s  (213.13 ns/i)
  load/parse/get_key     27.031 (± 3.7%) i/s   (36.99 ms/i)

Well, we've improved the parsing time by about about 5ms, which is a nice improvement, but still pretty slow overall. Again, we have that large hash construction time to overcome, which MessagePack doesn't help us with.

Protobufs

To experiment with Protobufs, I've created a simple definition using the native map type:

syntax = "proto3";

message PBHash {
  map<string, string> data = 1;
}

The results here are surprising:

$ ruby benchmark.rb -f pb -s hash
Calculating -------------------------------------
                load      1.311k (±35.9%) i/s  (762.93 μs/i)
               parse      21.496 (± 4.7%) i/s   (46.52 ms/i) # !!
             get_key      3.098M (± 3.2%) i/s  (322.75 ns/i)
  load/parse/get_key      20.707 (± 4.8%) i/s   (48.29 ms/i)

It looks like parsing protobuf maps are significantly slower than parsing the equivalent JSON or msgpack objects.

Encode as an array

Let's see what happens if we encode our data as an array.

This may improve parsing time, since we don't need to construct a large hash object in memory. The trade-off is that lookup times become O(log n) instead of O(1), assuming we can sort our input. As a first approximation, we can compare times to load & parse the array.

Using JSON I see marginal improvement to load & parse the data (~30ms), but using Protobufs we can load & parse the data in about 6ms. Unfortunately I couldn't get good measurements for the full load/parse/get_key with protobufs; it seems like doing a binary search for elements in the protobuf arrays is relatively slow in Ruby. Fetching individual items is fast.

ruby -Iproto benchmark.rb -f pb -s array
Calculating -------------------------------------
                load     1.641k (±10.7%) i/s  (609.32 μs/i)
               parse    163.886 (± 3.1%) i/s    (6.10 ms/i)
             get_key     27.325 (±11.0%) i/s   (36.60 ms/i)
  load/parse/get_key      3.000 (±33.3%) i/s  (333.32 ms/i) # ???

Perfect hashes

Maybe we can get the best of both worlds by using a perfect hash to order our data in a particular way to make it cheap to find?

There are several ways of constructing a perfect hash; the CHD algorithm is fairly easy to implement (without the "compress" part of compress-hash-displace) , and can compute a minimal perfect hash function in O(n) time.

The way it works is that we compute a secondary array of seeds that are used to prevent collisions between keys in the data set. To find the index of an element by its key, we compute:

# Find index for `key`
seed = seeds[hashfunc(seed: 0, data: key) % seeds.size]
index = hashfunc(seed: seed, data: key) % data.size]
value = data[index]

The results of using CHD with Protobufs are very good (using cityhash as the hashing function):

$ ruby benchmark.rb -f pb -s chd
Calculating -------------------------------------
                load     1.920k (± 7.0%) i/s  (520.85 μs/i)
               parse    173.076 (± 4.0%) i/s    (5.78 ms/i)
             get_key   491.510k (± 1.9%) i/s    (2.03 μs/i)
  load/parse/get_key    154.828 (± 1.9%) i/s    (6.46 ms/i)

Summary

Wow, that was a lot! What did we learn?

Benchmarks are hard! I wouldn't take any results here as absolute proof that X is better or worse than Y. Your data and use cases are most likely very different.

Using perfect hashes or binary search in an array of key/value pairs look like they're much faster in general than simple lookups into a hash object. The time to construct the hash object plays such a large part in the overall timings.

I'm surprised at the performance of protobufs in generally. They're much slower than I expected, in everything except the perfect hash (CHD) case. It makes me wonder if something is wrong with my implementation, or Ruby's protobuf gem.

$  ruby benchmark-final.rb
Comparison:
load/parse/get_key pb chd:        124.3 i/s
load/parse/get_key msgpack array: 50.2 i/s - 2.48x  slower
load/parse/get_key msgpack chd:   41.1 i/s - 3.02x  slower
load/parse/get_key json array:    37.3 i/s - 3.33x  slower
load/parse/get_key json chd:      27.9 i/s - 4.46x  slower
load/parse/get_key msgpack hash:  23.8 i/s - 5.22x  slower
load/parse/get_key json hash:     20.7 i/s - 6.00x  slower
load/parse/get_key pb hash:       19.4 i/s - 6.40x  slower
load/parse/get_key pb array:      3.1 i/s - 40.07x  slower

To follow up, I'd like to explore using capnproto, and running similar benchmarks in other languages to see if the protobuf performance improves.

GCS Performance Experiments

Recently I've been digging into some performance optimizations with some code that transfers dozens of assets into a bucket in Google Cloud Storage (GCS)

The code in question is using Google's Ruby gem, and uploads a set of files in sequence, something like this:

require "google/cloud/storage"
storage = Google::Cloud::Storage.new(...)
bucket = storage.bucket(...)

files.each do |filename|
  bucket.create_file(filename, filename)
end

For this naive implementation, it takes about 45 seconds to upload 75 files, totalling about 350k of data. That works out to about 600ms per file! This seemed awfully slow, so I wanted to understand what was going on.

(note that all the timings here were collected on my laptop running on my home wifi)

Jump down to the TL;DR for spoilers, or keep reading for way too much detail.

Simplify the dataset

First, I wanted to see if the size of the files had any impact on the speed here, so I created a test with just 50 empty text files:

$ touch assets/{1..50}.txt
$ time ruby gcs.rb !$
ruby gcs.rb assets/{1..50}.txt  0.61s user 0.20s system 2% cpu 27.951 total

That's still about 28 seconds, or about 550ms per asset, which seems super slow for a set of empty files.

Start digging

Looking at the http traffic generated by running this code, it appears as though Google's gem is creating two(!!) new TCP / https connections per asset. The default behaviour of the gem is to use a "resumeable upload", where one connection is issued to start the upload, and then a second connection to transfer the data and finalize the upload. It doesn't appear as though any connection pooling is happening either.

It looks like there's room for some improvement.

A few ideas came to mind:

  • Use a single request per asset
  • Use connection pooling
  • Run requests in parallel (using threading or async)
  • Use HTTP/2

Single request per asset

The GCS API is pretty simple, so it's straightforward to implement a basic upload function in Ruby.

def upload_object(bucket, filename)
  Net::HTTP.post(
    URI("https://storage.googleapis.com/upload/storage/v1/b/#{bucket}/o?name=#{filename}"),
    File.read(filename),
    {"Authorization" => "Bearer #{ENV["AUTH"]}"}
  )
end

This uses one connection per asset, and brings the time down to about 11 seconds for our 50 empty assets, which is less than half of the naive version.

Re-using the same connection

Net::HTTP supports re-using the same connection, we just need to restructure the code a little bit:

def upload_object(client, bucket, filename)
  uri = URI("https://storage.googleapis.com/upload/storage/v1/b/#{bucket}/o?name=#{filename}")
  req = Net::HTTP::Post.new(uri)
  req["Authorization"] = "Bearer #{ENV["AUTH"]}"
  req.body = File.read(filename)
  client.request(req)
end

Net::HTTP.start("storage.googleapis.com", 443, use_ssl: true) do |http|
  files.each do |filename|
    upload_object(http, bucket, filename)
  end
end

This runs a bit faster now, in 8 seconds total.

Parallelization

Ruby has a few concurrency models we can play with. I try to avoid threading wherever possible, and use async libraries. Luckily, Ruby's async gem handles wrapping Net::HTTP quite well:

Async do
  barrier = Async::Barrier.new
  files.each do |filename|
    barrier.async do 
      resp = upload_object(bucket, filename)
    end
  end
  barrier.wait
end

Now we can finish all 50 uploads in about 1.3s

HTTP/2

There are various options for making HTTP2 requests in Ruby. One we can use is async-http, which integrates well with the async gem used above. Another gem that's worked well for me is httpx.

Async do
  barrier = Async::Barrier.new
  internet = Async::HTTP::Internet.new

  files.each do |filename|
    barrier.async do
      resp = internet.post(
        "https://storage.googleapis.com/upload/storage/v1/b/#{bucket}/o?name=#{filename}",
        {"Authorization" => "Bearer #{ENV["AUTH"]}"},
        File.read(filename))
    end
  end
  barrier.wait
end

This finishes all 50 requests in 0.4s!

Looking back at our initial data set which took 45 seconds to run, we can now do in 0.9 seconds.

TL;DR

To summarize, here are the times for uploading 50 empty files to a bucket:

Method Time
naive (google gem) 28s
single request per asset 11s
single http connection 8s
async 1.3s
HTTP/2 0.4s

I'm really shocked at how much faster HTTP/2 here is.

Consumers of Google Cloud APIs should take a look at the libraries that they're using, and see if they can switch over to ones that support HTTP/2. I'm curious if other client libraries for the Google Cloud APIs support HTTP/2.

Can the Ruby gem support HTTP/2? There's an open issue on the github repo to switch the underlying client implementation to Faraday, which would allow one to use an HTTP/2 aware client under the hood. I've started working on a draft PR to see what would be involved in switching, but there are some major barriers at this point.

Gotta Cache 'Em All

TOO MUCH TRAFFIC!!!!

Waaaaaaay back in February we identified overall network bandwidth as a cause of job failures on TBPL. We were pushing too much traffic over our VPN link between Mozilla's datacentre and AWS. Since then we've been working on a few approaches to cope with the increased traffic while at the same time reducing our overall network load. Most recently we've deployed HTTP caches inside each AWS region.

Network traffic from January to August 2014

The answer - cache all the things!

Obligatory XKCD

Caching build artifacts

The primary target for caching was downloads of build/test/symbol packages by test machines from file servers. These packages are generated by the build machines and uploaded to various file servers. The same packages are then downloaded many times by different machines running tests. This was a perfect candidate for caching, since the same files were being requested by many different hosts in a relatively short timespan.

Caching tooltool downloads

Tooltool is a simple system RelEng uses to distribute static assets to build/test machines. While the machines do maintain a local cache of files, the caches are often empty because the machines are newly created in AWS. Having the files in local HTTP caches speeds up transfer times and decreases network load.

Results so far - 50% decrease in bandwidth

Initial deployment was completed on August 8th (end of week 32 of 2014). You can see by the graph above that we've cut our bandwidth by about 50%!

What's next?

There are a few more low hanging fruit for caching. We have internal pypi repositories that could benefit from caches. There's a long tail of other miscellaneous downloads that could be cached as well.

There are other improvements we can make to reduce bandwidth as well, such as moving uploads from build machines to be outside the VPN tunnel, or perhaps to S3 directly. Additionally, a big source of network traffic is doing signing of various packages (gpg signatures, MAR files, etc.). We're looking at ways to do that more efficiently. I'd love to investigate more efficient ways of compressing or transferring build artifacts overall; there is a ton of duplication between the build and test packages between different platforms and even between different pushes.

I want to know MOAR!

Great! As always, all our work has been tracked in a bug, and worked out in the open. The bug for this project is 1017759. The source code lives in https://github.com/mozilla/build-proxxy/, and we have some basic documentation available on our wiki. If this kind of work excites you, we're hiring!

Big thanks to George Miroshnykov for his work on developing proxxy.

Investigating hg performance

(caveat lector: this is a long post with lots of shell snippets and output; it's mostly a brain dump of what I did to investigate performance issues on hg.mozilla.org. I hope you find it useful. Scroll to the bottom for the summary.) Everybody knows that pushing to try can be slow. but why? while waiting for my push to try to complete, I wondered what exactly was slow. I started by cloning my own version of try:

$ hg clone http://hg.mozilla.org try

destination directory: try

requesting all changes

adding changesets

adding manifests

adding file changes

added 95917 changesets with 447521 changes to 89564 files (+2446 heads)

updating to branch default

53650 files updated, 0 files merged, 0 files removed, 0 files unresolved

Next I instrumented hg so I could get some profile information:

$ sudo vi /usr/local/bin/hg

python -m cProfile -o /tmp/hg.profile /usr/bin/hg $*

Then I timed out long it took me to check what would be pushed:

$ time hg out ssh://localhost//home/catlee/mozilla/try

hg out ssh://localhost//home/catlee/mozilla/try  0.57s user 0.04s system 54% cpu 1.114 total

That's not too bad. Let's check our profile:

import pstats

pstats.Stats("/tmp/hg.profile").strip_dirs().sort_stats('time').print_stats(10)

Fri Dec  9 00:25:02 2011    /tmp/hg.profile


         38744 function calls (37761 primitive calls) in 0.593 seconds

   Ordered by: internal time
   List reduced from 476 to 10 due to restriction 

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       13    0.462    0.036    0.462    0.036 {method 'readline' of 'file' objects}
        1    0.039    0.039    0.039    0.039 {mercurial.parsers.parse_index2}
       40    0.031    0.001    0.031    0.001 revlog.py:291(rev)
        1    0.019    0.019    0.019    0.019 revlog.py:622(headrevs)
   177/70    0.009    0.000    0.019    0.000 {__import__}
     6326    0.004    0.000    0.006    0.000 cmdutil.py:15(parsealiases)
       13    0.003    0.000    0.003    0.000 {method 'read' of 'file' objects}
       93    0.002    0.000    0.008    0.000 cmdutil.py:18(findpossible)
     7212    0.001    0.000    0.001    0.000 {method 'split' of 'str' objects}
  392/313    0.001    0.000    0.007    0.000 demandimport.py:92(_demandimport)
The top item is readline() on file objects? I wonder if that's socket operations. I'm ssh'ing to localhost, so it's really fast. Let's add 100ms latency:

$ sudo tc qdisc add dev lo root handle 1:0 netem delay 100ms

$ time hg out ssh://localhost//home/catlee/mozilla/try

hg out ssh://localhost//home/catlee/mozilla/try  0.58s user 0.05s system 14% cpu 4.339 total


import pstats

pstats.Stats("/tmp/hg.profile").strip_dirs().sort_stats('time').print_stats(10)

Fri Dec  9 00:42:09 2011    /tmp/hg.profile


         38744 function calls (37761 primitive calls) in 2.728 seconds

   Ordered by: internal time
   List reduced from 476 to 10 due to restriction 

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       13    2.583    0.199    2.583    0.199 {method 'readline' of 'file' objects}
        1    0.054    0.054    0.054    0.054 {mercurial.parsers.parse_index2}
       40    0.028    0.001    0.028    0.001 revlog.py:291(rev)
        1    0.019    0.019    0.019    0.019 revlog.py:622(headrevs)
   177/70    0.010    0.000    0.019    0.000 {__import__}
       13    0.006    0.000    0.006    0.000 {method 'read' of 'file' objects}
     6326    0.002    0.000    0.004    0.000 cmdutil.py:15(parsealiases)
       93    0.002    0.000    0.006    0.000 cmdutil.py:18(findpossible)
  392/313    0.002    0.000    0.008    0.000 demandimport.py:92(_demandimport)
     7212    0.001    0.000    0.001    0.000 {method 'split' of 'str' objects}
Yep, definitely getting worse with more latency on the network connection. Oh, and I'm using a recent version of hg:

$ hg --version

Mercurial Distributed SCM (version 2.0)



$ echo hello | ssh localhost hg -R /home/catlee/mozilla/try serve --stdio

145

capabilities: lookup changegroupsubset branchmap pushkey known getbundle unbundlehash batch stream unbundle=HG10GZ,HG10BZ,HG10UN httpheader=1024

This doesn't match what hg.mozilla.org is running:

$ echo hello | ssh hg.mozilla.org hg -R /mozilla-central serve --stdio  

67

capabilities: unbundle lookup changegroupsubset branchmap stream=1

So it must be using an older version. Let's see what mercurial 1.6 does:

$ mkvirtualenv hg16

New python executable in hg16/bin/python

Installing setuptools...



(hg16)$ pip install mercurial==1.6

Downloading/unpacking mercurial==1.6
  Downloading mercurial-1.6.tar.gz (2.2Mb): 2.2Mb downloaded
...



(hg16)$ hg --version

Mercurial Distributed SCM (version 1.6)



(hg16)$ echo hello | ssh localhost /home/catlee/.virtualenvs/hg16/bin/hg -R /home/catlee/mozilla/mozilla-central serve --stdio

75

capabilities: unbundle lookup changegroupsubset branchmap pushkey stream=1

That looks pretty close to what hg.mozilla.org claims it supports, so let's time 'hg out' again:

(hg16)$ time hg out ssh://localhost//home/catlee/mozilla/try

hg out ssh://localhost//home/catlee/mozilla/try  0.73s user 0.04s system 3% cpu 24.278 total

tl;dr

Finding missing changesets between two local repositories is 6x slower with hg 1.6 (4 seconds with hg 2.0 to 24 seconds hg 1.6). Add a few hundred people and machines hitting the same repository at the same time, and I imagine things can get bad pretty quickly. Some further searching reveals that mercurial does support a faster method of finding missing changesets in "newer" versions, although I can't figure out exactly when this change was introduced. There's already a bug on file for upgrading mercurial on hg.mozilla.org, so hopefully that improves the situation for pushes to try. The tools we use everyday aren't magical; they're subject to normal debugging and profiling techniques. If a tool you're using is holding you back, find out why!

cURL and paste

cURL and paste...two great tastes that apparently don't go well at all together! I've been writing a bunch of simple wsgi apps lately, some of which handle file uploads. Take this tiny application:

import webob



def app(environ, start_response):
    req = webob.Request(environ)
    req.body_file.read()
    return webob.Response("OK!")(environ, start_response)


import paste.httpserver

paste.httpserver.serve(app, port=8090)

Then throw some files at it with cURL:

[catlee] % for f in $(find -type f); do time curl -s -o /dev/null --data-binary @$f http://localhost:8090; done

curl -s -o /dev/null --data-binary @$f http://localhost:8090  0.00s user 0.00s system 0% cpu 1.013 total

curl -s -o /dev/null --data-binary @$f http://localhost:8090  0.01s user 0.00s system 63% cpu 0.013 total

curl -s -o /dev/null --data-binary @$f http://localhost:8090  0.01s user 0.00s system 64% cpu 0.012 total

curl -s -o /dev/null --data-binary @$f http://localhost:8090  0.01s user 0.00s system 81% cpu 0.015 total

curl -s -o /dev/null --data-binary @$f http://localhost:8090  0.01s user 0.00s system 0% cpu 1.014 total

curl -s -o /dev/null --data-binary @$f http://localhost:8090  0.00s user 0.00s system 0% cpu 1.009 total

Huh? Some files take a second to upload? I discovered after much digging, and rewriting my (more complicated) app several times, that the problem is that cURL sends an extra "Expect: 100-continue" header. This is supposed to let a web server respond with "100 Continue" immediately or reject an upload based on the request headers. The problem is that paste's httpserver doesn't send this by default, and so cURL will wait for a second before giving up and sending the rest of the request. The magic to turn this off is the '-0' to cURL, which forces HTTP/1.0 mode:

[catlee] % for f in $(find -type f); do time curl -0 -s -o /dev/null --data-binary @$f http://localhost:8090; done

curl -0 -s -o /dev/null --data-binary @$f http://localhost:8090  0.00s user 0.00s system 66% cpu 0.012 total

curl -0 -s -o /dev/null --data-binary @$f http://localhost:8090  0.01s user 0.00s system 64% cpu 0.012 total

curl -0 -s -o /dev/null --data-binary @$f http://localhost:8090  0.00s user 0.01s system 58% cpu 0.014 total

curl -0 -s -o /dev/null --data-binary @$f http://localhost:8090  0.01s user 0.00s system 66% cpu 0.012 total

curl -0 -s -o /dev/null --data-binary @$f http://localhost:8090  0.00s user 0.00s system 59% cpu 0.013 total

curl -0 -s -o /dev/null --data-binary @$f http://localhost:8090  0.01s user 0.00s system 65% cpu 0.012 total

Faster try builds!

When we run a try build, we wipe out the build directory between each job; we want to make sure that every user's build has a fresh environment to build in. Unfortunately this means that we also wipe out the clone of the try repo, and so we have to re-clone try every time. On Linux and OSX we were spending an average of 30 minutes to re-clone try, and on Windows 40 minutes. The majority of that is simply 'hg clone' time, but a good portion is due to locks: we need to limit how many simultaneous build slaves are cloning from try at once, otherwise the hg server blows up. Way back in September, Steve Fink suggested using hg's share extension to make cloning faster. Then in November, Ben Hearsum landed some changes that paved the way to actually turning this on. Today we've enabled the share extension for Linux (both 32 and 64-bit) and OSX 10.6 builds on try. Windows and OSX 10.5 are coming too, we need to upgrade hg on the build machines first. Average times for the 'clone' step are down to less than 5 minutes now. This means you get your builds 25 minutes faster! It also means we're not hammering the try repo so badly, and so hopefully won't have to reset it for a long long time. We're planning on rolling this out across the board, so nightly builds get faster, release builds get faster, clobber builds get faster, etc... Enjoy!

What do you want to know about builds?

Mozilla has been quite involved in recent buildbot development, in particular, helping to make it scale across multiple machines. More on this in another post! Once deployed, these changes will give us the ability to give real time access to various information about our build queue: the list of jobs waiting to start, and which jobs are in progress. This should help other tools like Tinderboxpushlog show more accurate information. One limitation of the upstream work so far is that it only captures a very coarse level of detail about builds: start/end time, and result code is pretty much it. No further detail about the build is captured, like which slave it executed on, what properties it generated (which could include useful information like the URL to the generated binaries), etc. We've also been exporting a json dump of our build status for many months now. It's been useful for some analysis, but it also has limitations: the data is always at least 5 minutes old by the time you look, and in-progress builds are not represented at all. We're starting to look at ways of exporting all this detail in a way that's useful to more people. You want to get notified when your try builds are done? You want to look at which test suites are taking the most time? You want to determine how our build times change over time? You want to find out what the last all-green revision was on trunk? We want to make this data available, so anybody can write these tools.

Just how big is that firehose?

I think we have one of the largest buildbot setups out there and we generate a non-trivial amount of data:
  • 6-10 buildbot master processes generating updates, on different machines in 2 or 3 data centers
  • around 130 jobs per hour composed of 4,773 individual steps total per hour. That works out to about 1.4 updates per second that are generated

How you can help

This is where you come in. I can think of two main classes of interfaces we could set up: a query-type interface where you poll for information that you are interested in, and a notification system where you register a listener for certain types (or all!) events. What would be the best way for us to make this data available to you? Some kind of REST API? A message or event brokering system? pubsubhubbub? Is there some type of data or filtering that would be super helpful to you?

Buildbot performance and scaling

It seems like it was ages ago when I posted about profiling buildbot. One of the hot spots identified there was the dataReceived call. This has been sped up a little bit in recent versions of twisted, but our buildbot masters were still severely overloaded. It turns out that the buildbot slaves make a lot of RPC calls when sending log data, which results in tens of thousands of dataReceived calls. Multiply that by several dozen build slaves sending log data peaking at a combined throughput of 10 megabits/s and you've got an awful lot of data to handle. By adding a small slave-side buffer, the number of RPC calls to send log data is drastically reduced by an order of magnitude in some tests, resulting in a much better load situation on the master. This is good for us, because it means the masters are much more responsive, and it's good for everybody else because it means we have fewer failures and wasted time due to the master being too busy to handle everything. It also means we can throw more build slaves onto the masters! The new code was deployed towards the end of the day on the 26th, or the end of the 12th week.

Faster build machines, faster end-to-end time

One metric Release Engineering focuses on a lot is the time between a commit to the source tree and when all the builds and tests for that revision are completed. We call this our end-to-end time. Our approach to improving this time has been to identify the longest bits in the total end-to-end time, and to make them faster. One big chunk of the end-to-end time used to be wait times: how long a build or test waited for a machine to become free. We've addressed this by adding more build and test slaves into the pool. We've also focused on parallelizing the entire process, so instead of doing this: before we now do this: after(not exactly to scale, but you get the idea)

Faster builds please!

After splitting up tests and reducing wait times, the longest part of the entire process was now the actual build time. Last week we added 25 new machines to the pool of build slaves for windows. These new machines are real hardware machines (not VMs), with a quad-core 2.4 GHz processor, 4 GB RAM, and dedicated hard drives. We've been anticipating that the new build machines can crank out builds much faster. And they can. (the break in the graph is the time when mozilla-central was closed due to the PGO link failure bug) We started rolling the new builders into production on February 22nd (where the orange diamonds start on the graph). You can see the end-to-end time immediately start to drop. From February 1st to the 22nd, our average end-to-end time for win32 on mozilla-central was 4h09. Since the 22nd the average time has dropped down to 3h02. That's over an hour better on average, a 26% improvement. Faster builds means tests can start faster, which means tests can be done sooner, which means a better end-to-end time. It also means that build machines become free to do another build sooner, and so we're hoping that these faster build machines will also improve our wait time situation (but see below). Currently we're limited to running the build on these machines with -j1 because of random hangs in the build when using -j2 or higher (bug 524149). Once we fix that, or move to pymake, we should see even better improvements.

What about OSX?

In preparation for deploying these new hardware build machines, we also implemented some prioritization algorithms to choose fast build machines over slow ones, and also to try and choose a machine that has a recent object directory (to make our incremental builds faster). This has helped us out quite a bit on our OSX end-to-end times as well, where we have a mixed pool of xserves and minis doing builds and tests. OSX End to End time Simply selecting the right machine for the job reduced our end-to-end time from 3h12 to 2h13, again almost an hour improvement, or 30% better.

What's next?

We have 25 linux machines that should be coming into our production pools this week. We'll continue to monitor the end-to-end and wait times over the next weeks to see how everything is behaving. One thing I'm watching out for is that having faster builds means we can produce more builds in less time...which means more builds to test! Without enough machines running tests, we may end up making wait times and therefore our end-to-end times worse! We're already begun work on handling this. Our plan is to start doing the unittest runs on the talos hardware....but that's another post!

Faster signing

Once upon a time, when Mozilla released a new version of Firefox, signing all of the .exes and .dlls for all of the locales took a few hours. As more locales were added, and the file sizes increased, this time has increased to over 8 hours to sign all the locales in a Firefox 3.5 release. This has been a huge bottleneck for getting new releases out the door. For our most recent Firefox releases, we've started using some new signing infrastructure that I've been working on over the past few months. There are quite a few reasons why the new infrastructure is faster:

  • Faster hardware. We've moved from an aging single core system to a new quad-core 2.5 GHz machine with 4 GB of RAM.
  • Concurrent signing of locales. Since we've got more cores on the system, we should take advantage of them! The new signing scripts spawn off 4 child processes, each one grabs one locale at a time to process.
  • In-process compression/decompression. Our complete.mar files use bz2 compression for every file in the archive. The old scripts would call the 'bzip2' and 'bunzip2' binaries to do compression and decompression. It's significantly faster to do these operations in-process.
  • Better caching of signed files. The biggest win came from the simple observation that after you sign a file, and re-compress it to include in a .mar file, you should be caching the compressed version to use later. The old scripts did cache signed files, but only the decompressed versions. So to sign the contents of our mar files, the contents would have to be completely unpacked and decompressed, then the cache was checked, files were signed or pulled from cache as necessary, and then re-compressed again. Now, we unpack the mar file, check the cache, and only decompress / sign / re-compress any files that need signing but aren't in the cache. We don't even bother decompressing files that don't need signing, another difference from the original scripts. Big thanks to Nick Thomas for having the idea for this at the Mozilla All-Hands in April.
As a result of all of this, signing all our locales can be done in less than 15 minutes now! See bug 470146 for the gory details. The main bottleneck at this point is the time it takes to transfer all of the files to the signing machine and back again. For a 3.5 release, there's around 3.5 GB of data to transfer back and forth, which takes on the order of 10 minutes to pull down to the signing machine, and another few minutes to push back after signing is done.