Skip to main content

Posts about performance

GCS Performance Experiments

Recently I've been digging into some performance optimizations with some code that transfers dozens of assets into a bucket in Google Cloud Storage (GCS)

The code in question is using Google's Ruby gem, and uploads a set of files in sequence, something like this:

require "google/cloud/storage"
storage = Google::Cloud::Storage.new(...)
bucket = storage.bucket(...)

files.each do |filename|
  bucket.create_file(filename, filename)
end

For this naive implementation, it takes about 45 seconds to upload 75 files, totalling about 350k of data. That works out to about 600ms per file! This seemed awfully slow, so I wanted to understand what was going on.

(note that all the timings here were collected on my laptop running on my home wifi)

Jump down to the TL;DR for spoilers, or keep reading for way too much detail.

Simplify the dataset

First, I wanted to see if the size of the files had any impact on the speed here, so I created a test with just 50 empty text files:

$ touch assets/{1..50}.txt
$ time ruby gcs.rb !$
ruby gcs.rb assets/{1..50}.txt  0.61s user 0.20s system 2% cpu 27.951 total

That's still about 28 seconds, or about 550ms per asset, which seems super slow for a set of empty files.

Start digging

Looking at the http traffic generated by running this code, it appears as though Google's gem is creating two(!!) new TCP / https connections per asset. The default behaviour of the gem is to use a "resumeable upload", where one connection is issued to start the upload, and then a second connection to transfer the data and finalize the upload. It doesn't appear as though any connection pooling is happening either.

It looks like there's room for some improvement.

A few ideas came to mind:

  • Use a single request per asset
  • Use connection pooling
  • Run requests in parallel (using threading or async)
  • Use HTTP/2

Single request per asset

The GCS API is pretty simple, so it's straightforward to implement a basic upload function in Ruby.

def upload_object(bucket, filename)
  Net::HTTP.post(
    URI("https://storage.googleapis.com/upload/storage/v1/b/#{bucket}/o?name=#{filename}"),
    File.read(filename),
    {"Authorization" => "Bearer #{ENV["AUTH"]}"}
  )
end

This uses one connection per asset, and brings the time down to about 11 seconds for our 50 empty assets, which is less than half of the naive version.

Re-using the same connection

Net::HTTP supports re-using the same connection, we just need to restructure the code a little bit:

def upload_object(client, bucket, filename)
  uri = URI("https://storage.googleapis.com/upload/storage/v1/b/#{bucket}/o?name=#{filename}")
  req = Net::HTTP::Post.new(uri)
  req["Authorization"] = "Bearer #{ENV["AUTH"]}"
  req.body = File.read(filename)
  client.request(req)
end

Net::HTTP.start("storage.googleapis.com", 443, use_ssl: true) do |http|
  files.each do |filename|
    upload_object(http, bucket, filename)
  end
end

This runs a bit faster now, in 8 seconds total.

Parallelization

Ruby has a few concurrency models we can play with. I try to avoid threading wherever possible, and use async libraries. Luckily, Ruby's async gem handles wrapping Net::HTTP quite well:

Async do
  barrier = Async::Barrier.new
  files.each do |filename|
    barrier.async do 
      resp = upload_object(bucket, filename)
    end
  end
  barrier.wait
end

Now we can finish all 50 uploads in about 1.3s

HTTP/2

There are various options for making HTTP2 requests in Ruby. One we can use is async-http, which integrates well with the async gem used above. Another gem that's worked well for me is httpx.

Async do
  barrier = Async::Barrier.new
  internet = Async::HTTP::Internet.new

  files.each do |filename|
    barrier.async do
      resp = internet.post(
        "https://storage.googleapis.com/upload/storage/v1/b/#{bucket}/o?name=#{filename}",
        {"Authorization" => "Bearer #{ENV["AUTH"]}"},
        File.read(filename))
    end
  end
  barrier.wait
end

This finishes all 50 requests in 0.4s!

Looking back at our initial data set which took 45 seconds to run, we can now do in 0.9 seconds.

TL;DR

To summarize, here are the times for uploading 50 empty files to a bucket:

Method Time
naive (google gem) 28s
single request per asset 11s
single http connection 8s
async 1.3s
HTTP/2 0.4s

I'm really shocked at how much faster HTTP/2 here is.

Consumers of Google Cloud APIs should take a look at the libraries that they're using, and see if they can switch over to ones that support HTTP/2. I'm curious if other client libraries for the Google Cloud APIs support HTTP/2.

Can the Ruby gem support HTTP/2? There's an open issue on the github repo to switch the underlying client implementation to Faraday, which would allow one to use an HTTP/2 aware client under the hood. I've started working on a draft PR to see what would be involved in switching, but there are some major barriers at this point.

Gotta Cache 'Em All

TOO MUCH TRAFFIC!!!!

Waaaaaaay back in February we identified overall network bandwidth as a cause of job failures on TBPL. We were pushing too much traffic over our VPN link between Mozilla's datacentre and AWS. Since then we've been working on a few approaches to cope with the increased traffic while at the same time reducing our overall network load. Most recently we've deployed HTTP caches inside each AWS region.

Network traffic from January to August 2014

The answer - cache all the things!

Obligatory XKCD

Caching build artifacts

The primary target for caching was downloads of build/test/symbol packages by test machines from file servers. These packages are generated by the build machines and uploaded to various file servers. The same packages are then downloaded many times by different machines running tests. This was a perfect candidate for caching, since the same files were being requested by many different hosts in a relatively short timespan.

Caching tooltool downloads

Tooltool is a simple system RelEng uses to distribute static assets to build/test machines. While the machines do maintain a local cache of files, the caches are often empty because the machines are newly created in AWS. Having the files in local HTTP caches speeds up transfer times and decreases network load.

Results so far - 50% decrease in bandwidth

Initial deployment was completed on August 8th (end of week 32 of 2014). You can see by the graph above that we've cut our bandwidth by about 50%!

What's next?

There are a few more low hanging fruit for caching. We have internal pypi repositories that could benefit from caches. There's a long tail of other miscellaneous downloads that could be cached as well.

There are other improvements we can make to reduce bandwidth as well, such as moving uploads from build machines to be outside the VPN tunnel, or perhaps to S3 directly. Additionally, a big source of network traffic is doing signing of various packages (gpg signatures, MAR files, etc.). We're looking at ways to do that more efficiently. I'd love to investigate more efficient ways of compressing or transferring build artifacts overall; there is a ton of duplication between the build and test packages between different platforms and even between different pushes.

I want to know MOAR!

Great! As always, all our work has been tracked in a bug, and worked out in the open. The bug for this project is 1017759. The source code lives in https://github.com/mozilla/build-proxxy/, and we have some basic documentation available on our wiki. If this kind of work excites you, we're hiring!

Big thanks to George Miroshnykov for his work on developing proxxy.

Investigating hg performance

(caveat lector: this is a long post with lots of shell snippets and output; it's mostly a brain dump of what I did to investigate performance issues on hg.mozilla.org. I hope you find it useful. Scroll to the bottom for the summary.) Everybody knows that pushing to try can be slow. but why? while waiting for my push to try to complete, I wondered what exactly was slow. I started by cloning my own version of try:

$ hg clone http://hg.mozilla.org try

destination directory: try

requesting all changes

adding changesets

adding manifests

adding file changes

added 95917 changesets with 447521 changes to 89564 files (+2446 heads)

updating to branch default

53650 files updated, 0 files merged, 0 files removed, 0 files unresolved

Next I instrumented hg so I could get some profile information:

$ sudo vi /usr/local/bin/hg

python -m cProfile -o /tmp/hg.profile /usr/bin/hg $*

Then I timed out long it took me to check what would be pushed:

$ time hg out ssh://localhost//home/catlee/mozilla/try

hg out ssh://localhost//home/catlee/mozilla/try  0.57s user 0.04s system 54% cpu 1.114 total

That's not too bad. Let's check our profile:

import pstats

pstats.Stats("/tmp/hg.profile").strip_dirs().sort_stats('time').print_stats(10)

Fri Dec  9 00:25:02 2011    /tmp/hg.profile


         38744 function calls (37761 primitive calls) in 0.593 seconds

   Ordered by: internal time
   List reduced from 476 to 10 due to restriction 

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       13    0.462    0.036    0.462    0.036 {method 'readline' of 'file' objects}
        1    0.039    0.039    0.039    0.039 {mercurial.parsers.parse_index2}
       40    0.031    0.001    0.031    0.001 revlog.py:291(rev)
        1    0.019    0.019    0.019    0.019 revlog.py:622(headrevs)
   177/70    0.009    0.000    0.019    0.000 {__import__}
     6326    0.004    0.000    0.006    0.000 cmdutil.py:15(parsealiases)
       13    0.003    0.000    0.003    0.000 {method 'read' of 'file' objects}
       93    0.002    0.000    0.008    0.000 cmdutil.py:18(findpossible)
     7212    0.001    0.000    0.001    0.000 {method 'split' of 'str' objects}
  392/313    0.001    0.000    0.007    0.000 demandimport.py:92(_demandimport)
The top item is readline() on file objects? I wonder if that's socket operations. I'm ssh'ing to localhost, so it's really fast. Let's add 100ms latency:

$ sudo tc qdisc add dev lo root handle 1:0 netem delay 100ms

$ time hg out ssh://localhost//home/catlee/mozilla/try

hg out ssh://localhost//home/catlee/mozilla/try  0.58s user 0.05s system 14% cpu 4.339 total


import pstats

pstats.Stats("/tmp/hg.profile").strip_dirs().sort_stats('time').print_stats(10)

Fri Dec  9 00:42:09 2011    /tmp/hg.profile


         38744 function calls (37761 primitive calls) in 2.728 seconds

   Ordered by: internal time
   List reduced from 476 to 10 due to restriction 

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       13    2.583    0.199    2.583    0.199 {method 'readline' of 'file' objects}
        1    0.054    0.054    0.054    0.054 {mercurial.parsers.parse_index2}
       40    0.028    0.001    0.028    0.001 revlog.py:291(rev)
        1    0.019    0.019    0.019    0.019 revlog.py:622(headrevs)
   177/70    0.010    0.000    0.019    0.000 {__import__}
       13    0.006    0.000    0.006    0.000 {method 'read' of 'file' objects}
     6326    0.002    0.000    0.004    0.000 cmdutil.py:15(parsealiases)
       93    0.002    0.000    0.006    0.000 cmdutil.py:18(findpossible)
  392/313    0.002    0.000    0.008    0.000 demandimport.py:92(_demandimport)
     7212    0.001    0.000    0.001    0.000 {method 'split' of 'str' objects}
Yep, definitely getting worse with more latency on the network connection. Oh, and I'm using a recent version of hg:

$ hg --version

Mercurial Distributed SCM (version 2.0)



$ echo hello | ssh localhost hg -R /home/catlee/mozilla/try serve --stdio

145

capabilities: lookup changegroupsubset branchmap pushkey known getbundle unbundlehash batch stream unbundle=HG10GZ,HG10BZ,HG10UN httpheader=1024

This doesn't match what hg.mozilla.org is running:

$ echo hello | ssh hg.mozilla.org hg -R /mozilla-central serve --stdio  

67

capabilities: unbundle lookup changegroupsubset branchmap stream=1

So it must be using an older version. Let's see what mercurial 1.6 does:

$ mkvirtualenv hg16

New python executable in hg16/bin/python

Installing setuptools...



(hg16)$ pip install mercurial==1.6

Downloading/unpacking mercurial==1.6
  Downloading mercurial-1.6.tar.gz (2.2Mb): 2.2Mb downloaded
...



(hg16)$ hg --version

Mercurial Distributed SCM (version 1.6)



(hg16)$ echo hello | ssh localhost /home/catlee/.virtualenvs/hg16/bin/hg -R /home/catlee/mozilla/mozilla-central serve --stdio

75

capabilities: unbundle lookup changegroupsubset branchmap pushkey stream=1

That looks pretty close to what hg.mozilla.org claims it supports, so let's time 'hg out' again:

(hg16)$ time hg out ssh://localhost//home/catlee/mozilla/try

hg out ssh://localhost//home/catlee/mozilla/try  0.73s user 0.04s system 3% cpu 24.278 total

tl;dr

Finding missing changesets between two local repositories is 6x slower with hg 1.6 (4 seconds with hg 2.0 to 24 seconds hg 1.6). Add a few hundred people and machines hitting the same repository at the same time, and I imagine things can get bad pretty quickly. Some further searching reveals that mercurial does support a faster method of finding missing changesets in "newer" versions, although I can't figure out exactly when this change was introduced. There's already a bug on file for upgrading mercurial on hg.mozilla.org, so hopefully that improves the situation for pushes to try. The tools we use everyday aren't magical; they're subject to normal debugging and profiling techniques. If a tool you're using is holding you back, find out why!

cURL and paste

cURL and paste...two great tastes that apparently don't go well at all together! I've been writing a bunch of simple wsgi apps lately, some of which handle file uploads. Take this tiny application:

import webob



def app(environ, start_response):
    req = webob.Request(environ)
    req.body_file.read()
    return webob.Response("OK!")(environ, start_response)


import paste.httpserver

paste.httpserver.serve(app, port=8090)

Then throw some files at it with cURL:

[catlee] % for f in $(find -type f); do time curl -s -o /dev/null --data-binary @$f http://localhost:8090; done

curl -s -o /dev/null --data-binary @$f http://localhost:8090  0.00s user 0.00s system 0% cpu 1.013 total

curl -s -o /dev/null --data-binary @$f http://localhost:8090  0.01s user 0.00s system 63% cpu 0.013 total

curl -s -o /dev/null --data-binary @$f http://localhost:8090  0.01s user 0.00s system 64% cpu 0.012 total

curl -s -o /dev/null --data-binary @$f http://localhost:8090  0.01s user 0.00s system 81% cpu 0.015 total

curl -s -o /dev/null --data-binary @$f http://localhost:8090  0.01s user 0.00s system 0% cpu 1.014 total

curl -s -o /dev/null --data-binary @$f http://localhost:8090  0.00s user 0.00s system 0% cpu 1.009 total

Huh? Some files take a second to upload? I discovered after much digging, and rewriting my (more complicated) app several times, that the problem is that cURL sends an extra "Expect: 100-continue" header. This is supposed to let a web server respond with "100 Continue" immediately or reject an upload based on the request headers. The problem is that paste's httpserver doesn't send this by default, and so cURL will wait for a second before giving up and sending the rest of the request. The magic to turn this off is the '-0' to cURL, which forces HTTP/1.0 mode:

[catlee] % for f in $(find -type f); do time curl -0 -s -o /dev/null --data-binary @$f http://localhost:8090; done

curl -0 -s -o /dev/null --data-binary @$f http://localhost:8090  0.00s user 0.00s system 66% cpu 0.012 total

curl -0 -s -o /dev/null --data-binary @$f http://localhost:8090  0.01s user 0.00s system 64% cpu 0.012 total

curl -0 -s -o /dev/null --data-binary @$f http://localhost:8090  0.00s user 0.01s system 58% cpu 0.014 total

curl -0 -s -o /dev/null --data-binary @$f http://localhost:8090  0.01s user 0.00s system 66% cpu 0.012 total

curl -0 -s -o /dev/null --data-binary @$f http://localhost:8090  0.00s user 0.00s system 59% cpu 0.013 total

curl -0 -s -o /dev/null --data-binary @$f http://localhost:8090  0.01s user 0.00s system 65% cpu 0.012 total

Faster try builds!

When we run a try build, we wipe out the build directory between each job; we want to make sure that every user's build has a fresh environment to build in. Unfortunately this means that we also wipe out the clone of the try repo, and so we have to re-clone try every time. On Linux and OSX we were spending an average of 30 minutes to re-clone try, and on Windows 40 minutes. The majority of that is simply 'hg clone' time, but a good portion is due to locks: we need to limit how many simultaneous build slaves are cloning from try at once, otherwise the hg server blows up. Way back in September, Steve Fink suggested using hg's share extension to make cloning faster. Then in November, Ben Hearsum landed some changes that paved the way to actually turning this on. Today we've enabled the share extension for Linux (both 32 and 64-bit) and OSX 10.6 builds on try. Windows and OSX 10.5 are coming too, we need to upgrade hg on the build machines first. Average times for the 'clone' step are down to less than 5 minutes now. This means you get your builds 25 minutes faster! It also means we're not hammering the try repo so badly, and so hopefully won't have to reset it for a long long time. We're planning on rolling this out across the board, so nightly builds get faster, release builds get faster, clobber builds get faster, etc... Enjoy!

What do you want to know about builds?

Mozilla has been quite involved in recent buildbot development, in particular, helping to make it scale across multiple machines. More on this in another post! Once deployed, these changes will give us the ability to give real time access to various information about our build queue: the list of jobs waiting to start, and which jobs are in progress. This should help other tools like Tinderboxpushlog show more accurate information. One limitation of the upstream work so far is that it only captures a very coarse level of detail about builds: start/end time, and result code is pretty much it. No further detail about the build is captured, like which slave it executed on, what properties it generated (which could include useful information like the URL to the generated binaries), etc. We've also been exporting a json dump of our build status for many months now. It's been useful for some analysis, but it also has limitations: the data is always at least 5 minutes old by the time you look, and in-progress builds are not represented at all. We're starting to look at ways of exporting all this detail in a way that's useful to more people. You want to get notified when your try builds are done? You want to look at which test suites are taking the most time? You want to determine how our build times change over time? You want to find out what the last all-green revision was on trunk? We want to make this data available, so anybody can write these tools.

Just how big is that firehose?

I think we have one of the largest buildbot setups out there and we generate a non-trivial amount of data:
  • 6-10 buildbot master processes generating updates, on different machines in 2 or 3 data centers
  • around 130 jobs per hour composed of 4,773 individual steps total per hour. That works out to about 1.4 updates per second that are generated

How you can help

This is where you come in. I can think of two main classes of interfaces we could set up: a query-type interface where you poll for information that you are interested in, and a notification system where you register a listener for certain types (or all!) events. What would be the best way for us to make this data available to you? Some kind of REST API? A message or event brokering system? pubsubhubbub? Is there some type of data or filtering that would be super helpful to you?

Buildbot performance and scaling

It seems like it was ages ago when I posted about profiling buildbot. One of the hot spots identified there was the dataReceived call. This has been sped up a little bit in recent versions of twisted, but our buildbot masters were still severely overloaded. It turns out that the buildbot slaves make a lot of RPC calls when sending log data, which results in tens of thousands of dataReceived calls. Multiply that by several dozen build slaves sending log data peaking at a combined throughput of 10 megabits/s and you've got an awful lot of data to handle. By adding a small slave-side buffer, the number of RPC calls to send log data is drastically reduced by an order of magnitude in some tests, resulting in a much better load situation on the master. This is good for us, because it means the masters are much more responsive, and it's good for everybody else because it means we have fewer failures and wasted time due to the master being too busy to handle everything. It also means we can throw more build slaves onto the masters! The new code was deployed towards the end of the day on the 26th, or the end of the 12th week.

Faster build machines, faster end-to-end time

One metric Release Engineering focuses on a lot is the time between a commit to the source tree and when all the builds and tests for that revision are completed. We call this our end-to-end time. Our approach to improving this time has been to identify the longest bits in the total end-to-end time, and to make them faster. One big chunk of the end-to-end time used to be wait times: how long a build or test waited for a machine to become free. We've addressed this by adding more build and test slaves into the pool. We've also focused on parallelizing the entire process, so instead of doing this: before we now do this: after(not exactly to scale, but you get the idea)

Faster builds please!

After splitting up tests and reducing wait times, the longest part of the entire process was now the actual build time. Last week we added 25 new machines to the pool of build slaves for windows. These new machines are real hardware machines (not VMs), with a quad-core 2.4 GHz processor, 4 GB RAM, and dedicated hard drives. We've been anticipating that the new build machines can crank out builds much faster. And they can. (the break in the graph is the time when mozilla-central was closed due to the PGO link failure bug) We started rolling the new builders into production on February 22nd (where the orange diamonds start on the graph). You can see the end-to-end time immediately start to drop. From February 1st to the 22nd, our average end-to-end time for win32 on mozilla-central was 4h09. Since the 22nd the average time has dropped down to 3h02. That's over an hour better on average, a 26% improvement. Faster builds means tests can start faster, which means tests can be done sooner, which means a better end-to-end time. It also means that build machines become free to do another build sooner, and so we're hoping that these faster build machines will also improve our wait time situation (but see below). Currently we're limited to running the build on these machines with -j1 because of random hangs in the build when using -j2 or higher (bug 524149). Once we fix that, or move to pymake, we should see even better improvements.

What about OSX?

In preparation for deploying these new hardware build machines, we also implemented some prioritization algorithms to choose fast build machines over slow ones, and also to try and choose a machine that has a recent object directory (to make our incremental builds faster). This has helped us out quite a bit on our OSX end-to-end times as well, where we have a mixed pool of xserves and minis doing builds and tests. OSX End to End time Simply selecting the right machine for the job reduced our end-to-end time from 3h12 to 2h13, again almost an hour improvement, or 30% better.

What's next?

We have 25 linux machines that should be coming into our production pools this week. We'll continue to monitor the end-to-end and wait times over the next weeks to see how everything is behaving. One thing I'm watching out for is that having faster builds means we can produce more builds in less time...which means more builds to test! Without enough machines running tests, we may end up making wait times and therefore our end-to-end times worse! We're already begun work on handling this. Our plan is to start doing the unittest runs on the talos hardware....but that's another post!

Faster signing

Once upon a time, when Mozilla released a new version of Firefox, signing all of the .exes and .dlls for all of the locales took a few hours. As more locales were added, and the file sizes increased, this time has increased to over 8 hours to sign all the locales in a Firefox 3.5 release. This has been a huge bottleneck for getting new releases out the door. For our most recent Firefox releases, we've started using some new signing infrastructure that I've been working on over the past few months. There are quite a few reasons why the new infrastructure is faster:

  • Faster hardware. We've moved from an aging single core system to a new quad-core 2.5 GHz machine with 4 GB of RAM.
  • Concurrent signing of locales. Since we've got more cores on the system, we should take advantage of them! The new signing scripts spawn off 4 child processes, each one grabs one locale at a time to process.
  • In-process compression/decompression. Our complete.mar files use bz2 compression for every file in the archive. The old scripts would call the 'bzip2' and 'bunzip2' binaries to do compression and decompression. It's significantly faster to do these operations in-process.
  • Better caching of signed files. The biggest win came from the simple observation that after you sign a file, and re-compress it to include in a .mar file, you should be caching the compressed version to use later. The old scripts did cache signed files, but only the decompressed versions. So to sign the contents of our mar files, the contents would have to be completely unpacked and decompressed, then the cache was checked, files were signed or pulled from cache as necessary, and then re-compressed again. Now, we unpack the mar file, check the cache, and only decompress / sign / re-compress any files that need signing but aren't in the cache. We don't even bother decompressing files that don't need signing, another difference from the original scripts. Big thanks to Nick Thomas for having the idea for this at the Mozilla All-Hands in April.
As a result of all of this, signing all our locales can be done in less than 15 minutes now! See bug 470146 for the gory details. The main bottleneck at this point is the time it takes to transfer all of the files to the signing machine and back again. For a 3.5 release, there's around 3.5 GB of data to transfer back and forth, which takes on the order of 10 minutes to pull down to the signing machine, and another few minutes to push back after signing is done.

Profiling Buildbot

Buildbot is a critical part of our build infrastructure at Mozilla. We use it to manage builds on 5 different platforms (Linux, Mac, Windows, Maemo and Windows Mobile), and 5 different branches (mozilla-1.9.1, mozilla-central, TraceMonkey, Electrolysis, and Places). All in all we have 80 machines doing builds across 150 different build types (not counting Talos; all the Talos test runs and slaves are managed by a different master). And buildbot is at the center of it all. The load on our machine running buildbot is normally fairly high, and occasionally spikes so that the buildbot process is unresponsive. It normally restores itself within a few minutes, but I'd really like to know why it's doing this! Running our staging buildbot master with python's cProfile module for almost two hours yields the following profile:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   416377 4771.188    0.011 4796.749    0.012 {select.select}
       52  526.891   10.133  651.043   12.520 /tools/buildbot/lib/python2.5/site-packages/buildbot-0.7.10p1-py2.5.egg/buildbot/status/web/waterfall.py:834(phase2)
     6518  355.370    0.055  355.370    0.055 {posix.fsync}
   232582  238.943    0.001 1112.039    0.005 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/banana.py:150(dataReceived)
 10089681  104.395    0.000  130.089    0.000 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/banana.py:36(b1282int)
36798140/36797962   83.536    0.000   83.537    0.000 {len}
 29913653   70.458    0.000   70.458    0.000 {method 'append' of 'list' objects}
      311   63.775    0.205   63.775    0.205 {bz2.compress}
 10088987   56.581    0.000  665.982    0.000 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/banana.py:141(gotItem)
4010792/1014652   56.079    0.000  176.693    0.000 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/jelly.py:639(unjelly)

2343910/512709   47.954    0.000  112.446    0.000 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/banana.py:281(_encode)

Interpreting the results

select shows up in the profile because we're profiling wall clock time, not cpu time. So the more time we're spending in select, the better, since that means we're just waiting for data. The overall run time for this profile was 7,532 seconds, so select is taking around 63% of our total time. I believe the more time spent here, the better. Time spent inside select is idle time. We already knew that the buildbot waterfall was slow (the second line in profile). fsync isn't too surprising either. buildbot calls fsync after writing log files to disk. We've considered removing this call, and this profile lends support to our original guess. The next entries really surprised me, twisted's dataReceived and a decoding function, b1282int. These are called when processing data received from the slaves. If I'm reading this correctly, this means that dataReceived and children account for around 40% of our total time after you remove the time spent in select. 1112 / (7532-4796) = 40%. These results are from our staging buildbot master, which doesn't have anywhere near the same load as the production buildbot master. I would expect that the time spent waiting in select would go down on the production master (there's more data being received, more often), and that time spent in fsync and dataReceived would go up.

What to do about it?

A few ideas....
  • Use psyco to get some JIT goodness applied to some of the slower python functions.
  • Remove the fsync call after saving logs.
  • Use the cpu-time to profile rather than wallclock time. This will give a different perspective on the performance of buildbot, which should give better information about where we're spending time processing data.
  • Implement slow pieces in C (or cython). Twisted's Banana library looks do-able in C, and also is high up in the profile.
  • Send less data from the slaves. We're currently logging all stdout/stderr produced by the slaves. All of this data is processed by the master process and then saved to disk.
  • Rearchitect buildbot to handle this kind of load.
  • Have more than one buildbot master, each one handling fewer slaves. We're actively looking into this approach, since it also allows us to have some redundancy for this critical piece of our infrastructure.