Nightly build times getting slower over time

chris

2011-02-11 18:04

Yesterday some folks in #developers mentioned they felt their builds were getting slower over time. I wondered if the same was true for our build machines.

Here's a chart of build times for the past year. This is just the compile + link step for nightly builds, restricted to a single class of hardware per OS. Same machines. Slower builds. Something isn't right here. Windows builds have gone from an average of 90 minutes last March to 150 minutes this January. The big jump for OSX builds at the end of September is when we turned on the universal x86/x86_64 builds. There's a pretty clear upward trend; some of this is to be expected given new features being added, but at the same time more complexity is creeping into the Makefiles. Each little bit costs developers extra time every day doing their own builds, and it also means slower builds in the build infrastructure. Which means you'll wait longer to get try results, our build pools will have longer wait times, dogs and cats living together, and mass hysteria! I'm sure there are places in our build process that can be sped up. Think you can help? Are you a build system rock star? Do you refactor Makefiles in your sleep? Great! We're hiring!

Just who am I talking to? (verifying https connections with python)

chris

2011-02-10 15:43

Comments

Did you know that python's urllib module supports connecting to web servers over HTTPS? It's easy!


import urllib

data = urllib.urlopen("https://www.google.com").read()

print data

Did you also know that it provides absolutely zero guarantees that your "secure" data isn't being observed by a man-in-the-middle? Run this:


from paste import httpserver

def app(environ, start_response):
    start_response("200 OK", [])
    return "Thanks for your secrets!"


httpserver.serve(app, host='127.0.0.1', port='8080', ssl_pem='*')

This little web app will generate a random SSL certificate for you each time it's run. A self-signed, completely untrustworthy certificate. Now modify your first script to look at https://localhost:8080 instead. Or, for more fun, keep it pointing at google and mess with your IP routing to redirect google.com:443 to localhost:8080.


iptables -t nat -A OUTPUT -d google.com -p tcp --dport 443 -j DNAT --to-destination 127.0.0.1:8080

Run your script again, and see what it says. Instead of the raw HTML of google.com, you now get "Thanks for your secrets!". That's right, python will happily accept without complaint or warning the random certificate generated this little python app pretending to be google.com. Sometimes you want to know who you're talking to, you know?


import httplib, socket, ssl, urllib2

def buildValidatingOpener(ca_certs):
    class VerifiedHTTPSConnection(httplib.HTTPSConnection):
        def connect(self):
            # overrides the version in httplib so that we do
            #    certificate verification
            sock = socket.create_connection((self.host, self.port),
                                            self.timeout)
            if self._tunnel_host:
                self.sock = sock
                self._tunnel()

            # wrap the socket using verification with the root
            #    certs in trusted_root_certs
            self.sock = ssl.wrap_socket(sock,
                                        self.key_file,
                                        self.cert_file,
                                        cert_reqs=ssl.CERT_REQUIRED,
                                        ca_certs=ca_certs,
                                        )

    # wraps https connections with ssl certificate verification
    class VerifiedHTTPSHandler(urllib2.HTTPSHandler):
        def __init__(self, connection_class=VerifiedHTTPSConnection):
            self.specialized_conn_class = connection_class
            urllib2.HTTPSHandler.__init__(self)

        def https_open(self, req):
            return self.do_open(self.specialized_conn_class, req)

    https_handler = VerifiedHTTPSHandler()
    url_opener = urllib2.build_opener(https_handler)

    return url_opener


opener = buildValidatingOpener("/usr/lib/ssl/certs/ca-certificates.crt")

req = urllib2.Request("https://www.google.com")

print opener.open(req).read()

Using the this new validating url opener, we can make sure we're talking to someone with a validly signed certificate. With our IP redirection in place, or pointing at localhost:8080 explicitly we get a certificate invalid error. We still don't know for sure that it's google (could be some other site with a valid ssl certificate), but maybe we'll tackle that in a future post!

Faster try builds!

chris

2011-02-04 17:55

Comments

When we run a try build, we wipe out the build directory between each job; we want to make sure that every user's build has a fresh environment to build in. Unfortunately this means that we also wipe out the clone of the try repo, and so we have to re-clone try every time. On Linux and OSX we were spending an average of 30 minutes to re-clone try, and on Windows 40 minutes. The majority of that is simply 'hg clone' time, but a good portion is due to locks: we need to limit how many simultaneous build slaves are cloning from try at once, otherwise the hg server blows up. Way back in September, Steve Fink suggested using hg's share extension to make cloning faster. Then in November, Ben Hearsum landed some changes that paved the way to actually turning this on. Today we've enabled the share extension for Linux (both 32 and 64-bit) and OSX 10.6 builds on try. Windows and OSX 10.5 are coming too, we need to upgrade hg on the build machines first. Average times for the 'clone' step are down to less than 5 minutes now. This means you get your builds 25 minutes faster! It also means we're not hammering the try repo so badly, and so hopefully won't have to reset it for a long long time. We're planning on rolling this out across the board, so nightly builds get faster, release builds get faster, clobber builds get faster, etc... Enjoy!

Better nightly builds

chris

2010-12-08 15:18

Comments

On November 24th we landed some changes we think are a big improvement to how we've been doing nightly builds. We've started doing nightly builds on the same revision across platforms, and where possible, on revisions that we already know compile. In addition, all of the nightly builds will share the same buildid. We pick the revision to build by looking at the past 24 hours of builds, and finding the latest one that built successfully on all platforms. If there is no such revision, we'll build the latest revision on the branch and hope for the best. We also do some extra checking to make sure we don't build a revision that's older than the previous nightly. Prior to this, we would trigger the nightly builds at the same time every day, but which revision and buildid each platform got was completely dependent on what time the builds actually started. There were many instances of nightly builds having different revisions between the various platforms. These changes are a big deal because they mean that we have a much better chance of getting a working nightly build every day (or at least one that compiles!), and all the platforms will be based on the same revision, regardless of when they are run. No more guessing if today's Windows build includes the same fixes as the Mac build! If you're interested in the implementation, which includes some pretty fun buildbot hacking, the details are in bug 570814. Background discussion as to how to choose the revision to use can be found in this thread.

3 days of fun: a journey into the bowels of buildbot

chris

2010-10-29 10:00

Comments

I've just spent 3 days trying to debug some new code in buildbot. The code in question is to implement a change to how we do nightly builds such that they use the same revision for all platforms. I was hitting a KeyError exception inside buildbot's util.loop code, specifically at a line where it is trying to delete a key from a dictionary. In simple form, the loop is doing this:


for k in d.keys():
    if condition:
        del d[k] # Raises KeyError....sometimes...

Tricky bit was, it didn't happen every time. I'd have to wait at least 3 minutes between attempts. So I added a bunch of debugging code:


print d

print d.keys()

for k in d.keys():
    print k
    if condition:
        try:
            del d[k] # Raises KeyError....sometimes...
        except KeyError:
            print k in d # sanity check 1
            print k in d.keys() # sanity check 2

Can you guess what the results of sanity checks 1 and 2 were? 'k in d' is False, but 'k in d.keys()' is True. whhhaaaaa? Much head scratching and hair pulling ensued. I tried many different variations of iterating through the loop, all with the same result. In the end, I posted a question on Stack Overflow. At the same time, Bear and Dustin were zeroing in on a solution. The crucial bit here is that the keys of d are (follow me here...) methods of instances of my new scheduler classes, which inherit from buildbot.util.ComparableMixin...which implements __cmp__ and __hash__. __cmp__ is used in the 'k in d.keys()' test, but __hash__ is used in the 'k in d' test. Some further digging revealed that my scheduler was modifying state that ComparableMixin.__hash__ was referring to, resulting in the scheduler instances not having stable hashes over time. Meanwhile, on stackoverflow, adw came up with an answer that confirmed what Dustin and Bear were saying, and katrielalex came up with a simple example to reproduce the problem. In the end, the fix was simple, just a few extra lines of python code. Too bad it took me 3 days to figure out!

The long road to victory: build & test logs on FTP

chris

2010-10-28 09:00

Comments

One of the biggest pieces of infrastructure keeping us attached to Tinderbox is the issue of build logs. Tinderbox has (and continues to be) the canonical source of a build or test's output. A while ago we made some changes that should help pry us loose from this close dependency on Tinderbox. Build & test logs for all regular depend & try builds are now available on FTP. For example, today's nightly build log for windows is available at http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2010-10-27-06-mozilla-central/mozilla-central-win32-nightly-build26.txt.gz. It's also in the same directory as the build itself! Test logs are available for depend builds, e.g. http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx64/1288215864/mozilla-central_leopard_test-crashtest-build1283.txt.gz. Again, all the builds, logs and other files (test archives, symbols, etc.) are in the same directory. Logs for your try builds and test are also available, e.g. http://stage.mozilla.org/pub/mozilla.org/firefox/tryserver-builds/[email protected]/tryserver-macosx64/. Here too all the logs and binaries are in the same directory. In addition you should be getting links to those logs in your try server notification emails. There's room for improvement here, and a few places where logs aren't getting put in all the right places, but this is still a major step forward. Enjoy! Note to future readers that a some of the links to the logs above will be invalid a few weeks after this post as old directories are purged.

Linux on a new Thinkpad T510

chris

2010-10-27 14:19

Comments

I got a new Thinkpad T510 at work to replace my aging MacBook Pro. I asked for a Thinkpad instead of another MacBook because I wanted hardware with better hardware support, in particular the trackpad. I got into the habit of bringing a USB mouse everywhere I went because the trackpad on the MacBook was so unreliable on linux. So when my new T510 arrived, I was pretty excited. And, except for one tiny problem (of the PEBKAC kind), transferring all my files from the old machine to the new one went flawlessly. Here's how I set up the new machine:

Download image from sysresccd.org. Follow the instructions to make a bootable USB drive.
Boot up computer off USB drive. Resize the existing NTFS partition to be really small. Add 2 new partitions in the new-free space: one for the boot partition for linux, and one to be encrypted and be formatted with lvm.
Format boot partition as ext3. Setup encrypted partition with 'cryptsetup luksFormat /dev/sda6; cryptsetup luksOpen /dev/sda6 crypt_sda6'. Setup LVM with 'pvcreate /dev/mapper/crypt_sda6'. Setup two volumes, one for swap, and one for the root partition.
Connect network cable between old laptop and new one. Configure local network.
Copy files from old /boot to new /boot.
Copy files from old / to new /. Here's where I messed up. My command was: 'rsync -aPxX 192.168.2.1:/ /target/'.
Install grub.
Reboot!

At this point the machine came up ok, but wasn't prompting to decrypt my root drive, and so I had to do some manual steps to get the root drive to mount initially. Fixing up /etc/crypttab and the initramfs solved this. However even after this I was having some problems. I couldn't connect to wireless networks with Network Manager. I couldn't run gnome-power-manager. Files in /var/lib/mysql were owned by ntp! Then I realized that my initial rsync had copied over files preserving the user/group names, not the uid/gid values. And since I wasn't booting off a Debian image, the id/name mappings were quite different. Re-running rsync with '--numeric-ids' got all the ownerships fixed up. After the next reboot things were working flawlessly. Now after a few weeks of using it, I'm enjoying it a lot more than my MacBook Pro. It boots up faster. It connects to wireless networks faster. It suspends/unsuspends faster. It's got real, live, page-up/page-down keys! The trackpad actually works!

Doing the branch dance

chris

2010-10-21 13:24

Comments

This Monday (October 25, 2010), we will be renaming the branches in our buildbotcustom hg repository. The current 'default' branch will be renamed to be called 'buildbot-0.7', and the current 'buildbot-0.8.0' branch will be renamed to 'default'. What does this mean for you? You need to explicitly specify which branch you want to update to after pulling. If you're currently on 'default', you should update to the 'buildbot-0.7' branch. If you're currently on 'buildbot-0.8.0', you should update to the 'default' branch. I'll be running these steps, for those who are interested in the gory details:


#!/bin/sh

set -e

test -d buildbotcustom-branchdance && rm -rf buildbotcustom-branchdance

hg clone http://hg.mozilla.org/build/buildbotcustom buildbotcustom-branchdance

cd buildbotcustom-branchdance



hg update default

echo 'This is the old default branch' > README

hg add README

hg commit --close-branch -m "Closing old default branch"

hg branch buildbot-0.7

echo 'This is the buildbot-0.7 branch' > README

hg commit -m 'Moving default to buildbot-0.7 branch'



hg update buildbot-0.8.0

echo 'This is the old buildbot-0.8.0 branch' > README

hg add README

hg commit --close-branch -m "Closing old buildbot-0.8.0 branch"

hg branch -f default

echo 'This is the default branch' > README

hg commit -m 'Moving buildbot-0.8.0 to the default branch'



echo "Out"

hg out



echo "Heads"

hg heads

A year in RelEng

chris

2010-05-11 10:15

Comments

Something prompted me to look at the size of our codebase here in RelEng, and how much it changes over time. This is the code that drives all the build, test and release automation for Firefox, project branches, and Try, as well as configuration management for the various build and test machines that we have. Here are some simple stats: 2,193 changesets across 5 repositories...that's about 6 changes a day on average. We grew from 43,294 lines of code last year to 73,549 lines of code as of today. That's 70% more code today than we had last year. We added 88,154 lines to our code base, and removed 51,957. I'm not sure what this means, but it seems like a pretty high rate of change!

What do you want to know about builds?

chris

2010-05-04 17:44

Comments

Mozilla has been quite involved in recent buildbot development, in particular, helping to make it scale across multiple machines. More on this in another post! Once deployed, these changes will give us the ability to give real time access to various information about our build queue: the list of jobs waiting to start, and which jobs are in progress. This should help other tools like Tinderboxpushlog show more accurate information. One limitation of the upstream work so far is that it only captures a very coarse level of detail about builds: start/end time, and result code is pretty much it. No further detail about the build is captured, like which slave it executed on, what properties it generated (which could include useful information like the URL to the generated binaries), etc. We've also been exporting a json dump of our build status for many months now. It's been useful for some analysis, but it also has limitations: the data is always at least 5 minutes old by the time you look, and in-progress builds are not represented at all. We're starting to look at ways of exporting all this detail in a way that's useful to more people. You want to get notified when your try builds are done? You want to look at which test suites are taking the most time? You want to determine how our build times change over time? You want to find out what the last all-green revision was on trunk? We want to make this data available, so anybody can write these tools.

Just how big is that firehose?

I think we have one of the largest buildbot setups out there and we generate a non-trivial amount of data:

6-10 buildbot master processes generating updates, on different machines in 2 or 3 data centers
around 130 jobs per hour composed of 4,773 individual steps total per hour. That works out to about 1.4 updates per second that are generated

How you can help

This is where you come in. I can think of two main classes of interfaces we could set up: a query-type interface where you poll for information that you are interested in, and a notification system where you register a listener for certain types (or all!) events. What would be the best way for us to make this data available to you? Some kind of REST API? A message or event brokering system? pubsubhubbub? Is there some type of data or filtering that would be super helpful to you?

chris' random ramblings

Posts about mozilla (old posts, page 4)

Nightly build times getting slower over time

Just who am I talking to? (verifying https connections with python)

Faster try builds!

Better nightly builds

3 days of fun: a journey into the bowels of buildbot

The long road to victory: build & test logs on FTP

Linux on a new Thinkpad T510

Doing the branch dance

A year in RelEng

What do you want to know about builds?

Just how big is that firehose?

How you can help