A tale of python profiling and optimizing
The Release Engineering infrastructure at Mozilla relies heavily on buildbot for much of our infrastructure. For various reasons we're running an older version based on buildbot 0.8.2, which we maintain in our own mercurial repository. We have many different masters with all sorts of different configurations.
To make sure that we don't break things too often, we wrote a tools called test-masters.sh that creates local versions of each unique master configuration and then runs a configuration check on it. Currently there are 20 unique master configurations to test, and it would take 4 minutes to run test-masters.sh on all of them on my machine. Recently sfink landed some changes which would test all the masters in parallel, which brought the time down from a previously interminable 11 minutes.
Four minutes is a long time to wait! What's taking so much time?
My go-to tool for profiling python code is cProfile. I ended up writing a small script to do the equivalent of 'buildbot checkconfig':
import cProfile import sys from buildbot.scripts.checkconfig import ConfigLoader def loadMaster(filename): ConfigLoader(configFileName=filename) cProfile.run("loadMaster(sys.argv[1])", "master.prof")
Running this against my buildbot master's master.cfg file will produce a master.prof file I can load to look at the profile.
>>> import pstats >>> pstats.Stats("master.prof").sort_stats('time').print_stats(5) Thu Nov 7 21:42:25 2013 master.prof 26601516 function calls (24716688 primitive calls) in 439.756 seconds Ordered by: internal time List reduced from 1997 to 5 due to restriction <5> ncalls tottime percall cumtime percall filename:lineno(function) 1 409.381 409.381 438.936 438.936 /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_ hg_b4673f1f2a86_default-py2.6.egg/buildbot/master.py:621(loadConfig) 170046 3.907 0.000 10.706 0.000 /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_ hg_b4673f1f2a86_default-py2.6.egg/buildbot/steps/shell.py:65(__init__) 222809 3.745 0.000 4.124 0.000 /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_ hg_b4673f1f2a86_default-py2.6.egg/buildbot/process/buildstep.py:611(__init__) 1 3.025 3.025 29.352 29.352 /home/catlee/mozilla/buildbot-configs/tests-master/master.cfg:2(<module>) 170046 2.385 0.000 6.033 0.000 /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_ hg_b4673f1f2a86_default-py2.6.egg/buildbot/process/buildstep.py:1027(__init__)
Looks like buildbot's loadConfig is taking a long time! Unfortunately we don't get any more detail than that from cProfile. To get line-by-line profiling info, I've started using kernprof. This requires a few changes to our setup. First, we don't want to run cProfile any more, so modify our test script like this:
import cProfile import sys from buildbot.scripts.checkconfig import ConfigLoader def loadMaster(filename): ConfigLoader(configFileName=filename) #cProfile.run("loadMaster(sys.argv[1])", "master.prof") loadMaster(sys.argv[1])
And since we want to get line-by-line profiling of buildbot's loadConfig, we need to annotate buildbot's code with the @profile decorator. In buildbot's master.py, I added @profile to loadConfig so it looks like this now:
@profile def loadConfig(self, f, check_synchronously_only=False): """Internal function to load a specific configuration file. Any errors in the file will be signalled by raising an exception. <snip> """
Now run kernprof.py to get our profile:
kernprof.py -l -v ../profile_master.py master.cfg
Line # Hits Time Per Hit % Time Line Contents
==============================================================
621 @profile
622 def loadConfig(self, f, check_synchronously_only=False):
623 """Internal function to load a specific configuration file. Any
624 errors in the file will be signalled by raising an exception.
625
626 If check_synchronously_only=True, I will return (with None)
627 synchronously, after checking the config file for sanity, or raise an
628 exception. I may also emit some DeprecationWarnings.
629
630 If check_synchronously_only=False, I will return a Deferred that
631 fires (with None) when the configuration changes have been completed.
632 This may involve a round-trip to each buildslave that was involved."""
633
634 1 17 17.0 0.0 localDict = {'basedir': os.path.expanduser(self.basedir)}
635 1 7 7.0 0.0 try:
636 1 68783020 68783020.0 12.0 exec f in localDict
637 except:
638 log.msg("error while parsing config file")
639 raise
<snip>
785 13303 86781 6.5 0.0 for b in builders:
786 13302 92920 7.0 0.0 if b.has_key('slavename') and b['slavename'] not in slavenames:
787 raise ValueError("builder %s uses undefined slave %s" \
788 % (b['name'], b['slavename']))
789 6935914 42782768 6.2 7.5 for n in b.get('slavenames', []):
790 6922612 449928915 65.0 78.4 if n not in slavenames:
791 raise ValueError("builder %s uses undefined slave %s" \
792 % (b['name'], n))
793 13302 2478517 186.3 0.4 if b['name'] in buildernames:
Wow! Line 790 is taking 78% of our runtime! What's going on?
If I look at my config, I see that I have 13,302 builders and 13,988 slaves configured. Each builder has a subset of slaves attached, but we're still doing 6,922,612 checks to see if each slave that's configured for the builder is configured as a top-level slave. slavenames happens to be a list, so each check does a full scan of the list to see if the slave exists or not!
A very simple patch fixes this:
diff --git a/master/buildbot/master.py b/master/buildbot/master.py --- a/master/buildbot/master.py +++ b/master/buildbot/master.py @@ -761,19 +761,19 @@ class BuildMaster(service.MultiService): errmsg = "c['schedulers'] must be a list of Scheduler instances" assert isinstance(schedulers, (list, tuple)), errmsg for s in schedulers: assert interfaces.IScheduler(s, None), errmsg assert isinstance(status, (list, tuple)) for s in status: assert interfaces.IStatusReceiver(s, None) - slavenames = [s.slavename for s in slaves] + slavenames = set(s.slavename for s in slaves) # convert builders from objects to config dictionaries builders_dicts = [] for b in builders: if isinstance(b, BuilderConfig): builders_dicts.append(b.getConfigDict()) elif type(b) is dict: builders_dicts.append(b)
Now our checks are into a set instead of a list, which is an O(log(n)) operation instead of O(n). Let's re-run our profile with this patch:
File: /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_hg_b4673f1f2a86_default-py2.6.egg/buildbot/master.py
Function: loadConfig at line 621
Total time: 109.885 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
621 @profile
622 def loadConfig(self, f, check_synchronously_only=False):
623 """Internal function to load a specific configuration file. Any
624 errors in the file will be signalled by raising an exception.
625
626 If check_synchronously_only=True, I will return (with None)
627 synchronously, after checking the config file for sanity, or raise an
628 exception. I may also emit some DeprecationWarnings.
629
630 If check_synchronously_only=False, I will return a Deferred that
631 fires (with None) when the configuration changes have been completed.
632 This may involve a round-trip to each buildslave that was involved."""
633
634 1 30 30.0 0.0 localDict = {'basedir': os.path.expanduser(self.basedir)}
635 1 13 13.0 0.0 try:
636 1 46268025 46268025.0 42.1 exec f in localDict
637 except:
638 log.msg("error while parsing config file")
639 raise
<snip>
785 13303 56199 4.2 0.1 for b in builders:
786 13302 60438 4.5 0.1 if b.has_key('slavename') and b['slavename'] not in slavenames:
787 raise ValueError("builder %s uses undefined slave %s" \
788 % (b['name'], b['slavename']))
789 6935914 27430456 4.0 25.0 for n in b.get('slavenames', []):
790 6922612 29028880 4.2 26.4 if n not in slavenames:
791 raise ValueError("builder %s uses undefined slave %s" \
792 % (b['name'], n))
793 13302 1286628 96.7 1.2 if b['name'] in buildernames:
794 raise ValueError("duplicate builder name %s"
795 % b['name'])
We're down to 25% of the runtime for this check now. If we apply the same treatment to a few of the other data structures here, we get the total time for test-masters down to 45 seconds!
I've landed resulting patch into our repo. I encourage you to upgrade!


We have a script called gittool that is the analog of hgtool. I modified it to support b2g-style xml manifests, and pointed it at a local shared repository and let it loose. It works!
But...
I noticed that the more repositories were cloned into the shared repository, the longer it took to clone new ones. Repositories that normally take 5 seconds to clone were now taking 5 minutes to fetch the first time into the shared repository. What was going on?
One clue was that there are over 144,000 refs across all the b2g repositories. The vast majority of these are tags...there are only 12k non-tag refs.
I created a test script to create new repositories and measure how long it took to fetch them into a shared repository. The most significant factor affecting fetch speed was the number of refs! The new repositories each had 40 commits, and I created 100 of them. If I created one tag per commit then fetching a new repository into the shared repo added 0.0028s. However if I created 10 tags per commit then fetching the new repositories into the shared repo added 0.0113s per repo.
For now I've given up on the idea to have giant local shared repositories, and have moved to a model that keeps each repository separate, like we do with hg. Hopefully we can revisit this later to avoid having multiple copies of related repositories cloned on the same machine.
The Ubuntu32/64 test platforms have been enabled on most branches this week. Wherever we've had consistent green results on the new test platforms, we've disabled running those tests on the older Fedora32/64 platforms. Currently these are crashtests, jsreftests and marionette tests. We're working closely with the awesome 