Skip to main content

Posts about mozilla (old posts, page 6)

A tale of python profiling and optimizing

The Release Engineering infrastructure at Mozilla relies heavily on buildbot for much of our infrastructure. For various reasons we're running an older version based on buildbot 0.8.2, which we maintain in our own mercurial repository. We have many different masters with all sorts of different configurations.

To make sure that we don't break things too often, we wrote a tools called test-masters.sh that creates local versions of each unique master configuration and then runs a configuration check on it. Currently there are 20 unique master configurations to test, and it would take 4 minutes to run test-masters.sh on all of them on my machine. Recently sfink landed some changes which would test all the masters in parallel, which brought the time down from a previously interminable 11 minutes.

Four minutes is a long time to wait! What's taking so much time?

My go-to tool for profiling python code is cProfile. I ended up writing a small script to do the equivalent of 'buildbot checkconfig':

import cProfile
import sys
from buildbot.scripts.checkconfig import ConfigLoader

def loadMaster(filename):
    ConfigLoader(configFileName=filename)

cProfile.run("loadMaster(sys.argv[1])", "master.prof")

Running this against my buildbot master's master.cfg file will produce a master.prof file I can load to look at the profile.

>>> import pstats
>>> pstats.Stats("master.prof").sort_stats('time').print_stats(5)
Thu Nov  7 21:42:25 2013    master.prof

         26601516 function calls (24716688 primitive calls) in 439.756 seconds

   Ordered by: internal time
   List reduced from 1997 to 5 due to restriction <5>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1  409.381  409.381  438.936  438.936 /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_
hg_b4673f1f2a86_default-py2.6.egg/buildbot/master.py:621(loadConfig)
   170046    3.907    0.000   10.706    0.000 /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_
hg_b4673f1f2a86_default-py2.6.egg/buildbot/steps/shell.py:65(__init__)
   222809    3.745    0.000    4.124    0.000 /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_
hg_b4673f1f2a86_default-py2.6.egg/buildbot/process/buildstep.py:611(__init__)
        1    3.025    3.025   29.352   29.352 /home/catlee/mozilla/buildbot-configs/tests-master/master.cfg:2(<module>)
   170046    2.385    0.000    6.033    0.000 /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_
hg_b4673f1f2a86_default-py2.6.egg/buildbot/process/buildstep.py:1027(__init__)

Looks like buildbot's loadConfig is taking a long time! Unfortunately we don't get any more detail than that from cProfile. To get line-by-line profiling info, I've started using kernprof. This requires a few changes to our setup. First, we don't want to run cProfile any more, so modify our test script like this:

import cProfile
import sys
from buildbot.scripts.checkconfig import ConfigLoader

def loadMaster(filename):
    ConfigLoader(configFileName=filename)

#cProfile.run("loadMaster(sys.argv[1])", "master.prof")
loadMaster(sys.argv[1])

And since we want to get line-by-line profiling of buildbot's loadConfig, we need to annotate buildbot's code with the @profile decorator. In buildbot's master.py, I added @profile to loadConfig so it looks like this now:

@profile
def loadConfig(self, f, check_synchronously_only=False):
    """Internal function to load a specific configuration file. Any
    errors in the file will be signalled by raising an exception.
    <snip>
    """

Now run kernprof.py to get our profile:

kernprof.py -l -v ../profile_master.py master.cfg

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   621                                               @profile
   622                                               def loadConfig(self, f, check_synchronously_only=False):
   623                                                   """Internal function to load a specific configuration file. Any
   624                                                   errors in the file will be signalled by raising an exception.
   625
   626                                                   If check_synchronously_only=True, I will return (with None)
   627                                                   synchronously, after checking the config file for sanity, or raise an
   628                                                   exception. I may also emit some DeprecationWarnings.
   629
   630                                                   If check_synchronously_only=False, I will return a Deferred that
   631                                                   fires (with None) when the configuration changes have been completed.
   632                                                   This may involve a round-trip to each buildslave that was involved."""
   633
   634         1           17     17.0      0.0          localDict = {'basedir': os.path.expanduser(self.basedir)}
   635         1            7      7.0      0.0          try:
   636         1     68783020 68783020.0     12.0              exec f in localDict
   637                                                   except:
   638                                                       log.msg("error while parsing config file")
   639                                                       raise

   <snip>

   785     13303        86781      6.5      0.0          for b in builders:
   786     13302        92920      7.0      0.0              if b.has_key('slavename') and b['slavename'] not in slavenames:
   787                                                           raise ValueError("builder %s uses undefined slave %s" \
   788                                                                            % (b['name'], b['slavename']))
   789   6935914     42782768      6.2      7.5              for n in b.get('slavenames', []):
   790   6922612    449928915     65.0     78.4                  if n not in slavenames:
   791                                                               raise ValueError("builder %s uses undefined slave %s" \
   792                                                                                % (b['name'], n))
   793     13302      2478517    186.3      0.4              if b['name'] in buildernames:

Wow! Line 790 is taking 78% of our runtime! What's going on?

If I look at my config, I see that I have 13,302 builders and 13,988 slaves configured. Each builder has a subset of slaves attached, but we're still doing 6,922,612 checks to see if each slave that's configured for the builder is configured as a top-level slave. slavenames happens to be a list, so each check does a full scan of the list to see if the slave exists or not!

A very simple patch fixes this:

diff --git a/master/buildbot/master.py b/master/buildbot/master.py
--- a/master/buildbot/master.py
+++ b/master/buildbot/master.py
@@ -761,19 +761,19 @@ class BuildMaster(service.MultiService):
         errmsg = "c['schedulers'] must be a list of Scheduler instances"
         assert isinstance(schedulers, (list, tuple)), errmsg
         for s in schedulers:
             assert interfaces.IScheduler(s, None), errmsg
         assert isinstance(status, (list, tuple))
         for s in status:
             assert interfaces.IStatusReceiver(s, None)

-        slavenames = [s.slavename for s in slaves]
+        slavenames = set(s.slavename for s in slaves)

         # convert builders from objects to config dictionaries
         builders_dicts = []
         for b in builders:
             if isinstance(b, BuilderConfig):
                 builders_dicts.append(b.getConfigDict())
             elif type(b) is dict:
                 builders_dicts.append(b)

Now our checks are into a set instead of a list, which is an O(log(n)) operation instead of O(n). Let's re-run our profile with this patch:

File: /home/catlee/.virtualenvs/buildbot-mozilla-080/lib/python2.6/site-packages/buildbot-0.8.2_hg_b4673f1f2a86_default-py2.6.egg/buildbot/master.py
Function: loadConfig at line 621
Total time: 109.885 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   621                                               @profile
   622                                               def loadConfig(self, f, check_synchronously_only=False):
   623                                                   """Internal function to load a specific configuration file. Any
   624                                                   errors in the file will be signalled by raising an exception.
   625
   626                                                   If check_synchronously_only=True, I will return (with None)
   627                                                   synchronously, after checking the config file for sanity, or raise an
   628                                                   exception. I may also emit some DeprecationWarnings.
   629
   630                                                   If check_synchronously_only=False, I will return a Deferred that
   631                                                   fires (with None) when the configuration changes have been completed.
   632                                                   This may involve a round-trip to each buildslave that was involved."""
   633
   634         1           30     30.0      0.0          localDict = {'basedir': os.path.expanduser(self.basedir)}
   635         1           13     13.0      0.0          try:
   636         1     46268025 46268025.0     42.1              exec f in localDict
   637                                                   except:
   638                                                       log.msg("error while parsing config file")
   639                                                       raise

   <snip>

   785     13303        56199      4.2      0.1          for b in builders:
   786     13302        60438      4.5      0.1              if b.has_key('slavename') and b['slavename'] not in slavenames:
   787                                                           raise ValueError("builder %s uses undefined slave %s" \
   788                                                                            % (b['name'], b['slavename']))
   789   6935914     27430456      4.0     25.0              for n in b.get('slavenames', []):
   790   6922612     29028880      4.2     26.4                  if n not in slavenames:
   791                                                               raise ValueError("builder %s uses undefined slave %s" \
   792                                                                                % (b['name'], n))
   793     13302      1286628     96.7      1.2              if b['name'] in buildernames:
   794                                                           raise ValueError("duplicate builder name %s"
   795                                                                            % b['name'])

We're down to 25% of the runtime for this check now. If we apply the same treatment to a few of the other data structures here, we get the total time for test-masters down to 45 seconds!

I've landed resulting patch into our repo. I encourage you to upgrade!

Archive of mozilla-inbound builds for regression hunting

Nick Thomas, of recent Balrog fame, has also been hard at work improving life for our intrepid regression hunters.

Sometimes we don't detect problems in Firefox until many weeks or months after code has landed on mozilla-inbound. We do builds for almost every checkin that happens, but end up having to delete them after a month to keep disk usage under control. Unfortunately that means that if a problem is detected while in the Beta cycle or even after release, we're only left with nightly builds for regression hunting, and so the regression window can be up to 24 hours. A lot can change in 24 hours!

Nick's hard work means we are keeping a full release cycle's worth of mozilla-inbound builds archived on S3. We're focused on mozilla-inbound for now, but I imagine we'll be adding other integration branches shortly.

The archive is currently available at http://inbound-archive.pub.build.mozilla.org/pub/mozilla.org/

Please send Nick your thanks, and if you run into any issues, please file bugs blocking bug 765258

Dealing with bursty load

John has been doing a regular series of posts about load on our infrastructure, going back years. Recently, GPS also posted about infrastructure efficiency here. What I've always found interesting is the bursty nature of the load at different times throughout the day, so thought people might find the following data useful.

Every time a developer checks in code, we schedule a huge number of build and test jobs. Right now on mozilla-central every checkin generates just over 10 CPU days worth of work. How many jobs do we end up running over the course of the day? How much time to they take to run?

I created a few graphs to explore these questions - these get refreshed hourly, so should serve as a good dashboard to monitor recent load on an ongoing basis.

# of jobs running over time

running jobs This shows how many jobs are running at any given hour over the past 7 days

# of cpu hours run over time

compute times This shows how many CPU hours were spent for jobs that started for any given hour over the same time period.

A few observations about the range in the load:

  • Weekends can be really quiet! Our peak weekday load is about 20x the lowest weekend load.
  • Weekday load varies by about 2x within any given day.

Over the years RelEng have scaled our capacity to meet peak demand. After all, we're trying to give the best experience to developers; making people wait for builds or tests to start is frustrating and can also really impact the productivity of our development teams, and ultimately the quality of Firefox/Fennec/B2G.

Scaling physical capacity takes a lot of time and advance planning. Unfortunately this means that many of our physical machines are idle for non-peak times, since there isn't any work to do. We've been able to make use of some of this idle time by doing things like run fuzzing jobs for our security team.

Scaling quickly for peak load is something that Amazon is great for, and we've been taking advantage of scaling build/test capacity in EC2 for nearly a year. We've been able to move a significant amount of our Linux build and test infrastructure to run inside Amazon's EC2 infrastructure. This means we can start machines in response to increasing load, and shut them down when they're not needed. (Aside - not all of our reporting tools know about this, which can cause some reporting confusion because it appears as if the slaves are broken or not taking work when in reality they're shut off because of low demand - we're working on fixing that up!)

Hopefully this data and these dashboards help give some context about why our infrastructure isn't always 100% busy 100% of the time.

New on TBPL - hot pink

jcranmer noticed a funny new colour on TBPL last week - hot pink!

This colour indicates jobs that have been cancelled manually. It's an important distinction to make between these jobs and jobs that have failed due to build or test or infrastructure problems.

It took a long time, but I finally finished deploying bug 704006 - add a new status for "user cancelled" last week.

With the help of fabric, I was able to script the upgrade so that we didn't require a downtime. Each master was first disabled in slavealloc so that slaves would connect to different masters after a reboot. Next I gracefully shut down the master. Once the master was shut down cleanly, buildbot was updated to pick up the new change, and the master started back up again. Each master could take several hours to shut down depending on what jobs were running, and we have quite a few masters now, so this whole process took most of the week to complete. It's awesome to be able to do this kind of thing without a downtime though....it makes me feel a bit like a ninja :P

Big thanks to ewong for the initial patch, and making sure it got accepted upstream as well!

Experiments with shared git repositories

I've been working on git support for Mozilla's build infrastructure over the past while.

RelEng's current model for creating our build machines is to create a homogeneous set of machines, each one capable of doing any type of build. This means that build machines can be doing a debug mozilla-central linux64 build one minute, and then an opt mozilla-b2g18 panda build the next. This model has worked well for us for scalability and load balancing: machines run on branches where there's currently activity, and we can focus our efforts on overall capacity. It's a huge improvement over the Firefox 2.x and 3.0 days where we had only one or two dedicated machines per branch!

However, there are challenges when using this model. We need to be reasonably efficient with disk space since the same disk is shared between all types of builds across all branches. We also need to make build setup time as quick as possible.

For mercurial based builds we've developed a utility called hgtool that manages shared local copies of repositories that are then locally cloned into each specific build directory. By using hg's share extension, we are able to save quite a bit of disk space and also have efficient pulling/cloning operations. However, because of the way Mozilla treats branches in hg (where each branch is in a separate repo), we need to keep each local clone of gecko branches separate from each other. We have a separate local clone of mozilla-central and another of mozilla-inbound. (although, maybe we could use bookmarks to keep them distinct in the same local repo...hmmm...)

What about git?

git's branches are quite different from mercurial's. Everything depends on "refs", which are names that refer to specific commit ids. It's actually possible to pull in completely unrelated git repositories into the same local repository. As long as you keep the refs from different repositories separate, all the commits and metadata coexist happily. In contrast, mercurial uses metadata on each commit to distinguish branches from each other. Most commits in the various gecko repositories are on the "default" branch. This makes it harder to combine mozilla-central and mozilla-inbound into the same repository without using some other means to keep each repository's "default" branch separate.

This gave me a crazy idea: what if we kept one giant shared git repository on the build machines that contained ALL THE REPOS.

combine ALL THE REPOS!

We have a script called gittool that is the analog of hgtool. I modified it to support b2g-style xml manifests, and pointed it at a local shared repository and let it loose. It works!

But...

I noticed that the more repositories were cloned into the shared repository, the longer it took to clone new ones. Repositories that normally take 5 seconds to clone were now taking 5 minutes to fetch the first time into the shared repository. What was going on?

One clue was that there are over 144,000 refs across all the b2g repositories. The vast majority of these are tags...there are only 12k non-tag refs.

I created a test script to create new repositories and measure how long it took to fetch them into a shared repository. The most significant factor affecting fetch speed was the number of refs! The new repositories each had 40 commits, and I created 100 of them. If I created one tag per commit then fetching a new repository into the shared repo added 0.0028s. However if I created 10 tags per commit then fetching the new repositories into the shared repo added 0.0113s per repo.

For now I've given up on the idea to have giant local shared repositories, and have moved to a model that keeps each repository separate, like we do with hg. Hopefully we can revisit this later to avoid having multiple copies of related repositories cloned on the same machine.

Now testing Firefox in AWS

You may have noticed some new tests appearing on TBPL this week: new tests

The Ubuntu32/64 test platforms have been enabled on most branches this week. Wherever we've had consistent green results on the new test platforms, we've disabled running those tests on the older Fedora32/64 platforms. Currently these are crashtests, jsreftests and marionette tests. We're working closely with the awesome A*Team to migrate over any remaining test suites that make sense. As always, if you notice anything that looks like it's not working right, please let us know - filing a bug is the best way. We expect there to be differences between the test platforms and therefore in some test results. We're committed to tracking down what those differences are so we can make sure the new test machines continue to give us good test results.

A lot of what we do in RelEng flies under the radar. When we're doing our jobs well, most of the time you shouldn't notice changes we make!

I wanted to highlight this change in particular, because it's a HUGE win for test scalability. If needed, we're able to add more Ubuntu test machines in a matter of minutes. And the more tests we can move over to this new pool of test machines, the more we can improve turnaround time on the overloaded Fedora slaves.

Rail deserves most of the credit for this awesome work, so send kudos and/or beer his way :)

Behind the clouds: how RelEng do Firefox builds on AWS

RelEng have been expanding our usage of Amazon's AWS over the past few months as the development pace of the B2G project increases. In October we began moving builds off of Mozilla-only infrastructure and into a hybrid model where some jobs are done in Mozilla's infra, and others are done in Amazon. Since October we've expanded into 3 amazon regions, and now have nearly 300 build machines in Amazon. Within each AWS region we've distributed our load across 3 availability zones.

That's great! But how does it work?

Behind the scenes, we've written quite a bit of code to manage our new AWS infrastructure. This code is in our cloud-tools repo (github|hg.m.o) and uses the excellent boto library extensively.

The two work horses in there are aws_watch_pending and aws_stop_idle. aws_stop_idle's job is pretty easy, it goes around looking at EC2 instances that are idle and shuts them off safely. If an EC2 slave hasn't done any work in more than 10 minutes, it is shut down.

aws_watch_pending is a little more involved. Its job is to notice when there are pending jobs (like your build waiting to start!) and to resume EC2 instances. We take a few factors into account when starting up instances:

  • We wait until a pending job is more than a minute old before starting anything. This allows in-house capacity to grab the job if possible, and other EC2 slaves that are online but idle also have a chance to take it.
  • Use any reserved instances first. As our AWS load stabilizes, we've been able to purchase some reserved instances to reduce our cost. Obviously, to reduce our cost, we have to use those reservations wherever possible! The code to do this is a bit more complicated than I'd like it to be since AWS reservations are specific to individual availability zones rather than whole regions.
  • Some regions are cheaper than others, so we prefer to start instances in the cheaper regions first.
  • Start instances that were most recently running. This should give both better depend-build time, and also helps with billing slightly. Amazon bills for complete hours. So if you start one instance twice in an hour, you're charged for a single hour. If you start two instances once in the hour, you're charged for two hours.

Overall we're really happy with Amazon's services. Having APIs for nearly everything has made development really easy.

What's next?

Seeing as how test capacity is always woefully behind, we're hoping to be able to run a large number of our linux-based unittests on EC2, particularly those that don't require an accelerated graphics context.

After that? Maybe windows builds? Maybe automated regression hunting? What do you want to see?

RelEng - we do builds, releases and updates!

Mozilla's Release Engineering team is responsible for making sure that our products get built, released and updated properly.

If you're working on a project which changes how or what we ship, or requires changes in how updates work, please get in touch with us as early in your project as possible. We can work with you to find the best solutions to your release/update problems, and save everybody a lot of last minute panic.

Firefox builds in the cloud!

Did you know that since last Thursday we've been doing the majority of our Linux Firefox builds in Amazon?

John Hopkins posted about this last week, but I wanted to highlight how important this is for Release Engineering. We can now scale up the number of Linux build machines in proportion to load. If there are no builds happening, great! Shut off the machines! If there are builds pending, we can start up more machines within minutes.

Migrating to these new build systems means that we can now convert excess in-house Linux build capacity into additional Windows build capacity.

Very shortly we'll be looking at running certain unit test suites in this environment as well.

For all the gory details, follow along in bug 772446.

Try usage

ehsan, mconnor and I were chatting yesterday about ways to improve how people use try. One hypothesis is that people aren't aware of how much their try pushes actually cost in terms of infrastructure time.

In order to remedy this, I present to you the Try High Scores list!

try usage (click through for interactive goodness)

This is a report of how much time people have racked up on try server over the last 7 days. This isn't meant to be a wall of shame (or fame...). I hope that by publishing this people realize in some ways the costs their try pushes have on the infrastructure. This cost impacts not only yourself, but other people waiting for results on other branches.

If you really need those results from try, then by all means use it, it's there for you!

Please keep in mind though, if you have the results you need from try already, cancelling the remaining jobs is just a click away on tbpl.