Diving into python logging

Python has a very rich logging system. It's very easy to add structured or unstructured log output to your python code, and have it written to a file, or output to the console, or sent to syslog, or to customize the output format.

We're in the middle of re-examining how logging works in mozharness to make it easier to factor-out code and have fewer mixins.

Here are a few tips and tricks that have really helped me with python logging:

There can be only more than one

Well, there can be only one logger with a given name. There is a special "root" logger with no name. Multiple getLogger(name) calls with the same name will return the same logger object. This is an important property because it means you don't need to explicitly pass logger objects around in your code. You can retrieve them by name if you wish. The logging module is maintaining a global registry of logging objects.

You can have multiple loggers active, each specific to its own module or even class or instance.

Each logger has a name, typically the name of the module it's being used from. A common pattern you see in python modules is this:

# in module foo.py
import logging
log = logging.getLogger(__name__)

This works because inside foo.py, __name__ is equal to "foo". So inside this module the log object is specific to this module.

Loggers are hierarchical

The names of the loggers form their own namespace, with "." separating levels. This means that if you have have loggers called foo.bar, and foo.baz, you can do things on logger foo that will impact both of the children. In particular, you can set the logging level of foo to show or ignore debug messages for both submodules.

# Let's enable all the debug logging for all the foo modules
import logging
logging.getLogger('foo').setLevel(logging.DEBUG)

Log messages are like events that flow up through the hierarchy

Let's say we have a module foo.bar:

import logging
log = logging.getLogger(__name__)  # __name__ is "foo.bar" here

def make_widget():
    log.debug("made a widget!")

When we call make_widget(), the code generates a debug log message. Each logger in the hierarchy has a chance to output something for the message, ignore it, or pass the message along to its parent.

The default configuration for loggers is to have their levels unset (or set to NOTSET). This means the logger will just pass the message on up to its parent. Rinse & repeat until you get up to the root logger.

So if the foo.bar logger hasn't specified a level, the message will continue up to the foo logger. If the foo logger hasn't specified a level, the message will continue up to the root logger.

This is why you typically configure the logging output on the root logger; it typically gets ALL THE MESSAGES!!! Because this is so common, there's a dedicated method for configuring the root logger: logging.basicConfig()

This also allows us to use mixed levels of log output depending on where the message are coming from:

import logging

# Enable debug logging for all the foo modules
logging.getLogger("foo").setLevel(logging.DEBUG)

# Configure the root logger to log only INFO calls, and output to the console
# (the default)
logging.basicConfig(level=logging.INFO)

# This will output the debug message
logging.getLogger("foo.bar").debug("ohai!")

If you comment out the setLevel(logging.DEBUG) call, you won't see the message at all.

exc_info is teh awesome

All the built-in logging calls support a keyword called exc_info, which if isn't false, causes the current exception information to be logged in addition to the log message. e.g.:

import logging
logging.basicConfig(level=logging.INFO)

log = logging.getLogger(__name__)

try:
    assert False
except AssertionError:
    log.info("surprise! got an exception!", exc_info=True)

There's a special case for this, log.exception(), which is equivalent to log.error(..., exc_info=True)

Python 3.2 introduced a new keyword, stack_info, which will output the current stack to the current code. Very handy to figure out how you got to a certain point in the code, even if no exceptions have occurred!

"No handlers found..."

You've probably come across this message, especially when working with 3rd party modules. What this means is that you don't have any logging handlers configured, and something is trying to log a message. The message has gone all the way up the logging hierarchy and fallen off the...top of the chain (maybe I need a better metaphor).

import logging
log = logging.getLogger()
log.error("no log for you!")

outputs:

No handlers could be found for logger "root"

There are two things that can be done here:

  1. Configure logging in your module with basicConfig() or similar

  2. Library authors should add a NullHandler at the root of their module to prevent this. See the cookbook and this blog for more details here.

Want more?

I really recommend that you read the logging documentation and cookbook which have a lot more great information (and are also very well written!) There's a lot more you can do, with custom log handlers, different output formats, outputting to many locations at once, etc. Have fun!

Upcoming hotness from RelEng

To kick off the new year, I'd like to share some of the exciting projects we have underway in Release Engineering.

Balrog

First off we have Balrog, our next generation update server. Work on Balrog has been underway for quite some time. Last fall we switched beta users to use it. Shortly after, we did some additional load testing to see if we were ready to flip over release traffic. The load testing revealed some areas that needed optimization, which isn't surprising since almost no optimization work had been done up to that point!

Ben and Nick added the required caching, and our subsequent load testing was a huge success. We're planning on flipping the switch to divert release users over on January 19th. \o/

Funsize

Next up we have Funsize. (Don't ask about the name; it's all Laura's fault). Funsize is a web service to generate partial updates between two versions of Firefox. There are a number of places where we want to generate these partial updates, so wrapping the logic up into a service makes a lot of sense, and also affords the possibility of faster generation due to caching.

We're aiming to have nightly builds use funsize for partial update generation this quarter.

I'd really like to see us get away from the model where the "nightly build" job is responsible for not only the builds, but generating and publishing the complete and partial updates. The problem with this is that the single job is responsible for too many deliverables, and touches too many systems. It's hard to make and test changes in isolation.

The model we're trying to move to is where the build jobs are responsible only for generating the required binaries. It should be the responsibility of a separate system to generate partials and publish updates to users. I believe splitting up these functions into their own systems will allow us to be more flexible in how we work on changes to each piece independently.

S3 uploads from automation

This quarter we're also working on migrating build and test files off our aging file server infrastructure (aka "FTP", which is a bit of a misnomer...) and onto S3. All of our build and test binaries are currently uploaded and downloaded via a central file server in our data center. It doesn't make sense to do this when most of our builds and tests are being generated and consumed inside AWS now. In addition, we can get much better cost-per-GB by moving the storage to S3.

No reboots

Morgan has been doing awesome work with runner. One of the primary aims here is to stop rebooting build and test machines between every job. We're hoping that by not rebooting between builds, we can get a small speedup in build times since a lot of the build tree should be cached in memory already. Also, by not rebooting we can have shorter turnaround times between jobs on a single machine; we can effectively save 3-4 minutes of overhead per job by not rebooting. There's also the opportunity to move lots of machine maintenance work from inside the build/test jobs themselves to instead run before buildbot starts.

Release build promotion

Finally I'd like to share some ideas we have about how to radically change how we do release builds of Firefox.

Our plan is to create a new release pipeline that works with already built binaries and "promotes" them to the release/beta channel. The release pipeline we have today creates a fresh new set of release builds that are distinct from the builds created as part of continuous integration.

This new approach should cut the amount of time required to release nearly in half, since we only need to do one set of builds instead of two. It also has the benefit of aligning the release and continuous-integration pipelines, which should simplify a lot of our code.

... and much more!

This is certainly not an exhaustive list of the things we have planned for this year. Expect to hear more from us over the coming weeks!

Gotta Cache 'Em All

TOO MUCH TRAFFIC!!!!

Waaaaaaay back in February we identified overall network bandwidth as a cause of job failures on TBPL. We were pushing too much traffic over our VPN link between Mozilla's datacentre and AWS. Since then we've been working on a few approaches to cope with the increased traffic while at the same time reducing our overall network load. Most recently we've deployed HTTP caches inside each AWS region.

Network traffic from January to August 2014

The answer - cache all the things!

Obligatory XKCD

Caching build artifacts

The primary target for caching was downloads of build/test/symbol packages by test machines from file servers. These packages are generated by the build machines and uploaded to various file servers. The same packages are then downloaded many times by different machines running tests. This was a perfect candidate for caching, since the same files were being requested by many different hosts in a relatively short timespan.

Caching tooltool downloads

Tooltool is a simple system RelEng uses to distribute static assets to build/test machines. While the machines do maintain a local cache of files, the caches are often empty because the machines are newly created in AWS. Having the files in local HTTP caches speeds up transfer times and decreases network load.

Results so far - 50% decrease in bandwidth

Initial deployment was completed on August 8th (end of week 32 of 2014). You can see by the graph above that we've cut our bandwidth by about 50%!

What's next?

There are a few more low hanging fruit for caching. We have internal pypi repositories that could benefit from caches. There's a long tail of other miscellaneous downloads that could be cached as well.

There are other improvements we can make to reduce bandwidth as well, such as moving uploads from build machines to be outside the VPN tunnel, or perhaps to S3 directly. Additionally, a big source of network traffic is doing signing of various packages (gpg signatures, MAR files, etc.). We're looking at ways to do that more efficiently. I'd love to investigate more efficient ways of compressing or transferring build artifacts overall; there is a ton of duplication between the build and test packages between different platforms and even between different pushes.

I want to know MOAR!

Great! As always, all our work has been tracked in a bug, and worked out in the open. The bug for this project is 1017759. The source code lives in https://github.com/mozilla/build-proxxy/, and we have some basic documentation available on our wiki. If this kind of work excites you, we're hiring!

Big thanks to George Miroshnykov for his work on developing proxxy.

B2G now building using unified sources

Last week, with the help of Ehsan and John, we finally enabled unified source builds for B2G devices.

As a result we're building device builds approximately 40% faster than before.

Between June 12th and June 17th, 50% of our successful JB emulator builds on mozilla-inbound finished in 97 minutes or less. Using unified sources for these builds reduced the 50th percentile of build times down to 60 minutes (from June 19th to June 24th).

To mitigate the risks of changes landing that break non-unified builds, we're also doing periodic non-unified builds for these devices.

As usual, all our work here was done in the open. If you're interested, read along in bug 950676, and bug 942167.

Do you enjoy building, debugging and optimizing build, test & release pipelines? Great, because we're hiring!

First brew day

This weekend I finally took the plunge into home brewing. I've been interested in home brewing for a while, lurking on /r/homebrewing and other lists, putting together kit lists, revising kit lists, and learning a lot as I went.

brulosopher's post on /r/homebrewing finally convinced me to jump straight into all-grain brewing with BIAB rather than the normal beginner's method of extract brewing. To get started, I decided on this beginner BIAB kit from ontariobeerkegs.com. It was a bit more than I was hoping to spend initially, but I was having a really hard time putting together a reasonable kit for anything less.

Prep

The kit finally arrived last Thursday! I was really excited to open it up, and see all the shiny new equipment. It was all somewhat familiar from my months of research, but still excitingly new and real now that it was actually here.

equipment.png

In preparation for brew day, I cleaned out the kettle and fermenter buckets. I discovered a small leak in the ball valve on the kettle, and tightened up a little bit on the nuts holding it together. That slowed the leak down a bit, but I think I'll need to put some new teflon tape on it to stop it completely.

I used an old wooden dowel as a way to measure how much liquid was in the kettle. I filled up the kettle 2 quarts at a time, and used a knife to make a notch in the dowel for each additional gallon of liquid. This took a lot longer than I expected. 6 gallons is a lot of water!

On Friday night I filled up the kettle again with water from the tap, to let it sit overnight. I've read this helps to get rid of some of the chlorine in the water, but later wondered if it also allowed dissolved oxygen to escape.

I also tried putting the lid on the fermenter just to see how it worked. It was a real struggle to snap it on, and I could not for the life of me get it off again! I ended up tearing the strip off the base of the lid in order to remove the lid. Hopefully this doesn't increase risk of infection...

Brew Day

(May 24th, 2014) I was planning on getting started on Saturday afternoon, which should have given me plenty of time to get everything mashed, boiled, cooled and cleaned up. We were having such a great time at our family pilgrimage / picnic that we didn't get home until 6pm or so. No problem, I should only need 4 hours to brew, right? Onwards!

The kit came with a 5gal ingredient kit included, and I chose their American IPA. It comes with:

  • 10lb 2-row malt
  • 1lb Carafoam
  • 1lb Crystal 60L
  • 2oz Centennial Hops
  • 2oz Columbus Hops
  • 1pkg US-05 dry yeast

I kept notes as I went, and here's a summary of how the night went.

18:54   6.25gal water on high heat
19:49   water temp hit 150°F.
19:54   put in bag, grains
        heat off
        trying to keep temp at 150°F
        front dial thermometer shoots up to 170-180°F, while probe thermometers reads 145°F

at this point I wasn't really sure what to do...I didn't want to over-heat the mash, and I trusted my two probe thermometers more than the dial thermometer on the kettle. I wonder if the heat from the burner is affecting the dial thermometer?

mash.png
20:54   Turn heat on to begin mash out. Target 168°F
        Heat sparge water. ~0.5gal
21:06   Temperature @ 164°F according to probe; remove bag & grains to strainer in bowl on counter
        Dial thermometer was reading 180°F. Removed bag early because unsure of actual temp.

21:13   Drain, squeeze, pour sparge water over grain bag
        Pour collected runnings from bowl back into kettle

        I forgot to measure how much runnings I poured back in :\

21:15   I had about 5.8gal in the kettle. Took a sample:
        62°C 1.044 gravity = 1.052 gravity at 20°C according to online
        calculator
        Turn heat on high for boil
        Tasted sample after - tastes great!
21:52   Water boiling
hotbreak.png
21:54   Add 1oz columbus

22:45   50 minutes into boil. Checked gravity because it didn't look like
        volume had reduced enough. Looks like I still have ~5.75gal, but I
        had added more of the runnings from the grain bag since my initial
        pre-boil measurement. Again, something to fix in my process next
        time.
        Hydrometer reads 1.044 again at 70°C. I was a little disappointed
        since it seemed the gravity hadn't changed. Adjusted for
        temperature this works out to 1.062, which is the target gravity,
        so feeling better.
22:52   Add 1oz columbus
23:02   Turn heat off
        Add 2oz centennial
        Didn't take a volume reading at this point either. I assumed I
        could measure the amount in the fermenter, but ended up not
        transferring all the wort over...So this volume would have been
        good to have!
        Begin cooling wort
        kettle in sink with ice bath
        3x 1L frozen sanitized water bottles in wort
23:53   wort at 43°C
00:11   wort at 41°C
00:48   wort at 38°C
01:23   wort at 35°C
02:00   wort at 29°C
        too tired to wait longer!
        transferred to fermenter by pouring. Left hop sludge in
        kettle...not sure how much
        take gravity, volume, temp reading
        4.25gal 1.062 @ 29°C = 1.064 OG. Target was 1.062.
        pitch yeast dry
        put in airlock
02:05   move to basement
02:20   done cleanup; go to bed!

Did it work?

I must have checked the bucket 20 times the next day. I didn't see any airlock activity on Sunday at all. I peeked inside the bucket by removing the airlock, and it looked like there was a nice foam developing on top of the wort. I don't think the bucket has a great seal around the lid after removing the tab strip at the bottom, and I also noticed that the rubber stopper has a tendency to pop out of the hole in the lid, leaving a small gap. So CO2 could be escaping from a few places other than the airlock! I think I'd really prefer to ferment in something like a better bottle so I can see what's going on, and have an easier time keeping the airlock in place.

Much to my relief, this morning the airlock was bubbling a bit, and it looked like there was a nice krausen forming. It smells great!

What's next?

I'm hoping to bottle in a few weeks. The recipe says to leave in primary for 21 days.

There are definitely some parts of my process that need improvement. Because I didn't take accurate volume readings post-boil, including how much extra water I added and how much crud I left behind in the kettle, I don't have a good idea of the overall efficiency. I did hit the target gravity, which I was worried most about.

I still wonder if I should have poured the entire contents of the kettle into the fermenter. On the one hand I thought I didn't want all that crud clouding up my beer. On the other hand, I figured that an IPA should be sitting on hops for a long time...

I also meant to aerate the wort more in the fermenter by shaking it around for a few minutes. Again, I was pretty tired at 2am, and starting to worry about the wort being infected, so decided against anything fancy.

If I brew again, I'll definitely be buying or making a wort chiller. Waiting 3 hours to cool down to 29°C is insane, and is probably a bit on the warm side to be pitching yeast. My basement is a pretty consistent 19°C right now, which I hope is ok for fermenting. The bucket reads about 20-21°C. I've also considered building a fermentation chamber to control temperature.

Other things for next time would be getting some kind of wine thief or long turkey baster for drawing samples out of the wort for measuring. I used a Pyrex measuring cup to take a sample from the kettle right after the mash, and I just dipped the hydrometer sample jar into the fermenter bucket to take the final sample. Both of these ended up being messy, and I was worried about infecting the wort with my hands while removing the sample jar from the fermenting bucket.

All in all, I did have a lot of fun, even though I ended much later than I was planning. The beer smells great so far, and tasted great at each point along the way. I'm really hopeful the final result will be as tasty!

Limiting coalescing on the build/test farm

tl;dr - as of yesterday we've limited coalescing on all builds/tests to merge at most 3 pending jobs together

Coalescing (aka queue collapsing aka merging) has been part of Mozilla's build/test CI for a long, long time. Back in the days of Tinderbox, a single machine would do a checkout/build/upload loop. If there were more checkins while the build was taking place, well, they would get built on the next iteration through the loop.

Fast forward a few years later to our move to buildbot, and having pools of machines all able to do the same builds. Now we create separate jobs in the queue for each build for each push. However, we didn't always have capacity to do all these builds in a reasonable amount of time, so we left buildbot's default behaviour (merging all pending jobs together) enabled for the majority of jobs. This means that if there are pending jobs for a particular build type, the first free machine skips all but the most recent item on the queue. The skipped jobs are "merged" into the job that was actually run.

In the case that all builds and tests are green, coalescing is actually a good thing most of the time. It saves you from doing a bunch of extra useless work.

However, not all pushes are perfect (just see how often the tree is closed due to build/test failures), and coalescing makes bisecting the failure very painful and time consuming, especially in the case that we've coalesced away intermediate build jobs.

To try and find a balance between capacity and sane results, we've recently added a limit to how many jobs can be coalesced at once.

By rigorous statistical analysis:

@catlee     so it's easiest to pick a single upper bound for coalescing and go with that at first
@catlee     did you have any ideas for what that should be?
@catlee     I was thinking 3
edmorley|sheriffduty        catlee: that sounds good to me as a first go :-)
mshal       chosen by fair dice roll? :)
@catlee     1d4
bhearsum    Saving throw failed. You are dead.
philor      wfm

we've chosen 3 as the upper bound on the number of jobs we'll coalesce, and we can tweak this as necessary.

I hope this makes the trees a bit more manageable! Please let us know what you think!

As always, all our work is done in the open. See the bug with the patch here: https://bugzilla.mozilla.org/show_bug.cgi?id=1008213

Digging into Firefox update sizes

Update sizes up 20-37% since last year!

Mozilla relies on our automatic update infrastructure to make sure that our users are kept up to date with the latest, most secure and fastest browser.

Having smaller updates means users are able to get the latest version of Firefox more quickly.

Since Firefox 19.0 (released just over a year ago - February 16th, 2013) our complete update size for Windows has grown from 25.6MB to 30.9MB for Firefox 28.0 (released March 15th, 2014). That's a 20% increase in just one year. On OSX it's grown from 37.8MB to 47.8MB, a 26% increase.

Partial updates have similarly grown. For Windows, a user coming form 27.0.1 to 28.0 would receive a 14.3MB update compared to a 10.4MB update going from 18.0.2 to 19.0, an increase of 37.5%

This means it's taking users 20-37% longer to download updates than it did last year. Many of our users don't have fast reliable internet, so an increase in update size makes it even harder for them to stay up to date. In addition, this size increase translates directly into bandwidth costs to Mozilla. All else being equal, we're now serving 20-37% more data from our CDNs for Firefox updates this year compared to last year.

update-sizes.png

Reducing update size

How can we reduce the update size? There are a few ways:

  1. Make sure we're serving partial updates to as many users as possible, and that these updates are applied properly. More analysis is needed, but it appears that only roughly half of our users are updating via partial updates.
  2. Reduce the amount of code we ship in the update. This could mean more features and content are distributed at runtime as needed This is always a tricky trade-off to make between having features available for all users out of the box, and introducing a delay the first time the user wants to use a feature that requires remote content. It also adds complexity to the product.
  3. Change how we generate updates. This is going to be the focus of the rest of my post.

Smaller updates are more better

There are a few techniques I know of to reduce our update sizes:

  • Use xz compression instead of bzip2 compression inside the MAR files (bug 641212). xz generally gets better compression ratios at the cost of using more memory.
  • Use courgette instead of bsdiff for generating the binary diff between .exe and .dll files (bug 504624). Courgette is designed specifically for diffing executables, and generates much smaller patches than bsdiff.
  • Handle omni.ja files more effectively (bug 772868). omni.ja files are essentially zip files, and are using zip style compression. That means each member of the zip archive is individually compressed. This is problematic for two reasons: it makes generating binary diffs between omni.ja files much less effective, and it makes compressing the omni.ja file with bzip2 or xz less effective. You get far better results packing files into a zip file with 0 compression and using xz to compress it afterwards. Similarly for generating diffs, the diff between two omni.ja files using no zip compression is much smaller than the diff between two omni.ja files using the default zip -9 compression.
  • Don't use per-file compression inside the MAR file at all, rather compress the entire archive with xz. I simulated this by xz-compressing tar files of the MAR contents.

27% smaller complete updates

complete-updates.png

We can see that using xz alone saves about 10.9%. There's not a big difference between xz -6 and xz -9e, only a few kb generally. ("xz -6" and "xz -9e" series in the chart)

Disabling zip compression in the omni.ja files and using the standard bzip2 compression saves about 9.7%. ("zip0 .ja" in the chart)

Combining xz compression with the above yields a 24.8% saving, which is 7.6MB. ("zip0 .ja, xz -9e" in the chart)

Finally, disabling zip compression for omni.ja and per-file compression and compressing the entire archive at once yields a 27.2% saving, or 8.4MB.

66% smaller partial updates

partial-updates.png

Very similar results here for xz, 8.4% savings with xz -9e.

Disabling zip compression in the omni.ja files has a much bigger impact for partial updates because we're able to produce a much smaller binary diff. This method alone is saves 42%, or 6.0MB.

Using courgette for executable files yields a 19.1% savings. ("courgette" in the chart)

Combining courgette for executable files, no zip level compression, and per-file xz compression reduces the partial update by 61%. ("courgette, zip0 .ja, xz -9e" in the chart)

And if we compress the entire archive at once rather than per-file, we can reduce the update by 65.9%. ("courgette, zip0 .ja, tar, xz -9e" in the chart)

Some notes on my methodology: I'm focusing on 32-bit Windows, en-US only. Most of the optimizations, with the exception of courgette, are applicable to other platforms. I'm measuring the total size of the files inside the MAR file, rather than the size of the MAR file itself. The MAR file format overhead is small, only a few kilobytes, so this doesn't significantly impact the results, and significantly simplifies the analysis.

Finally, the code I used for analysis is here.

The amazing RelEng Jacuzzis

RelEng has jacuzzis???

On Tuesday, we enabled the first of our "jacuzzis" in production, and initial results look great.

A few weeks ago, Ben blogged about some initial experiments with running builds on smaller pools of machines ("hot tubs", get it? we've since renamed them as "jacuzzis"). His results confirmed glandium's findings on the effectiveness (or lack thereof!) of incremental builds on mozilla-inbound.

The basic idea behind smaller pools of workers is that by restricting which machines are eligible to run jobs, you get much faster incremental builds since you have more recent checkouts, object dirs, etc.

Additionally, we've made some improvements to how we initialize mock environments. We don't reset and re-install packages into the mock chroot if the previous package list is the same as the new package list. This works especially well with jacuzzis, as we can arrange for machines to run jobs with the same package lists.

On Tuesday we enabled jacuzzis for some build types on b2g-inbound: hamachi device builds, and opt/debug ICS emulator builds.

jacuzzi-results.png

We've dropped build times by more than 50%!

The spikes earlier this morning look like they're caused by running on fresh spot instances. When spot nodes first come online, they have no previous state, and so their first builds will always be slower. The machines should stay up most of the day, so you really only have to pay this cost once per day.

For the B2G emulator builds, this means we're getting tests started earlier and therefore get much faster feedback as to the quality of recent patches.

I'm super happy with these results. What's next? Well, turning on MOAR JACUZZIS for MOAR THINGS! Additionally, having fewer build types per machine means our disk footprint is much lower, and we should be able to use local SSDs for builds on AWS.

As usual, all our work has a tracking bug: bug 970738

Implementation details

There are three major pieces involved in pulling this together: the jacuzzi allocations themselves, buildbot integration, and finally AWS integration.

Jacuzzis

Ben has published the allocations here: http://jacuzzi-allocator.pub.build.mozilla.org/v1/

Each builder (or job type) has a specific list of workers associated with it. Ben has been working on ways of automatically managing these allocations so we don't need to tune them by hand too much.

Buildbot

The bulk of the work required was in buildbot. Previous to jacuzzis, we had several large pools of workers, each capable doing any one of hundreds of different job types. Each builder in buildbot has each of the workers in a pool listed as able to do that job. We wanted to avoid having to reconfigure buildbot every time we needed to change jacuzzi allocations, which is why we decided to put the allocations in an external service.

There are two places where buildbot fetches the allocation data: nextSlave functions and our builder prioritizing function. The first is straightforward, and was the only place I was expecting to make changes. These nextSlave functions are called with a list of available machines, and the function's job is to pick one of the connected machines to do the job. It was relatively simple to add support for this to buildbot. The need to handle latter case, modifying our builder prioritization, didn't become apparent until testing...The reasoning is a bit convoluted, so I'll explain below for those interested.

AWS Support

Now that we had buildbot using the right workers, we needed to make sure that we were starting those workers in Amazon!

We adjusted our tools that monitor the job queue and start new EC2 instances to also check the jacuzzi allocations, and start the correct instances.

The gory details of build prioritizations

We have a function in buildbot which handles a lot of the prioritization of the job queue. For example, pending jobs for mozilla-central will get priority over jobs for any of the twigs, like ash or birch. Older jobs also tend to get priority over newer jobs. The function needs to return the list of builders in priority sorted order. Buildbot then processes each builder in turn, trying to assign pending jobs to any idle workers.

There are two factors which make this function complicated: each buildbot master is doing this prioritization independently of the others, and workers are becoming idle while buildbot is still processing the sorted list of builders. This second point caused prioritization to be broken (bug 659222) for a long time.

Imagine a case where you have 3 pending jobs (A, B, C), all for the same set of workers (W1, W2, W3). Job A is the most important, job C is the least. All the workers are busy. prioritizeBuilders has sorted our list of builders, and buildbot looks at A first. No workers are available, so it moves onto B next. Still no free workers. But now worker W1 connects, and buildbot examines job C, and finds there are available workers (W1). So job C buds in line and gets run before jobs A or B.

Our fix for this was to maintain a list of pending jobs for each set of available workers, and then discard all but the most important pending job for each worker set. In our example, we would see that all 3 pending jobs have the same worker set (W1, W2, W3), and so would temporarily ignore pending jobs B and C. This leaves buildbot only job A to consider. In our example above, it would find no available workers and stop processing. When W1 becomes available, it triggers another prioritization run, and again job A is the sole job under consideration and gets the worker.

Unfortunately, this behaviour conflicted with what we were trying to do with jacuzzis. Imagine in the same examble above we have jacuzzis allocated so that W1 is allocated to only do jobs of type C. If W1 is the only available worker, and C is getting discarded each time the prioritization is done, we're not making any forward progress. In fact, we've triggered a bit of an infinite loop, since currently we trigger another round of prioritizing/job assignments if we have available workers and have temporarily discarded lower priority jobs.

The fix was to integrate the jacuzzi allocations into this prioritization logic as well. I'm a little concerned about the runtime impact of this, since we need to query the allocated workers for every pending job type. One thing we're considering is to change the allocator to return the allocations as a single monolithic blob, rather than having per-job-type requests. Or, we could support both types.

AWS, networks, and burning trees

Help! The trees are on fire!

You may have noticed that the trees were closed quite a bit last week due to infrastructure issues that all stem from network flakiness. We're really sorry about that! RelEng and IT have been working very hard to stabilize the network.

Symptoms of the problem have been pretty varied, but the two most visible ones are:

  • bug 957502 - All trees closed due to timeouts as usw2 slaves download or clone
  • bug 964804 - jobs not being scheduled

We've been having more and more problems with our VPN tunnel to our Amazon infrastructure, particularly in the us-west-2 region. Prior to any changes we put in place last week, ALL traffic from our infrastructure in EC2 was routed over the VPN tunnel to be handled by Mozilla's firewall in our SCL3 data center.

While debugging the scheduling bug early last week, we discovered that latency to our mysql database used for scheduling was nearly 500ms. No wonder scheduling jobs was slow!

Digging deeper - the network is in bad shape

Investigation of this latency revealed that one of the core firewalls deployed for RelEng traffic was overloaded, and that a major contributor to the firewall load was traffic to/from AWS over the VPN tunnels. We were consistently pushing around 1 gigabit/s to our private network inside Amazon. The extra load on the firewall required for the VPN encryption caused the latency to go up for all traffic passing through the firewall, even for traffic not bound for AWS!

Our next step was to do a traffic analysis to see how we could reduce the amount of traffic going over the VPN tunnel.

Michal was able to get the traffic analysis done, and his report indicated that the worst traffic offender was traffic coming from ftp.m.o. All of our test jobs pull the builds, tests and crash symbols from ftp. With the increase in push load, more types of jobs, and more tests, our traffic from ftp has really exploded in the past months. Because all of our traffic was going over the VPN tunnel, this put a huge load on the VPN component of the firewall.

Firefighting

Since all of the content on ftp is public, we decided we could route traffic to/from ftp over the public internet rather than our VPN tunnel, provided we could ensure file integrity at the same time. This required a few changes to our infrastructure:

  • Rail re-created all of our EC2 instances to have public IP addresses, in addition to private IP addresses they had already. Without a public IP, Amazon can't route traffic to the public internet. You can set up extra instances to act as NAT gateways, but that's much more complicated and introduces another point of failure. (bug 965001)
  • We needed a new IP endpoint for ftp so that we could be sure that only SSL traffic was going over the public routes. Chris Turra set up ftp-ssl.m.o, and then I changed our routing tables in AWS to route traffic to ftp-ssl via the public internet rather than our VPN tunnel (bug 965289).
  • I landed a change to mozharness to download files from ftp-ssl instead of ftp.

In addition, we also looked at other sources of traffic. The next highest source of traffic was hg.m.o, followed by pvtbuilds.m.o.

Ben quickly rolled out a fix to our test slaves to allow them to cache the gaia repository locally rather than re-cloning gaia each time (bug 964411). We were surprised to discover the gaia repository has grown to 1.2 GB, so this was a big win!

It was clear we would need to divert traffic to hg in a similar way we did for ftp. Unfortunately, adding a DNS/IP endpoint for hg isn't as simple as for ftp, so aki has been going through our code changing all our references of http://hg.mozilla.org to https://hg.mozilla.org (bug 960571). Once we've found and fixed all usages of unsecured access to hg, we can change the routing for hg traffic like we did for FTP.

Aki also patched up some of our FirefoxOS build configs to limit which device builds upload per-checkin (bug 966412), and reduce the amount of data we're sending back to pvtbuilds.m.o over the VPN tunnel.

Ted tracked down a regression in how we were packaging some files in our test zips which cut the size by about 100MB (bug 963651).

On Monday, Adam added some more capacity to the firewall, which should allow us to cope with the remaining load better.

State of the network

This week we're in much better shape than we were last week. If you look at traffic this week vs last week, you can see that traffic is down significantly (see far right of graph):

aws-traffic.png

and latency has been kept under control:

aws-ping.png

We're not out of the woods yet though - we're still seeing occasional traffic spikes close to our maximum. The good news is there's still more we can do to reduce or divert the traffic:

  • we're not done transitioning all FTP/HG traffic to public routes
  • there's still plenty of room to reduce test zip size by splitting them up and changing the compression format used (bug 917999)
  • we can use caching in S3 from the test machines to avoid having to download identical builds/tests from FTP multiple times

Extra special thanks also to Hal who has been keeping all of us focused and on track on this issue.

Blobber is live - upload ALL the things!

Last week without any fanfare, we closed bug 749421 - Allow test slaves to save and upload files somewhere. This has actually been working well for a few months now, it's just taken a while to close it out properly, and I completely failed to announce it anywhere. mea culpa!

This was a really important project, and deserves some fanfare! cue trumpets, parades and skywriters

The goal of this project was to make it easier for developers to get important data out of the test machines reporting to TBPL. Previously the only real output from a test job was the textual log. That meant if you wanted a screen shot from a failing test, or the dump from a crashing process, you needed to format it somehow into the log. For screen shots we would base64 encode a png image and print it to the log as a data URL!

With blobber successfully running now, it's now possible to upload extra files from your test runs on TBPL. Things like screen shots, minidump logs and zip files are already supported.

Getting new files uploaded is pretty straightforward. If the environment variable MOZ_UPLOAD_DIR is set in your test's environment, you can simply copy files there and they will be uploaded after the test run is complete. Links to the files are output in the log. e.g.

15:21:18     INFO -  (blobuploader) - INFO - TinderboxPrint: Uploaded 70485077-b08a-4530-8d4b-c85b0d6f9bc7.dmp to http://mozilla-releng-blobs.s3.amazonaws.com/blobs/mozilla-inbound/sha512/5778e0be8288fe8c91ab69dd9c2b4fbcc00d0ccad4d3a8bd78d3abe681af13c664bd7c57705822a5585655e96ebd999b0649d7b5049fee1bd75a410ae6ee55af

Your thanks and praise should go to our awesome intern, Mihai Tabara, who did most of the work here.

Most test jobs are already supported; if you're unsure if the job type you're interested is supported, just search for MOZ_UPLOAD_DIR in the log on tbpl. If it's not there and you need it, please file a bug!