Spread Firefox Affiliate Button

Pages

One useful script, a linux version

Johnathan posted links to 3 scripts he finds useful. His sattap script looked handy, so I hacked it up for linux. Run it to do a screen capture, and upload the image to a website you have ssh access into. The link is printed out, and put into the clipboard.

Hope you find this useful!

#!/bin/sh
# sattap - Send a thing to a place
set -e
 
SCP_USER='catlee'
SCP_HOST='people.mozilla.org'
SCP_PATH='~/public_html/sattap/'
 
HTTP_URL="http://people.mozilla.org/~catlee/sattap/"
 
FILENAME=`date | md5sum | head -c 8`.png
FILEPATH=/tmp/$FILENAME
 
echo Capturing...
import $FILEPATH
echo Copying to $SCP_HOST
scp $FILEPATH ${SCP_USER}@${SCP_HOST}:$SCP_PATH
echo Deleting local copy
rm $FILEPATH
 
echo $HTTP_URL$FILENAME | xclip -selection clipboard
echo Your file should be at $HTTP_URL$FILENAME, which is also in your paste buffer

What happens when you push

As of November 1st, when you push a change to mozilla-central, the following builds and tests get triggered:

  • Linux optimized build
    • mochitest 1/5
    • mochitest 2/5
    • mochitest 3/5
    • mochitest 4/5
    • mochitest 5/5
    • everythingelse
    • Talos
    • Talos nochrome
    • Talos jss
    • Talos dirty
    • Talos tp4
    • Talos cold
  • Linux debug build + leak tests
    • mochitest 1/5
    • mochitest 2/5
    • mochitest 3/5
    • mochitest 4/5
    • mochitest 5/5
    • everythingelse
  • Linux optimized + refcounting build
    • mochitest 1/5
    • mochitest 2/5
    • mochitest 3/5
    • mochitest 4/5
    • mochitest 5/5
    • everythingelse
  • Windows optimized build
    • mochitest 1/5
    • mochitest 2/5
    • mochitest 3/5
    • mochitest 4/5
    • mochitest 5/5
    • everythingelse
    • XP Talos
    • XP Talos nochrome
    • XP Talos jss
    • XP Talos dirty
    • XP Talos tp4
    • Vista Talos
    • Vista Talos nochrome
    • Vista Talos jss
    • Vista Talos dirty
    • Vista Talos tp4
  • Windows debug build + leak tests
    • mochitest 1/5
    • mochitest 2/5
    • mochitest 3/5
    • mochitest 4/5
    • mochitest 5/5
    • everythingelse
  • Windows optimized + refcounting build
    • mochitest 1/5
    • mochitest 2/5
    • mochitest 3/5
    • mochitest 4/5
    • mochitest 5/5
    • everythingelse
  • Mac OSX optimized build
    • mochitest 1/5
    • mochitest 2/5
    • mochitest 3/5
    • mochitest 4/5
    • mochitest 5/5
    • everythingelse
    • Leopard Talos
    • Leopard Talos nochrome
    • Leopard Talos jss
    • Leopard Talos dirty
    • Leopard Talos tp4
    • Leopard Talos cold
  • Mac OSX debug build + leak tests
    • mochitest 1/5
    • mochitest 2/5
    • mochitest 3/5
    • mochitest 4/5
    • mochitest 5/5
    • everythingelse
  • Mac OSX optimized + refcounting build
    • mochitest 1/5
    • mochitest 2/5
    • mochitest 3/5
    • mochitest 4/5
    • mochitest 5/5
    • everythingelse
  • Linux 64-bit build
  • Maemo Build
    • mochitest chrome
    • crashtest
    • mochitest 1/4
    • mochitest 2/4
    • mochitest 3/4
    • mochitest 4/4
    • reftest
    • xpcshell
    • Talos Tdhtml
    • Talos Tgfx
    • Talos Tp3
    • Talos Tp4
    • Talos Tp4 nochrome
    • Talos Ts
    • Talos Tsspider
    • Talos Tsvg
    • Talos Twinopen
    • Talos non-Tp1
    • Talos non-Tp2
  • WinCE build
  • Windows Mobile build
  • Linux Fennec Desktop build
  • Windows Fennec Desktop build
  • Mac OSX Fennec Desktop build

That’s 111 distinct build and test jobs that get spread out across our build and tests pools. A total of 40 machine hours per checkin in our main build, test and talos pools is used, plus an additional 25 machine hours on the mobile devices!!!

In addition, we also do certain types of jobs on a periodic basis:

  • Nightly builds
  • XULRunner builds
  • Shark builds
  • Code coverage runs
  • L10n repacks for 72 locales and 7 platforms (Windows, Mac OSX, Linux, Windows Fennec, Mac OSX Fennec, Linux Fennec, Maemo); that’s 504 individual repacks!

In the course of collecting the data for this post, I’ve been constantly amazed at the amount of stuff that we’re doing, and the scale of the infrastructure! The list above is just for our mozilla-central branch, and I’ve most likely missed something. We do similar amounts of work for our other branches as well: Try, mozilla-1.9.2, mozilla-1.9.1, TraceMonkey, Electrolysis, and Places. Things have certainly changed a lot in the past year.

When do tests get run?

Continuing our RelEng Blogging Blitz, I’m going to be discussing how and when tests get triggered in our build automation systems.

We’ve got two basic classes of tests right now: unit tests, and performance tests, a.k.a. Talos. The unit tests are run on the same pool of machines that the builds are done on, while the performance tests are run on a separate pool of around 100 Mac Minis. Both kinds of tests are triggered in similar ways.

For refcounting (“unittest”) builds, once the compile step is complete, the binaries are packaged up with make package, the tests are packaged up with make package-tests, the symbols are packaged up with make buildsymbols, and then the whole lot is uploaded to stage.mozilla.org using make upload. Once they’re uploaded, we have valid URLs that refer to the builds, tests, and symbols. We then trigger the relevant unit test runs on that build. When a slave is assigned this test run, it then downloads the build, tests, and symbols from stage and starts running the tests.

On mozilla-central, we’ve also recently started to run unittests on optimized and debug builds. We’re hoping to bring this functionality to mozilla-1.9.2 once all the kinks are worked out.

For regular optimized builds, in addition to unittests, we also trigger performance tests on the freshly minted build. OSX builds are currently tested on Tiger and Leopard for mozilla-1.9.1 and mozilla-1.9.2, and on Leopard only for mozilla-central and project branches. Windows builds are tested on XP and Vista, and Linux builds are tested on Ubuntu.

In addition to having tests triggered automatically by builds, the Release Engineering Sheriff can re-run unittests or performance tests on request!

When do builds happen?

As part of our RelEng Blogging Blitz, I’ll give a quick overview of when and how builds get triggered on our main build infrastructure.

There are three ways builds can be triggered.

The first, and most common way, is when a developer pushes his or her changes to hg.mozilla.org. Our systems check for new changes every minute or so, and put new changes into a queue. Once the tree has been quiet for 3 minutes (i.e. no changes for 3 minutes), a new build request is triggered with all queued changes. If there is a free slave available, then a new build starts immediately, otherwise the build request is put in a queue.

The second way builds are triggered is via a nightly scheduler. We start triggering builds on branches at 3:02am pacific local time (some branches are triggered at 3:32am or 4:02 am). We run at 3:02am to avoid problems with daylight savings <-> standard time transitions. In the fall there are two 2:59am’s when we go back to standard time, and in the spring transition there is no 2:59am. The start times are staggered to avoid slamming hg.mozilla.org, or other shared resources.

The last way builds can be triggered is manually. The Release Engineering Sheriff can trigger builds on specific revisions, or rebuild past builds pretty easily, so if you need a build triggered, contact your friendly neighbourhood RelEng Sheriff!

RelEng Blogging Blitz!

Folks from RelEng are going to be blogging about various bits of our build, test, and release automation infrastructure this week.

Stay tuned for more!

RelEng Blogging Blitz is coming soon!

Several members of the Release Engineering team are going to be blogging next week about various bits of the build, test, and release automation infrastructure for Firefox.

If there’s something about our infrastructure you’ve always wondered about, give us a shout and we’ll do our best to explain it!

poster 0.5 released

I’ve just released version 0.5 of poster, the streaming http upload library for python. It’s easy_installable or downloadable directly from the cheeseshop.

Thanks again to everybody who’s written in with bug fixes and suggestions!

Faster signing

Once upon a time, when Mozilla released a new version of Firefox, signing all of the .exes and .dlls for all of the locales took a few hours. As more locales were added, and the file sizes increased, this time has increased to over 8 hours to sign all the locales in a Firefox 3.5 release. This has been a huge bottleneck for getting new releases out the door.

For our most recent Firefox releases, we’ve started using some new signing infrastructure that I’ve been working on over the past few months. There are quite a few reasons why the new infrastructure is faster:

  • Faster hardware. We’ve moved from an aging single core system to a new quad-core 2.5 GHz machine with 4 GB of RAM.
  • Concurrent signing of locales. Since we’ve got more cores on the system, we should take advantage of them! The new signing scripts spawn off 4 child processes, each one grabs one locale at a time to process.
  • In-process compression/decompression. Our complete.mar files use bz2 compression for every file in the archive. The old scripts would call the ‘bzip2′ and ‘bunzip2′ binaries to do compression and decompression. It’s significantly faster to do these operations in-process.
  • Better caching of signed files. The biggest win came from the simple observation that after you sign a file, and re-compress it to include in a .mar file, you should be caching the compressed version to use later. The old scripts did cache signed files, but only the decompressed versions. So to sign the contents of our mar files, the contents would have to be completely unpacked and decompressed, then the cache was checked, files were signed or pulled from cache as necessary, and then re-compressed again.

    Now, we unpack the mar file, check the cache, and only decompress / sign / re-compress any files that need signing but aren’t in the cache. We don’t even bother decompressing files that don’t need signing, another difference from the original scripts.

    Big thanks to Nick Thomas for having the idea for this at the Mozilla All-Hands in April.

As a result of all of this, signing all our locales can be done in less than 15 minutes now! See bug 470146 for the gory details.

The main bottleneck at this point is the time it takes to transfer all of the files to the signing machine and back again. For a 3.5 release, there’s around 3.5 GB of data to transfer back and forth, which takes on the order of 10 minutes to pull down to the signing machine, and another few minutes to push back after signing is done.

Profiling Buildbot

Buildbot is a critical part of our build infrastructure at Mozilla. We use it to manage builds on 5 different platforms (Linux, Mac, Windows, Maemo and Windows Mobile), and 5 different branches (mozilla-1.9.1, mozilla-central, TraceMonkey, Electrolysis, and Places). All in all we have 80 machines doing builds across 150 different build types (not counting Talos; all the Talos test runs and slaves are managed by a different master).

And buildbot is at the center of it all.

The load on our machine running buildbot is normally fairly high, and occasionally spikes so that the buildbot process is unresponsive. It normally restores itself within a few minutes, but I’d really like to know why it’s doing this!

Running our staging buildbot master with python’s cProfile module for almost two hours yields the following profile:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   416377 4771.188    0.011 4796.749    0.012 {select.select}
       52  526.891   10.133  651.043   12.520 /tools/buildbot/lib/python2.5/site-packages/buildbot-0.7.10p1-py2.5.egg/buildbot/status/web/waterfall.py:834(phase2)
     6518  355.370    0.055  355.370    0.055 {posix.fsync}
   232582  238.943    0.001 1112.039    0.005 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/banana.py:150(dataReceived)
 10089681  104.395    0.000  130.089    0.000 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/banana.py:36(b1282int)
36798140/36797962   83.536    0.000   83.537    0.000 {len}
 29913653   70.458    0.000   70.458    0.000 {method 'append' of 'list' objects}
      311   63.775    0.205   63.775    0.205 {bz2.compress}
 10088987   56.581    0.000  665.982    0.000 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/banana.py:141(gotItem)
4010792/1014652   56.079    0.000  176.693    0.000 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/jelly.py:639(unjelly)
2343910/512709   47.954    0.000  112.446    0.000 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/banana.py:281(_encode)

Interpreting the results

select shows up in the profile because we’re profiling wall clock time, not cpu time. So the more time we’re spending in select, the better, since that means we’re just waiting for data. The overall run time for this profile was 7,532 seconds, so select is taking around 63% of our total time. I believe the more time spent here, the better. Time spent inside select is idle time.

We already knew that the buildbot waterfall was slow (the second line in profile).

fsync isn’t too surprising either. buildbot calls fsync after writing log files to disk. We’ve considered removing this call, and this profile lends support to our original guess.

The next entries really surprised me, twisted’s dataReceived and a decoding function, b1282int. These are called when processing data received from the slaves. If I’m reading this correctly, this means that dataReceived and children account for around 40% of our total time after you remove the time spent in select. 1112 / (7532-4796) = 40%.

These results are from our staging buildbot master, which doesn’t have anywhere near the same load as the production buildbot master. I would expect that the time spent waiting in select would go down on the production master (there’s more data being received, more often), and that time spent in fsync and dataReceived would go up.

What to do about it?

A few ideas….

  • Use psyco to get some JIT goodness applied to some of the slower python functions.
  • Remove the fsync call after saving logs.
  • Use the cpu-time to profile rather than wallclock time. This will give a different perspective on the performance of buildbot, which should give better information about where we’re spending time processing data.
  • Implement slow pieces in C (or cython). Twisted’s Banana library looks do-able in C, and also is high up in the profile.
  • Send less data from the slaves. We’re currently logging all stdout/stderr produced by the slaves. All of this data is processed by the master process and then saved to disk.
  • Rearchitect buildbot to handle this kind of load.
  • Have more than one buildbot master, each one handling fewer slaves. We’re actively looking into this approach, since it also allows us to have some redundancy for this critical piece of our infrastructure.

Parallelizing Unit Tests

Last week we flipped the switch and turned on running unit tests on packaged builds for our mozilla-1.9.1, mozilla-central, and tracemonkey branches.

What this means is that our current unit test builds are uploaded to a web server along with all their unit tests. Another machine will then download the build and tests, and run various test suites on them.

Splitting up the tests this way allows us to run the test suites in parallel, so the mochitest suite will run on one machine, and all the other suites will be run on another machine (this group of tests is creatively named ‘everythingelse’ on Tinderbox).

paralleltests

Splitting up the tests is a critical step towards reducing our end-to-end time, which is the total time elapsed between when a change is pushed into one of the source repositories, and when all of the results from that build are available. Up until now, you had to wait for all the test suites to be completed in sequence, which could take over an hour in total. Now that we can split the tests up, the wait time is determined by the longest test suite. The mochitest suite is currently the biggest chunk here, taking somewhere around 35 minutes to complete, and all of the other tests combined take around 20 minutes. One of the next steps for us to do is to look at splitting up the mochitests into smaller pieces.

For the time being, we will continue to run the existing unit tests on the same machine that is creating the build. This is so that we can make sure that running tests on the packaged builds is giving us the same results (there are already some known differences: bug 491675, bug 475383)

Parallelizing the unit tests, and the infrastructure required to run them, is the first step towards achieving a few important goals.

- Reducing end-to-end time.

- Running unit tests on debug, as well as on optimized builds. Once we’ve got both of these going, we can turn off the builds that are currently done solely to be able to run tests on them.

- Running unit tests on the same build multiple times, to help isolate intermittent test failures.

All of the gory details can be found in bug 383136.