Release Engineering makes heavy use of Amazon's EC2 infrastucture. The vast majority of our Firefox builds for Linux and Android, as well as our Firefox OS builds happen in EC2. We also do a ton of unit testing inside EC2.
Amazon offers a service inside EC2 called spot instances. This is a way for Amazon to sell off unused capacity by auction. You can place a bid for how much you want to pay by the hour for a particular type of VM, and if your price is more than the current market price, you get a cheap VM! For example, we're able to run tests on machines for $0.025/hr instead of the regular $0.12/hr on-demand price. We started experimenting with spot instances back in November.
There are a few downsides to using spot instances however. One is that your VM can be killed at any moment if somebody else bids a higher price than yours. The second is that your instances can't (easily) use extra EBS volumes for persistent storage. Once the VM powers off, the storage is gone.
These issues posed different challenges for us. In the first case, we were worried about the impact that interrupted jobs would have on the tree. We wanted to avoid the case where jobs were run on a spot instance, interrupted because of the market price changing, and then being retried a second time on another spot instance subject to the same termination. This required changes to two systems:
- aws_watch_pending needed to know to start regular on-demand EC2 instances in the case of retried jobs. This has been landed and has been working well, but really needs the next piece to be complete.
- buildbot needed to know to not pick a spot instance to run retried jobs. This work is being tracked in bug 936222. It turns out that we're not seeing too many spot instances being killed off due to market price , so this work is less urgent.
The second issue, the VM storage, turned out to be much more complicated to fix. We rely on puppet to make sure that VMs have consistent software packages and configuration files. Puppet requires per-host SSL certificates generated, and at Mozilla, these certificates need to be signed by a central certificate authority. In our previous usage of EC2 we work around this by puppetizing new instances on first boot, and saving the disk image for later use.
With spot instances, we essentially need to re-puppetize every time we create a new VM.
Having fresh storage on boot also impacts the type of jobs we can run. We're starting with running test jobs on spot instances, since there's no state from previous tests that is valuable for the next test.
Builds are more complicated, since we depend on the state of previous builds to have faster incremental builds. In addition, the cost of having to retry a build is much higher than it is for a test. It could be that the spot instances stay up long enough or that we end up clobbering frequently enough that this won't have a major impact on build times. Try builds are always clobbers though, so we'll be running try builds on spot instances shortly.
All this work is being tracked in https://bugzilla.mozilla.org/show_bug.cgi?id=935683
Big props to Rail for getting this done so quickly. With all this great work, we should be able to scale better while reducing our costs.