<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>chris' random ramblings (Posts about cloud)</title><link>https://atlee.ca/</link><description></description><atom:link href="https://atlee.ca/categories/cloud.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><lastBuildDate>Sat, 22 Feb 2025 20:04:30 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Gotta Cache 'Em All</title><link>https://atlee.ca/posts/cache-em-all/</link><dc:creator>chris</dc:creator><description>&lt;section id="too-much-traffic"&gt;
&lt;h2&gt;TOO MUCH TRAFFIC!!!!&lt;/h2&gt;
&lt;p&gt;&lt;a class="reference external" href="https://atlee.ca/blog/posts/aws-networks-and-burning-trees.html"&gt;Waaaaaaay back in February&lt;/a&gt; we identified overall network bandwidth as a
cause of job failures on &lt;a class="reference external" href="https://tbpl.mozilla.org"&gt;TBPL&lt;/a&gt;. We were pushing too much traffic over our
VPN link between Mozilla's datacentre and AWS.  Since then we've been
working on a few approaches to cope with the increased traffic while at the
same time reducing our overall network load.  Most recently we've deployed
HTTP caches inside each AWS region.&lt;/p&gt;
&lt;img alt="Network traffic from January to August 2014" class="align-center" src="https://atlee.ca/posts/cache-em-all/releng-traffic-2014.png"&gt;
&lt;/section&gt;
&lt;section id="the-answer-cache-all-the-things"&gt;
&lt;h2&gt;The answer - cache all the things!&lt;/h2&gt;
&lt;a class="reference external image-reference" href="http://xkcd.com/908/"&gt;&lt;img alt="Obligatory XKCD" class="align-center" src="http://imgs.xkcd.com/comics/the_cloud.png"&gt;&lt;/a&gt;
&lt;section id="caching-build-artifacts"&gt;
&lt;h3&gt;Caching build artifacts&lt;/h3&gt;
&lt;p&gt;The primary target for caching was downloads of build/test/symbol packages
by test machines from file servers. These packages are generated by the
build machines and uploaded to various file servers. The same packages are
then downloaded many times by different machines running tests. This was a
perfect candidate for caching, since the same files were being requested by
many different hosts in a relatively short timespan.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="caching-tooltool-downloads"&gt;
&lt;h3&gt;Caching tooltool downloads&lt;/h3&gt;
&lt;p&gt;&lt;a class="reference external" href="https://wiki.mozilla.org/ReleaseEngineering/Applications/Tooltool"&gt;Tooltool&lt;/a&gt; is a simple system RelEng uses to distribute static assets to
build/test machines. While the machines do maintain a local cache of files,
the caches are often empty because the machines are newly created in AWS.
Having the files in local HTTP caches speeds up transfer times and
decreases network load.&lt;/p&gt;
&lt;/section&gt;
&lt;/section&gt;
&lt;section id="results-so-far-50-decrease-in-bandwidth"&gt;
&lt;h2&gt;Results so far - 50% decrease in bandwidth&lt;/h2&gt;
&lt;p&gt;Initial deployment was completed on August 8th (end of week 32 of 2014).
You can see by the graph above that we've cut our bandwidth by about 50%!&lt;/p&gt;
&lt;/section&gt;
&lt;section id="what-s-next"&gt;
&lt;h2&gt;What's next?&lt;/h2&gt;
&lt;p&gt;There are a few more low hanging fruit for caching. We have internal pypi
repositories that could benefit from caches. There's a long tail of other
miscellaneous downloads that could be cached as well.&lt;/p&gt;
&lt;p&gt;There are other improvements we can make to reduce bandwidth as well, such
as moving uploads from build machines to be outside the VPN tunnel, or
perhaps to S3 directly. Additionally, a big source of network traffic is
doing signing of various packages (gpg signatures, MAR files, etc.). We're
looking at ways to do that more efficiently. I'd love to investigate more
efficient ways of compressing or transferring build artifacts overall;
there is a ton of duplication between the build and test packages between
different platforms and even between different pushes.&lt;/p&gt;
&lt;/section&gt;
&lt;section id="i-want-to-know-moar"&gt;
&lt;h2&gt;I want to know MOAR!&lt;/h2&gt;
&lt;p&gt;Great! As always, all our work has been tracked in a bug, and worked out in
the open. The bug for this project is &lt;a class="reference external" href="https://bugzilla.mozilla.org/show_bug.cgi?id=1017759"&gt;1017759&lt;/a&gt;. The source code lives in
&lt;a class="reference external" href="https://github.com/mozilla/build-proxxy/"&gt;https://github.com/mozilla/build-proxxy/&lt;/a&gt;, and we have some basic
documentation available on our &lt;a class="reference external" href="https://wiki.mozilla.org/ReleaseEngineering/Applications/Proxxy"&gt;wiki&lt;/a&gt;. If this kind of work excites you,
&lt;a class="reference external" href="https://careers.mozilla.org/en-US/position/ohz2YfwA"&gt;we're hiring!&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Big thanks to &lt;a class="reference external" href="https://github.com/laggyluke"&gt;George Miroshnykov&lt;/a&gt; for his work on developing proxxy.&lt;/p&gt;
&lt;/section&gt;</description><category>aws</category><category>build</category><category>cloud</category><category>graph</category><category>make-stuff-fast</category><category>mozilla</category><category>performance</category><guid>https://atlee.ca/posts/cache-em-all/</guid><pubDate>Tue, 26 Aug 2014 14:21:00 GMT</pubDate></item><item><title>Behind the clouds: how RelEng do Firefox builds on AWS</title><link>https://atlee.ca/posts/blog20121214behind-the-clouds/</link><dc:creator>chris</dc:creator><description>&lt;p&gt;RelEng have been expanding our usage of Amazon's AWS over the past few months as the development pace of the B2G project increases. In October we began moving builds off of Mozilla-only infrastructure and into a hybrid model where some jobs are done in Mozilla's infra, and others are done in Amazon. Since October we've &lt;a href="http://oduinn.com/blog/2012/11/27/releng-production-systems-now-in-3-aws-regions/"&gt;expanded into 3 amazon regions&lt;/a&gt;, and now have nearly 300 build machines in Amazon. Within each AWS region we've distributed our load across 3 &lt;a href="http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html"&gt;availability zones&lt;/a&gt;.


&lt;/p&gt;&lt;h3&gt;That's great! But how does it work?&lt;/h3&gt;

Behind the scenes, we've written quite a bit of code to manage our new AWS infrastructure. This code is in our cloud-tools repo (&lt;a href="https://github.com/mozilla/build-cloud-tools"&gt;github&lt;/a&gt;|&lt;a href="http://hg.mozilla.org/build/cloud-tools/"&gt;hg.m.o&lt;/a&gt;) and uses the excellent &lt;a href="https://github.com/boto/boto"&gt;boto&lt;/a&gt; library extensively.



The two work horses in there are &lt;a href="https://github.com/mozilla/build-cloud-tools/blob/master/aws/aws_watch_pending.py"&gt;aws_watch_pending&lt;/a&gt; and &lt;a href="https://github.com/mozilla/build-cloud-tools/blob/master/aws/aws_stop_idle.py"&gt;aws_stop_idle&lt;/a&gt;. &lt;a href="https://github.com/mozilla/build-cloud-tools/blob/master/aws/aws_stop_idle.py"&gt;aws_stop_idle&lt;/a&gt;'s job is pretty easy, it goes around looking at EC2 instances that are idle and shuts them off safely. If an EC2 slave hasn't done any work in more than 10 minutes, it is shut down.



&lt;a href="https://github.com/mozilla/build-cloud-tools/blob/master/aws/aws_watch_pending.py"&gt;aws_watch_pending&lt;/a&gt; is a little more involved. Its job is to notice when there are pending jobs (like your build waiting to start!) and to resume EC2 instances. We take a few factors into account when starting up instances:

&lt;ul&gt;&lt;li&gt;We wait until a pending job is more than a minute old before starting anything. This allows in-house capacity to grab the job if possible, and other EC2 slaves that are online but idle also have a chance to take it.&lt;/li&gt;
    &lt;li&gt;Use any &lt;a href="http://aws.amazon.com/ec2/reserved-instances/"&gt;reserved instances&lt;/a&gt; first. As our AWS load stabilizes, we've been able to purchase some reserved instances to reduce our cost. Obviously, to reduce our cost, we have to use those reservations wherever possible! The code to do this is a bit more complicated than I'd like it to be since AWS reservations are specific to individual availability zones rather than whole regions.&lt;/li&gt;
    &lt;li&gt;Some regions are cheaper than others, so we prefer to start instances in the cheaper regions first.&lt;/li&gt;
    &lt;li&gt;Start instances that were most recently running. This should give both better depend-build time, and also helps with billing slightly. Amazon bills for complete hours. So if you start one instance twice in an hour, you're charged for a single hour. If you start two instances once in the hour, you're charged for two hours.&lt;/li&gt;
&lt;/ul&gt;



Overall we're really happy with Amazon's services. Having APIs for nearly everything has made development &lt;em&gt;really&lt;/em&gt; easy.



&lt;h3&gt;What's next?&lt;/h3&gt;

Seeing as how test capacity is always woefully behind, we're hoping to be able to run a large number of our linux-based unittests on EC2, particularly those that don't require an accelerated graphics context.



After that? Maybe windows builds? Maybe automated regression hunting? What do you want to see?</description><category>aws</category><category>cloud</category><category>firefox</category><category>mozilla</category><guid>https://atlee.ca/posts/blog20121214behind-the-clouds/</guid><pubDate>Fri, 14 Dec 2012 23:15:49 GMT</pubDate></item></channel></rss>