Help! The trees are on fire!
You may have noticed that the trees were closed quite a bit last week due to infrastructure issues that all stem from network flakiness. We're really sorry about that! RelEng and IT have been working very hard to stabilize the network.
Symptoms of the problem have been pretty varied, but the two most visible ones are:
- bug 957502 - All trees closed due to timeouts as usw2 slaves download or clone
- bug 964804 - jobs not being scheduled
We've been having more and more problems with our VPN tunnel to our Amazon infrastructure, particularly in the us-west-2 region. Prior to any changes we put in place last week, ALL traffic from our infrastructure in EC2 was routed over the VPN tunnel to be handled by Mozilla's firewall in our SCL3 data center.
While debugging the scheduling bug early last week, we discovered that latency to our mysql database used for scheduling was nearly 500ms. No wonder scheduling jobs was slow!
Digging deeper - the network is in bad shape
Investigation of this latency revealed that one of the core firewalls deployed for RelEng traffic was overloaded, and that a major contributor to the firewall load was traffic to/from AWS over the VPN tunnels. We were consistently pushing around 1 gigabit/s to our private network inside Amazon. The extra load on the firewall required for the VPN encryption caused the latency to go up for all traffic passing through the firewall, even for traffic not bound for AWS!
Our next step was to do a traffic analysis to see how we could reduce the amount of traffic going over the VPN tunnel.
Michal was able to get the traffic analysis done, and his report indicated that the worst traffic offender was traffic coming from ftp.m.o. All of our test jobs pull the builds, tests and crash symbols from ftp. With the increase in push load, more types of jobs, and more tests, our traffic from ftp has really exploded in the past months. Because all of our traffic was going over the VPN tunnel, this put a huge load on the VPN component of the firewall.
Since all of the content on ftp is public, we decided we could route traffic to/from ftp over the public internet rather than our VPN tunnel, provided we could ensure file integrity at the same time. This required a few changes to our infrastructure:
- Rail re-created all of our EC2 instances to have public IP addresses, in addition to private IP addresses they had already. Without a public IP, Amazon can't route traffic to the public internet. You can set up extra instances to act as NAT gateways, but that's much more complicated and introduces another point of failure. (bug 965001)
- We needed a new IP endpoint for ftp so that we could be sure that only SSL traffic was going over the public routes. Chris Turra set up ftp-ssl.m.o, and then I changed our routing tables in AWS to route traffic to ftp-ssl via the public internet rather than our VPN tunnel (bug 965289).
- I landed a change to mozharness to download files from ftp-ssl instead of ftp.
Ben quickly rolled out a fix to our test slaves to allow them to cache the gaia repository locally rather than re-cloning gaia each time (bug 964411). We were surprised to discover the gaia repository has grown to 1.2 GB, so this was a big win!
It was clear we would need to divert traffic to hg in a similar way we did for ftp. Unfortunately, adding a DNS/IP endpoint for hg isn't as simple as for ftp, so aki has been going through our code changing all our references of http://hg.mozilla.org to https://hg.mozilla.org (bug 960571). Once we've found and fixed all usages of unsecured access to hg, we can change the routing for hg traffic like we did for FTP.
Aki also patched up some of our FirefoxOS build configs to limit which device builds upload per-checkin (bug 966412), and reduce the amount of data we're sending back to pvtbuilds.m.o over the VPN tunnel.
On Monday, Adam added some more capacity to the firewall, which should allow us to cope with the remaining load better.
State of the network
This week we're in much better shape than we were last week. If you look at traffic this week vs last week, you can see that traffic is down significantly (see far right of graph):
and latency has been kept under control:
We're not out of the woods yet though - we're still seeing occasional traffic spikes close to our maximum. The good news is there's still more we can do to reduce or divert the traffic:
- we're not done transitioning all FTP/HG traffic to public routes
- there's still plenty of room to reduce test zip size by splitting them up and changing the compression format used (bug 917999)
- we can use caching in S3 from the test machines to avoid having to download identical builds/tests from FTP multiple times
Extra special thanks also to Hal who has been keeping all of us focused and on track on this issue.