Skip to main content

Posts about releng (old posts, page 1)

PyCon 2016 report

I had the opportunity to spend last week in Portland for PyCon 2016. I'd like to share some of my thoughts and some pointers to good talks I was able to attend. The full schedule can be found here and all the videos are here.

Monday

Brandon Rhodes' Welcome to PyCon was one of the best introductions to a conference I've ever seen. Unfortunately I can't find a link to a recording... What I liked about it was that he made everyone feel very welcome to PyCon and to Portland. He explained some of the simple (but important!) practical details like where to find the conference rooms, how to take transit, etc. He noted that for the first time, they have live transcriptions of the talks being done and put up on screens beside the speaker slides for the hearing impaired.

He also emphasized the importance of keeping questions short during Q&A after the regular sessions. "Please form your question in the form of a question." I've been to way too many Q&A sessions where the person asking the question took the opportunity to go off on a long, unrelated tangent. For the most part, this advice was followed at PyCon: I didn't see very many long winded questions or statements during Q&A sessions.

Machete-mode Debugging

(abstract; video)

Ned Batchelder gave this great talk about using python's language features to debug problematic code. He ran through several examples of tricky problems that could come up, and how to use things like monkey patching and the debug trace hook to find out where the problem is. One piece of advice I liked was when he said that it doesn't matter how ugly the code is, since it's only going to last 10 minutes. The point is the get the information you need out of the system the easiest way possible, and then you can undo your changes.

Refactoring Python

(abstract; video)

I found this session pretty interesting. We certainly have lots of code that needs refactoring!

Security with object-capabilities

(abstract; video; slides)

I found this interesting, but a little too theoretical. Object capabilities are a completely orthogonal way to access control lists as a way model security and permissions. It was hard for me to see how we could apply this to the systems we're building.

Awaken your home

(abstract; video)

A really cool intro to the Home Assistant project, which integrates all kinds of IoT type things in your home. E.g. Nest, Sonos, IFTTT, OpenWrt, light bulbs, switches, automatic sprinkler systems. I'm definitely going to give this a try once I free up my raspberry pi.

Finding closure with closures

(abstract; video)

A very entertaining session about closures in Python. Does Python even have closures? (yes!)

Life cycle of a Python class

(abstract; video)

Lots of good information about how classes work in Python, including some details about meta-classes. I think I understand meta-classes better after having attended this session. I still don't get descriptors though!

(I hope Mike learns soon that __new__ is pronounced "dunder new" and not "under under new"!)

Deep learning

(abstract; video)

Very good presentation about getting started with deep learning. There are lots of great libraries and pre-trained neural networks out there to get started with!

Building protocol libraries the right way

(abstract; video)

I really enjoyed this talk. Cory Benfield describes the importance of keeping a clean separation between your protocol parsing code, and your IO. It not only makes things more testable, but makes code more reusable. Nearly every HTTP library in the Python ecosystem needs to re-implement its own HTTP parsing code, since all the existing code is tightly coupled to the network IO calls.

Tuesday

Guido's Keynote

(video)

Some interesting notes in here about the history of Python, and a look at what's coming in 3.6.

Click

(abstract; video)

An intro to the click module for creating beautiful command line interfaces.

I like that click helps you to build testable CLIs.

HTTP/2 and asynchronous APIs

(abstract; video)

A good introduction to what HTTP/2 can do, and why it's such an improvement over HTTP/1.x.

Remote calls != local calls

(abstract; video)

Really good talk about failing gracefully. He covered some familiar topics like adding timeouts and retries to things that can fail, but also introduced to me the concept of circuit breakers. The idea with a circuit breaker is to prevent talking to services you know are down. For example, if you have failed to get a response from service X the past 5 times due to timeouts or errors, then open the circuit breaker for a set amount of time. Future calls to service X from your application will be intercepted, and will fail early. This can avoid hammering a service while it's in an error state, and works well in combination with timeouts and retries of course.

I was thinking quite a bit about Ben's redo module during this talk. It's a great module for handling retries!

Diving into the wreck

(abstract; video)

A look into diagnosing performance problems in applications. Some neat tools and techniques introduced here, but I felt he blamed the DB a little too much :)

Wednesday

Magic Wormhole

(abstract; video; slides)

I didn't end up going to this talk, but I did have a chance to chat with Brian before. magic-wormhole is a tool to safely transfer files from one computer to another. Think scp, but without needing ssh keys set up already, or direct network flows. Very neat tool!

Computational Physics

(abstract; video)

How to do planetary orbit simulations in Python. Pretty interesting talk, he introduced me to Feynman, and some of the important characteristics of the simulation methods introduced.

Small batch artisinal bots

(abstract; video)

Hilarious talk about building bots with Python. Definitely worth watching, although unfortunately it's only a partial recording.

Gilectomy

(abstract; video)

The infamous GIL is gone! And your Python programs only run 25x slower!

Larry describes why the GIL was introduced, what it does, and what's involved with removing it. He's actually got a fork of Python with the GIL removed, but performance suffers quite a bit when run without the GIL.

Lars' Keynote

(video)

If you watch only one video from PyCon, watch this. It's just incredible.

Firefox builds on the Taskcluster Index

RIP FTP?

You have have heard rumblings that FTP is going away...

61319299.jpg

Over the past few quarters we've been working to migrate our infrastructure off of the ageing "FTP" [1] system to Amazon S3.

We've maintained some backwards compatibility for the time being [2], so that current Firefox CI and release builds are still available via ftp.mozilla.org, or preferably, archive.mozilla.org since we don't support the ftp protocol any more!

Our long term plan is to make the builds available via the Taskcluster Index, and stop uploading builds to archive.mozilla.org

How do I find my builds???

65722041.jpg

This is pretty big change, but we really think this will make it easier to find the builds you're looking for.

The Taskcluster Index allows us to attach multiple "routes" to a build job. Think of a route as a kind of hierarchical tag, or directory. Unlike regular directories, a build can be tagged with multiple routes, for example, according to the revision or buildid used.

A great tool for exploring the Taskcluster Index is the Indexed Artifact Browser

Here are some recent examples of nightly Firefox builds:

The latest win64 nightly Firefox build is available via the

gecko.v2.mozilla-central.nightly.latest.firefox.win64-opt route

This same build (as of this writing) is also available via its revision:

gecko.v2.mozilla-central.nightly.revision.47b49b0d32360fab04b11ff9120970979c426911.firefox.win64-opt

Or the date:

gecko.v2.mozilla-central.nightly.2015.11.27.latest.firefox.win64-opt

The artifact browser is simply an interface on top of the index API. Using this API, you can also fetch files directly using wget, curl, python requests, etc.:

https://index.taskcluster.net/v1/task/gecko.v2.mozilla-central.nightly.latest.firefox.win64-opt/artifacts/public/build/firefox-45.0a1.en-US.win64.installer.exe [3]

Similar routes exist for other platforms, for B2G and mobile, and for opt/debug variations. I encourage you to explore the gecko.v2 namespace, and see if it makes things easier for you to find what you're looking for! [4]

Can't find what you want in the index? Please let us know!

RelEng 2015 (Part 2)

This the second part of my report on the RelEng 2015 conference that I attended a few weeks ago. Don't forget to read part 1!

I've put up links to the talks whenever I've been able to find them.

Defect Prediction

Davide Giacomo Cavezza gave a presentation about his research into defect prediction in rapidly evolving software. The question he was trying to answer was if it's possible to predict if any given commit is defective. He compared static models against dynamic models that adapt over the lifetime of a project. The models look at various metrics about the commit such as:

  • Number of files the commit is changing

  • Entropy of the changes

  • The amount of time since the files being changed were last modified

  • Experience of the developer

His simulations showed that a dynamic model yields superior results to a static model, but that even a dynamic model doesn't give sufficiently accurate results.

I wonder about adopting something like this at Mozilla. Would a warning that a given commit looks particularly risky be a useful signal from automation?

Continuous Deployment and Schema Evolution

Michael de Jong spoke about a technique for coping with SQL schema evolution in a continuously deployed applications. There are a few major issues with changing SQL schemas in production including:

  • schema changes typically block other reads/writes from occurring

  • application logic needs to be synchronized with the schema change. If the schema change takes non-trivial time, which version of the application should be running in the meanwhile?

Michael's solution was to essentially create forks of all tables whenever the schema changes, and to identify each by a unique version. Applications need to specify which version of the schema they're working with. There's a special DB layer that manages the double writes to both versions of the schema that are live at the same time.

Securing a Deployment Pipeline

Paul Rimba spoke about security deployment pipelines.

One of the areas of focus was on the "build server". He started with the observation that a naive implementation of a Jenkins infrastructure has the master and worker on the same machine, and so the workers have the opportunity to corrupt the state of the master, or of other job types' workspaces. His approach to hardening this part of the pipeline was to isolate each job type into its own docker worker. He also mentioned using a microservices architecture to isolate each task into its own discrete component - easier to secure and reason about.

We've been moving this direction at Mozilla for our build infrastructure as well, so it's good to see our decision corroborated!

Makefile analysis

Shurui Zhou presented some data about extracting configuration data from build systems. Her goal was to determine if all possible configurations of a system were valid.

The major problem here is make (of course!). Expressions such as $(shell uname -r) make static analysis impossible.

We had a very good discussion about build systems following this, and how various organizations have moved themselves away from using make. Somebody described these projects as requiring a large amount of "activation energy". I thought this was a very apt description: lots of upfront cost for little visible change.

However, people unanimously agreed that make was not optimal, and that the benefits of moving to a more modern system were worth the effort.

Yak Shaving

Benjamin Smedberg wrote about shaving yaks a few weeks ago. This is an experience all programmers (and probably most other professions!) encounter regularly, and yet it still baffles and frustrates us!

Just last week I got tangled up in a herd of yaks.

I started the week wanting to deploy the new bundleclone extension that :gps has been working on. If you're not familiar with bundleclone, it's a combination of server and client side extensions to mercurial that allow the server to advertise an externally hosted static bundle to clients requesting full clones. hg.mozilla.org advertises bundles hosted on S3 for certain repositories. This can significantly reduce load on the mercurial servers, something that has caused us problems in the past!

With the initial puppet patch written and sanity-checked, I wanted to verify it worked on other platforms before deploying it more widely.

Linux...looks ok. I was able to make some minor tweaks to existing puppet modules to make testing a bit easier as well. Always good to try and leave things in better shape for the next person!

OSX...looks ok...er, wait! All our test jobs are re-cloning tools and mozharness for every job! What's going on? I'm worried that we could be paying for unnecessary S3 bandwidth if we don't fix this.

back-to-bugzilla.jpg

I find bug 1103700 where we started deploying and relying on machine-wide caches of mozharness and tools. Unfortunately this wasn't done on OSX yet. Ok, time to file a new bug and start a new puppet branch to get these shared repositories working on OSX. The patch is pretty straightforward, and I start testing it out.

Because these tasks run at boot as part of runner, I need to be watching logs on the test machines very early on in the bootup process. Hmmm, the machine seems really sluggish when I'm trying to interact with it right after startup. A quick glance at top shows something fishy:

fseventsd.png

There's a system daemon called fseventsd taking a substantial amount of system resources! I wonder if this is causing any impact to build times. After filing a new bug, I do a few web searches to see if this service can be disabled. Most of the articles I find suggest to write a no_log file into the volume's .fseventsd directory to disable fseventds from writing out event logs for the volume. Time for another puppet branch!.

And that's the current state of this project! I'm currently running testing the no_log fix to see if it impacts build times, or if it causes other problems on the machines. I'm hoping to deploy the new runner tasks this week, and switch our buildbot-configs and mozharness configs to make use of them.

And finally, the first yak I started shaving could be the last one I finish shaving; I'm hoping to deploy the bundleclone extension this week as well.

RelEng 2015 (part 1)

Last week, I had the opportunity to attend RelEng 2015 - the 3rd International Workshop of Release Engineering. This was a fantastic conference, and I came away with lots of new ideas for things to try here at Mozilla.

I'd like to share some of my thoughts and notes I took about some of the sessions. As of yet, the speakers' slides aren't collected or linked to from the conference website. Hopefully they'll get them up soon! The program and abstracts are available here.

For your sake (and mine!) I've split up my notes into a few separate posts. This post covers the introduction and keynote.

tl;dr

"Continuous deployment" of web applications is basically a solved problem today. What remains is for organizations to adopt best practices. Mobile/desktop applications remain a challenge.

Cisco relies heavily on collecting and analyzing metrics to inform their approach to software development. Statistically speaking, quality is the best driver of customer satisfaction. There are many aspects to product quality, but new lines of code introduced per release gives a good predictor of how many new bugs will be introduced. It's always challenging to find enough resources to focus on software quality; being able to correlate quality to customer satisfaction (and therefore market share, $$$) is one technique for getting organizational support for shipping high quality software. Common release criteria such as bugs found during testing, or bug fix rate, are used to inform stakeholders as to the quality of the release.

Introductory Session

Bram Adams and Foutse Khomh kicked things off with an overview of "continuous deployment" over the last 5 years. Back in 2009 we were already talking about systems where pushing to version control would trigger tens of thousands of tests, and do canary deployments up to 50 times a day.

Today we see companies like Facebook demonstrating that continuous deployment of web applications is basically a solved problem. Many organizations are still trying to implement these techniques. Mobile [and desktop!] applications still present a challenge.

Keynote

Pete Rotella from Cisco discussed how he and his team measured and predicted release quality for various projects at Cisco. His team is quite focused on data and analytics.

Cisco has relatively long release cycles compared to what we at Mozilla are used to now. They release 2-3 times per year, with each release representing approximately 500kloc of new code. Their customers really like predictable release cycles, and also don't like releases that are too frequent. Many of their customers have their own testing / validation cycles for releases, and so are only willing to update for something they deem critical.

Pete described how he thought software projects had four degrees of freedom in which to operate, and how quality ends up being the one sacrificed most often in order to compensate for constraints in the others:

  • resources (people / money): It's generally hard to hire more people or find room in the budget to meet the increasing demands of customers. You also run into the mythical man month problem by trying to throw more people at a problem.

  • schedule (time): Having standard release cycles means organizations don't usually have a lot of room to push out the schedule so that features can be completed properly.

    I feel that at Mozilla, the rapid release cycle has helped us out to some extent here. The theory is that if your feature isn't ready for the current version, it can wait for the next release which is only 6 weeks behind. However, I do worry that we have too many features trying to get finished off in aurora or even in beta.

  • content (features): Another way to get more room to operate is to cut features. However, it's generally hard to cut content or features, because those are what customers are most interested in.

  • quality: Pete believes this is where most organizations steal resources for to make up for people/schedule/content constraints. It's a poor long-term play, and despite "quality is our top priority" being the Official Party Line, most organizations don't invest enough here. What's working against quality?

    • plethora of releases: lots of projects / products / special requests for releases. Attempts to reduce the # of releases have failed on most occasions.

    • monetization of quality is difficult. Pete suggests tying the cost of a poor quality release to this. How many customers will we lose with a buggy release?

    • having RelEng and QA embedded in Engineering teams is a problem; they should be independent organizations so that their recommendations can have more weight.

    • "control point exceptions" are common. e.g. VP overrides recommendations of QA / RelEng and ships the release.

Why should we focus on quality? Pete's metrics show that it's the strongest driver of customer satisfaction. Your product's customer satisfaction needs to be more than 4.3/5 to get more than marginal market share.

How can RelEng improve metrics?

  • simple dashboards

  • actionable metrics - people need to know how to move the needle

  • passive - use existing data. everybody's stretched thin, so requiring other teams to add more metadata for your metrics isn't going to work.

  • standardized quality metrics across the company

  • informing engineering teams about risk

  • correlation with customer experience.

Interestingly, handling the backlog of bugs has minimal impact on customer satisfaction. In addition, there's substantial risk introduced whenever bugs are fixed late in a release cycle. There's an exponential relationship between new lines of code added and # of defects introduced, and therefore customer satisfaction.

Another good indicator of customer satisfaction is the number of "Customer found defects" - i.e. the number of bugs found and reported by their customers vs. bugs found internally.

Pete's data shows that if they can find more than 80% of the bugs in a release prior to it being shipped, then the remaining bugs are very unlikely to impact customers. He uses lines of code added for previous releases, and historical bug counts per version to estimate number of bugs introduced in the current version given the new lines of code added. This 80% figure represents one of their "Release Criteria". If less than 80% of predicted bugs have been found, then the release is considered risky.

Another "Release Criteria" Pete discussed was the weekly rate of fixing bugs. Data shows that good quality releases have the weekly bug fix rate drop to 43% of the maximum rate at the end of the testing cycle. This data demonstrates that changes late in the cycle have a negative impact on software quality. You really want to be fixing fewer and fewer bugs as you get closer to release.

I really enjoyed Pete's talk! There are definitely a lot of things to think about, and how we might apply them at Mozilla.

Release Engineering Sheriffs

Picking up a thread that was being discussed last year about suggested changes to sheriffing... Starting next week (on the 17th), there will be one person from release engineering designated to be the RelEng Sheriff for the week. This person will be responsible for various duties, most important of which will be to be available in #developers to help the developer sheriff track down issues with build machines, test failures, or other infrastructure problems. For more information, and for the current schedule of RelEng Sheriffs, please see the ReleaseEngineering:Sheriffing wiki page.