Skip to main content

Posts about python (old posts, page 2)

Exporting MQ patches

I've been trying to use Mercurial Queues to manage my work on different tasks in several repositories. I try to name all my patches with the name of the bug it's related to; so for my recent work on getting Talos not skipping builds, I would call my patch 'bug468731'. I noticed that I was running this series of steps a lot: cd ~/mozilla/buildbot-configs hg qdiff > ~/patches/bug468731-buildbot-configs.patch cd ~/mozilla/buildbotcustom hg qdiff > ~/patches/bug468731-buildbotcustom.patch ...and then uploading the resulting patch files as attachments to the bug. There's a lot of repetition and extra mental work in those steps:

  • I have to type the bug number manually twice. This is annoying, and error-prone. I've made a typo on more than one occasion and then wasted a few minutes trying to track down where the file went.
  • I have to type the correct repository name for each patch. Again, I've managed to screw this up in the past. Often I have several terminals open, one for each repository, and I can get mixed up as to which repository I've currently got active.
  • mercurial already knows the bug number, since I've used it in the name of my patch.
  • mercurial already knows which repository I'm in.
I wrote the mercurial extension below to help with this. It will take the current patch name, and the basename of the current repository, and save a patch in ~/patches called [patch_name]-[repo_name].patch. It will also compare the current patch to any previous ones in the patches directory, and save a new file if the patches are different, or tell you that you've already saved this patch. To enable this extension, save the code below somewhere like ~/.hgext/mkpatch.py, and then add "mkpatch = ~/.hgext/mkpatch.py" to your .hgrc's extensions section. Then you can run 'hg mkpatch' to automatically create a patch for you in your ~/patches directory!

import os, hashlib



from mercurial import commands, util

from hgext import mq



def mkpatch(ui, repo, *pats, **opts):
    """Saves the current patch to a file called -.patch
    in your patch directory (defaults to ~/patches)
    """
    repo_name = os.path.basename(ui.config('paths', 'default'))
    if opts.get('patchdir'):
        patch_dir = opts.get('patchdir')
        del opts['patchdir']
    else:
        patch_dir = os.path.expanduser(ui.config('mkpatch', 'patchdir', "~/patches"))

    ui.pushbuffer()
    mq.top(ui, repo)
    patch_name = ui.popbuffer().strip()

    if not os.path.exists(patch_dir):
        os.makedirs(patch_dir)
    elif not os.path.isdir(patch_dir):
        raise util.Abort("%s is not a directory" % patch_dir)

    ui.pushbuffer()
    mq.diff(ui, repo, *pats, **opts)
    patch_data = ui.popbuffer()
    patch_hash = hashlib.new('sha1', patch_data).digest()

    full_name = os.path.join(patch_dir, "%s-%s.patch" % (patch_name, repo_name))
    i = 0
    while os.path.exists(full_name):
        file_hash = hashlib.new('sha1', open(full_name).read()).digest()
        if file_hash == patch_hash:
            ui.status("Patch is identical to ", full_name, "; not saving")
            return
        full_name = os.path.join(patch_dir, "%s-%s.patch.%i" % (patch_name, repo_name, i))
        i += 1

    open(full_name, "w").write(patch_data)
    ui.status("Patch saved to ", full_name)


mkpatch_options = [
        ("", "patchdir", '', "patch directory"),
        ]
cmdtable = {
    "mkpatch": (mkpatch, mkpatch_options + mq.cmdtable['^qdiff'][1], "hg mkpatch [OPTION]... [FILE]...")
}

Automated Talos Analysis

As part of one of our goals in Release Engineering this quarter, I'm investigating whether we can automatically detect variance in Talos performance data. Automatically detecting these changes in performance results would be a great help to developers and tree sheriffs. Imagine if the Tinderbox tree could be made to burn if a performance regression was detected? There are lots of possibilities if we can get this working: regressions could cause the tree to burn, firebot could spam #developers with information, try-talos data could be compared more easily to the baseline data, or we could automatically back out changes that cause regressions! :P This is also exciting, because it allows us to consider moving towards a pool-o'-slaves model for the Talos machines, just like we have for build and unittests right now. Having Talos use a pool-o'-slaves allows us to scale to additional project / release branches much more quickly, and allows us to be more flexible in allocating machines across branches. I've spent some time over the past few weeks playing around with data from graph server, bugging Johnathan, and having fun with flot, and I think I've come up with a workable solution.

How it works

I grab all the data for a test/branch/platform combination, and merge it into a single data series, ordered by buildid (the closest thing we've got right now to being able to sort the data in the same order in which changes landed). Individual data points are classified into one of four buckets:
  • "Good" data. We think these data points are within a certain tolerance of the expected value. Determining what the expected value is a bit tricky, so read on!
  • "Spikes". These data points are outside of the specified tolerance, but don't seem to be part of an ongoing problem (yet). Spikes can be caused by having the tolerance set too low, random machine voodoo, or not having enough data to make a definitive call as to if it's a code regression or machine problem.
  • "Regressions". When 3 or more data points are outside of the tolerance in the same direction, we assume this is due to a problem with the code, and flag it as a regression.
  • "Machine problem". When the last 2 data points from the same machine have been outside of the tolerance, then we assume this is due to a problem with the machine.
For the purposes of the algorithm (and this post!), a regression is a deviation from the expected value, regardless of it's a performance gain or loss. At this point the tolerance criteria is being set semi-manually. For each test/branch/platform combination, the tolerance is set as a certain number of standard deviations. The expected value is then determined by going back in the performance data history and looking for a certain sized window of data where no point is more than the configured number of standard deviations from the average. This can change over time, so we re-calculate the expected value at each point in the graph.

Initial Results

As an example, here's how data from Linux Tp3 tests on the Mozilla 1.9.2 branch is categorized: Linux Tp3 Data for Mozilla 1.9.2 Or, if you have a canvas-enabled browser, check out this interactive graph. A window size of 20 and a standard deviation threshold of 2.5 was used here for this data set. The green line represents all the good data. The orange line (which is mostly hidden by the green line), represents the raw data from the 3 Linux machines running that test. The orange circles represent spikes in the data, red circles represent regressions, and blue circles represent possible machine problems. For the most part we can ignore the spikes. Too many spikes probably means we need to tighten our tolerance a bit There are two periods of to take notice of on this graph:
  • Jan 12, around noon, a regression was detected. Two orange spike circles are followed by three red regression circles. Recall that we wait for the 3rd data point to confirm an actual regression.
  • Jan 30, around noon, a similar case. Two orange spike circles, followed by regression points.
Although in these cases, the regression was actually a win in terms of performance, it shows that the algorithm works. The second regression is due to Alice unthrottling the Talos boxes. In both cases, a new expected value is found after the data levels off again. The analysis also produces some textual output more suitable for e-mail, nagios or irc notification, e.g.: Regression: Tp3 decrease from 417.974 to 235.778 (43.59%) on Fri Jan 30 11:34:00 2009. Linux 1.9.2 build 20090130083434 http://graphs.mozilla.org/#show=395125,395135,395166&sel=1233236074,1233408874 http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=7f5292b5b9e2&tochange=f1493cf102b9 My code can be found on http://hg.mozilla.org/users/catlee_mozilla.com/talos-grokker. Patches or comments welcome!

python reload: danger, here be dragons

At Mozilla, we use buildbot to coordinate performing builds, unit tests, performance tests, and l10n repacks across all of our build slaves. There is a lot of activity on a project the size of Firefox, which means that the build slaves are kept pretty busy most of the time. Unfortunately, like most software out there, our buildbot code has bugs in it. buildbot provides two ways of picking up new changes to code and configuration: 'buildbot restart' and 'buildbot reconfig'. Restarting buildbot is the cleanest thing to do: it shuts down the existing buildbot process, and starts a new one once the original has shut down cleanly. The problem with restarting is that it interrupts any builds that are currently active. The second option, 'reconfig', is usually a great way to pick up changes to buildbot code without interrupting existing builds. 'reconfig' is implemented by sending SIGHUP to the buildbot process, which triggers a python reload() of certain files. This is where the problem starts. Reloading a module basically re-initializes the module, including redefining any classes that are in the module...which is what you want, right? The whole reason you're reloading is to pick up changes to the code you have in the module! So let's say you have a module, foo.py, with these classes:


class Foo(object):
    def foo(self):
        print "Foo.foo"


class Bar(Foo):
    def foo(self):
        print "Bar.foo"
        Foo.foo(self)
and you're using it like this:

>>> import foo

>>> b = foo.Bar()

>>> b.foo()

Bar.foo

Foo.foo

Looks good! Now, let's do a reload, which is what buildbot does on a 'reconfig':

>>> reload(foo)



>>> b.foo()

Bar.foo

Traceback (most recent call last):
  File "", line 1, in 
  File "/Users/catlee/test/foo.py", line 13, in foo
    Foo.foo(self)
TypeError: unbound method foo() must be called with Foo instance as first argument (got Bar instance instead)

Whoops! What happened? The TypeError exception is complaining that Foo.foo must be called with an instance of Foo as the first argument. (NB: we're calling the unbound method on the class here, not a bound method on the instance, which is why we need to pass in 'self' as the first argument. This is typical when calling your parent class) But wait! Isn't Bar a sub-class of Foo? And why did this work before? Let's try this again, but let's watch what happens to Foo and Bar this time, using the id() function:

>>> import foo

>>> b = foo.Bar()

>>> id(foo.Bar)

3217664

>>> reload(foo)



>>> id(foo.Bar)

3218592

(The id() function returns a unique identifier for objects in python; if two objects have the same id, then they refer to the same object) The id's are different, which means that we get a new Bar class after we reload...I guess that makes sense. Take a look at our b object, which was created before the reload:

>>> b.__class__



>>> id(b.__class__)

3217664

So b is an instance of the old Bar class, not the new one. Let's look deeper:

>>> b.__class__.__bases__

(,)

>>> id(b.__class__.__bases__[0])

3216336

>>> id(foo.Foo)

3218128

A ha! The old Bar's base class (Foo) is different than what's currently defined in the module. After we reloaded the foo module, the Foo class was redefined, which is presumably what we want. The unfortunate side effect of this is that any references by name to the class 'Foo' will pick up the new Foo class, including code in methods of subclasses. There are probably other places where this has unexpected results, but for us, this is the biggest problem. Reloading essentially breaks class inheritance for objects whose lifetime spans the reload. Using super() in the normal way doesn't even work, since you usually refer to your instance's class by name:

class Bar(Foo):
    def foo(self):
        print "Bar.foo"
        super(Bar, self).foo()
If you're using new-style classes, it looks like you can get around this by looking at your __class__ attribute:

class Bar(Foo):
    def foo(self):
        print "Bar.foo"
        super(self.__class__, self).foo()
Buildbot isn't using new-style classes...yet...so we can't use super(). Another workaround I'm playing around with is to use the inspect module to get at the class hierarchy:

def get_parent(obj, n=1):
    import inspect
    return inspect.getmro(obj.__class__)[n]


class Bar(Foo):
    def foo(self):
        print "Bar.foo"
        get_parent(self).foo(self)

Announcing poster 0.1

I've just uploaded the first public release of poster to my website, and to the cheeseshop. I wrote poster to scratch an itch I've had with Python's standard library: it's hard to do HTTP file uploads. There are a few reasons for this, one is that the standard library doesn't provide a way to do multipart/form-data encoding, and the second reason is that there's no way to stream an upload to the remote server, you have to build the entire request in memory first before sending the request. poster addresses both these issues. The poster.encode module provides multipart/form-data encoding, and the poster.streaminghttp module provides streaming http request support. Here's an example of how you might use it:


# test_client.py

from poster.encode import multipart_encode

from poster.streaminghttp import register_openers

import urllib2



# Register the streaming http handlers with urllib2

register_openers()



# Start the multipart/form-data encoding of the file "DSC0001.jpg"

# "image1" is the name of the parameter, which is normally set

# via the "name" parameter of the HTML  tag.



# headers contains the necessary Content-Type and Content-Length

# datagen is a generator object that yields the encoded parameters

datagen, headers = multipart_encode({"image1": open("DSC0001.jpg")})



# Create the Request object

request = urllib2.Request("http://localhost:5000/upload_image", datagen, headers)

# Actually do the request, and get the response

print urllib2.urlopen(request).read()

Download it as a tarball or egg for python 2.5, or easy_install it from cheeseshop. Bugs, patches, comments or complaints are welcome!

Validating credit card numbers in python

For various reasons I've needed to validate some credit card numbers in Python. For future reference, here's what I've come up with:


import re

def validate_cc(s):
    """
    Returns True if the credit card number ``s`` is valid,
    False otherwise.

    Returning True doesn't imply that a card with this number has ever been,
    or ever will be issued.

    Currently supports Visa, Mastercard, American Express, Discovery
    and Diners Cards.  

    >>> validate_cc("4111-1111-1111-1111")
    True
    >>> validate_cc("4111 1111 1111 1112")
    False
    >>> validate_cc("5105105105105100")
    True
    >>> validate_cc(5105105105105100)
    True
    """
    # Strip out any non-digits
    # Jeff Lait for Prime Minister!
    s = re.sub("[^0-9]", "", str(s))
    regexps = [
            "^4\d{15}$",
            "^5[1-5]\d{14}$",
            "^3[4,7]\d{13}$",
            "^3[0,6,8]\d{12}$",
            "^6011\d{12}$",
            ]

    if not any(re.match(r, s) for r in regexps):
        return False

    chksum = 0
    x = len(s) % 2

    for i, c in enumerate(s):
        j = int(c)
        if i % 2 == x:
            k = j*2
            if k >= 10:
                k -= 9
            chksum += k
        else:
            chksum += j

    return chksum % 10 == 0

Getting free diskspace in python

To calculate the amount of free disk space in Python, you can use the os.stafvfs() function. For some reason, I can never find the docs for os.statvfs() on the first or second try (it's in the "Files and Directories" section in the os module), and I never remember how it works, so I'm posting this as a note to myself, and maybe to help out anybody else wanting to do the same thing. A simple free space function can be written as:

import os

def freespace(p):
    """
    Returns the number of free bytes on the drive that ``p`` is on
    """
    s = os.statvfs(p)
    return s.f_bsize * s.f_bavail
I use the f_bavail attribute instead of f_bfree, since the latter includes blocks that are reserved for the the super-user's use. I'm not sure, however, on the distinction between f_bsize and f_frsize.

Python Warts, part 2 - the infamous GIL

I'm going to come out and say that the global interpreter lock (GIL) in Python bothers me. For those who don't know, the GIL in the C implementation of Python allows only one thread to be running Python code at any one time. Extension modules executing C/C++/!Python code can release the GIL so that other threads can run, but this doesn't apply in general to regular Python code. Ian Bicking posted a while back about the GIL of Doom. Granted, his post was originally written in October 2003, so things have changed a bit since then. I believe the main thrust of his argument was that there are only a few cases where the GIL would really get in your way. Those cases are basically where you are doing some CPU bound task that isn't easily separated into separate processes, and is running on a multi-processor machine. The way to get around the GIL in Python is to split up your application into separate processes, and use some kind of inter-process communication (IPC) mechanism to transfer work/results between processes. The message seems to be: "You don't really want to use a shared address space threading model, do you? I'm sure you'd much rather just use a separate process. Everybody knows that's better." Suggestions such as calling time.sleep(0), or fiddling around with sys.setcheckinterval are hacky, and clutter up your code for no good reason other than to work around deficiencies of the interpreter. Yes, sharing an address space with multiple threads of execution can be tricky. But IPC is no picnic either. Starting a new process can be expensive. os.fork() isn't available on all platforms. There is, AFAIK, no portable shared memory module for python (POSH seems to be dead?), so to send data between processes you need to set up a socket, pipe or use temporary files, leading to extra code (more to write, more to read and understand afterwards), setup overhead (system calls aren't free), performance impact (serializing data isn't free), and room for buggy implementations (did you clean up your temporary files? did you close your socket? did you set restrictive permissions on your socket file?) In many ways, threading is much simpler. It's simple to set up, has low overhead, no data copying costs, and is self-contained in the process so you're not leaking out-of-process resources (sockets files, bound addresses, temporary files, etc.). Python has never been the type of language to prevent the developer from doing "unsafe" things, that's why there aren't really private members on classes. Ian Bicking again writes (in a different post),

An important rule in the Python community is: we are all consenting adults. That is, it is not the responsibility of the language designer or library author to keep people from doing bad things. It is their responsibility to prevent people doing bad things accidentally. But if you really want to do something bad, who are we to say you are wrong? It's your program. Maybe you even have a good reason.
Python's GIL is getting in my way. Yes I can do bad things with multiple threads sharing one address space, but that should be my problem, not a restriction of the language implementation. With multi-core CPUs becoming more and more common, and not only in the server domain, I think this will become more and more of an issue for Python. In the short term, some slick IPC would be nice, but in the long term a truly multithreaded Python interpreter would benefit everybody. Talk is cheap, I know...code is what counts here. Maybe a PEP or SIG could be started to flesh out what would be required to get this accomplished for Python 3000.

Rebooting linux faster with kexec (and even faster with kexec-chooser!)

Somehow when reading through the Linux 2.6.17 changelog last week I came across a few articles discussing the kexec feature of recent Linux kernels. It's pretty neat, you can boot directly into another kernel image without having to go through a hardware / BIOS reboot. There's a Debian package called kexec-tools which gives you the ability to load these kernel images into memory and to boot into them. I found kexec a bit cumbersome to use, especially since all the kernels I care about booting into are the stock Debian kernels, and they all ship with ramdisk images that need to be used properly to boot. Using kexec by itself also requires that you have to manually bring the machine into a rebootable state first, or hack up some system scripts. You shouldn't just boot into a new kernel directly without shutting down devices, unmounting file systems, etc. So to scratch this itch, I wrote kexec-chooser. It's a small Python script that will allow you to easily warm-reboot into any of the stock Debian kernels installed on your system. It'll probably work with custom kernels as well, but I haven't tested that yet :) Downloads and more information can be found on the kexec-chooser page.