3 days of fun: a journey into the bowels of buildbot

chris

2010-10-29 10:00

I've just spent 3 days trying to debug some new code in buildbot. The code in question is to implement a change to how we do nightly builds such that they use the same revision for all platforms. I was hitting a KeyError exception inside buildbot's util.loop code, specifically at a line where it is trying to delete a key from a dictionary. In simple form, the loop is doing this:


for k in d.keys():
    if condition:
        del d[k] # Raises KeyError....sometimes...

Tricky bit was, it didn't happen every time. I'd have to wait at least 3 minutes between attempts. So I added a bunch of debugging code:


print d

print d.keys()

for k in d.keys():
    print k
    if condition:
        try:
            del d[k] # Raises KeyError....sometimes...
        except KeyError:
            print k in d # sanity check 1
            print k in d.keys() # sanity check 2

Can you guess what the results of sanity checks 1 and 2 were? 'k in d' is False, but 'k in d.keys()' is True. whhhaaaaa? Much head scratching and hair pulling ensued. I tried many different variations of iterating through the loop, all with the same result. In the end, I posted a question on Stack Overflow. At the same time, Bear and Dustin were zeroing in on a solution. The crucial bit here is that the keys of d are (follow me here...) methods of instances of my new scheduler classes, which inherit from buildbot.util.ComparableMixin...which implements __cmp__ and __hash__. __cmp__ is used in the 'k in d.keys()' test, but __hash__ is used in the 'k in d' test. Some further digging revealed that my scheduler was modifying state that ComparableMixin.__hash__ was referring to, resulting in the scheduler instances not having stable hashes over time. Meanwhile, on stackoverflow, adw came up with an answer that confirmed what Dustin and Bear were saying, and katrielalex came up with a simple example to reproduce the problem. In the end, the fix was simple, just a few extra lines of python code. Too bad it took me 3 days to figure out!

Comments