Skip to main content

Posts about yaks

Packing bits with Rust & Ruby

The missing C of CHD

One element of the CHD (compress-hash-displace) algorithm that I didn't implement in my previous post was the "compress" part.

This algorithm generates an auxiliary table of seeds that are used to prevent hash collisions in the data set. These seeds need to be encoded somehow and transmitted along with the rest of the data in order to perform lookups later on. The number of seeds (called r in the algorithm) here is usually proportional to the number of elements in the input. Having a larger r means that it's easier to compute seeds that avoid collisions, and therefore faster to compute the perfect hash. Reducing r results in a more compact data structure at the expense of more compute up-front.

Packing seeds

Seeds are generally tried starting from 0, and typically don't end up being very large. Encoding these values as a basic array of 8/16/32-bit integers is a waste of space.

lots of zeros

I wanted to improve on my implementation of efficient encoding of hashes by doing some simple bit packing of the seeds.

The basic idea is that for a set of integers, we find the maximum value, and therefore the maximum number of bits (b) needed to represent that value. We can then encode all the integers using b bits instead of a fixed number of bits.

less zeros

There's a Rust crate bitpacking that does exactly this! And it runs super duper fast, assuming that you can arrange your data into groups of 32/128/256 integers. The API is really simple to use as well:

use bitpacking::{BitPacker, BitPacker4x};

fn main() {
    let data: Vec<u32> = (0..128).map(|i| i % 8).collect();
    let packer = BitPacker4x::new();
    let num_bits = packer.num_bits(&data);
    let mut compressed = vec![0u8; 4 * BitPacker4x::BLOCK_LEN];
    let len = packer.compress(&data, compressed.as_mut_slice(), num_bits);
    compressed.truncate(len);

    println!("Compressed data: {:?}", compressed);
}

Bridging the gap between Rust & Ruby

I wanted to use this from Ruby code though...time to bust out magnus!

Magnus is a crate which makes it really easy to write Ruby extensions using Rust. It takes care of most of the heavy lifting of converting to/from Ruby & Rust types.

#[magnus::wrap(class="BitPacking::BitPacker4x")]
struct BitPacker4x(bitpacking::BitPacker4x)

impl BitPacker4x {
  // ...
  fn compress(
      ruby: &Ruby,
      rb_self: &Self,
      decompressed: Vec<u32>,
      num_bits: u8,
  ) -> RString {
      let mut compressed = vec![0u8; 4 * Self::BLOCK_LEN];
      let len = rb_self.0 // refers to underlying bitpacking::BitPacker4x struct
          .compress(&decompressed, compressed.as_mut_slice(), num_bits);
      compressed.truncate(len);
      ruby.str_from_slice(compressed.as_slice())
  }
}

This lets me write Ruby code like this:

data = 128.times.map { |i| i % 8 }
packer = BitPacking::BitPacker4x.new
num_bits = packer.num_bits(data)
compressed = packer.compress(data, num_bits)

Here we have these 128 integers represented in 48 bytes, or 3 bits per integer.

BitPacking gem

I've packaged this up into the bitpacking gem.

I hope you find this useful!

Yak Shaving

Benjamin Smedberg wrote about shaving yaks a few weeks ago. This is an experience all programmers (and probably most other professions!) encounter regularly, and yet it still baffles and frustrates us!

Just last week I got tangled up in a herd of yaks.

I started the week wanting to deploy the new bundleclone extension that :gps has been working on. If you're not familiar with bundleclone, it's a combination of server and client side extensions to mercurial that allow the server to advertise an externally hosted static bundle to clients requesting full clones. hg.mozilla.org advertises bundles hosted on S3 for certain repositories. This can significantly reduce load on the mercurial servers, something that has caused us problems in the past!

With the initial puppet patch written and sanity-checked, I wanted to verify it worked on other platforms before deploying it more widely.

Linux...looks ok. I was able to make some minor tweaks to existing puppet modules to make testing a bit easier as well. Always good to try and leave things in better shape for the next person!

OSX...looks ok...er, wait! All our test jobs are re-cloning tools and mozharness for every job! What's going on? I'm worried that we could be paying for unnecessary S3 bandwidth if we don't fix this.

back-to-bugzilla.jpg

I find bug 1103700 where we started deploying and relying on machine-wide caches of mozharness and tools. Unfortunately this wasn't done on OSX yet. Ok, time to file a new bug and start a new puppet branch to get these shared repositories working on OSX. The patch is pretty straightforward, and I start testing it out.

Because these tasks run at boot as part of runner, I need to be watching logs on the test machines very early on in the bootup process. Hmmm, the machine seems really sluggish when I'm trying to interact with it right after startup. A quick glance at top shows something fishy:

fseventsd.png

There's a system daemon called fseventsd taking a substantial amount of system resources! I wonder if this is causing any impact to build times. After filing a new bug, I do a few web searches to see if this service can be disabled. Most of the articles I find suggest to write a no_log file into the volume's .fseventsd directory to disable fseventds from writing out event logs for the volume. Time for another puppet branch!.

And that's the current state of this project! I'm currently running testing the no_log fix to see if it impacts build times, or if it causes other problems on the machines. I'm hoping to deploy the new runner tasks this week, and switch our buildbot-configs and mozharness configs to make use of them.

And finally, the first yak I started shaving could be the last one I finish shaving; I'm hoping to deploy the bundleclone extension this week as well.