Skip to main content

Posts about linux

Managing dotfiles with dotbot

We have an amazing development tool at Shopify called spin. Spin lets you create brand new cloud-based development environments in just a few seconds, right from your CLI. These environments are meant to be somewhat short-lived; I often spin one up just for the day to work on a PR, and then destroy it after my PR is merged.

As a longtime vim and linux user, I want to make sure that all my configuration files (aka "dotfiles") are made available automatically on any new environments that I'm setting up.

There are many different ways of managing your dotfiles. I started my journey with a "simple" shell script that tried to set up .vimrc, .zshrc, etc. Inevitably this script became overly complex, and was too cumbersome to maintain.

I briefly toyed around with ansible for managing my local system, but I found it to be overkill just managing dotfiles on remote systems.

Finally I discovered dotbot, which has a few features that I like:

  • No dependencies
  • Add into your dotfiles repo via a git submodule
  • Supports most of the stuff I need, without being too complicated
  • Configuration is easy to read & write
  • Safe to re-run

In particular, to set up neovim, I have something like this in my dotbot configuration:

# install.conf.yaml
---
- defaults:
    link:
      relink: true
      force: true

- link:
    # This line symlinks the astronvim repo from the dotfiles checkout 
    # into ~/.config/nvim
    ~/.config/nvim: vim/astronvim
    # Link in our user configuration for astronvim
    ~/.config/nvim/lua/user/init.lua:
      path: vim/user_init.lua

- shell:
    # Install all neovim plugins with packer
    - nvim --headless -c 'autocmd User PackerComplete quitall' -c PackerSync

Since setting up my dotfiles with dotbot, I've discovered that I'm much more willing to experiment with different configurations and tools. It's easy to blow away local directories, or git restore in my dotfiles repo, and then re-run the install script.

One small tweak I made to the generic install script is to split up configurations by operating system and environment. I have different configurations for Linux, macOS, and spin. I also have the install script send logs to the home directory, which makes it easier to debug when something has gone wrong. Adding exec &> >(tee -a $HOME/.dotfile-install.log) at the top of the install script copies the script output to a log file in my home directory, as well as outputting to stdout/stderr.

My dotfiles can be found on github.

Investigating hg performance

(caveat lector: this is a long post with lots of shell snippets and output; it's mostly a brain dump of what I did to investigate performance issues on hg.mozilla.org. I hope you find it useful. Scroll to the bottom for the summary.) Everybody knows that pushing to try can be slow. but why? while waiting for my push to try to complete, I wondered what exactly was slow. I started by cloning my own version of try:

$ hg clone http://hg.mozilla.org try

destination directory: try

requesting all changes

adding changesets

adding manifests

adding file changes

added 95917 changesets with 447521 changes to 89564 files (+2446 heads)

updating to branch default

53650 files updated, 0 files merged, 0 files removed, 0 files unresolved

Next I instrumented hg so I could get some profile information:

$ sudo vi /usr/local/bin/hg

python -m cProfile -o /tmp/hg.profile /usr/bin/hg $*

Then I timed out long it took me to check what would be pushed:

$ time hg out ssh://localhost//home/catlee/mozilla/try

hg out ssh://localhost//home/catlee/mozilla/try  0.57s user 0.04s system 54% cpu 1.114 total

That's not too bad. Let's check our profile:

import pstats

pstats.Stats("/tmp/hg.profile").strip_dirs().sort_stats('time').print_stats(10)

Fri Dec  9 00:25:02 2011    /tmp/hg.profile


         38744 function calls (37761 primitive calls) in 0.593 seconds

   Ordered by: internal time
   List reduced from 476 to 10 due to restriction 

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       13    0.462    0.036    0.462    0.036 {method 'readline' of 'file' objects}
        1    0.039    0.039    0.039    0.039 {mercurial.parsers.parse_index2}
       40    0.031    0.001    0.031    0.001 revlog.py:291(rev)
        1    0.019    0.019    0.019    0.019 revlog.py:622(headrevs)
   177/70    0.009    0.000    0.019    0.000 {__import__}
     6326    0.004    0.000    0.006    0.000 cmdutil.py:15(parsealiases)
       13    0.003    0.000    0.003    0.000 {method 'read' of 'file' objects}
       93    0.002    0.000    0.008    0.000 cmdutil.py:18(findpossible)
     7212    0.001    0.000    0.001    0.000 {method 'split' of 'str' objects}
  392/313    0.001    0.000    0.007    0.000 demandimport.py:92(_demandimport)
The top item is readline() on file objects? I wonder if that's socket operations. I'm ssh'ing to localhost, so it's really fast. Let's add 100ms latency:

$ sudo tc qdisc add dev lo root handle 1:0 netem delay 100ms

$ time hg out ssh://localhost//home/catlee/mozilla/try

hg out ssh://localhost//home/catlee/mozilla/try  0.58s user 0.05s system 14% cpu 4.339 total


import pstats

pstats.Stats("/tmp/hg.profile").strip_dirs().sort_stats('time').print_stats(10)

Fri Dec  9 00:42:09 2011    /tmp/hg.profile


         38744 function calls (37761 primitive calls) in 2.728 seconds

   Ordered by: internal time
   List reduced from 476 to 10 due to restriction 

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       13    2.583    0.199    2.583    0.199 {method 'readline' of 'file' objects}
        1    0.054    0.054    0.054    0.054 {mercurial.parsers.parse_index2}
       40    0.028    0.001    0.028    0.001 revlog.py:291(rev)
        1    0.019    0.019    0.019    0.019 revlog.py:622(headrevs)
   177/70    0.010    0.000    0.019    0.000 {__import__}
       13    0.006    0.000    0.006    0.000 {method 'read' of 'file' objects}
     6326    0.002    0.000    0.004    0.000 cmdutil.py:15(parsealiases)
       93    0.002    0.000    0.006    0.000 cmdutil.py:18(findpossible)
  392/313    0.002    0.000    0.008    0.000 demandimport.py:92(_demandimport)
     7212    0.001    0.000    0.001    0.000 {method 'split' of 'str' objects}
Yep, definitely getting worse with more latency on the network connection. Oh, and I'm using a recent version of hg:

$ hg --version

Mercurial Distributed SCM (version 2.0)



$ echo hello | ssh localhost hg -R /home/catlee/mozilla/try serve --stdio

145

capabilities: lookup changegroupsubset branchmap pushkey known getbundle unbundlehash batch stream unbundle=HG10GZ,HG10BZ,HG10UN httpheader=1024

This doesn't match what hg.mozilla.org is running:

$ echo hello | ssh hg.mozilla.org hg -R /mozilla-central serve --stdio  

67

capabilities: unbundle lookup changegroupsubset branchmap stream=1

So it must be using an older version. Let's see what mercurial 1.6 does:

$ mkvirtualenv hg16

New python executable in hg16/bin/python

Installing setuptools...



(hg16)$ pip install mercurial==1.6

Downloading/unpacking mercurial==1.6
  Downloading mercurial-1.6.tar.gz (2.2Mb): 2.2Mb downloaded
...



(hg16)$ hg --version

Mercurial Distributed SCM (version 1.6)



(hg16)$ echo hello | ssh localhost /home/catlee/.virtualenvs/hg16/bin/hg -R /home/catlee/mozilla/mozilla-central serve --stdio

75

capabilities: unbundle lookup changegroupsubset branchmap pushkey stream=1

That looks pretty close to what hg.mozilla.org claims it supports, so let's time 'hg out' again:

(hg16)$ time hg out ssh://localhost//home/catlee/mozilla/try

hg out ssh://localhost//home/catlee/mozilla/try  0.73s user 0.04s system 3% cpu 24.278 total

tl;dr

Finding missing changesets between two local repositories is 6x slower with hg 1.6 (4 seconds with hg 2.0 to 24 seconds hg 1.6). Add a few hundred people and machines hitting the same repository at the same time, and I imagine things can get bad pretty quickly. Some further searching reveals that mercurial does support a faster method of finding missing changesets in "newer" versions, although I can't figure out exactly when this change was introduced. There's already a bug on file for upgrading mercurial on hg.mozilla.org, so hopefully that improves the situation for pushes to try. The tools we use everyday aren't magical; they're subject to normal debugging and profiling techniques. If a tool you're using is holding you back, find out why!

Linux on a new Thinkpad T510

I got a new Thinkpad T510 at work to replace my aging MacBook Pro. I asked for a Thinkpad instead of another MacBook because I wanted hardware with better hardware support, in particular the trackpad. I got into the habit of bringing a USB mouse everywhere I went because the trackpad on the MacBook was so unreliable on linux. So when my new T510 arrived, I was pretty excited. And, except for one tiny problem (of the PEBKAC kind), transferring all my files from the old machine to the new one went flawlessly. Here's how I set up the new machine:

  • Download image from sysresccd.org. Follow the instructions to make a bootable USB drive.
  • Boot up computer off USB drive. Resize the existing NTFS partition to be really small. Add 2 new partitions in the new-free space: one for the boot partition for linux, and one to be encrypted and be formatted with lvm.
  • Format boot partition as ext3. Setup encrypted partition with 'cryptsetup luksFormat /dev/sda6; cryptsetup luksOpen /dev/sda6 crypt_sda6'. Setup LVM with 'pvcreate /dev/mapper/crypt_sda6'. Setup two volumes, one for swap, and one for the root partition.
  • Connect network cable between old laptop and new one. Configure local network.
  • Copy files from old /boot to new /boot.
  • Copy files from old / to new /. Here's where I messed up. My command was: 'rsync -aPxX 192.168.2.1:/ /target/'.
  • Install grub.
  • Reboot!
At this point the machine came up ok, but wasn't prompting to decrypt my root drive, and so I had to do some manual steps to get the root drive to mount initially. Fixing up /etc/crypttab and the initramfs solved this. However even after this I was having some problems. I couldn't connect to wireless networks with Network Manager. I couldn't run gnome-power-manager. Files in /var/lib/mysql were owned by ntp! Then I realized that my initial rsync had copied over files preserving the user/group names, not the uid/gid values. And since I wasn't booting off a Debian image, the id/name mappings were quite different. Re-running rsync with '--numeric-ids' got all the ownerships fixed up. After the next reboot things were working flawlessly. Now after a few weeks of using it, I'm enjoying it a lot more than my MacBook Pro. It boots up faster. It connects to wireless networks faster. It suspends/unsuspends faster. It's got real, live, page-up/page-down keys! The trackpad actually works!

poster 0.7.0 released!

I've just pushed poster 0.7.0 to the cheeseshop. Thanks again to everybody who sent in bug reports, and for letting me know how you're using poster! It's really great to hear from users. poster 0.7.0 fixes a few problems with 0.6.0, most notably:

  • Added callback parameters to MutipartParam and multipart_encode so you can add progress indicators to your applications. Thanks to Ludvig Ericson for the suggestion.
  • Fixed a bug where posting to a url that returned a 401 code would hang. Thanks to Patrick Guido and Andreas Loupasakis for the bug reports.
  • MultipartParam.from_params will now accept MultipartParam instances as the values of a dict object passed in. The parameter name must match the key corresponding to the parameter in the dict. Thanks to Matthew King for the suggestion.
  • poster now works under python2.7
poster 0.7.0 can be downloaded from the cheeseshop, or from my website. Documentation can be found at https://atlee.ca/software/poster/ I'm planning on looking at python 3 compatibility soon. Also, if anybody has suggestions on a reliable way to test the streaming http code, I'm open to suggestions! My current methods result in intermittent failures because of the test harness I suspect. poster's code is now available on bitbucket.

poster 0.6.0 released

I've just pushed poster 0.6.0 to the cheeseshop. Thanks again to everybody who sent in bug reports, and for letting me know how you're using poster! It's really great to hear from users. poster 0.6.0 fixes a few problems with 0.5, most notably:

  • Documentation updates to clarify some common use cases.
  • Added a poster.version attribute. Thanks to JP!
  • Fix for unicode filenames. Thanks to Zed Shaw.
  • Handle StringIO file objects. Thanks to Christophe Combelles.
poster 0.6.0 can be downloaded from the cheeseshop, or from my website. Documentation can be found at https://atlee.ca/software/poster/

One useful script, a linux version

Johnathan posted links to 3 scripts he finds useful. His sattap script looked handy, so I hacked it up for linux. Run it to do a screen capture, and upload the image to a website you have ssh access into. The link is printed out, and put into the clipboard. Hope you find this useful!


#!/bin/sh

# sattap - Send a thing to a place

set -e



SCP_USER='catlee'

SCP_HOST='people.mozilla.org'

SCP_PATH='~/public_html/sattap/'



HTTP_URL="http://people.mozilla.org/~catlee/sattap/"



FILENAME=`date | md5sum | head -c 8`.png

FILEPATH=/tmp/$FILENAME



echo Capturing...

import $FILEPATH

echo Copying to $SCP_HOST

scp $FILEPATH ${SCP_USER}@${SCP_HOST}:$SCP_PATH

echo Deleting local copy

rm $FILEPATH



echo $HTTP_URL$FILENAME | xclip -selection clipboard

echo Your file should be at $HTTP_URL$FILENAME, which is also in your paste buffer

Profiling Buildbot

Buildbot is a critical part of our build infrastructure at Mozilla. We use it to manage builds on 5 different platforms (Linux, Mac, Windows, Maemo and Windows Mobile), and 5 different branches (mozilla-1.9.1, mozilla-central, TraceMonkey, Electrolysis, and Places). All in all we have 80 machines doing builds across 150 different build types (not counting Talos; all the Talos test runs and slaves are managed by a different master). And buildbot is at the center of it all. The load on our machine running buildbot is normally fairly high, and occasionally spikes so that the buildbot process is unresponsive. It normally restores itself within a few minutes, but I'd really like to know why it's doing this! Running our staging buildbot master with python's cProfile module for almost two hours yields the following profile:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   416377 4771.188    0.011 4796.749    0.012 {select.select}
       52  526.891   10.133  651.043   12.520 /tools/buildbot/lib/python2.5/site-packages/buildbot-0.7.10p1-py2.5.egg/buildbot/status/web/waterfall.py:834(phase2)
     6518  355.370    0.055  355.370    0.055 {posix.fsync}
   232582  238.943    0.001 1112.039    0.005 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/banana.py:150(dataReceived)
 10089681  104.395    0.000  130.089    0.000 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/banana.py:36(b1282int)
36798140/36797962   83.536    0.000   83.537    0.000 {len}
 29913653   70.458    0.000   70.458    0.000 {method 'append' of 'list' objects}
      311   63.775    0.205   63.775    0.205 {bz2.compress}
 10088987   56.581    0.000  665.982    0.000 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/banana.py:141(gotItem)
4010792/1014652   56.079    0.000  176.693    0.000 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/jelly.py:639(unjelly)

2343910/512709   47.954    0.000  112.446    0.000 /tools/twisted-8.0.1/lib/python2.5/site-packages/twisted/spread/banana.py:281(_encode)

Interpreting the results

select shows up in the profile because we're profiling wall clock time, not cpu time. So the more time we're spending in select, the better, since that means we're just waiting for data. The overall run time for this profile was 7,532 seconds, so select is taking around 63% of our total time. I believe the more time spent here, the better. Time spent inside select is idle time. We already knew that the buildbot waterfall was slow (the second line in profile). fsync isn't too surprising either. buildbot calls fsync after writing log files to disk. We've considered removing this call, and this profile lends support to our original guess. The next entries really surprised me, twisted's dataReceived and a decoding function, b1282int. These are called when processing data received from the slaves. If I'm reading this correctly, this means that dataReceived and children account for around 40% of our total time after you remove the time spent in select. 1112 / (7532-4796) = 40%. These results are from our staging buildbot master, which doesn't have anywhere near the same load as the production buildbot master. I would expect that the time spent waiting in select would go down on the production master (there's more data being received, more often), and that time spent in fsync and dataReceived would go up.

What to do about it?

A few ideas....
  • Use psyco to get some JIT goodness applied to some of the slower python functions.
  • Remove the fsync call after saving logs.
  • Use the cpu-time to profile rather than wallclock time. This will give a different perspective on the performance of buildbot, which should give better information about where we're spending time processing data.
  • Implement slow pieces in C (or cython). Twisted's Banana library looks do-able in C, and also is high up in the profile.
  • Send less data from the slaves. We're currently logging all stdout/stderr produced by the slaves. All of this data is processed by the master process and then saved to disk.
  • Rearchitect buildbot to handle this kind of load.
  • Have more than one buildbot master, each one handling fewer slaves. We're actively looking into this approach, since it also allows us to have some redundancy for this critical piece of our infrastructure.

poster 0.4 released

I'm happy to announce the release of poster version 0.4. This is a bug fix release, which fixes problems when trying to use poster over a secure connection (with https). I've also reworked some of the code so that it can hopefully work with python 2.4. It passes all the unit tests that I have under python 2.4 now, but since I don't normally use python 2.4, I'd be interested to hear other people's experience using it. One of the things that I love about working on poster, and about open source software in general, is hearing from users all over the world who have found it helpful in some way. It's always encouraging to hear about how poster is being used, so thank you to all who have e-mailed me! poster can be downloaded from my website, or from the cheeseshop. As always, bug reports, comments, and questions are always welcome.

ssh on-the-fly port forwarding

Check out this great tip from nion's blog: ssh on-the-fly port forwarding. I've often wanted to open up new port forwards, but haven't wanted to shut down my existing session.

If you follow this by # character (and thus type ~#) you get a list of all forwarded connections. Using ~C you can open an internal ssh shell that enables you to add and remove local/remote port forwardings ssh> help Commands: -L[bind_address:]port:host:hostport Request local forward -R[bind_address:]port:host:hostport Request remote forward -KR[bind_address:]port Cancel remote forward ssh> -L 8080:localhost:8080