GCS Performance Experiments

chris

2022-08-10 14:21

Recently I've been digging into some performance optimizations with some code that transfers dozens of assets into a bucket in Google Cloud Storage (GCS)

The code in question is using Google's Ruby gem, and uploads a set of files in sequence, something like this:

require "google/cloud/storage"
storage = Google::Cloud::Storage.new(...)
bucket = storage.bucket(...)

files.each do |filename|
  bucket.create_file(filename, filename)
end

For this naive implementation, it takes about 45 seconds to upload 75 files, totalling about 350k of data. That works out to about 600ms per file! This seemed awfully slow, so I wanted to understand what was going on.

(note that all the timings here were collected on my laptop running on my home wifi)

Jump down to the TL;DR for spoilers, or keep reading for way too much detail.

Simplify the dataset

First, I wanted to see if the size of the files had any impact on the speed here, so I created a test with just 50 empty text files:

$ touch assets/{1..50}.txt
$ time ruby gcs.rb !$
ruby gcs.rb assets/{1..50}.txt  0.61s user 0.20s system 2% cpu 27.951 total

That's still about 28 seconds, or about 550ms per asset, which seems super slow for a set of empty files.

Start digging

Looking at the http traffic generated by running this code, it appears as though Google's gem is creating two(!!) new TCP / https connections per asset. The default behaviour of the gem is to use a "resumeable upload", where one connection is issued to start the upload, and then a second connection to transfer the data and finalize the upload. It doesn't appear as though any connection pooling is happening either.

It looks like there's room for some improvement.

A few ideas came to mind:

Use a single request per asset
Use connection pooling
Run requests in parallel (using threading or async)
Use HTTP/2

Single request per asset

The GCS API is pretty simple, so it's straightforward to implement a basic upload function in Ruby.

def upload_object(bucket, filename)
  Net::HTTP.post(
    URI("https://storage.googleapis.com/upload/storage/v1/b/#{bucket}/o?name=#{filename}"),
    File.read(filename),
    {"Authorization" => "Bearer #{ENV["AUTH"]}"}
  )
end

This uses one connection per asset, and brings the time down to about 11 seconds for our 50 empty assets, which is less than half of the naive version.

Re-using the same connection

Net::HTTP supports re-using the same connection, we just need to restructure the code a little bit:

def upload_object(client, bucket, filename)
  uri = URI("https://storage.googleapis.com/upload/storage/v1/b/#{bucket}/o?name=#{filename}")
  req = Net::HTTP::Post.new(uri)
  req["Authorization"] = "Bearer #{ENV["AUTH"]}"
  req.body = File.read(filename)
  client.request(req)
end

Net::HTTP.start("storage.googleapis.com", 443, use_ssl: true) do |http|
  files.each do |filename|
    upload_object(http, bucket, filename)
  end
end

This runs a bit faster now, in 8 seconds total.

Parallelization

Ruby has a few concurrency models we can play with. I try to avoid threading wherever possible, and use async libraries. Luckily, Ruby's async gem handles wrapping Net::HTTP quite well:

Async do
  barrier = Async::Barrier.new
  files.each do |filename|
    barrier.async do 
      resp = upload_object(bucket, filename)
    end
  end
  barrier.wait
end

Now we can finish all 50 uploads in about 1.3s

HTTP/2

There are various options for making HTTP2 requests in Ruby. One we can use is async-http, which integrates well with the async gem used above. Another gem that's worked well for me is httpx.

Async do
  barrier = Async::Barrier.new
  internet = Async::HTTP::Internet.new

  files.each do |filename|
    barrier.async do
      resp = internet.post(
        "https://storage.googleapis.com/upload/storage/v1/b/#{bucket}/o?name=#{filename}",
        {"Authorization" => "Bearer #{ENV["AUTH"]}"},
        File.read(filename))
    end
  end
  barrier.wait
end

This finishes all 50 requests in 0.4s!

Looking back at our initial data set which took 45 seconds to run, we can now do in 0.9 seconds.

TL;DR

To summarize, here are the times for uploading 50 empty files to a bucket:

Method	Time
naive (google gem)	28s
single request per asset	11s
single http connection	8s
async	1.3s
HTTP/2	0.4s

I'm really shocked at how much faster HTTP/2 here is.

Consumers of Google Cloud APIs should take a look at the libraries that they're using, and see if they can switch over to ones that support HTTP/2. I'm curious if other client libraries for the Google Cloud APIs support HTTP/2.

Can the Ruby gem support HTTP/2? There's an open issue on the github repo to switch the underlying client implementation to Faraday, which would allow one to use an HTTP/2 aware client under the hood. I've started working on a draft PR to see what would be involved in switching, but there are some major barriers at this point.