Building an SSTable

SSTables are a critical piece of technology that holds up the modern web. It’s the basis for most modern databases, search backends and many other technologies. What it provides is a reasonably fast way to perform lookups in large datasets. SSTables are sorted string tables meaning both our lookup keys and the resulting values are simply sequences of bytes. The key-value pairs are stored on disk in sorted order.

Let’s make one!

Implementation

The basic structure of an SSTable is fairly straight forward — so much so that when I told Google’s Gemini AI that we were implementing one in Rust, it came up with a reasonable binary encoding:

The data block contains all the keys and values in order. You can simply iterate through them. The index block stores a subset of the keys and their locations in the file. You pick a number m and write every m-th key to the index. This index should be small enough that you can store it in memory. That way you can quickly binary search on the index to find where your key should be in the SSTable file, then do a linear search on disk in the right part of the data block. Precomputing the index and writing it to the file means you don’t have to read the whole file before you start serving requests.

Now there’s a grab bag of improvements you can make on this structure. It’s no secret that companies like Google have their own SSTable formats. There’s also an SSTable format used in Apache Cassandra. Here’s what I chose to implement:

varint encoding of the key and value lengths. I called this a run-length encoding, but the AI referred to it as LEB128.
The keys in the index don’t have to be repeated in the data block, so we can drop them.
Since the keys are in sorted order we can use an incremental encoding by just storing the delta between keys.
Since iterating through elements in the data block is the slowest part, we can speed it up with a skip pointer to move us k elements ahead. We’ll store one of these skip pointers every k elements.

I’m calling each group of m keys a chunk. It’s interesting to me that in the Bigtable paper they talk about 64 KB chunks, rather than a fix number of entries. That sounds useful if you’re going to read entire chunks into memory and process them as a group and want to control the memory they use.

Gemini was very helpful in building this and understood what I wanted from my admittedly rough descriptions. Working away, small feature by small, feature it was able to help me build and test it.

At this point, I built a CLI to help me build SSTable files, inspect the structure, and lookup elements.

byte64:~/sstable
Usage: sstable <command> [args]
Commands:
  convert <input.txt> <output.db>
  get <db.db> <key> [--profile]
  getn <db.db> <k> [--profile]
  info <db.db>
  profile <db.db> <L>

My chosen input format was a simple text file comma separated key-value pairs on each line. I downloaded a recent XML dump of all English wikipedia pages (106 GiB) and had Gemini write a parser to pull out URLs and markdown and then converted it to SSTable format with my CLI. Sorting it was not fun.

byte64:~/sstable convert sstable/data/all/enwiki-latest-pages-articles-sorted.txt sstable/data/all/enwiki.sst --chunk-size 400 --skip-size 20

SSTable Conversion Stats:
  Data Block:    6984089666 bytes (6.50 GiB)
  Index Block:   1075291 bytes (1.03 MiB)
  Footer:        48 bytes (48 B)
  Total size:    6985165005 bytes (6.51 GiB)
  Original keys: 365958411 bytes (349.01 MiB)
  Compressed:    73413475 bytes (70.01 MiB)
  Values:        6875849336 bytes (6.40 GiB)
  Avg key size:  51.32 bytes (51 B)
  Avg val size:  964.32 bytes (964 B)

byte64:~/sstable info sstable/data/all/enwiki.sst
 
SSTable Info:
  Elements:   7130281
  Chunk Size: 400
  Skip Size:  20
  Min Key:    https://en.wikipedia.org/wiki/%21%21
  Max Key: https://en.wikipedia.org/wiki/~_%28album%29

I cheated a bit by chopping off the values at 1KiB. The resulting SSTable was a modestly smaller than the original text file of keys and cropped values (6.5 GiB vs 6.9 GiB).

For performance, it’s not fair to use the CLI since that reads the index each time. Instead, my approach was to select a collection of random keys, load the SSTable index into memory and then query the random keys as a server would do.

byte64:~/sstable profile sstable/data/all/enwiki.sst 4000

Profile results for 4000 lookups:
  Average time: 69.214µs
  Min time:     2.583µs
  Max time:     927.583µs

This is running off my local machine. I’ll update this blog when it’s running on a Google Compute Engine machine and you can try it for yourself.

Edit: It’s working now on Google Compute Engine. I’ll have a separate post on that, but you can try it out at https://byte64.com/search/. A recent run looked like this:

Profile results for 40000 lookups:
  Average time: 6.919699ms
  Min time:     7.443µs
  Max time:     50.337263ms

Tuning and Asyptoics

You may have wondered how I picked the chunk size and skip size. My initial intuition was to balance the whole structure with:

k = N^{1/3}

m = N^{2/3}

where $k$ is the skip size, $m$ is the chunk size an $N$ is the number of pairs.

This gives a runtime of $O(log(N/m)) = O(log(N^{1/3}))$ from the in memory binary search of the index block, along with $O(m/k) = O(N^{1/3})$ for the linear search through the skip pointers and another $O(k) = O(N^{1/3})$ for the linear search between the skip pointers. That’s an overall runtime of $O(N^{1/3})$ along with $O(N/m) = O(N^{1/3})$ space in memory (assuming constant key size).

For the 71 million records in wikipedia this gives around 414 records in index block, $k=414$ and $m= 171396$ .

Unfortunately, that’s not the way you want to use an SSTable. It would be great if we could keep a constant lookup time regardless of how many elements are in the SSTable. That takes memory. Neglecting the in memory lookup, what you’re doing is saying is that you’re limited to a fixed number of reads on disk. Just set $m$ and $k$ to whatever numbers you’re comfortable with and assume that your in memory index is going to grow linearly as $N/m$ elements. Setting $m = k^2$ should minimize on disk reads ( $2k)$ with this linear. scheme. If you want play with more optimization tricks, it’s about increasing the number of elements that can be searched on disk using a fixed number of reads.

For my table, I was happy with $m=400$ and $k=20$ which gave me ~70µs lookups and an index block of 1 MiB for 7825 keys of size ~50 bytes. Given the size of the index block, we can store a lot more keys before needing to shard the data.

There’s another way to balance the parameters that spends a proportionate amount of time in memory as on disk. Because of the logarithmic binary search in memory, you’ll want $log(N/m) \propto m$ or equivalently $N = m e^{am}$ . Giving a generous constant like $a=1/100$ and solving for $m$ when $N$ is 71 million gives $m = 1106$ .

AI Assistance

I used Google’s Antigravity using Gemini 3.1 Pro to assist me in making this SSTable and it was incredibly helpful. This is my first foray into Rust, so having an assistant saved me so much time. I was able to have almost everything working in a single day only running out of the default quota near the end.

I found it had certain weakneses:

I asked for an initial implementation with an interface and using a hashmap. I had to correct Gemini that I didn’t want it to be mutable with a set function, but instead that we’d have a separate SSTable writer.
I had to ask it refactor various some excessively long functions with deep nesting. It didn’t use helper functions without prompting, and even then it made the same anonymous helper function repeatedly rather than share a copy.
It wouldn’t update the design document when we changed the encoding without prompting.
It tried to read the entire input file into memory to sort it. I had it change to a streaming converter and made me responsible for presorting the data.
In the incremental encoding of the skip pointers, it used a common prefix length of u32::MAX to represent that there were no more elements in the chunk. It could have determined this instead by keeping track of how many times it had skipped ahead.

But overall, I still felt in charge of the readability of the code, its interface and testing.

What’s next?

As mentioned above, I’ll be putting this into a machine on Google’s Compute Engine as soon as I get that figured out. If we’re going the way of a database, we’ll maybe want a bloom filter, or a log of mutations to overlay. If we want to build a search engine, then we’ll use this for storing posting lists. Let’s see.

~Andrew

Building an SSTable

Implementation

Tuning and Asyptoics

AI Assistance

What’s next?

Comments

Leave a Reply Cancel reply