Recent Blog Posts
-
BM25 & Search Index Encoding
Okapi BM25 is a standard ranking formula that has been used in search engines since the 1980s. For each word in a query, it uses the frequency of that word in a document, the length of the document and the number documents that contain the word to decide how significant the word is for the…
-
Golden SSTables
When you’re testing, it’s common to transform whatever structure you’re dealing with into text and compare that text to a fixed string of expected text. This is convenient because you get a nice visual representation of what you’re expecting the output to be. This approach can cause problems if the text representation doesn’t capture everything…
-
Generating User Data
A search engine depends on a feedback loop of users making queries, following links, returning to the search page and rewriting their queries. All these feed into an understanding of whether they’re finding the results they’re looking for. Bootstrapping a system like this is difficult because you don’t start with any users and your search…
-
Search Indexes and Memory
I’ve been working on v0 of the search engine which requires building a search index in the form of “posting lists”. I’ve build a pipeline that reads wikipedia documents, outputs all the words it finds and emits key-value pairs of (word, url). Then we group by word so we can lookup all documents that word…
-
Writing SSTables with Beam
Apache Beam is an open source system for processing large datasets. It has both a realtime and a batch processing mode. The batch processing mode is based on Google’s internal Flume framework which I had the pleasure of using for 7 years while processing Android telemetry. It’s also the perfect system for building a search…
-
SSTable on SSD
When I first put the SSTable server on Google’s Compute Engine, I opted for the cheapest machine: an e2-small (2 cores and 2GB ram) with a standard persistent disk of 20 GiB. When I ran the server, each request was taking hundreds of milliseconds. Installed iotop and found I was getting at most 8 MiB/s…
-
Cloud Engine and SSL
Yesterday, I spent a lot of time setting up byte64.com, particularly trying to setup HTTPS and connecting it to a Google Compute Engine Machine. I learned lots and suffered just as much. I started with a single E2 compute engine VM and had Google Cloud install a wordpress service on it. From there, I looked…
-
Building an SSTable
SSTables are a critical piece of technology that holds up the modern web. It’s the basis for most modern databases, search backends and many other technologies. What it provides is a reasonably fast way to perform lookups in large datasets. SSTables are sorted string tables meaning both our lookup keys and the resulting values are…

