Data Structures

Categories and Suggestions

Apr 28, 2026

—

by

in Algorithms, Data Processing, Data Structures, Search

Andrew

Recently I’ve been trying, somewhat unsuccessfully, to find wikipedia search filters that don’t fit into my retrieval model. I was hoping to find some natural low cardinality, high coverage fields that people could use as filters in their queries. Imagine you’re on a product page for Bombas socks. You’ll find filters for product type (Men/Women/Kids/Sport),…

BM25 & Search Index Encoding

Apr 2, 2026

—

by

Andrew

in Data Structures, Search

Andrew

Okapi BM25 is a standard ranking formula that has been used in search engines since the 1980s. For each word in a query, it uses the frequency of that word in a document, the length of the document and the number documents that contain the word to decide how significant the word is for the…

Vector Retrieval

Mar 31, 2026

—

by

Andrew

in Data Structures, Google Cloud, Search

Andrew

Now that every document has been assigned a vector encoding its semantics, this opens the door to a new kind of retrieval. Rather than find documents that might be relevant to the query by searching through the search index for keywords, we can instead take the query’s vector and find nearby documents in the embedding.…

Vector Embeddings

Mar 26, 2026

—

by

Andrew

in Data Structures, Search

Andrew

Search engines make use of AI to improve their search results. There are AI models that can understand the meaning of a sentence or document. They often present their results as embeddings of the document space into a vector space: D→ℝnD \to \mathbb{R}^n. These are called vector embeddings. Once you’ve found the embeddings for your…

Golden SSTables

Mar 25, 2026

—

by

Andrew

in Data Structures

Andrew

When you’re testing, it’s common to transform whatever structure you’re dealing with into text and compare that text to a fixed string of expected text. This is convenient because you get a nice visual representation of what you’re expecting the output to be. This approach can cause problems if the text representation doesn’t capture everything…

Search Indexes and Memory

Mar 12, 2026

—

by

Andrew

in Data Structures, Search

Andrew

I’ve been working on v0 of the search engine which requires building a search index in the form of “posting lists”. I’ve build a pipeline that reads wikipedia documents, outputs all the words it finds and emits key-value pairs of (word, url). Then we group by word so we can lookup all documents that word…

Writing SSTables with Beam

Mar 10, 2026

—

by

Andrew

in Coding with AI, Data Processing, Data Structures

Andrew

Apache Beam is an open source system for processing large datasets. It has both a realtime and a batch processing mode. The batch processing mode is based on Google’s internal Flume framework which I had the pleasure of using for 7 years while processing Android telemetry. It’s also the perfect system for building a search…

Building an SSTable

Feb 28, 2026

—

by

Andrew

in Coding with AI, Data Structures

Andrew

SSTables are a critical piece of technology that holds up the modern web. It’s the basis for most modern databases, search backends and many other technologies. What it provides is a reasonably fast way to perform lookups in large datasets. SSTables are sorted string tables meaning both our lookup keys and the resulting values are…

Category: Data Structures