Category: Data Structures
-
Categories and Suggestions
Recently I’ve been trying, somewhat unsuccessfully, to find wikipedia search filters that don’t fit into my retrieval model. I was hoping to find some natural low cardinality, high coverage fields that people could use as filters in their queries. Imagine you’re on a product page for Bombas socks. You’ll find filters for product type (Men/Women/Kids/Sport),…
-
BM25 & Search Index Encoding
Okapi BM25 is a standard ranking formula that has been used in search engines since the 1980s. For each word in a query, it uses the frequency of that word in a document, the length of the document and the number documents that contain the word to decide how significant the word is for the…
-
Golden SSTables
When you’re testing, it’s common to transform whatever structure you’re dealing with into text and compare that text to a fixed string of expected text. This is convenient because you get a nice visual representation of what you’re expecting the output to be. This approach can cause problems if the text representation doesn’t capture everything…
-
Search Indexes and Memory
I’ve been working on v0 of the search engine which requires building a search index in the form of “posting lists”. I’ve build a pipeline that reads wikipedia documents, outputs all the words it finds and emits key-value pairs of (word, url). Then we group by word so we can lookup all documents that word…
-
Writing SSTables with Beam
Apache Beam is an open source system for processing large datasets. It has both a realtime and a batch processing mode. The batch processing mode is based on Google’s internal Flume framework which I had the pleasure of using for 7 years while processing Android telemetry. It’s also the perfect system for building a search…
-
Building an SSTable
SSTables are a critical piece of technology that holds up the modern web. It’s the basis for most modern databases, search backends and many other technologies. What it provides is a reasonably fast way to perform lookups in large datasets. SSTables are sorted string tables meaning both our lookup keys and the resulting values are…
