Category: Search
-
BM25 & Search Index Encoding
Okapi BM25 is a standard ranking formula that has been used in search engines since the 1980s. For each word in a query, it uses the frequency of that word in a document, the length of the document and the number documents that contain the word to decide how significant the word is for the…
-
Generating User Data
A search engine depends on a feedback loop of users making queries, following links, returning to the search page and rewriting their queries. All these feed into an understanding of whether they’re finding the results they’re looking for. Bootstrapping a system like this is difficult because you don’t start with any users and your search…
-
Search Indexes and Memory
I’ve been working on v0 of the search engine which requires building a search index in the form of “posting lists”. I’ve build a pipeline that reads wikipedia documents, outputs all the words it finds and emits key-value pairs of (word, url). Then we group by word so we can lookup all documents that word…
