Tag: doc ids
-
Retrieval Algorithms
I’ve been making incremental progress on the retrieval algorithm and now have what I think of as a full complement of algorithms and settings. In terms of search index encoding, I started with the PFOR encoding of docids and then added repeated varint encodings of the docids to save space. (The PFOR blocks contain 128…
-
BM25 & Search Index Encoding
Okapi BM25 is a standard ranking formula that has been used in search engines since the 1980s. For each word in a query, it uses the frequency of that word in a document, the length of the document and the number documents that contain the word to decide how significant the word is for the…
-
Search Indexes and Memory
I’ve been working on v0 of the search engine which requires building a search index in the form of “posting lists”. I’ve build a pipeline that reads wikipedia documents, outputs all the words it finds and emits key-value pairs of (word, url). Then we group by word so we can lookup all documents that word…
