{"id":63,"date":"2026-03-12T17:19:55","date_gmt":"2026-03-12T17:19:55","guid":{"rendered":"https:\/\/byte64.com\/?p=63"},"modified":"2026-03-12T17:25:56","modified_gmt":"2026-03-12T17:25:56","slug":"search-indexes-and-memory","status":"publish","type":"post","link":"https:\/\/byte64.com\/?p=63","title":{"rendered":"Search Indexes and Memory"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">I&#8217;ve been working on v0 of the search engine which requires building a search index in the form of &#8220;posting lists&#8221;. I&#8217;ve build a pipeline that reads wikipedia documents, outputs all the words it finds and emits key-value pairs of (word, url). Then we group by word so we can lookup all documents that word appears in.<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code>cat -&gt; http:\/\/w\/Dog, http:\/\/w\/Mammal, http:\/\/w\/Pet, ...\ndog -&gt; http:\/\/w\/Cat, http:\/\/w\/Mammal, http:\/\/w\/Pet, ...\npet -&gt; http:\/\/w\/Cat, http:\/\/w\/Dog, http:\/\/w\/Pet, ...<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">As with anything in a search engine, the data representations need to be efficient if it&#8217;s going to scale. Typically, to get a, a compact representation of the posting list, you replace each document key with a local document id. The local document id is local to the shard and machine holding the particular chunk of the search index. In order to keep the representation as small as possible, it&#8217;s not a hash, but an integer position of the document in your local shard. <\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>cat -&gt; 3, 7, 20, 31, ...\ndog -&gt; 4, 7, 20, 36, ...\npet -&gt; 3, 4, 20, 35, ...<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We can make this even more compact by storing the differences between the local document ids:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>cat -&gt; 3, 4, 13, 11, ...\ndog -&gt; 4, 3, 13, 16, ...\npet -&gt; 3, 1, 16, 15, ...<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, if the maximum size in a range of values is small, we can use a small number of bits to represent each value in the range. For example, the sequence [3, 4, 13, 11] doesn&#8217;t have any number larger than 15, so they can be represented with 4-bit numbers: 0011, 0100, 1101, 1011. We&#8217;re going to have a ton of keys, so it&#8217;s important to minimize the value size.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Once this loaded into an sstable and a search request for &#8220;cat AND dog&#8221; comes in, we grab the posting lists for cat and dog and find the documents in common: 7, 20. These represent http:\/\/w\/Mammal and http:\/\/w\/Pet in our example.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Going back from the local document ids to the document keys (URLs) takes some finesse. There can be thousands or millions of results and looking up each document in a separate document SSTable isn&#8217;t feasible for all of them. In my initial implementation, a query for &#8220;quantum physics&#8221; returned 2441. If looking up each the key for each local doc id takes 10ms, then you have to wait at least 24.41s for your result which is too long. Not only that, but we need to get more information about these documents so we can score their relevance to the query. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">What&#8217;s the solution? Well, it&#8217;s many-fold.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>First, I&#8217;ve created a separate ScoringInfo dataset containing just the document key (URL) and keywords for now. By not storing the whole document, it means less data and faster lookup, though snippetting will become a problem.<\/li>\n\n\n\n<li>Second, I&#8217;ve loaded the ScoringInfo into memory as a simple array. The local doc ids are small sequential integers which enumerate the documents after all.<\/li>\n\n\n\n<li>Finally, a full search engine has two more features which help it: an ordering on the local documents which puts the most important documents first, and limited depth retrieval. That is, we don&#8217;t need to process all 2441 documents if we&#8217;re only going to serve 10 of them to the end user and if we can be assured the the important ones are early in the list.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">I haven&#8217;t implemented a document order, but by simply putting the ScoringInfo into ram, I can get reasonable latencies. How much ram though? My enwiki.scoring.rio file is 346 MiB but unpacks to a whopping ~2.3GiB in memory. This was so large, I needed to upgrade my VM from e2-small (2 core, 2 GiB ram) to an e2-standard (2 core, 8 GiB ram). It also gives me concern for scaling since this only covers 7 million records and I expect to add more fields to ScoringInfo over time. It may be that for less frequently retrieved documents, they can live on disk especially if I limit the number of retrievals.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This tells me a bunch about the calculus of search. Since we need to bound the time to respond, each request is limited in how much data it can pull from disk. It seems a critical metric to measure. In memory storage and caching need to be optimized to keep that number low for that vast majority of requests.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;ve been working on v0 of the search engine which requires building a search index in the form of &#8220;posting lists&#8221;. I&#8217;ve build a pipeline that reads wikipedia documents, outputs all the words it finds and emits key-value pairs of (word, url). Then we group by word so we can lookup all documents that word [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5,18],"tags":[19,8,7],"class_list":["post-63","post","type-post","status-publish","format-standard","hentry","category-data-structures","category-search","tag-doc-ids","tag-sstable","tag-wikipedia"],"_links":{"self":[{"href":"https:\/\/byte64.com\/index.php?rest_route=\/wp\/v2\/posts\/63","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/byte64.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/byte64.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/byte64.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/byte64.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=63"}],"version-history":[{"count":3,"href":"https:\/\/byte64.com\/index.php?rest_route=\/wp\/v2\/posts\/63\/revisions"}],"predecessor-version":[{"id":66,"href":"https:\/\/byte64.com\/index.php?rest_route=\/wp\/v2\/posts\/63\/revisions\/66"}],"wp:attachment":[{"href":"https:\/\/byte64.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=63"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/byte64.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=63"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/byte64.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=63"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}