{"id":141,"date":"2026-04-20T23:07:47","date_gmt":"2026-04-20T23:07:47","guid":{"rendered":"https:\/\/byte64.com\/?p=141"},"modified":"2026-04-20T23:10:21","modified_gmt":"2026-04-20T23:10:21","slug":"retrieval-algorithms","status":"publish","type":"post","link":"https:\/\/byte64.com\/?p=141","title":{"rendered":"Retrieval Algorithms"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">I&#8217;ve been making incremental progress on the retrieval algorithm and now have what I think of as a full complement of algorithms and settings.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In terms of search index encoding, I started with the PFOR encoding of docids and then added repeated varint encodings of the docids to save space. (The PFOR blocks contain 128 docids each, an many keywords only appear a few times.) Next, I added frequencies to support BM25 scoring. This was added in fixed size (4-bit) quantized values so I could easily access them after intersection.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In terms of retrieval depth, I ordered all the documents using pagerank and I added a retrieval quota to limit the number of results returned from retrieval and sent to scoring.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In terms of the actual retrieval algorithm, I just had two options:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Read everything<\/strong>. For each term, I&#8217;d go read the entire record from the search index SSTable. I&#8217;d read all document ids from either the PFOR or repeated varints and pull them into memory, along with all the frequency information. The retrieval quota was implemented simply as a filter after all the intersections and frequencies were calculated.<\/li>\n\n\n\n<li><strong>Read all document ids but read frequencies after intersection.<\/strong> This was a small improvement where we wouldn&#8217;t touch the frequency section of the SearchIndex record until after we had completed computed the intersection. But even then, the retrieval quota would be applied after grabbing all the frequencies.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Now, I&#8217;ve implemented a third option: progressive intersection.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Progressive Intersection<\/strong>. We&#8217;re going to implement the intersection as a progressive multi-way merge pulling in new document ids in chunks from the posting lists only as needed. If we reach the the retrieval quota, we stop without ever reading into memory the complete posting lists.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Here&#8217;s the algorithm. We imagine having the posting list for each term in sorted ascending order (by local doc id) and we want to find documents that are common to all the posting lists.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"436\" height=\"154\" src=\"https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/Untitled-drawing-4.png\" alt=\"\" class=\"wp-image-142\" srcset=\"https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/Untitled-drawing-4.png 436w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/Untitled-drawing-4-300x106.png 300w\" sizes=\"auto, (max-width: 436px) 100vw, 436px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">We&#8217;re looking for the elements that are in common to all three. So, let&#8217;s make a max heap of the first elements. The largest element we find is 3. So we can discard anything lower than 3 from our posting lists:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"436\" height=\"154\" src=\"https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/Posting-list-progressive-retrieval.png\" alt=\"\" class=\"wp-image-143\" srcset=\"https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/Posting-list-progressive-retrieval.png 436w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/Posting-list-progressive-retrieval-300x106.png 300w\" sizes=\"auto, (max-width: 436px) 100vw, 436px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Since the first elements are not all 3, and 3 is the minimum of the first elements, we know we don&#8217;t have 3 in our intersection. Now we&#8217;ll simply repeat this process. The max of the first elements is now 5. Removing everything lower gives:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"436\" height=\"154\" src=\"https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/Posting-list-progressive-retrieval-1.png\" alt=\"\" class=\"wp-image-144\" srcset=\"https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/Posting-list-progressive-retrieval-1.png 436w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/Posting-list-progressive-retrieval-1-300x106.png 300w\" sizes=\"auto, (max-width: 436px) 100vw, 436px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Now since all first elements are the same, we&#8217;ve found that 5 is in the intersection. We continue this process:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"436\" height=\"154\" src=\"https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/Posting-list-progressive-retrieval-2-1.png\" alt=\"\" class=\"wp-image-148\" srcset=\"https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/Posting-list-progressive-retrieval-2-1.png 436w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/Posting-list-progressive-retrieval-2-1-300x106.png 300w\" sizes=\"auto, (max-width: 436px) 100vw, 436px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Removing 5, our max is now 14, so we can remove anything lower:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"436\" height=\"154\" src=\"https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/Posting-list-progressive-retrieval-3.png\" alt=\"\" class=\"wp-image-145\" srcset=\"https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/Posting-list-progressive-retrieval-3.png 436w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/Posting-list-progressive-retrieval-3-300x106.png 300w\" sizes=\"auto, (max-width: 436px) 100vw, 436px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Now our max is 17 and we can remove anything lower:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"436\" height=\"154\" src=\"https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/Posting-list-progressive-retrieval-4.png\" alt=\"\" class=\"wp-image-149\" srcset=\"https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/Posting-list-progressive-retrieval-4.png 436w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/Posting-list-progressive-retrieval-4-300x106.png 300w\" sizes=\"auto, (max-width: 436px) 100vw, 436px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Now our max is 20 and we can remove anything lower:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"436\" height=\"154\" src=\"https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/Posting-list-progressive-retrieval-5.png\" alt=\"\" class=\"wp-image-150\" srcset=\"https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/Posting-list-progressive-retrieval-5.png 436w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/Posting-list-progressive-retrieval-5-300x106.png 300w\" sizes=\"auto, (max-width: 436px) 100vw, 436px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">And that finds 20 is in the intersection. And, now I just noticed that poodle 30 should have been red and not poodle 35. Anyway, you get the point.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The advantage of this progressive approach is two-fold:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>We get to stop as soon as we fill our retrieval quota. For example, I&#8217;m using a retrieval quota of 10k documents. So, if there are more than 10k documents in the intersection, we&#8217;ll only take the first 10k with respect to the document order (pagerank).<\/li>\n\n\n\n<li>We don&#8217;t need to read the entire PFOR from disk in order to compute the intersection. Since the PFOR is already in chunks of 128 document ids, I&#8217;ve setup buffers of 128 document ids for each term. We&#8217;ll read another 128 documents in for each term when the buffer runs out. This applies to both PFOR encoded and repeated varint encoded document ids.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Likely there&#8217;s further optimizations that can be done here. I expect 128 documents to only take around 256 bytes and typically the operating system is reading from disk in 4KiB pages. So, perhaps I should figure out where the page boundaries are and use that to determine how much to buffer.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Ignoring the page boundaries for now, I can count up the bytes I&#8217;m using and see what should lead to a difference in performance.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here&#8217;s the fancy new settings to choose the retrieval algorithm:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"515\" src=\"https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-13-1024x515.png\" alt=\"\" class=\"wp-image-151\" srcset=\"https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-13-1024x515.png 1024w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-13-300x151.png 300w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-13-768x386.png 768w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-13-1536x772.png 1536w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-13.png 1616w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">And here&#8217;s the comparison:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"608\" src=\"https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/comparison-1024x608.png\" alt=\"\" class=\"wp-image-155\" srcset=\"https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/comparison-1024x608.png 1024w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/comparison-300x178.png 300w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/comparison-768x456.png 768w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/comparison-1536x912.png 1536w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/comparison.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The approximate total results clearly needs some work on the progressive algorithm. It thinks that if it continued it would find over a million results. But in terms of bytes read, we can see that we&#8217;ve done much better than the previous two algorithms. Again, this doesn&#8217;t take into account that we read pages. But some of the other statistics, like intersection retrieval time look improved. I&#8217;m looking into what keyword retrieval time means for the progressive algorithm. In the previous two, it was the time taken to read the PFOR. There seems to be a large swing in embedded retrieval time which could certainly account for the overall server time improvements. I don&#8217;t think these results should be entirely trusted until I can do some robust experimentation.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;ve been making incremental progress on the retrieval algorithm and now have what I think of as a full complement of algorithms and settings. In terms of search index encoding, I started with the PFOR encoding of docids and then added repeated varint encodings of the docids to save space. (The PFOR blocks contain 128 [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[30,18],"tags":[19,33,32],"class_list":["post-141","post","type-post","status-publish","format-standard","hentry","category-algorithms","category-search","tag-doc-ids","tag-pfor","tag-retrieval"],"_links":{"self":[{"href":"https:\/\/byte64.com\/index.php?rest_route=\/wp\/v2\/posts\/141","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/byte64.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/byte64.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/byte64.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/byte64.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=141"}],"version-history":[{"count":3,"href":"https:\/\/byte64.com\/index.php?rest_route=\/wp\/v2\/posts\/141\/revisions"}],"predecessor-version":[{"id":156,"href":"https:\/\/byte64.com\/index.php?rest_route=\/wp\/v2\/posts\/141\/revisions\/156"}],"wp:attachment":[{"href":"https:\/\/byte64.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=141"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/byte64.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=141"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/byte64.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=141"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}