{"id":158,"date":"2026-04-28T20:59:57","date_gmt":"2026-04-28T20:59:57","guid":{"rendered":"https:\/\/byte64.com\/?p=158"},"modified":"2026-04-28T20:59:57","modified_gmt":"2026-04-28T20:59:57","slug":"categories-and-suggestions","status":"publish","type":"post","link":"https:\/\/byte64.com\/?p=158","title":{"rendered":"Categories and Suggestions"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Recently I&#8217;ve been trying, somewhat unsuccessfully, to find wikipedia search filters that don&#8217;t fit into my retrieval model. I was hoping to find some natural low cardinality, high coverage fields that people could use as filters in their queries.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Imagine you&#8217;re on a product page for Bombas socks. You&#8217;ll find filters for product type (Men\/Women\/Kids\/Sport), Category (socks, slippers, underwear, t-shirts) and material (cotton, wool, cashmere, etc.). These pose a problem for search because every product has a value for each. So, if I search for socks, say, I can expect that the posting list of products is a significant fraction of the entire database. (This isn&#8217;t a problem for Bombas which only has hundreds of products, but it does pose a problem for wikipedia) For large datasets this requires using bit arrays, perhaps with run length encoding, instead of PFORs or repeated varints. Even with more compact encodings for this dense data, it can take too long to read it from disk, especially combined with other keywords that could keep us from reaching our retrieval quota. So, we might have to keep these in memory or build caching systems. Finally, if you have a lot of fields like this, or if a field can take multiple values simultaneously, we&#8217;ve got to do some hard combinatorics to figure out how best to store it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Unfortunately, I couldn&#8217;t find any data like this.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">First, I tried parsing <strong>infoboxes<\/strong>. This is structured data the comes with each wikipedia page and shows up as little box in the right margin. There&#8217;s all sorts of templates for these infoboxes, and you get nested structured data. However, the values in these infoboxes are fairly unique, so if you build out the search index for them, there&#8217;s not a lot of pages for each value. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">You can lookup in the infoboxes using a simple key:value syntax. For example, &#8220;author:Shelley book&#8221; will give you Frankenstein among others:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"632\" src=\"https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-14-1024x632.png\" alt=\"\" class=\"wp-image-159\" srcset=\"https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-14-1024x632.png 1024w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-14-300x185.png 300w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-14-768x474.png 768w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-14-1536x947.png 1536w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-14.png 1618w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Next, I tried looking at <strong>quality classes<\/strong>. Wikipedia pages are judged on <a href=\"https:\/\/en.wikipedia.org\/wiki\/Wikipedia:Content_assessment\">various criteria<\/a> and given scores like &#8220;Featured Article&#8221;, &#8220;Good Article&#8221;, &#8220;A-class&#8221;, &#8220;B-class&#8221; and so on. I had some problem pulling this information from the wikipedia dumps. Most of these annotations are in talk pages which require downloading the complete history of wikipedia revisions (no thanks). If you look on the articles themselves, you only see &#8220;Featured Articles&#8221; (6885), &#8220;Good Articles&#8221; (43598), &#8220;Stubs&#8221; (2318123) and a few others. Only &#8220;Stubs&#8221; has really high coverage and this data didn&#8217;t seem like a good test of the bit array filters I wanted to build.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, I looked at <strong>categories<\/strong>. If you go to the bottom of almost any wikipedia page, you&#8217;ll find a bunch of categories that it falls into. These categories are essentially collections of pages on a similar topic or having something else in common. For example, Frankenstein is in:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Category:1810s_science_fiction_novels\">1810s science fiction novels<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Category:British_Gothic_novels\">British Gothic novels<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Category:British_horror_novels\">British horror novels<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Category:Censored_books\">Censored books<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Category:Cultural_depictions_of_scientists\">Cultural depictions of scientists<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Category:Novels_adapted_into_video_games\">Novels adapted into video games<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Category:Novels_by_Mary_Shelley\">Novels by Mary Shelley<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Category:Novels_set_in_the_Arctic\">Novels set in the Arctic<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Category:Romantic_novels\">Romantic novels<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Category:Science_fiction_horror_novels\">Science fiction horror novels<\/a><\/li>\n\n\n\n<li>and many more&#8230;<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Now, these categories appear as text in the article, so you can already search for them but what you couldn&#8217;t do restrict your results to the category. For example, if you search for &#8220;Romanic novels&#8221;, you&#8217;ll get a result for &#8220;Romantic fantasy&#8221;. This is page has both those words, but is not in the category.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"480\" src=\"https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-15-1024x480.png\" alt=\"\" class=\"wp-image-160\" srcset=\"https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-15-1024x480.png 1024w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-15-300x141.png 300w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-15-768x360.png 768w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-15-1536x720.png 1536w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-15.png 1612w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">So, I decided to add a category filter.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Category Filter<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">With a category filter, I wanted exact matches on the category name. You should be able to select a category, see the results in it and filter further using a plain search. The user experience for this is a bit different from our plain searches. I don&#8217;t know in advance what categories there are and I need to make an exact match. So, the first step was to index the categories with their full text phrase (lowercase) as a key. In our search index we&#8217;ll have &#8220;romanic novels&#8221; as a key with all the documents in that category as values.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">That enforces that we require exact matches. In the user interface, I added an input fields specifically for categories, because it needs to suggest that partial matches aren&#8217;t allowed. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Next, we need to fix the discovery problem. How does the user know what categories can be searched? There are 1,836,240 different categories which is too many for a drop-down list. Instead, we need to implement suggestions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">When typing into a search box, you are often given options to complete what you might be typing. Often, your text is seen a prefix and the suggestions merely complete the line of text in some way. Other times, when you&#8217;re looking for something exact but you don&#8217;t know what it is, your text might appear in the middle of the result. That said, typically we don&#8217;t start typing in the middle of a word, so you can treat whatever someone types as a prefix of a word in the text they&#8217;re looking for. So maybe typing &#8220;sci fi&#8221; should match &#8220;Science Fiction&#8221;. Or even &#8220;fi sci&#8221; should work.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I expect something like this to be implemented with a data structure called a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Trie\">Trie<\/a>. Talking to Gemini about it, there seemed to be three options:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Use a Trie to store the prefixes of all words in the category phrases,<\/li>\n\n\n\n<li>Use a regular search index to store the words in the category phrases, or<\/li>\n\n\n\n<li>Use a map (hashmap or B-tree) to store all the prefixes separately and do a direct prefix lookup.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s talk it through:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Tries are an in-memory approach and relatively efficient on space. I was a little worried about how much time it would take to read them from disk on startup.<\/li>\n\n\n\n<li>I hadn&#8217;t really thought of reusing my other search infrastructure for suggestions. I didn&#8217;t think it&#8217;d be fast enough, but it does much of what&#8217;s needed.<\/li>\n\n\n\n<li>Storing all the prefixes separately (&#8220;s&#8221;, &#8220;sc&#8221;, &#8220;sci&#8221;, &#8220;scie&#8221;, &#8220;scien&#8221;, &#8220;scienc&#8221;, &#8220;science&#8221;) is really wasteful with memory. But it does give you a super fast lookup.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">One new idea I wanted to bring to this was rather than storing (prefix -> list of categories), there&#8217;s a natural in-between where (prefix -> list of words) and (word -> list of categories). This extra jump will take more time, but should keep everything more compact. So, I started out by building the (word -> list of categories) search index. Once that was done, I&#8217;d have options. I could serve it from disk, or pull it into memory. That would be a newer faster way to search one of my search indices. I&#8217;d also build out the (prefix -> list of words) mapping as either a trie or a simple map.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">But as it turns out, once you have the (word -> list of categories) search index, you&#8217;re pretty much done. The sstables I use to store the search indices are really good at listing out keys with a given prefix since the keys are sorted. I had the whole thing implemented before doing any in-memory hacks and it turns out the latency is great:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"481\" src=\"https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-16-1024x481.png\" alt=\"\" class=\"wp-image-161\" srcset=\"https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-16-1024x481.png 1024w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-16-300x141.png 300w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-16-768x361.png 768w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-16-1536x722.png 1536w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-16.png 1630w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Every keystroke takes ~5ms on the server and 71ms for the round-trip request. There&#8217;s not much room to improve if it&#8217;s all in memory or I directly map to the prefixes. With only 10 results, there&#8217;s not much time spent on scoring &#8212; I&#8217;m not really doing any scoring other than looking up the URL from the local_doc_id (in the category space).<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"633\" src=\"https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-17-1024x633.png\" alt=\"\" class=\"wp-image-162\" srcset=\"https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-17-1024x633.png 1024w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-17-300x185.png 300w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-17-768x475.png 768w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-17-1536x950.png 1536w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-17.png 1624w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">You can now see the full list of pages in the &#8220;Romantic Novels&#8221; category. You can also add more search terms and I&#8217;ve had it filter out semantic retrieval results that don&#8217;t meet the category criteria.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"480\" src=\"https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-18-1024x480.png\" alt=\"\" class=\"wp-image-163\" srcset=\"https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-18-1024x480.png 1024w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-18-300x141.png 300w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-18-768x360.png 768w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-18-1536x721.png 1536w, https:\/\/byte64.com\/wp-content\/uploads\/2026\/04\/image-18.png 1624w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">I&#8217;m still on the lookout for fields that have large coverage over the corpus. Currently, I&#8217;m only indexing English wikipedia, but if I expand that, a language filter could fit the bill.  <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Recently I&#8217;ve been trying, somewhat unsuccessfully, to find wikipedia search filters that don&#8217;t fit into my retrieval model. I was hoping to find some natural low cardinality, high coverage fields that people could use as filters in their queries. Imagine you&#8217;re on a product page for Bombas socks. You&#8217;ll find filters for product type (Men\/Women\/Kids\/Sport), [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[30,16,5,18],"tags":[34,37,32,36,35,7],"class_list":["post-158","post","type-post","status-publish","format-standard","hentry","category-algorithms","category-data-processing","category-data-structures","category-search","tag-filters","tag-prefix","tag-retrieval","tag-suggestions","tag-trie","tag-wikipedia"],"_links":{"self":[{"href":"https:\/\/byte64.com\/index.php?rest_route=\/wp\/v2\/posts\/158","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/byte64.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/byte64.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/byte64.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/byte64.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=158"}],"version-history":[{"count":1,"href":"https:\/\/byte64.com\/index.php?rest_route=\/wp\/v2\/posts\/158\/revisions"}],"predecessor-version":[{"id":164,"href":"https:\/\/byte64.com\/index.php?rest_route=\/wp\/v2\/posts\/158\/revisions\/164"}],"wp:attachment":[{"href":"https:\/\/byte64.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=158"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/byte64.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=158"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/byte64.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=158"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}