Search – Byte64

Query Planning for Fun

Jul 7, 2026

—

by

Andrew

I’ve still been hard at work trying to bring the Byte64 search engine to its first release. I’ve been focused on online indexing which has a host of complexities. I don’t think it’s lent itself to blog posts because it’s a complicated system and there doesn’t appear to be one clear right way to do…

Categories and Suggestions

Apr 28, 2026

—

by

Andrew

in Algorithms, Data Processing, Data Structures, Search

Andrew

Recently I’ve been trying, somewhat unsuccessfully, to find wikipedia search filters that don’t fit into my retrieval model. I was hoping to find some natural low cardinality, high coverage fields that people could use as filters in their queries. Imagine you’re on a product page for Bombas socks. You’ll find filters for product type (Men/Women/Kids/Sport),…

Retrieval Algorithms

Apr 20, 2026

—

by

Andrew

in Algorithms, Search

Andrew

I’ve been making incremental progress on the retrieval algorithm and now have what I think of as a full complement of algorithms and settings. In terms of search index encoding, I started with the PFOR encoding of docids and then added repeated varint encodings of the docids to save space. (The PFOR blocks contain 128…

PageRank on Wikipedia

Apr 17, 2026

—

by

Andrew

in Algorithms, Data Processing, Search

Andrew

PageRank is a famous scoring function invented and deployed by the Google guys in early days of websearch. It assigns each webpage a score based on the scores of webpages that link to it. As you can see, it’s a recursive definition, but if you use the right formula, then it’ll converge to something meaningful.…

Hybrid Search

Apr 16, 2026

—

by

Andrew

in Search

Andrew

I’ve already written a post on using a piecewise-linear scaling to bring BM25 and my semantic score (from cosine similarity with our embeddings) into the same numerical space. After performing this scaling, I found some important results weren’t scoring as well as they should. In particular, any search query with a common term (e.g. “the”…

Score Normalization

Apr 3, 2026

—

by

Andrew

in Search

Andrew

Currently, I have two different scoring functions: BM25 and the semantic scoring function that comes from our sentence embedding. These scores take very different ranges, but need to be combined to make a final score. It’s not simply a matter of assigning different weights to these scores. We need to stretch them out to make…

BM25 & Search Index Encoding

Apr 2, 2026

—

by

Andrew

in Data Structures, Search

Andrew

Okapi BM25 is a standard ranking formula that has been used in search engines since the 1980s. For each word in a query, it uses the frequency of that word in a document, the length of the document and the number documents that contain the word to decide how significant the word is for the…

Vector Retrieval

Mar 31, 2026

—

by

Andrew

in Data Structures, Google Cloud, Search

Andrew

Now that every document has been assigned a vector encoding its semantics, this opens the door to a new kind of retrieval. Rather than find documents that might be relevant to the query by searching through the search index for keywords, we can instead take the query’s vector and find nearby documents in the embedding.…

Vector Embeddings

Mar 26, 2026

—

by

Andrew

in Data Structures, Search

Andrew

Search engines make use of AI to improve their search results. There are AI models that can understand the meaning of a sentence or document. They often present their results as embeddings of the document space into a vector space: D→ℝnD \to \mathbb{R}^n. These are called vector embeddings. Once you’ve found the embeddings for your…

Cookies and GDPR

Mar 24, 2026

—

by

Andrew

in Search, Web Development

Andrew

I’m trying to navigate the intricacies of privacy and building a search engine. This means understanding important privacy legislation, like GDPR, and what it means for collection of the data that’s needed to tune a search engine. Session Cookies. The classic approach to building a search engine is to give users session cookies, log their…

Generating User Data

Mar 18, 2026

—

by

Andrew

in Coding with AI, Search

Andrew

A search engine depends on a feedback loop of users making queries, following links, returning to the search page and rewriting their queries. All these feed into an understanding of whether they’re finding the results they’re looking for. Bootstrapping a system like this is difficult because you don’t start with any users and your search…

Search Indexes and Memory

Mar 12, 2026

—

by

Andrew

in Data Structures, Search

Andrew

I’ve been working on v0 of the search engine which requires building a search index in the form of “posting lists”. I’ve build a pipeline that reads wikipedia documents, outputs all the words it finds and emits key-value pairs of (word, url). Then we group by word so we can lookup all documents that word…

Category: Search