Tag: sstable

  • Golden SSTables

    When you’re testing, it’s common to transform whatever structure you’re dealing with into text and compare that text to a fixed string of expected text. This is convenient because you get a nice visual representation of what you’re expecting the output to be. This approach can cause problems if the text representation doesn’t capture everything…

  • Search Indexes and Memory

    I’ve been working on v0 of the search engine which requires building a search index in the form of “posting lists”. I’ve build a pipeline that reads wikipedia documents, outputs all the words it finds and emits key-value pairs of (word, url). Then we group by word so we can lookup all documents that word…

  • Writing SSTables with Beam

    Apache Beam is an open source system for processing large datasets. It has both a realtime and a batch processing mode. The batch processing mode is based on Google’s internal Flume framework which I had the pleasure of using for 7 years while processing Android telemetry. It’s also the perfect system for building a search…

  • SSTable on SSD

    When I first put the SSTable server on Google’s Compute Engine, I opted for the cheapest machine: an e2-small (2 cores and 2GB ram) with a standard persistent disk of 20 GiB. When I ran the server, each request was taking hundreds of milliseconds. Installed iotop and found I was getting at most 8 MiB/s…

  • Building an SSTable

    SSTables are a critical piece of technology that holds up the modern web. It’s the basis for most modern databases, search backends and many other technologies. What it provides is a reasonably fast way to perform lookups in large datasets. SSTables are sorted string tables meaning both our lookup keys and the resulting values are…