Algolia’s compression algorithm: inspired by lightning and coin sorters
Here at Algolia, we store a lot of data. And at such large scales, that efficiency has become our number one concern — but how do we manage…
Here at Algolia, we store a lot of data. And at such large scales, that efficiency has become our number one concern — but how do we manage to fit so much input data into an index that we can search through in milliseconds?
We tackle this incredibly difficult problem with an unusually effective solution called tree search. It can be a little tough to understand at first, but it’ll be helpful if we use some more familiar analogies first before diving in.
The algorithm of euros and electrons
Imagine with me a manual coin sorter (here’s an advertisement for one, if you’d like a visual), where you can drop coins down onto a guided track. According to a variation of the dimensions of the track that the coins ride on, the coins are dropped into different containers by their size. This simple mechanism doesn’t require sophisticated electronics to read and understand what the coins represent, but rather it notes a simple distinction between them — their small size differences — and exploits it by forcing a sort of “decision” at several key points along the track. In effect, you can imagine the sorter “asking” the coins, “are you big enough to continue along this track, or should you be split off onto a new path?”
Lightning is another good example of this algorithm in practice, so let’s skim the surface of the science behind why lightning takes the shape it does. Take a look at this image of lightning below, keeping in mind that lightning is essentially a bunch of very excited electrons trying to get to a place where there are fewer electrons:
image of a lightning strike
While many of us see lightning quite often, it may have never crossed our minds just how bizarre this pattern is. Really, if lightning is just electrons trying to get from point A to point B as fast as possible, surely they’d just go in a straight line, right? Why does lightning take such sharp corners? Why does it split into branches?
To be honest, this is a developing science (and our science focuses on a different kind of cloud), but the general explanation is that lightning is not made all at once. The electrons jump only a few dozen meters downwards from the cloud, pause for a moment, and make a decision of sorts. Depending on the conductivity of the air around them (which is very irregular since “air is not a perfect mixture”), the electrons will “choose” to take a different direction for the next jump. It’s by this process that they leapfrog to the ground, twisting and even branching with each step. While it seems random, it’s for this reason that lightning is actually deterministic — if we could recreate the same air mixture on a scale roughly equivalent to the length of each step, we could theoretically recreate a lightning bolt that has already existed.
Coin sorters and lightning seem like wildly different phenomena, but if we strip away the details, they actually follow a similar recursive process:
Some items travel down a track for some length. They analyze each item on the track by some specific facet, and decide whether it should branch onto a new path. For all of the new branches of the track, repeat step 1. It’s a fairly simplistic algorithm, but that makes it extremely versatile. When it comes to lightning, the strength of the various bolts that hit the ground map out the conductivity of the air above them (the mystery “factor” of step 2). Coin sorters are actually built to exploit this effect, as the whole point of a coin sorter is that the number of coins in each bucket reflects the difference in the sizes of the coins that first entered the track.
How tree search works similarly
What if we had a sort of branching structure that worked like the air that the lightning travels through, or the track that the coins slide down? At every step down this node-based structure, we’d decide where to branch to in the next level by matching the nodes to a search query. Essentially, our analogue to air conductivity or coin size would be query relevancy. By reverse-engineering how we expect to traverse this tree, we can build it!
Let’s imagine our dataset is seven strings: romane, romanus, romulus, rubens, ruber, rubicon, and rubicundus. Our first attempt at building a tree out of this data would start with an r at the top rung of the tree, and then each branch down would spell out one of our searchable items letter-by-letter, as shown in the image below:
Tree search branch example image
Now we can designate each string as a sequence of moves at each junction! For example, our original string romulus can be described as the third option at the first branch, and the first option at the rest of the junctions — 311111. That’s only saving us one character per string though, as there’s one less junction than there are characters. Let’s see if we can optimize this by combining identical characters laterally, like what we did with the r in the first row:
Tree search branch example image with characters combined laterally
This hasn’t improved the query time as it takes the same amount of steps to traverse the tree, but it’s so much cleaner now! It’s much easier to comprehend, and it only takes up about 65% of the storage space as the previous version. It also opens up the possibility of combining nodes vertically:
Tree search branch example image with nodes combined vertically
This new, shorter tree contains the same amount of data as the previous one, but it’s far quicker to traverse — that same query romulus only requires us to go through two junctions as opposed to the previous six. Of course, this was a fairly contrived example, and there exist situations with multiple optimization routes, but the gist is here.
What tree search is actually used for
And this isn’t just a theoretical thing either! Tree search is a very well-known and well-used technology, and you probably use it every day without noticing. Formats like JPEG and MP3 include versions of a tree search (specifically a related technology called Huffman coding) for its compression benefits, and it’s probably what’s powering the search through your phone’s contact list. Another common example is Abstract Syntax Trees, a form of these tree structures that hold tokens of a programming language. A simple line of code like let x = 100; won’t just be run directly by an interpreter or compiler, since it doesn’t really know what to make of the string in its full form. Instead, it’ll break it into something like this:
let x = 100 visualized
Syntax details like the = and the ; are helpful for resolving ambiguity, but they don’t actually help with the code’s functionality. Once the code is converted to a clearly understood tree structure, those bits of syntax just aren’t necessary for the compiler or interpreter to traverse the tree and run or compile it one command at a time.
Here at Algolia, we get asked a lot how we make our search tools so incredibly fast. Really, doesn’t it seem a little unreasonable at first glance that we could sort through millions of data points — JSON objects, string content, numbers, URLs, booleans — and still find what the user is looking for in a matter of mere milliseconds? Well, we’re not magicians; we’re using tree search!
As mentioned earlier, our example with just seven similar strings was chosen specifically to demonstrate the power of this algorithm. In reality, a typical dataset that small is barely going to see any improvement from tree search, so it was difficult to create a practical dataset for this article. The real advantage comes as the dataset gets bigger, and it becomes likelier that information in new entries to the dataset is already contained in part within the tree. This actually happens perhaps sooner than you’d expect, for a couple of reasons, the most significant of which is that certain letters just don’t follow each other in sentence case. Characters that sit alongside each other as different branches, for example, don’t follow the expected equal distribution, and character sequences that can clump into one node vertically happen far more often than they would for randomly generated text.
As an aside, that actually allows us to correct typos fairly easily. It would be unlikely for a node containing W to be the parent of a node containing g, so the user likely meant either the most common character that appears after W in our existing tree, or a character that happens to be nearby on a keyboard, criteria which will often converge on h. And this more confident the larger the tree becomes.
The average branching off point of new additions to the tree gets further and further down the structure, and we’ll start to need less new storage space per character of new entries. That means Algolia’s search indexes (which are essentially just massive tree structures) become more and more performant as the dataset grows. Some of the most well-known Algolia customers are building incredibly large indexes for this reason (think the guides in GitLab’s docs, or the recipes published by King Arthur Baking, or GoFundMe’s fundraisers, or details about Gucci products). This is all to say that the search speed and the size of those massive datasets don’t just scale linearly as they add more information, but actually tend to somewhere — if you had the time and the patience, you could theoretically calculate the limit of the storage space and search speed of Gucci’s product database as the record count approaches infinity!
Perhaps we should leave it at this for now, since we’d probably end up down a very deep rabbit-hole if we continued with that train of thought. But if you found this explanation interesting, here are some takeaways, links to further research, and questions to ponder:
You use tree search daily. They’re probably most often implemented as Huffman trees, or if you’re a programmer, as ASTs. Just like electrons in lightning step through the sky and take the most conductive path available, our tree search algorithm is essentially just our search query stepping down the tree and finding the closest matches available. What other natural phenomena work this way? Optimizing the tree’s size and optimizing the speed of searching through it are two different things: to reduce tree size, we’ll combine equivalent nodes laterally, but to reduce search time, we’ll combine sequential, single-child nodes vertically. How might those two processes interact with each other?
This general algorithm is at the heart of Algolia’s unmatched performance with large datasets. If you’re interested in testing it out on a smaller scale, our free tier is large enough to start feeling the benefits of tree search’s optimizations.
We hope that you enjoyed this deep-dive into tree search, and as always, feel free to reach out on Twitter if you want to keep the conversation going.