Optimization of String Key Comparison

Ordered data structures that index string keys are widespread, and provide the backbone for databases. Lookups on such data structures (e.g., balanced tree) is characterized by accessing keys that are very different from each other, at least until the traversal zooms on a small range of adjacent keys. When searching for a certain key, most comparisons are thus likely to determine the result (bigger or smaller) quickly.

Many string implementations offer optimizations that are useful in the general case, but are of little help on the particular case of key lookup: while key comparisons are very likely to result in a quick result due to the high variance, the strings are still compared character by character. This might incur several overheads:
1. A call to library functions such as strcmp
2. Dereferencing a character array pointer, possibly yielding cache misses and eviction of useful data in order to cache the array
3. Entering a character-by-character comparison loop, which include conditional branches that are likely to be mispredicted at least once.

To avoid the overheads of likely-short comparisons, the prefix of the string can be turned into a number, which will be stored in the string object. Comparisons should then first compare prefix numbers, and resort to a full comparison only if the two prefixes match. When key variance is high as described, all the overheads described above will be saved, along with the associated instructions.

Project description
The goal of the project is to evaluate the realistic gain of fast key prefix comparison. Implementing the described optimization is not free: the prefix number occupies space, and when prefixes match, comparing them is pure overhead. Given the trade-of between gain and lost, applying the optimization on a real database, storing real datasets, can reveal the actual benefit (if any).

In the proposed project, students will first apply the optimization on an open source C/C++ database. They will then evaluate the performance of the optimized database, focusing on the overheads described above. Lastly, they will further tune the optimization in light of the measured overheads, trying to make it efficient as possible.

Preliminary requirements
1. Reasonable programming skills in C or C++ - not a lot of code writing is expected, but the databases to be optimized include a non-trivial amount of code
2. Reasonable understanding of hardware mechanisms such as caches and predictors - the project will evaluate the overheads induced when the software misuses them.
3. Reasonable technical maturity - project work includes understanding industrial code, running benchmarks, using profiling tools and analyzing the results.

Prerequisites: Any programming language with ability to learn new concepts quickly