Introducing TrieHash, a order-preserving minimal perfect hash function generator for C(++)


I introduce TrieHash an algorithm for constructing perfect hash functions from tries. The generated hash functions are pure C code, minimal, order-preserving and outperform existing alternatives. Together with the generated header files,they can also be used as a generic string to enumeration mapper (enums are created by the tool).


APT (and dpkg) spend a lot of time in parsing various files, especially Packages files. APT currently uses a function called AlphaHash which hashes the last 8 bytes of a word in a case-insensitive manner to hash fields in those files (dpkg just compares strings in an array of structs).

There is one obvious drawback to using a normal hash function: When we want to access the data in the hash table, we have to hash the key again, causing us to hash every accessed key at least twice. It turned out that this affects something like 5 to 10% of the cache generation performance.

Enter perfect hash functions: A perfect hash function matches a set of words to constant values without collisions. You can thus just use the index to index into your hash table directly, and do not have to hash again (if you generate the function at compile time and store key constants) or handle collision resolution.

As #debian-apt people know, I happened to play a bit around with tries this week before guillem suggested perfect hashing. Let me tell you one thing: My trie implementation was very naive, that did not really improve things a lot…

Enter TrieHash

Now, how is this related to hashing? The answer is simple: I wrote a perfect hash function generator that is based on tries. You give it a list of words, it puts them in a trie, and generates C code out of it, using recursive switch statements (see code generation below). The function achieves competitive performance with other hash functions, it even usually outperforms them.

Given a dictionary, it generates an enumeration (a C enum or C++ enum class) of all words in the dictionary, with the values corresponding to the order in the dictionary (the order-preserving property), and a function mapping strings to members of that enumeration.

By default, the first word is considered to be 0 and each word increases a counter by one (that is, it generates a minimal hash function). You can tweak that however:

= 0
WordLabel ~ Word
OtherWord = 9

will return 0 for an unknown value, map “Word” to the enum member WordLabel and map OtherWord to 9. That is, the input list functions like the body of a C enumeration. If no label is specified for a word, it will be generated from the word. For more details see the documentation

C code generation

switch(string[0] | 32) {
case 't':
    switch(string[1] | 32) {
    case 'a':
        switch(string[2] | 32) {
        case 'g':
            return Tag;
return Unknown;

Yes, really recursive switches – they directly represent the trie. Now, we did not really do a straightforward translation, there are some optimisations to make the whole thing faster and easier to look at:

First of all, the 32 you see is used to make the check case insensitive in case all cases of the switch body are alphabetical characters. If there are non-alphabetical characters, it will generate two cases per character, one upper case and one lowercase (with one break in it). I did not know that lowercase and uppercase characters differed by only one bit before, thanks to the clang compiler for pointing that out in its generated assembler code!

Secondly, we only insert breaks only between cases. Initially, each case ended with a return Unknown, but guillem (the dpkg developer) suggested it might be faster to let them fallthrough where possible. Turns out it was not faster on a good compiler, but it’s still more readable anywhere.

Finally, we build one trie per word length, and switch by the word length first. Like the 32 trick, his gives a huge improvement in performance.

Digging into the assembler code

The whole code translates to roughly 4 instructions per byte:

  1. A memory load,
  2. an or with 32
  3. a comparison, and
  4. a conditional jump.

(On x86, the case sensitive version actually only has a cmp-with-memory and a conditional jump).

Due to this may be one instruction more: On some architectures an unneeded zero-extend-byte instruction is inserted – this causes a 20% performance loss.

Performance evaluation

I run the hash against all 82 words understood by APT in Packages and Sources files, 1,000,000 times for each word, and summed up the average run-time:

host arch Trie TrieCase GPerfCase GPerf DJB
plummer ppc64el 540 601 1914 2000 1345
eller mipsel 4728 5255 12018 7837 4087
asachi arm64 1000 1603 4333 2401 1625
asachi armhf 1230 1350 5593 5002 1784
barriere amd64 689 950 3218 1982 1776
x230 amd64 465 504 1200 837 693

Suffice to say, GPerf does not really come close.

All hosts except the x230 are Debian porterboxes. The x230 is my laptop with a a Core i5-3320M, barriere has an Opteron 23xx. I included the DJB hash function for another reference.

Source code

The generator is written in Perl, licensed under the MIT license and available from – I initially prototyped it in Python, but guillem complained that this would add new build dependencies to dpkg, so I rewrote it in Perl.

Benchmark is available from


See the script for POD documentation.

5 thoughts on “Introducing TrieHash, a order-preserving minimal perfect hash function generator for C(++)

  1. This looks awesome.

    Have you tried generating a version that does word-at-a-time comparisons? You could load 4 or 8 bytes, bitwise-and with 0xDFDFDFDF or 0xDFDFDFDFDFDFDFDF, and compare. In some cases, that would produce smaller and faster code.

    1. I tried 2 bytes, but did not see a huge gain, except on aarch64 where it got rid of the zero extend bug. I’m thinking about potentially start chunking keys into n * 8 bytes, (0 to 1) 4-byte, (0 to 1) 2-byte, (0 to 1) byte; and seeing how that performs.

      One problem with doing this in chunks is that it might end up doing unaligned reads, which means that the code has undefined behavior and might just error (SIGBUS or something?) on devices that don’t support them, like old ARM devices IIRC. Unaligned reads also seem to have worse performance on modern Intel CPUs than normal reads, so not sure how much that optimizes there (I think the 2 byte one actually performed worse, but it was a bit broken – some cases missing – so the result might be screwed).

      Another problem is complexity in the creator: I can only do the bitmasking optimization for some characters, so if some of the keys include special characters like minus (with | 32, or \r with & 0xDF), it does not work and I’d have to generate multiple cases.

      The way the multi-byte stuff was supposed to work is to just try to get as many 2**n bytes (n <= 3) byte chunks from the key as the head in the insertion in the trie. So if you insert a 15 byte string you end up with

      root.insert(key=15 bytes)
      root.child.insert(7 bytes)
      root.child.child.insert(3 bytes)
      root.child.child.child.insert(2 bytes)
      root.child.child.child.insert(1 bytes)
      root.child.child.child.insert(0 bytes)

      Then when building the switches you can rely that all children in your trie have the same length: because it creates per-length tries, all keys have them same length, and thus are chunked the same way.

      With special characters that breaks a bit: We'd have to split at special characters, but that also means we have to split any other child we have already inserted at the same position, so we can do proper case statements in the generator.

      1. OK, so it turns out I can just create my own multi-byte integer types with __attribute__((aligned(1))) and gcc does the magic for me on architectures not supporting unaligned reads.

        So the only thing missing now is the handling of non-alphabetical characters, then it should be ready to go.

      2. All that said, I stayed up until 5am, but the multibyte branch now seems to work correctly. Splitting at special (ambiguous) characters was fairly easy: I just rebuilt the entire (per-length) trie again splitting at the earliest needed character at each level. The tree is then fully normalized: All children have indexes of the same length.

        This can be further optimized though: If all ambiguous characters are at the same position in all cases, I can just or 0x00 instead of 0x20. Your suggestion with anding 0xDF was interesting too – I think they basically both work the same way, yours just always unsets the bit.

        I’ll do a more in-depth look in my next blog post

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s