Ternary Search Tree


The ternary search tree (TST) is a 3-way tree.It finds all keys having a given prefix, suffix, or infix. It even finds those keys that closely match a given pattern. You can easily search the tree for partial matches. In addition, you can implement near-match functions, which gives you the ability to suggest alternatives for misspelled words.

A TST stores key-value pairs, where keys are strings and values are objects. TST keys are stored and retrieved in sorted order, regardless of the order in which they are inserted into the tree. In addition, TSTs use memory efficiently to store large quantities of data. Best of all, the ternary search tree is lightning fast. The tremendous flexibility of TSTs provides ample opportunity for programming creatively.

A ternary search is an example of a divide and conquer algorithm. It’s a k-ary tree with k=3.

A 3-way tree where every node’s left subtree has keys less than the node’s key, every middle subtree has keys equal to the node’s key, and every right subtree has keys greater than the node’s key.

A trie (which is officially pronounced ‘tree’ since it’s from the word retrieval, but practically everyone says ‘try’) is a data structure that stores not whole strings, but the characters in strings.
But, a trie has a major problem. It requires lots of nodes. If we needed to support all 26 English letters, each node would have to store 26 pointers. And, if we need to support international characters, punctuation, or distinguish between lowercase and uppercase characters, the memory usage grows becomes untenable.

To solve the problem of accessive data node, we could consider using a different data structure in each node, such as a hash map. However, managing thousands and thousands of hash maps is generally not a good idea, and there are other problems also.

In a hash table, we ‘hash’ (or reduce) the name we’re looking for into an integer value, and use that number to index into an array. This is a very fast method, but we lose all information about the name in doing so, and we can’t then ‘see’ the names in sorted order.

Another problem with hash tables is what happens when two names hash to the same value (known as a ‘collision’). There are several strategies used to deal with collisions, categorised into two main types: ‘probing’ and ‘chaining’. In either case, we end up reduced to a sequential set of comparisons to sort this out. Obviously, if the hash table gets too loaded, we compromise with the speed of the searching and end up searching sequentially.

Ternary search tree can help over here.

A ternary tree is a data structure that solves the memory problem of tries in a more clever way. To avoid the memory occupied by unnecessary pointers, each trie node is represented as a tree-within-a-tree rather than as an array. Each non-null pointer in the trie node gets its own node in a ternary search tree.

Unlike the binary search tree , the ternary search tree doesn’t store a complete string at each node. Instead, it stores a single character, just like a trie. Traversing the tree is then a process of picking off characters from our search string. For each character, we compare it against the character in the current node. If it’s less than the node character, we take the left child path; if equal, the middle; if greater, the right path.

Ternary trees have some nice properties as the following: the tree can be traversed in sorted order, partial matches (wildcard) can be implemented, retrieval of all keys within a given distance from the target, etc. The storage requirements are higher than a binary tree but a lot less than a trie. Performance is comparable with a hash table, sometimes it outperforms a hash function (most of the time can determine a miss faster than a hash).

I’ll write a blog having the implementation of Ternary Search Tree for autocomplete soon.

About these ads

One Comment

  1. Trevor Brown
    Posted February 1, 2012 at 12:31 pm | Permalink | Reply

    You could also just use an extensible array or linked list at each node… I suspect that the indirection of trees within trees will cause a lot of cache misses that will completely dominate the running time. That is, I suspect a trie with expandable arrays (with linear insertion/traversal time) at each node will still outperform the trie with 3-ary trees at each node, since the array memory will be contiguous and prefetched into the cache, and the 3-ary tree’s memory is most assuredly likely non-contiguous, which means it will cause cache misses. (One cache miss when dereferencing a pointer is estimated to cost over 1,000 cycles on modern hardware–which means doing this for several tree pointers will likely cost much more than manually reading the entire expandable array to determine which single pointer to follow.)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: