Back to index

A suffix array is a sorted list of all the suffixes of a string. All string comparisons are done lexicographically, so compare the first characters, and if they're equal go to the next character. Instead of storing the entire suffix as a sequence of characters, we store just the beginning index. This is useful when looking for all the occurrences of a substring, since all suffixes beginning with the substring end up next to each other in the suffix array, and we can binary search to find them.

String: "bcababacc"
Suffix array: 9 2 4 6 3 5 0 8 1 7
                a a a b b b c c c
              ^ b b c a a c   a c
              e a a c b c a   b
              m b c   a c b   a
              p a c   c   a   b
              t c     c   b   a
              y c         a   c
                          c   c
              v           c
      
There are more applications of suffix arrays, but I'm going to write about the construction of suffix arrays, specifically the SA-IS algorithm, and how my awful implementation works.

I read through this set of stanford lecture notes a whole bunch of times until I think I can finally say that I understand what's going on, and hopefully I can explain it well.

To begin, we want to have a sentinel character at the end of the string. It makes life easier later on. The sentinel is defined to be the smallest character, and it is unique. We'll use '$' for this string since its ASCII value 0x24 is smaller than all the alphanumeric characters, it happens to be the regex end of line character, and it's what's used in the original SA-IS paper.

We assign types to the suffixes, where a suffix can be either S-type or L-type. A suffix starting at index i is S-type if it is the sentinel or it is smaller than the suffix starting at index i + 1. A suffix starting at index i is L-type if it is larger than the suffix starting at index i + 1. We want to build up an array types storing the type of each suffix.

Let's look at a way of implementing the function is_l_type

bool is_l_type(char *str, int i) {
  if (str[i] == '$') return false;
  if (str[i] < str[i + 1]) return false;
  if (str[i] > str[i + 1]) return true;
  return is_l_type(str, i + 1);
}
      
If we run types[i] = is_l_type(str, i) for each index i, we'd have O(n^2). But notice that is_l_type(str, i) depends on the output of is_l_type(str, i + 1). If we store types[i + 1] = is_l_type(str, i + 1), then when calculating is_l_type(str, i), we already have the result of is_l_type(str, index + 1) stored in types[i + 1]. So we can begin at the end of the array, starting with the sentinel as S-type, and iterate backwards to get all the types in O(n).

This is example of dynamic programming, a pretty simple example but an example none the less.

String: "bcababacc$"
Types:   SLSLSLSLLS
  1   3   5   7
 S L S L S L S L
0   2   4   6   8
                 L
                  9
    

Now that we have the types of the suffixes, we pick out Left-Most-S-type (LMS) suffixes, which are S-type suffixes that are preceded by an L-type suffix.

  1   3   5   7
 S L S L S L S L
0   2   4   6   8
    ^   ^   ^    L
                  9
                  ^
    

What's cool about these suffixes is that if we have them sorted, then we can figure out the order of the rest of the suffixes in O(n) time. This procedure is called an Induced Sort: that's what the "IS" in the algorithm name refers to. It works by something similar to a bucket sort/k-way merge.

What's also cool is that every time I try to explain this part of the algorithm, I confuse myself and anyone trying to listen along. So for now this is approximately where this article is going to trail off. Maybe some another time, I'll come back to writing this.

We begin by inducing the order of the L-type suffixes:

  _   _   _   _
 |1\ |3\ |5\ |7\
 S\L\S\L\S\L\S\L\
0  \2| \4| \6| \8\
    ^   ^   ^   \L\
                 \9|
                  ^
      
We start with all the lists we want to merge in a priority queue:
String: bcababacc$
7>8>9: $
  1>2: ababacc$
  3>4: abacc$
  5>6: acc$
Suffix array: []
      
We take list with the smallest suffix, and append it to the suffix array. Then, we get the next suffix in the list, and put it back into the priority queue.