Bring your tuxedo. It's a formal affair.

Level: Advanced Theory

This article describes the techniques of recognizing regular expressions. This will describe how to build non-deterministic finite state automata (NFA), convert them to deterministic finite state automata (DFA), minimize the DFA, and generate tables to be used to parse input.

If you want to learn more about these techniques, and to see a complete example of these concepts applied, I recommend: Compiler Design in C by Allen I. Holub. Particularly, chapter 2, "Input and Lexical Analysis."

Using regular expressions adds a level of formalism to your code which is unattainable using trial and error, write and debug, development. This formalism is a good friend to have when you're dancing with the deceptive mistress called Entropy. Sooner or later, some input will come along and give you problems. However, by building upon a strong theoretical foundation, you can rest assured that, once the theory is properly implemented, your program will behave correctly for all input.

The theory of interest here is the relationship between regular expressions and finite state machines. In 1956, Stephen Kleen proved that any regular set, defined by a regular expression, can be recognized by some finite state machine. The technique to convert a regular expression into an NFA was conceived by Ken Thomson of Bell Labs for the QED editor.

Review of Finite State Automata

A finite state machine can be defined somewhat formally as:

A set of states, S, one of which is the starting state
A set of input symbols, I
A set of output actions (or accepts)
A set of transitions which indicate that on input i(x) applied to state s(m) results in a transition to state s(n)

An FSA can be shown as a graph, where the states are shown as labeled nodes, and transitions from one state to another on some input symbol is shown as an arc from one node to another, with the arc labeled with the input symbol. Accept states are shown with a bold circle. The following graph represents an FSA which recognizes the word bit or the word bet.

Deterministic vs. Non-deterministic FSA

A Non-deterministic FSA can have multiple arcs with the same label, and thus the result of a transition is not a single state, but a set of states. A deterministic FSA will go to exactly one node, given any input symbol. Examples below will demonstrate this.

Regular Expressions

Regular expressions let us define classes of input strings. The rules can be described as:

Any single symbol in a language is itself a regular expression.
If A and B are regular expressions, so are AB, (A|B), and B*.

Concatenation, (AB), is an operation that combines two regular expressions into a new regular expression.

Union, (A|B), is an operation that defines a new set to consist of all the union of two or more sets.

Closure, B*, is an operation that allows a regular expression to be repeated zero or more times.

For example, the expression: Hello defines a class with exactly one item, the word Hello. We can also define the set of all strings that begin with the letter A followed by one or more Bs. This requires the definition of an operator: *, which indicates that the previous expression or sub expression may be repeated zero or more times. Thus the expression is: AB*.

It is useful to define a few more operators for practical use. One possible set is:

?: The proceeding regular expression is optional: colou?r
*: Repeat the preceding regular expression zero or more times: 10*.00
+: Repeat the preceding regular expression one or more times: Wow!+
( ): Grouping to create a regular expression comprised of sub expressions: ba(na)+
|: Accept any of a group of regular expressions: l(a|i|o|u)ck
\: Escape, to remove a character's special meaning as an operator: Who\?

Many regular expression compilers, such as lexx, or the one built into the UNIX grep utility, may define other operators, but they only provide an easier notation for complex expressions. The operators shown above are sufficient to demonstrate the theory.

Thompson's Construction - Building the NFA

Thompson's construction is named for Ken Thompson, mentioned above. It uses a simple construction for each fundamental regular expression operator. These structures can be nested in a way which parallels the nesting of the regular expression itself.

The simplest construction is one which recognizes a single character. It has only a start and end state.

A union of two structure looks like this:

In practice, an arc actually can hold a set of characters. This is more efficient than using a union structure for individual characters. It might be useful, for example, to allow a transition on any lowercase character, or any numeric digit, or simply any character.

There is also a new symbol, epsilon, which represents an immediate transition. If you enter state s(0) you are actually entering the state set {s(0), s(1), s(3)}. An entry into a state with an epsilon transition is the same as entering into every state reachable through all subsequent epsilon transitions.

Concatenation is constructed by linking two structures together with an epsilon transition. Actually, this wastes states and so the second form is actually used.

The following diagram shows the construction for the *, +, and ? operators.

To demonstrate how these work together, here is the diagram for the regular expression ((ch|r)an?t)+.

Subset Construction - Converting the NFA to the DFA

A deterministic finite state automaton, or DFA, has no epsilon transitions and all state transitions result in a single state, not a set of states rather than a set of states.

Converting from an NFA to DFA takes advantage of the fact that all states connected by epsilon transitions all represent one logical state. The set of all states reachable from an input set through epsilon transitions is called the epsilon-closure set for the input set.

The epsilon-closure sets for each node can be computed after the NFA is complete. For example, in NFA above, the epsilon-closure set for state s(11) is {s(1), s(3), s(6), s(11) }. Notice that a node is always in its own epsilon-closure set. Nodes that are accepting states will be show as bold. The epsilon-closure sets for all the states are shown below.

State   epsilon-closure
----------------------------------
  1      {s(1)}
  2      {s(2)}
  3      {s(3)}
  4      {s(4), s(7)}
  5      {s(5), s(7)}
  6      {s(1), s(3), s(6)}
  7      {s(7)}
  8      {s(8)}
  9      {s(8), s(9), s(12)}
 10      {s(10), s(12)}
 11      {s(1), s(3), s(6), s(11)}
 12      {s(12)}
 13      {s(6), s(13), s(14)}
 14      {s(14)}

The epsilon-closure function is actually applied to a set of states. The epsilon-closure for a set of nodes can be computed by creating a union of the epsilon-closure sets of all the individual nodes in the input set. In other words,

epsilon-closure({s(1), s(2), ...s(n)}) = epsilon-closure({s(1)}) U epsilon-closure({s(2)}) U ... U epsilon-closure({s(n)})

For example:

epsilon-closure({s(4), s(9)}) = {s(4), s(7), s(8), s(9), s(12)}

Another function needed to create the DFA, which also operates on a set of states, is called next. It computes the next states reachable on a particular input symbol from all the states in the given set.

For example:

next({s(6), s(11)}, c) = {s(2)}

The DFA is built by computing S(n)= epsilon-closure(next(S(m), i)), for each i in the input set on a given set S(m). Each new unique set created this way becomes a state in the DFA. If any NFA state within this set is an accepting state, the resulting DFA state will also be an accepting state.

The first state of the DFA is always the epsilon-closure of the first state.

epsilon-closure({s(11)}) = {s(1), s(3), s(6),s(11) } = d(1)

Now, for each input symbol, compute epsilon-closure(next(S(n), I))

epsilon-closure(next({s(1), s(3), s(6), s(11) }, c)) = {s(2)} = d(2)
epsilon-closure(next({s(1), s(3), s(6), s(11) }, h)) = { }
epsilon-closure(next({s(1), s(3), s(6), s(11) }, r)) = s(5), s(7)} = d(3)
epsilon-closure(next({s(1), s(3), s(6), s(11) }, a)) = { }
epsilon-closure(next({s(1), s(3), s(6), s(11) }, n)) = { }
epsilon-closure(next({s(1), s(3), s(6), s(11) }, t)) = { }

Let's examine the derivation of state d(3) in a little closer detail. The next function searches each state in its input set for the given transition, in this case, "r." The only state which has a transition on "r" is s(3) and its next state is s(5). The epsilon-closure set of s(5) is {s(5), s(7)}.

We can now examine the transitions out of state d(2), or in other words, epsilon-closure(next({s(2)}, i)) for each valid input symbol. From now on, the input symbols that result in the empty set will not be shown.

epsilon-closure(next({s(3)}, h)) = {s(4), s(7)} = d(4)

From d(3),

epsilon-closure(next({s(5), s(7)}, a)) = {s(8), s(9), s(12)} = d(5)

And from d(4),

epsilon-closure(next({s(4), s(7)}, a)) = {s(8), s(9), s(12)} = existing state d(5)

There are two valid input symbols for state d(5): "n", which is the transition character for s(8), and "t", which is the transition character for s(12).

epsilon-closure(next({s(8, )s(9), s(11)}, n)) = {s(10), s(12)} = d(6)
epsilon-closure(next({s(8), s(9), s(11)}, t)) = {s(1), s(3), s(6), s(13),s(14)} = d(7) (accepting state)

For state d(6) there is only one transition, on "t."

epsilon-closure(next({s(10), s(11)}, t)) = {s(1), s(3), s(6), s(13), s(14)} = existing state d(7)

Finally, d(7) gives us

epsilon-closure(next({s(1), s(3), s(6), s(13), s(14)}, c)) = {s(2)} = existing state d(2)
epsilon-closure(next({s(1), s(3), s(6), s(13), s(14)}, r)) = {s(5), s(7)} = existing state d(3)

We are finished because no new intermediate state sets have been created. This gives us the following DFA.

State Nfa States                   c    h    r    a    n    t   Accepting
-------------------------------------------------------------------------
d(1)  {s(1),s(3),s(6),s(11)}       d(2) -    d(3) -    -    -      no
d(2)  {s(2)}                       -    d(4) -    -    -    -      no
d(3)  {s(5), s(7)}                 -    -    -    d(5) -    -      no
d(4)  {s(4), s(7)}                 -    -    -    d(5) -    -      no
d(5)  {s(8), s(9), s(12)}          -    -    -    -    d(6) d(7)   no
d(6)  {s(10), s(12)}               -    -    -    -    -    d(7)   no
d(7)  {s(1),s(3),s(6),s(13),s(14)} d(1) -    d(2) -    -    -     yes

DFA Minimization

Once the DFA has been created, it can be minimized by removing redundant states. In figure 2, it can be easily seen that states d(3) and d(4) are identical. They are identical because their accepting conditions and transitions all match. However, minimization is not restricted only to finding identical states.

We want to create sets of DFA states such that all the states in set m(x) will transition to one of the states in m(y) for any given input character. For example, consider the very simple DFA:

State    A      B     Accept
----------------------------
 d(1)   d(3)    -       no
 d(2)   d(4)    -       no
 d(3)   d(2)   d(4)    yes
 d(4)   d(2)   d(3)    yes

We can create the following two sets:

m(1) = {d(1), d(2)}
m(2) = {d(3), d(4)}

Both states d(1) and d(2) are not accepting states. On input "A," they both transition to one of the states in m(2). Also, both have no valid transition on input character "B." Likewise, all states within m(2) transition to states within the same set on any given input. On input "A", both states d(3) and d(4) transition to states within m(1). On input "B," all states transition to states within m(2).

The minimization algorithm seeks to find such sets. It begins with one set for each unique accept condition. All non-accepts states are placed into a single set. Then, each set is examined with the following algorithm:

Create a new empty set, M
D(start )= First state in the set
D(next) = The next state in the set
While there is a D(next)
	{
	For each input character I(n)
		{
		D(x) = D(start)[I(n)]
		D(y) = D(next )[I(n)]
		If D(x) and D(y) are in different groups
			Move D(next) from the current group
			into the new group	
		}
	D(next) = the next state in the set
	}
If M != {}
	Add M
Else
	Discard M

Once all the new m states are created, it is important to go back to the beginning and continue until all m states are processed without creating any new ones.

To illustrate this, let's add a new expression to our previous example:

( ((ch|r)an?t)+ | rap )

The NFA shows that the two expressions are combined with an or construction and that there are now two accepting states.

The DFA looks like this.

Minimization begins by creating a minimized node for each different accept state. In the above DFA, there are two accepts, states d(8) and d(9). All the other nodes are not accepting states.

m(1) = {d(1), d(2), d(3), d(4), d(5), d(6), d(7), d(10)}
m(2) = {d(8)}
m(3) = {d(9)}

The Minimized DFA states are then handled one at a time. The first DFA state within the set becomes the representative of that set. Any states that can be distinguished from the first state are removed and placed into a new set which is added to the end.

Comparing d(1) with d(2) we see on the very first character that they are different: d(1) recognizes "c," but d(2) does not. In fact, no other state has a transition on both "c" and "d" to a non-transition state, and so all states but the first are removed and added to a new minimized state.

This leaves us with:

m(1) = {d(1)}
m(2) = {d(8)}
m(3) = {d(9)}
m(4) = {d(2), d(3), d(4), d(5), d(6), d(7), d(10)}

Both m(2) and m(3) have a single state each and cannot be further minimized.

With m(4), state d(2) becomes the representative. Because it is the only state with a transition on "h," it is is the only state left in m(4) and

m5 = {d(3), d(4), d(5), d(6), d(7), d(10)}

From this set, d(3), d(4), and d(10) are all similar. They all have a single transition on "a" to one of the states in m(5). d(3)[c] and d(10)[c] both transition to d(6) which is an element of m(5). d(4)[c] transitions to d(5) which is an element of m(5). This gives us

m(6) = {d(5), d(6), d(7)}

Following to the end, we will see that each of the following three are mutually exclusive leaving us with the new minimized states:

m(6) = {d(5)}
m(7) = {d(6)}
m(8) = {d(7)}

We're not done yet, however. Notice that d(3), d(4), and d(10) no longer result in a transition to the same set. They all transition on input character "a," however, d(3) goes to d(5) which is an element of m(6) while d(4) and d(10) go to d(6) which is an element of m(7). And so, the list is processed again. This time d(4) and d(10) are moved to a new minimized state m(9). If we did not do this, the we could mistakenly recognize the word "chap."

The list will be processed a third time, which will result in no new states, leaving us with the final result:

m(1) = {d(1)}
m(2) = {d(8)}
m(3) = {d(9)}
m(4) = {d(2)}
m(5) = {d(3)}
m(6) = {d(5)}
m(7) = {d(6)}
m(8) = {d(7)}
m(9) = {d(4), d(10)}

The first item in each set must be kept, but the rest can be thrown away along with the minimized states that were used for the algorithm. Once state d(10) is removed, we have our final minimized DFA.

And the final state table looks like this:

State   c    h    r    a    n    t    p   Accepting
----------------------------------------------
m(1)    m(2) -    m(3) -    -    -    -      no
m(2)    -    m(4) -    -    -    -    -      no
m(3)    -    -    -    m(5) -    -    -      no
m(4)    -    -    -    m(6) -    -    -      no
m(5)    -    -    -    -    m(7) m(8) m(9)   no
m(6)    -    -    -    -    m(7) m(8) -      no
m(7)    -    -    -    -    -    m(8) -      no
m(8)    m(2) -    m(4) -    -    -    -     yes
m(9)    -    -    -    -    -    -    -     yes

Try stepping through the table with the inputs: chant, rat, rap and ratchant. Implementation of the theory is left as an exercise ;-)

:^D