Recognizing languages

Recognizing languages

When we are given a particular string, and are asked to determine whether or not it is in a particular language, we will proceed as follows:

Memory requirements: one interesting aspect of a language is the amount of information we have to remember about a string while we are trying to determine whether or not it is a valid member of the language.

For the (1101)* language we only need to remember one of the following five things at any given point in time:

  1. the substring seen so far (in the left-to-right examination) has been of the form (1101)*) -- this includes our initial state, when we haven't seen any characters yet so (1101)0 would be a valid interpretation, or
  2. the substring seen so far has been of the form (1101)*1, so may still turn out to be valid, or
  3. the substring seen so far has been of the form (1101)*11, so may still turn out to be valid, or
  4. the substring seen so far has been of the form (1101)*110, so may still turn out to be valid, or
  5. the substring seen so far is not valid for the language
By remembering which of these five points we are at so far, we can look at the next character in the string and make a valid judgement as to whether or not the substring seen so far is still potentially valid for the language.

Using these five potential states, we could design an abstract machine capable of recognizing whether or not a particular string was in the language.

We could, theoretically, design such a machine that had one state for every different string in the language, but for an infinite language that would require an infinite number of things to remember (i.e. an infinite number of states).

A useful property of regular languages is that they are precisely the set of languages that can be recognized by such an abstract machine with a finite number of states -- i.e. for every regular language we can design a recognizer without having to remember every string in the language.

The abstract recognizing machines we are discussing will be termed finite automata, and are the subject of our next section.


Finite Automata/State Machines

Definitions and theorems

Transition functions
We will use § to represent the transition function for a finite automata, taking you from a state and character to a new state. Thus § Q × ∑ -> Q

Acceptance and rejection
Let machine M = (Q, ∑, q0, A, §) be a finite automata. A string x in * is accepted by M if §* is in A.

(I.e. by repeatedly applying the transition function to the characters of x, we are finally left in an accept state.)

The language accepted by M, or the language recognized by M, is the set L(M) = { x ∈ ∑* | x is accepted by M }

If L is any language over , L is accepted, or recognized, by M if and only if L = L(M)

(I.e. to accept (or recognize) a language L, a FA must accepta all the strings in L and reject all the strings in L'.)

Theorem 3.1
A language L over the alphabet is regular if and only if there is a FA that accepts L.

(we'll prove this one later)

Distinguishing strings
The whole purpose of finite automata is to distinguish some strings from others.

From a practical viewpoint, we want two strings, x, y, to be treated differently (i.e. distinguishable) if we can follow each of them with the same substring, z, and wind up with one string in a language, L, and the other string in L'

Example: over alphabet { 0, 1 }

Lemma 3.1
Suppose that L is a language over the alphabet and M = (Q, ∑, q0, A, §) is any FA recognizing L. If x and y are any two strings over the alphabet for which §*(q0,x) = §*(q0,y) then x and y are indistinguishable with respect to L.

Proof:

Theorem 3.2
Suppose that L is a language over the alphabet , and for some positive integer, n, there are n strings over the alphabet, any two of which are distinguishable with respect to L. Then there can be no FA recognizing L with fewer than n states.

Proof by contradiction:

This is important, in that if we can find n strings that are pairwise distinguishable, we know that any FA for the language must have at least n states.

Theorem 3.3
The language, pal of palindromes (strings that read the same forwards as backwards) over the alphabet { 0, 1 } is not regular.

Proof:

Theorem 3.4
(Paraphrased: the regular languages are closed under union, instersection, and complement: we will show this by showing how to construct recognizing FA from the FAs for the languages being operated on)

Suppose that M1 = (Q1, ∑1, q1, A1, §1 accepts L1, and M2 = (Q2, S2, q2, A2, §2 accepts L2.

Let M = (Q, ∑, q0, A, §) where Q = Q1 × Q2 and q0 = (q1,q2) and §((p,q),a) = (§1(p,a), §2(q,a)) for any p in Q1, q in Q2, and a in .

Then

  1. If A = { (p,q) | p ∈ A1 or q ∈ A2 } then M accepts the language L1 U L2
  2. If A = { (p,q) | p ∈ A1 and q ∈ A2 } then M accepts the language L1 @ L2
  3. If A = { (p,q) | p ∈ A1 and q ∉ A2 } then M accepts the language L1 - L2

Intuitive argument:

  • for any input string, our new machine M simply needs to be able to track which state the string would put us in for machine M1 and which state the string would put us in for machine M2
  • The three cases:
    • For union, we accept if either M1 or M2 is in an accept state.
    • For intersection, we accept if both M1 and M2 are in an accept state.
    • For difference, we accept if M1 is in an accept state and M2 is not in an accept state.

Example: Suppose that, over the alphabet { 0, 1 }, we have the two regular languages L1 = { x | 00 is not a substring of x } , L2 = { x | x ends with 01 } .

From the FAs which recognize L1 and L2, construct an FA which recognizes L1 - L2

  • State table for L1
    StateSymbolNext State
    X 0 Y
    X 1 X
    Y 0 Z
    Y 1 X
    Z 0 Z
    Z 1 Z
    X is the start state and X, Y are the accept states.
  • State table for L2
    StateSymbolNext State
    T 0 V
    T 1 T
    V 0 V
    V 1 W
    W 0 V
    W 1 T
    T is the start state and W is the accept state.
  • Constructing the state table for L1 - L2
    • There are nine potential states for the new machine, corresponding to the state combinations XT, XV, XW, YT, YV, YW, ZT, ZV, ZW
    • The start state for the new machine corresponds to the start pair from the two originals, i.e. XT
    • From this point, state XT, if we observe a 1 we would stay in state XT, while on a 0 we would move to state YV
    • From state YV on 0 we move to state ZV while on 1 we move to state XW
    • From state ZV on 0 we stay in state ZV while on 1 we move to state ZW
    • From state XW on 0 we move to state YV while on 1 we move to state XT
    • From state ZW on 0 we move to state ZV while on 1 we move to state ZT
    • From state ZT on 0 we move to state ZV while on 1 we stay in state ZT
    • Observe that there are three unused, or inaccessible, states - these can be safely removed from the FA
    • The accept states are those that correspond to accept for L1 and reject for L2, i.e. XT and YV. (XV and YT would have been valid, but they were two of the unreachable states.)
      StateSymbolNext State
      XT 0 YV
      XT 1 XT
      YV 0 ZV
      YV 1 XW
      XW 0 YV
      XW 1 XT
      ZV 0 ZV
      ZV 1 ZW
      ZW 0 ZV
      ZW 1 ZT
      ZT 0 ZV
      ZT 1 ZT
      Where XT is the start state and the accept states are XT and YV

        Note that for L1 U L2 and L1 @ L2 the transition function/state tables are the same as this, but the set of accept states differs!

    • Simplification of the table for L1 - L2: we can observe that, once state ZV is reached the only possible result is rejection, so we can collapse states ZV, ZW, ZT into a single reject state, R, from which there is no escape...
      StateSymbolNext State
      XT 0 YV
      XT 1 XT
      YV 0 R
      YV 1 XW
      XW 0 YV
      XW 1 XT
      R 0 R
      R 1 R
      Where XT is the start state and the accept states are XT and YV

Proof of theorem 3.4:

  • First, we create an inductive proof (left as an exercise for the reader) of the following:
      For any string x over the alphabet, and any pair of states p, q from Q, §*((p,q),x) = (§1*(p,x), §2*(q,x))
  • A string, x, is accepted by M iff §*((q1,q2),x) is in A
  • By the formula above, this is true iff 1*(p,x), §2*(q,x)) is in A
  • For the three cases of defining set A,
    1. If set A is defined as in case 1 (union) then this is the same as saying either §1*(q1,x) is in A1 or §2*(q2,x) is in A2
    2. If set A is defined as in case 2 (intersection) then this is the same as saying that both §1*(q1,x) is in A1 and §2*(q2,x) is in A2
    3. If set A is defined as in case 3 (difference) then this is the same as saying that both §1*(q1,x) is in A1 and §2*(q2,x) is not in A2

Practice: (based on exercises from text)