C SC 320 Lecture Notes: regular languages

Regular languages and regular expressions

Since we are interested in describing languages as sets of strings, and since these sets can be infinite in size, we need a compact mechanism for precisely describing the languages.

During the semester we will examine several description approaches of varying compexity - the first of these approaches is called regular expressions, and the languages it allows us to describe are called regular languages.

Every language that can described by a regular expression is a regular language, and every regular language can be described by some regular expression.

The easiest way of describing the set of all regular languages is by describing the ways you can "build up" a regular language.

Definition: for any alphabet, ∑, the set of regular languages over ∑ is as follows:

The empty set (i.e. the language which has no strings in it) is a regular language.
The regular expression describing that language is ø
The language which consists of exactly the null string is a regular language, i.e. { ñ }
The regular expression describing that language is ñ
For each character in ∑, the language which consists of one string which is exactly that character in a regular language, i.e. ∀a ∈ ∑, { a } is a regular language
The regular expression describing that language is a
If L₁ and L₂ are regular languages, and r₁ and r₂ are their corresponding regular expressions, then
- L₁ ^U L₂ is a regular language, and its corresponding regular expression is (r₁ + r₂)
- L₁L₂ is a regular language, and its corresponding regular expression is (r₁r₂)
- L₁^* is a regular language and its corresponding regular expression is (r₁^*)

Only those languages that can be obtained using statements 1-4 are regular languages over ∑

Handy shortcuts and extra notation

Even though the above gives the formal definition of regular languages, there are two more notation items that make descriptions easier to read:

If L is a regular language and r is its corresponding regular expression, then L⁺ indicates the regular language formed by concatenating one or more strings from L, and the regular expression for this is r⁺
If L is a regular language and r is its corresponding regular expression, then Lⁿ indicates the regular language formed by concatenating exactly n strings from L, and the regular expression for this is rⁿ. (This holds for any integer n.)

Practice with regular languages and regular expressions

Observation: any finite set of strings over an alphabet is a regular language over that alphabet, and could be constructed as a regular expression that simply lists all the strings.
For example, if our alphabet is { a, b, c }, then the set of all strings of length two is finite, and could be listed with the regular expression
(aa + ab + ac + ba + bb + bc + ca + cb + cc)
(Of course, a more compact representation would be something like (a + b + c)²)
Is the set of strings of even length a regular language?
Yes, since we can obtain it with the regular expression (∑²)^*
Is the set of strings of odd length a regular language?
Yes, since we can obtain any string in the language by adding a single character to some string of even length, and we know the even length strings form a regular language, i.e.: (∑²)^*∑
For the alphabet { 0, 1 }, is the set of all strings containing the substring 1001 a regular language?
Yes, since we can create the language with the following regular expression: (0 + 1)^* 1001 (0 + 1)^*
For the alphabet { 0, 1 }, is the set of strings of length at most 100 a regular language?
Yes, since we could represent it by enumerating all the strings of length at most 100, although a much more compact notation would be (0 + 1 + ñ)¹⁰⁰
For the alphabet { 0, 1 }, is the set of strings representing powers of two (expressed as binary integers) a regular language?
Yes, since we could represent the language with the regular expression 10^*
For the alphabet { a, b }, is the set of strings which contain no consecutive a's a regular language?
Yes, since we could represent the language with the regular expression b^*(ab⁺)(a+ñ)
Over the alphabet { a, b }, is the set of strings which contain at least as many a's as b's a regular language?
In fact, it is not - a proof of this will be considered in several lectures, but there is no regular expression that can be used to describe this language.

Some practice questions

For each of the following questions assume the alphabet is { 0, 1 }

Find a string not in the language described by the regular expression (0^*+1^*)(0^*+1^*)(0^*+1^*)
Find a string not in the language described by the regular expression 0^*(100^*)^*1^*
Simplify the following regular expression 01((01)^*01+(01)^*)+(01)^*
Give a simple description of the language described by the following regular expression 0^*1(0^*10^*1)^*0
Give a simple description of the language described by the following regular expression (0+1)^* (0⁺1⁺0⁺ + 1⁺0⁺1⁺) (0+1)^*
Give a regular expression for the language of all strings not ending with 01
Give a regular expression for the language of all strings in which each 0 is followed immediately by 11
Give a regular expression for the language of all strings containing both 11 and 010 as substrings

Sample solutions

Find a string not in the language described by the regular expression (0^*+1^*)(0^*+1^*)(0^*+1^*)
i.e. any string that is more than 3 characters
Find a string not in the language described by the regular expression 0^*(100^*)^*1^*
i.e. any string containing either consecutive 1's or ending with a 0
Simplify the following regular expression 01((01)^*01+(01)^*)+(01)^*
i.e. (01)^*
Give a simple description of the language described by the following regular expression 0^*1(0^*10^*1)^*0
i.e. language of all strings containing an odd number of 1's and ending with 10
Give a simple description of the language described by the following regular expression (0+1)^* (0⁺1⁺0⁺ + 1⁺0⁺1⁺) (0+1)^*
i.e. the language of all strings containing both 01 and 10 as substrings
Give a regular expression for the language of all strings not ending with 01
i.e. (0 + 1)^*(00 + 10 + 11)
Give a regular expression for the language of all strings in which each 0 is followed immediately by 11
i.e. 1^*(011⁺)^*
Give a regular expression for the language of all strings containing both 11 and 010 as substrings
i.e. (0+1)^*11(0+1)^*010(0+1)^* + (0+1)^*010(0+1)^*11(0+1)^*