Regular Expressions

Introduction: The idea of using regular expressions to denote whole classes of strings is quite widespread. In many card games, the Joker card is called the “wild-card,” in that it can be used in lieu of any other card. In most operating systems, the wild-card is ∗that matches all file names—as in rm *.* or del *.*. Such expressions are practical examples of regular expressions in day-to-day use. Formally, we define regular expressions as follows. We use the inductive definition style in which a basis case and a list of inductive rules are provided, and define regular expressions to be the least set satisfying these rules:

∅is an RE denoting the empty language ∅.
Most computer-assisted tools have trouble accepting ∅directly, and so their syntax for ∅varies. It is written as {} in grail.Check with each tool to see how it encodes ∅. A command man regexp or man egrep issued on most Unix systems usually revealshow Unix encodes REs.
ε is an RE denoting the language {ε}. Again, most computer-assisted tools encode ε variously—as "" by grail, for example. In our drawings, we sometimes use e or epsilon in lieu of ε. Unix has several flavors of ε—those occurring within words, at the beginning, at the end, etc.6
a ∈Σ is an RE denoting the language {a}. Most tools directly encode this regular expression in ASCII – for example, a.
For REs r, r1 and r2, r1r2, r1 r2, r∗, and (r) are regular expressions. These regular expressions denote the following languages,with the indicated encodings in grail (we write L(x) to mean ‘languageof’):

RE r1r2 denotes L(r1) ◦ L(r2) encoded in grail as r1 r2

RE r1 r2 denotes L(r1) ∪L(r2) encoded in grail as r1 r2

RE r∗denotes (L(r))∗encoded in grail as r*

RE (r) denotes L(r) encoded in grail as (r)

The above regular operators form a basis set in the sense that we do not need any other operators such as complementation, reversal, intersection RE, etc., to build regular languages. All regular languages can be specified using only the above regular expression syntax. In practice, however, there are many languages that are very difficult to specify without the use of these additional operators, especially complementation (also known as negation).7 The reason for this difficulty is that the regular expression syntax, which only has ∗, ◦, and , allows us to build up the language from simpler languages, while often it would be much more natural to build down a language, say, by specifying two different (but larger) languages and intersecting them.

Illustration: In previous chapters we have seen how to write out languages using set-theoretic syntax. In particular, we can express language

Labc_k = {xabc | x ∈{0, 1}^k ∧a, b, c ∈{0, 1} ∧(a = 0 ∨b = 1)}

for various k as follows:

k = 0 : {ε} ◦ {00, 01, 11} ◦ { 0, 1},
k = 1 : {0, 1} ◦ {00, 01, 11} ◦ { 0, 1},
k = 2 : {0, 1}2 ◦ {00, 01, 11} ◦ { 0, 1},
. . .

We can encode these languages using regular expressions as follows:

k = 0 : ε ◦ (00 01 11) ◦ (0 1),
k = 1 : (0 1) ◦ (00 01 11) ◦ (0 1),
k = 2 : (0 1)(0 1) ◦ (00 01 11) ◦ (0 1),
. . .

In the grail syntax, these regular expressions can be encoded as:

k = 0 : ""(00 01 11)(0 1),
k = 1 : (0 1)(00 01 11)(0 1),
k = 2 : (0 1)(0 1)(00 01 11)(0 1),
. . .

Illustration Consider the language

Lk = {x0z | x ∈{0, 1}∗and z ∈{0, 1}k}.

For various k, we can express this language using regular expressions (in grail syntax) as follows:

k = 0: (0 1)*0,
k = 1: (0 1)*0(0 1),
k = 2: (0 1)*0(0 1)(0 1),
k = 3: (0 1)*0(0 1)(0 1)(0 1),
. . .

SKEDSOFT