Regular expressions


Index:
This reference contains the following sections: Introduction Matching specific characters Special characters Special characters in basic regular expressions Special characters in extended regular expressions Alternatives Matching the beginning of a line Matching the end of a line Character sets Matching possible occurrences Matching repeating occurrences Group EscapeSequences Regular expression summary Examples

Index:

This reference contains the following sections:

Introduction

Matching specific characters

Special characters

Special characters in basic regular expressions

Special characters in extended regular expressions

Alternatives

Matching the beginning of a line

Matching the end of a line

Character sets

Matching possible occurrences

Matching repeating occurrences

Group

EscapeSequences

Regular expression summary

Examples

1 Introduction

This tutorial describes regular expressions that are used in string pattern searches. Regular expressions are commonly used in many products. If you are already familiar with regular expressions you can skip to the regular expression summary to analyze how these are used within Kilowatt Software L.L.C.'s products.

If you didn't click the above link, the following progressively introduces regular expression concepts one at a time. You should try to gain a complete understanding of each expression type, before proceeding to the next section. Eventually all of these concepts can become intricately intertwined.

What is a "regular expression" ? A regular expression is a sequence of characters that specifies a pattern to locate within a text string. The regular expression is passed as a pattern argument to a search program. The program will then analyze one or more text strings to determine if they contain the pattern. Strings that match the pattern are often printed. In other contexts, the offset and width of the matching text are determined, for subsequent additional processing.

So, a "regular expression" is a pattern. A "generalized regular expression" is the same kind of pattern. And, a "basic generalized regular expression" or an "extended generalized regular expression" are also a similar pattern.

What types of text search could you require ?

Find a specific sequence of characters
Find two sequences of characters, with arbitrary text in between
Find sequences of characters within a set of possible values -- digits, words, zip codes, etc.
Find one sequence of characters, or another
Find sequences of characters, with optional segments -- long distance phone number prefix
Find repetitive sequences of characters
Find a sequence at the beginning or end of the text string

2 Matching specific characters

Within a pattern, the majority of characters represent themselves. A space within a pattern represents a space in the text.

a b c ... A B C ... 0 1 2 3 ... - , < > ...

Thus, you can search for genie by using the pattern:

genie

Or, you can search for a specific phone number by using the pattern:

555-1212

3 Summary of special characters (meta-characters)

Some characters are used to establish contextual meaning within the pattern. Fourteen characters are used to represent context. These special character values are sometimes referred to as meta-characters.

^ $ . [ ] ( ) + * ? { } | \

In addition, the hyphen (-) has a special meaning within a character set.

There are two types of regular expressions -- basic & extended.

roo!™ uses extended regular expressions. Other programs, such as the UNIX® grep program use basic regular expressions instead. An emulation of grep is available in Kilowatt Software L.L.C.'s Poof!™ product. Descriptions of basic regular expressions are provided here. The roo!™ Pattern built-in class only supports extended regular expressions.

3.1 Special characters in basic regular expressions

In basic regular expressions parentheses and curly braces are preceded by a backslash. This is because early versions of pattern search programs did not use parentheses or curly braces as special characters. Consequently, parentheses and curly braces represented the characters themselves in older pattern search scripts. So within a basic regular expression the special context values are

^ $ . [ ]  + * ? \{ \} | \

3.2 Special characters in extended regular expressions

In extended regular expressions parentheses and curly braces are NOT preceded by a backslash. When extended regular expressions were introduced, there was no need to worry about previously existing pattern search scripts. Consequently, within an extended regular expression parentheses and curly braces do not require a leading backslash.

^ $ . [ ] ( ) + * ? { } | \

If you compare this with the first set of special characters above, you will discover that they are the same.

4 Matching alternatives

Alternative values are matched by using a vertical bar between the possibilities. Some author names could be matched as follows:

Note: the vertical bar can be used to separate alternative regular expressions as well.

5 Matching any character -- period, or dot .

Often you will need to locate a line that has two segments separated by various intervening values. The period special character is used to represent any character value. An arbitrary intervening sequence uses the asterisk special character as well. The following pattern will locate any line that contains the letters of the word 'special' followed by arbitrary intervening characters, and then the letters of the word 'character'. The pattern will also locate a line containing the letters 'specialcharacter' as well.

special.*character

To locate a line with at least one letter in between, and perhaps many more, the plus special character is used as follows:

special.+character

6 Matching the beginning of a line -- caret ^

A leading caret is used to indicate the pattern must match text at the start of a line. The following locates Giraffe at the start of a line.

^Giraffe

7 Matching the end of a line -- dollar sign $

A trailing dollar sign is used to indicate the pattern must match text at the end of a line. The following locates Giraffe at the end of a line.

Giraffe$

8 Character sets

8.1 Matching any character within a set -- square-brackets [ ]

A set of characters that can occur at a position are enclosed in square brackets. For example, the following is the set of lowercase letters.

[abcdefghijklmnopqrstuvwxyz]

8.2 Character set range -- hyphen between start and end character [a-z]

It would be inconvenient to type the all of the lowercase letters, as shown above, whenever you want to match a lowercase letter. The hyphen character allows ranges of characters to be easily specified. Using the special hyphen character lowercase letters are represented as:

[a-z]

Uppercase letters are represented as:

[A-Z]

All digits are represented as:

[0-9]

8.3 Matching any character NOT within a set -- square-brackets with leading caret [^...]

Suppose you want to match a character that is not within a set. This uses a similar expression, however a caret '^' character is placed immediately after the left square bracket. For example, common telephone numbers can use all of the uppercase letters except Q and Z. The corresponding pattern in this case is:

[^QZ]

Notice the caret is a special character within the context of a character set, only in the first position. The caret represents itself in all other positions, as shown in a subsequent example that shows a character set for all PC keyboard characters that are neither letters, nor digits.

8.4 Character set range -- leading hyphen or right square bracket

Suppose in addition to lowercase letters, you would also like to permit a hyphen character, or a right square bracket, within the set of possible characters. This is done by putting the hyphen, or right square bracket, immediately after the opening left square bracket of the character set.

Add a leading hyphen character:

[-a-z]

Add a leading right-square bracket

[]a-z]

Add a leading hyphen character to a complemented set:

[^-a-z]

Add a leading right-square bracket to a complemented set:

[^]a-z]

8.5 Character set range -- escaping special characters

Suppose you want other special characters (i.e. a right square bracket ']', or a backslash) to be within your character set. You use the backslash before the special character:

[a-z\]\\]

The common punctuation characters that are represented by the set:

[,.;:'"()\-]

All of the PC keyboard characters that are neither letters, nor digits, are represented by the set:

[ ,.;:'"\[\]{}`~!@#$%^&*()_+=|<>\-\b\t\n\r\\]

Notice the leading space above. The characters \b, \t, \n, and \r will be discussed in a subsequent section below.

Notice many of the characters above were not escaped, even though they are special regular expression characters. Within the context of a character set these characters are not special.

9 Matching possible occurrences

Some sequences can repeat in various ways.

9.1 Zero or more occurrences -- asterisk *

Patterns can contain sequences that appear multiple times, but possibly not at all. For example, proper names begin with an uppercase letter, and continue with a series of lowercase letters. Occasionally, the trailing lowercase letters are omitted. The special asterisk '*' character is used for sequences that occur zero or more times. The corresponding pattern for a name is:

[A-Z][a-z]*

Notice the asterisk binds to its leftmost element. In the example above it binds to the character set [a-z].

9.2 One or more occurrences -- plus sign +

Other patterns can require at least one occurrence of a sequence, and permit multiple additional occurrences as well. Words within a sentence are a series of one or more lowercase characters, with a possible hyphen. The special asterisk '+' character is used for sequences that occur one or more times. The corresponding pattern for a word is:

[a-z-]+

Notice the plus sign binds to its leftmost element. In the example above it binds to the character set [a-z-].

9.3 Zero or one occurrence -- question mark ?

Some patterns are optional. US postal codes begin with five digits and have an optional sequence of four digits and a separating hyphen. The special question mark '?' character is used for sequences that occur zero or one time. The corresponding pattern for a US postal code is:

[0-9][0-9][0-9][0-9][0-9](-[0-9][0-9][0-9][0-9])?

Notice the question mark binds to its leftmost element. In the example above it binds to the group (-[0-9][0-9][0-9][0-9]).

9.4 Ranges of occurrences -- curly-braces { }

As seen in the postal code pattern above, some patterns repeat a precise number of times. Other patterns can repeat a minimum number of times, with more repetitions permitted. Other patterns can repeat between a minimum and a maximum number of times, inclusively. Curly braces are used to specify the number of repetitions.

9.4.1 Exact occurrence count -- {n}

Patterns that repeat a precise number of times, have the number within curly braces.

Example: a five-digit US zip code is matched by:

[0-9]{5}

Notice the curly brace expression binds to its leftmost element. In the example above it binds to the character set [0-9].

A five-digit US zip code followed by an optional four-digit sequence is matched by:

[0-9]{5}(-[0-9]{4})?

Notice the second curly brace expression binds to the character set [0-9].

Notice the trailing question mark binds to the group (-[0-9]{4}). Consequently, the group can appear 0 or 1 time.

9.4.2 Minimum or more occurrence count -- {n,}

When the pattern repeats a minimum number of times, and then any number of times more, a comma follows the number between curly braces. Some system designers require passwords to be 8 characters or longer. This is represented as:

.{8,}

Notice the curly brace expression binds to its leftmost element. In the example above it binds to a period that represents any character.

9.4.3 Minimum to maximum occurrence count -- {m,n}

Some patterns repeat between a minimum and a maximum number of times. In this case a comma separates the two numbers in parentheses. For example a US telephone number consists of one or two 3-digit prefixes followed by a 4-digit number. Each prefix is followed by a separating hyphen. This is represented as:

([0-9]{3}-){1,2}[0-9]{4}

Notice the curly brace expression binds to its leftmost element. In the example above {1,2} binds to the group ([0-9]{3}-). Consequently, the group can appear 1 or 2 times. Thus, the pattern matches 555-1212 and 207-555-1212.

10 Group -- parentheses ( )

Patterns can contain sequences that are accessed collectively. A pair of parentheses surrounds the collection. Several examples of grouped sequences have already appeared in earlier sections. You will find that groups are commonly necessary for pattern searches.

Here we will consider proper names. Each name begins with an uppercase letter, and continues with a series of lower case letters. The corresponding pattern is:

[A-Z][a-z]+

Omitting the names of various celebrities, many individuals are addressed by a first and last name separated by a space. The corresponding pattern is:

([A-Z][a-z]+ ?){2}

Note, the above pattern allows a trailing space to follow the last name.

With the addition of an optional middle name, the pattern is:

([A-Z][a-z]+ ?){2,3}

The pattern above matches proper names consisting of 2 or 3 names.

11 Matching special characters

What if you want to match one of the special characters? This is done by preceding the character with a backslash. The backslash character acts as an escape for the subsequent character's special meaning. It becomes the character itself within the pattern. Thus, a left square bracket is specified as:

And a backslash character is specified with two backslashes:

12 Other escaped character sequences

12.1 Special character values

The following characters are represented by a backslash followed by a letter.

Sequence	Meaning	Hexadecimal
\b	backspace	x'08'
\e	ESCape	x'1b'
\f	formfeed	x'0c'
\n	newline	x'0a'
\r	carriage return	x'0d'
\s	space	x'20'
\t	tab	x'09'

12.2 Octal value

Octal values are specified by a backslash followed by 1 to 3 octal digits (0 to 7). For example, an Ascii ESCape character is represented in octal as follows:

\033

12.3 Hexadecimal value

Hexadecimal values are specified by a backslash followed by the letter 'x', and then 1 or 2 hexadecimal digits (0 to 9, a to f, or A to F). For example, an Ascii ESCape character is represented in hexadecimal as follows:

\x1B

An Ascii caret character is represented in hexadecimal as follows:

\x94

12.4 Control character

Control character values are specified by a backslash followed by a caret '^', and then a single character value -- usually in the range A to Z. Control-Z has a value of decimal 26, and is represented as follows:

\^Z

13 Regular expression summary

Expression	Meaning	Examples, extended	Basic expression equivalents
c	Normal character	Radiance
.	Any character	Rad...ce
^	Start of line	^Radiance
$	End of line	radiance$
[aAbB]	Character set	[01234546789]
[m-n]	Character set sequence	[a-z] [a-zA-Z] [0-9]
[^aAbB]	Non-character set	[^01234546789]
[^m-n]	Non-character set sequence	[^0-9]
?	0 or 1 occurrence	[0-9]?
+	1 or more occurrences	.+
*	0 or more occurrences	[a-z]*
{n}	Exact count	[0-9]{5}	[0-9]\{5\}
{n,}	Minimum count or more	[0-9]{1,}	[0-9]\{1,\}
{min,max}	Minimum count up to maximum count	[0-9]{1,5}	[0-9]\{1,5\}
\|	Alternative	Kipling\|Hemingway\|Twain
(expr)	Group	(Hydrogen\|Carbon\|Oxygen)	$Hydrogen\|Carbon\|Oxygen$
(expr)?	Optional group (0 or 1)	(Hydrogen\|Carbon\|Oxygen)?	$Hydrogen\|Carbon\|Oxygen$?
(expr)+	Repeating group (1 or more)	(Hydrogen\|Carbon\|Oxygen)+	$Hydrogen\|Carbon\|Oxygen$+
(expr)*	Repeating group (0 or more)	(Hydrogen\|Carbon\|Oxygen)*	$Hydrogen\|Carbon\|Oxygen$*
(expr){n}	Repeating group (exactly n)	(Hydrogen\|Carbon\|Oxygen){2}	$Hydrogen\|Carbon\|Oxygen$\{2\}
(expr){n,}	Repeating group (n or more)	(Hydrogen\|Carbon\|Oxygen){2,}	$Hydrogen\|Carbon\|Oxygen$\{2,\}
(expr){m,n}	Repeating group (m to n)	(Hydrogen\|Carbon\|Oxygen){2,5}	$Hydrogen\|Carbon\|Oxygen$\{2,5\}

14 Some examples

14.1 Matching a surname

A surname could be matched by the following regular expression:

[A-Z][a-z]*(-[A-Z][a-z]*)?

The above matches Smith, or Smith-Jones

14.2 Matching a formal name

Various formal names could be matched by the following regular expression:

(Mr\.|Mrs\.|Ms\.|Miss|Sir|Madame) ([A-Z][a-z]* )?[A-Z][a-z]*(-[A-Z][a-z]*)

The above matches Sir Walter Raleigh, or Ms. Walker, or Mrs. James-Jones

An alternative pattern that matches the same formal names is:

(M(rs?|s)\.|Miss|Sir|Madame) ([A-Z][a-z]* )?[A-Z][a-z]*(-[A-Z][a-z]*)

The above matches Sir Walter Raleigh, or Mrs. James-Jones

14.3 Matching a social security number

US social security numbers can be matched by the following EXTENDED regular expression:

[0-9]{3}-[0-9]{2}-[0-9]{4}

14.4 Matching a zip code

US zip codes can be matched by the following EXTENDED regular expression:

[0-9]{5}(-[0-9]{4})?

The above matches 03903, or 03903-0001

14.5 Matching phone numbers

US phone numbers can be matched by the following EXTENDED regular expressions:

([0-9]{3}-)?[0-9]{3}-[0-9]{4}

([0-9]{3}-){1,2}[0-9]{4}

The above patterns match the following numbers

The following pattern matches a phone number that optionally has a leading 1- before the areacode.

((1-)?[0-9]{3}-)[0-9]{3}-[0-9]{4}

Last updated on: 18 Jun 2004