Index: |
---|
This reference contains the following sections:
|
This tutorial describes regular expressions that are used in string pattern searches. Regular expressions are commonly used in many products. If you are already familiar with regular expressions you can skip to the regular expression summary to analyze how these are used within Kilowatt Software L.L.C.'s products.
If you didn't click the above link, the following progressively introduces regular expression concepts one at a time. You should try to gain a complete understanding of each expression type, before proceeding to the next section. Eventually all of these concepts can become intricately intertwined.
What is a "regular expression" ? A regular expression is a sequence of characters that specifies a pattern to locate within a text string. The regular expression is passed as a pattern argument to a search program. The program will then analyze one or more text strings to determine if they contain the pattern. Strings that match the pattern are often printed. In other contexts, the offset and width of the matching text are determined, for subsequent additional processing.
So, a "regular expression" is a pattern. A "generalized regular expression" is the same kind of pattern. And, a "basic generalized regular expression" or an "extended generalized regular expression" are also a similar pattern.
What types of text search could you require ?
|
Within a pattern, the majority of characters represent themselves. A space within a pattern represents a space in the text.
Thus, you can search for genie by using the pattern:
genie |
Or, you can search for a specific phone number by using the pattern:
555-1212 |
Some characters are used to establish contextual meaning within the pattern. Fourteen characters are used to represent context. These special character values are sometimes referred to as meta-characters.
In addition, the hyphen (-) has a special meaning within a character set.
There are two types of regular expressions -- basic & extended.
|
In basic regular expressions parentheses and curly braces are preceded by a backslash. This is because early versions of pattern search programs did not use parentheses or curly braces as special characters. Consequently, parentheses and curly braces represented the characters themselves in older pattern search scripts. So within a basic regular expression the special context values are
In extended regular expressions parentheses and curly braces are NOT preceded by a backslash. When extended regular expressions were introduced, there was no need to worry about previously existing pattern search scripts. Consequently, within an extended regular expression parentheses and curly braces do not require a leading backslash.
If you compare this with the first set of special characters above, you will discover that they are the same.
Alternative values are matched by using a vertical bar between the possibilities. Some author names could be matched as follows:
Note: the vertical bar can be used to separate alternative regular expressions as well.
Often you will need to locate a line that has two segments separated by various intervening values. The period special character is used to represent any character value. An arbitrary intervening sequence uses the asterisk special character as well. The following pattern will locate any line that contains the letters of the word 'special' followed by arbitrary intervening characters, and then the letters of the word 'character'. The pattern will also locate a line containing the letters 'specialcharacter' as well.
To locate a line with at least one letter in between, and perhaps many more, the plus special character is used as follows:
A leading caret is used to indicate the pattern must match text at the start of a line. The following locates Giraffe at the start of a line.
A trailing dollar sign is used to indicate the pattern must match text at the end of a line. The following locates Giraffe at the end of a line.
A set of characters that can occur at a position are enclosed in square brackets. For example, the following is the set of lowercase letters.
It would be inconvenient to type the all of the lowercase letters, as shown above, whenever you want to match a lowercase letter. The hyphen character allows ranges of characters to be easily specified. Using the special hyphen character lowercase letters are represented as:
Uppercase letters are represented as:
All digits are represented as:
Suppose you want to match a character that is not within a set. This uses a similar expression, however a caret '^' character is placed immediately after the left square bracket. For example, common telephone numbers can use all of the uppercase letters except Q and Z. The corresponding pattern in this case is:
Notice the caret is a special character within the context of a character set, only in the first position. The caret represents itself in all other positions, as shown in a subsequent example that shows a character set for all PC keyboard characters that are neither letters, nor digits.
Suppose in addition to lowercase letters, you would also like to permit a hyphen character, or a right square bracket, within the set of possible characters. This is done by putting the hyphen, or right square bracket, immediately after the opening left square bracket of the character set.
Add a leading hyphen character:
Add a leading right-square bracket
Add a leading hyphen character to a complemented set:
Add a leading right-square bracket to a complemented set:
Suppose you want other special characters (i.e. a right square bracket ']', or a backslash) to be within your character set. You use the backslash before the special character:
The common punctuation characters that are represented by the set:
All of the PC keyboard characters that are neither letters, nor digits, are represented by the set:
Notice the leading space above. The characters \b, \t, \n, and \r will be discussed in a subsequent section below.
Notice many of the characters above were not escaped, even though they are special regular expression characters. Within the context of a character set these characters are not special.
Some sequences can repeat in various ways.
Patterns can contain sequences that appear multiple times, but possibly not at all. For example, proper names begin with an uppercase letter, and continue with a series of lowercase letters. Occasionally, the trailing lowercase letters are omitted. The special asterisk '*' character is used for sequences that occur zero or more times. The corresponding pattern for a name is:
Notice the asterisk binds to its leftmost element. In the example above it binds to the character set [a-z].
Other patterns can require at least one occurrence of a sequence, and permit multiple additional occurrences as well. Words within a sentence are a series of one or more lowercase characters, with a possible hyphen. The special asterisk '+' character is used for sequences that occur one or more times. The corresponding pattern for a word is:
Notice the plus sign binds to its leftmost element. In the example above it binds to the character set [a-z-].
Some patterns are optional. US postal codes begin with five digits and have an optional sequence of four digits and a separating hyphen. The special question mark '?' character is used for sequences that occur zero or one time. The corresponding pattern for a US postal code is:
Notice the question mark binds to its leftmost element. In the example above it binds to the group (-[0-9][0-9][0-9][0-9]).
As seen in the postal code pattern above, some patterns repeat a precise number of times. Other patterns can repeat a minimum number of times, with more repetitions permitted. Other patterns can repeat between a minimum and a maximum number of times, inclusively. Curly braces are used to specify the number of repetitions.
Patterns that repeat a precise number of times, have the number within curly braces.
Example: a five-digit US zip code is matched by:
Notice the curly brace expression binds to its leftmost element. In the example above it binds to the character set [0-9].
A five-digit US zip code followed by an optional four-digit sequence is matched by:
Notice the second curly brace expression binds to the character set [0-9].
Notice the trailing question mark binds to the group (-[0-9]{4}). Consequently, the group can appear 0 or 1 time.
When the pattern repeats a minimum number of times, and then any number of times more, a comma follows the number between curly braces. Some system designers require passwords to be 8 characters or longer. This is represented as:
Notice the curly brace expression binds to its leftmost element. In the example above it binds to a period that represents any character.
Some patterns repeat between a minimum and a maximum number of times. In this case a comma separates the two numbers in parentheses. For example a US telephone number consists of one or two 3-digit prefixes followed by a 4-digit number. Each prefix is followed by a separating hyphen. This is represented as:
Notice the curly brace expression binds to its leftmost element. In the example above {1,2} binds to the group ([0-9]{3}-). Consequently, the group can appear 1 or 2 times. Thus, the pattern matches 555-1212 and 207-555-1212.
Patterns can contain sequences that are accessed collectively. A pair of parentheses surrounds the collection. Several examples of grouped sequences have already appeared in earlier sections. You will find that groups are commonly necessary for pattern searches.
Here we will consider proper names. Each name begins with an uppercase letter, and continues with a series of lower case letters. The corresponding pattern is:
Omitting the names of various celebrities, many individuals are addressed by a first and last name separated by a space. The corresponding pattern is:
Note, the above pattern allows a trailing space to follow the last name.
With the addition of an optional middle name, the pattern is:
The pattern above matches proper names consisting of 2 or 3 names.
What if you want to match one of the special characters? This is done by preceding the character with a backslash. The backslash character acts as an escape for the subsequent character's special meaning. It becomes the character itself within the pattern. Thus, a left square bracket is specified as:
And a backslash character is specified with two backslashes:
The following characters are represented by a backslash followed by a letter.
Sequence | Meaning | Hexadecimal |
---|---|---|
\b | backspace | x'08' |
\e | ESCape | x'1b' |
\f | formfeed | x'0c' |
\n | newline | x'0a' |
\r | carriage return | x'0d' |
\s | space | x'20' |
\t | tab | x'09' |
Octal values are specified by a backslash followed by 1 to 3 octal digits (0 to 7). For example, an Ascii ESCape character is represented in octal as follows:
Hexadecimal values are specified by a backslash followed by the letter 'x', and then 1 or 2 hexadecimal digits (0 to 9, a to f, or A to F). For example, an Ascii ESCape character is represented in hexadecimal as follows:
An Ascii caret character is represented in hexadecimal as follows:
Control character values are specified by a backslash followed by a caret '^', and then a single character value -- usually in the range A to Z. Control-Z has a value of decimal 26, and is represented as follows:
Expression | Meaning | Examples, extended | Basic expression equivalents |
---|---|---|---|
c | Normal character | Radiance | |
. | Any character | Rad...ce | |
^ | Start of line | ^Radiance | |
$ | End of line | radiance$ | |
[aAbB] | Character set | [01234546789] | |
[m-n] | Character set sequence | [a-z] [a-zA-Z] [0-9] | |
[^aAbB] | Non-character set | [^01234546789] | |
[^m-n] | Non-character set sequence | [^0-9] | |
? | 0 or 1 occurrence | [0-9]? | |
+ | 1 or more occurrences | .+ | |
* | 0 or more occurrences | [a-z]* | |
{n} | Exact count | [0-9]{5} | [0-9]\{5\} |
{n,} | Minimum count or more | [0-9]{1,} | [0-9]\{1,\} |
{min,max} | Minimum count up to maximum count | [0-9]{1,5} | [0-9]\{1,5\} |
| | Alternative | Kipling|Hemingway|Twain | |
(expr) | Group | (Hydrogen|Carbon|Oxygen) | \(Hydrogen|Carbon|Oxygen\) |
(expr)? | Optional group (0 or 1) | (Hydrogen|Carbon|Oxygen)? | \(Hydrogen|Carbon|Oxygen\)? |
(expr)+ | Repeating group (1 or more) | (Hydrogen|Carbon|Oxygen)+ | \(Hydrogen|Carbon|Oxygen\)+ |
(expr)* | Repeating group (0 or more) | (Hydrogen|Carbon|Oxygen)* | \(Hydrogen|Carbon|Oxygen\)* |
(expr){n} | Repeating group (exactly n) | (Hydrogen|Carbon|Oxygen){2} | \(Hydrogen|Carbon|Oxygen\)\{2\} |
(expr){n,} | Repeating group (n or more) | (Hydrogen|Carbon|Oxygen){2,} | \(Hydrogen|Carbon|Oxygen\)\{2,\} |
(expr){m,n} | Repeating group (m to n) | (Hydrogen|Carbon|Oxygen){2,5} | \(Hydrogen|Carbon|Oxygen\)\{2,5\} |
A surname could be matched by the following regular expression:
The above matches Smith, or Smith-Jones
Various formal names could be matched by the following regular expression:
The above matches Sir Walter Raleigh, or Ms. Walker, or Mrs. James-Jones
An alternative pattern that matches the same formal names is:
The above matches Sir Walter Raleigh, or Mrs. James-Jones
US social security numbers can be matched by the following EXTENDED regular expression:
US zip codes can be matched by the following EXTENDED regular expression:
The above matches 03903, or 03903-0001
US phone numbers can be matched by the following EXTENDED regular expressions:
The above patterns match the following numbers
The following pattern matches a phone number that optionally has a leading 1- before the areacode.