REGREPL statement

Purpose

Scan a string for a matching "wildcard" or regular expression, and replace it with a new value.

Syntax

REGREPL mask$ IN main$ WITH repl$ [AT start&] TO iPos&, newmain$

Remarks

REGREPL scans main$ for a matching regular expression specified in mask$.  If a match is made, REGREPL replaces the matched text with the contents of repl$, and assigns the new text to newmain$.  Additionally, REGREPL sets iPos& to reflect the character position immediately following the matched text in newmain$, so the operation can be repeated, if desired.

If no matching expression is found, iPos& will be set to zero, and newmain$ receives a direct copy of main$.  In either case, main$ remains unchanged.

mask$ may contain literal characters and metacharacters (wildcards) to form the regular expression, and repl$ may only contain literal characters and tags specified by \##.  Each tag from \01 through \99 is replaced by the text actually matched for that tag.  \00 is replaced by the entire matched text.

If specified, the search begins at the character position start& in main$; however, start& must be between 1 and the length of main$.  If start& is less than 1, the start& parameter is ignored.

While it is possible for more than one match to be found in a particular target string, REGREPL first selects one or more matches which start at the leftmost possible position, then returns the longest of those.  Use the \s special escape operator to force a match on the shortest match (see below).

The ^ and $ operators match on both the actual string start/end, or the previous/next embedded line-delimiter characters (CHR$(13,10) or $CRLF) in main$.  This enables REGREPL to treat the main$ string as containing a set of "logical lines" of text.  In this situation, the start& character position plays a crucial role in identifying which logical delimited line that should be examined by REGREPL.

By default, search expressions are assumed to be case-insensitive, so capitalization is ignored.

mask$

The regular (wildcard) expression specified in mask$ may contain a combination of standard text characters and/or the metacharacters which are defined as follows:

char

Definition

.

(period) Matches any character, except the end-of-line.

^

(caret) Matches the actual beginning-of-line position or the preceding line-delimiter character pair (CHR$(13,10) or $CRLF), as taken from the start& character position.  The line-delimiter characters themselves are not replaced by repl$.  (also see [^] below for usage within a character class definition).

$

(dollar) Matches the end-of-line position, which may be the either the first line-delimiter character pair (CHR$(13,10) or $CRLF) that is encountered in the search to the right of the start& position, or the actual end of the main$ string, whichever occurs first.  The line-delimiter characters themselves are not replaced by repl$.

|

(stile) Specifies alternation (the OR operator), so that an expression on either side can match.  Precedence is from left-to-right, as encountered in the expression.

?

(question mark) Specifies that zero or one match of the preceding sub-pattern is allowed.  Cannot be used with a Tag.

+

(plus) Specifies that one or more matches of the preceding sub-pattern are allowed.  Cannot be used with a Tag.

*

(asterisk) Specifies that zero or more matches of the preceding sub-pattern are allowed.  Cannot be used with a Tag.

Character classes

[ ]

(square brackets) Identifies a user-defined class of characters, any of which will match:  [abc] will match a, b, or c.  Only three special metacharacters are recognized within a class definition, the caret (^) for complemented characters, the hyphen (-) for a range of characters, or one of the following \ backslash escape sequences:

 

\\  \-  \]  \e  \f  \n  \q  \r  \t  \v  \x##

 

Any other use of a backslash within a class definition yields an undefined operation that should be avoided.

 [-]

(hyphen) The hyphen identifies a range of characters to match.  For example, [a-f] will match a, b, c, d, e, or f.

 

Characters in an individual range must occur in the natural order as they appear in the character set.  For example, [f-a] will match nothing.

 

Lists of characters, and one or more ranges of characters, may be intermixed in a single class definition.  The start and end of a range may be specified by a literal character, or one of the \ backslash escape sequences:

 

\\  \-  \]  \e  \f  \n  \q  \r  \t  \v  \x##

 

Any other use of a backslash within a class definition yields an undefined operation.

 

Multiple ranges in a class are valid.  For example, [a-d2-5] matches a, b, c, d, 2, 3, 4, or 5.

 

When the hyphen is escaped, it is treated as a literal.  For example, [a\-c] is a list, not a range, and matches a, -, or c due to the \ backslash escape sequence.

[^]

(caret) When the caret appears as the first item in a class definition, it identifies a complemented class of characters, which will not match.  For example, [^abc] matches any character except a, b, or c.

 

A range can also be specified for the complemented class.  For example, [^a-z] matches any character except a through z.

 

A caret located in any position other than the first is treated as a literal character.

Tags/sub-patterns

( )

(parentheses) Parentheses are used to match a Tag, or sub-pattern, within the full search pattern, and remember the match.  The matched sub-pattern can be retrieved later in the mask, or in a replace operation, with \01 through \99, based upon the left-to-right position of the opening parentheses.

 

Parentheses may also be used to force precedence of evaluation with the alternation operator.  For example, "(Begin)|(End)File" would match either "BeginFile" or "EndFile", but without the Tag designations, "Begin|EndFile" would only match either "BeginndFile" or "BegiEndFile".

 

Note: Parentheses may not be used with ? + * as any match repetition could cause the tag value to be ambiguous.  To match repeated expressions, use parentheses followed by \01*.

Escaped characters

\

(backslash).  The escape operator (single-character quote). The following character will be treated as a literal value rather than being interpreted as a special character. Note that the character following the backslash must actually be a special character, as follows:

\b

A word boundary.  The start or end of a word, where a word is defined as one or more characters that include an alphabetic character (A-Z or a-z), a numeric character (0-9), and an underscore.  For example, "abc_123" is considered a single word and "abc-123" is considered two words.

\c

Case-sensitive search.  Without the \c operator, the default is to ignore case when matching.  Unlike some other implementations of regular expressions, case-insensitivity is recognized in all operations, even a range of characters such as "[6-Z]".  The \c operator may appear at any position in the mask.

\e

Escape character: CHR$(27) or $ESC.

\f

Formfeed character: CHR$(12) or $FF.

\n

Linefeed (or newline) character: CHR$(10) or $LF.

\q

Double-quote mark ("): CHR$(34) or $DQ.  \q is included for ease of inclusion within a literal string.  For example: "\qHello\q".

\r

Carriage-return character: CHR$(13) or $CR.

\s

Shortest match character: The \s flag causes the shortest matching string to be returned, rather than the longest (the default).  For example, when searching for the mask "abc.*abc" in "abcdabcabc", the default setting would return position 1 and length 10.  With the \s switch set, it returns position 1 and length 7.  This option may cause a slight increase in processing time.  The \S flag must appear at the beginning of the mask string.

\t

Horizontal tab character: CHR$(9) or $TAB.

\v

Vertical tab character: CHR$(11) or $VT.

\x##

Hex character code: Indicates that an ASCII code follows, given by two hexadecimal digits.  For example, \xFF = CHR$(&HFF) (which is equivalent to CHR$(255)).  XX must be in the range 0 through 255.

\##

Tag number: Evaluated as the characters matched by tag number ## (where ## is in the range 01 through 99, in decimal).  Tags are implicitly numbered from 01 through 99, based upon the left-to-right position of the left parenthesis.  "(…)w\01" would match "abcwabc" or "456w456".

Tags cannot be forward-referenced - that is, if a reference is made to any Tag that is not yet defined, a non-match is presumed.

Restrictions

To maximize performance, avoid overuse of the *, + and ? metacharacters.

See also

REGEXPR

Example

#COMPILE EXE

FUNCTION PBMAIN

  a$ = "please email support@powerbasic.com"

  b$ = "([a-z0-9._/+-]+)(@[a-z0-9.-]+)"

  c$ = "sales\02"

  REGREPL b$ IN a$ WITH c$ TO position&, d$

  ' d$ -> "please email sales@powerbasic.com"

 

  a$ = "Line 1" + $CRLF + "Line 2" + $CRLF

  b$ = "([0-9])$"

  c$ = "\01.0"

  position& = 1

  DO

    REGREPL b$ IN a$ WITH c$ AT position& TO position&, a$

  LOOP WHILE position&

  ' a$ -> " Line 1.0" + $CRLF + "Line 2.0" + $CRLF

END FUNCTION