|
Regular Expression is a very powerful tool for pattern-matching, or finding one string within another. Here are some basic pattern-matching rules.
Any single character matches itself, unless it is a metacharacter with a special meaning described below. You can cause characters that normally function as metacharacters to be interpreted literally by prefixing them with a \ (e.g., \. matches a ., not any character; \\ matches a \). A series of characters matches that series of characters in the target string, so the pattern blurfl would match blurfl in the target string.
You can specify a character class, by enclosing a list of characters in [], which will match any one character from the list. If the first character after the [ is ^, the class matches any character not in the list. Within a list, the - character specifies a range, so that a-z represents all characters between a and z, inclusive. If you want either - or ] itself to be a member of a class, put it at the start of the list (possibly after a ^), or escape it with a backslash. - is also taken literally when it is at the end of the list, just before the closing ]. (The following all specify the same class of three characters: [-az], [az-], and [a\-z].) Also, if you try to use the character classes \w, \W, \s, \S, \d, or \D as endpoints of a range, that's not a range, the - is understood literally.
Note also that the whole range idea is rather unportable between character sets - and even within character sets they may cause results you probably didn't expect. A sound principle is to use only ranges that begin from and end at either alphabets of equal case ([a-e], [A-E]), or digits ([0-9]). Anything else is unsafe. If in doubt, spell out the character sets in full.
Characters may be specified using a metacharacter syntax: \n matches a newline, \t a tab, \r a carriage return, \f a form feed, etc. More generally, \nnn, where nnn is a string of octal digits, matches the character whose coded character set value is nnn. Similarly, \xnn, where nn are hexadecimal digits, matches the character whose numeric value is nn. The expression \cx matches the character control-x. Finally, the . metacharacter matches any character except \n (unless you use /s).
You can specify a series of alternatives for a pattern using | to separate them, so that fee|fie|foe will match any of fee, fie, or foe in the target string (as would f(e|i|o)e). The first alternative includes everything from the last pattern delimiter ((, [, or the beginning of the pattern) up to the first |, and the last alternative contains everything from the last | to the next pattern delimiter. That's why it's common practice to include alternatives in parentheses: to minimize confusion about where they start and end.
Within a pattern, you may designate subpatterns for later reference by enclosing them in parentheses, and you may refer back to the nth subpattern later in the pattern using the metacharacter \n. Subpatterns are numbered based on the left to right order of their opening parenthesis. A backreference matches whatever actually matched the subpattern in the string being examined, not the rules for that subpattern. Therefore, (0|0x)\d*\s\1\d* will match 0x1234 0x4321, but not 0x1234 01234, because subpattern 1 matched 0x, even though the rule 0|0x could potentially match the leading 0 in the second number.
Metacharacters
Below is the list of metacharacters used in Regular Expressions.
Escape sequences
| \xnn |
Character with hex code nn |
| \t |
Tab character |
| \n |
Newline |
| \r |
Carriage return |
| \f |
Form feed |
| \a |
Alarm (bell) |
| \e |
Escape |
Examples:
| foo\x20bar |
Matches foo bar (note the space in the middle) |
| \tfoobar |
Matches foobar predefined by tab |
Line separators
| ^ |
Beginning of the line |
| $ |
End of the line |
| \A |
Beginning of text |
| \Z |
End of text |
| . |
Any character |
Examples:
| ^foobar |
Matches string foobar only if it's at the beginning of line |
| foobar$ |
Matches string foobar only if it's at the end of line |
| ^foobar$ |
Matches string foobar only if it's the only string in line |
| foob.r |
Matches strings like foobar, foobbr, foob1r |
The ^ metacharacter by default is only guaranteed to match at the beginning of the string/text, the $ metacharacter only at the end. Embedded line separators will not be matched by ^ or $.
The . metacharacter by default matches any character, but if you turn off the modifier /s, then . won't match embedded line separators.
Predefined character classes
| \w |
Any letter or digit, including _ |
| \W |
Any nonalphanumeric character (not letter nor digit) |
| \d |
Any numeric character (digit) |
| \D |
Any non-numeric character |
| \s |
Any space character (same as [ \t\n\r\f]) |
| \S |
Any non-space character |
You may use \w, \d and \s within custom character classes.
Examples:
| foob\dr |
Matches strings foob1r, foob6r, but not foobar, foobbr |
| foob[\w\s]r |
Matches strings foobar, foob r, foobbr, but not foob2r, foob=r |
Word boundaries
| \b |
Matches a word boundary |
| \B |
Matches a non-word boundary |
A word boundary \b is a spot between two characters that has a \w on one side of it and a \W on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a \W.
Iterators
Any item of a regular expression may be followed by another type of metacharacters - iterators. Using this kind of metacharacters you can specify number of occurences of previous character, metacharacter or subexpression.
| * |
Zero or more. Greedy, similar to {0,} |
| + |
One or more. Greedy, similar to {1,} |
| ? |
Zero or one. Greedy, similar to {0,1} |
| {n} |
Exactly n times. Greedy |
| {n,} |
At least n times. Greedy |
| {n,m} |
At least n but not more than m times. Greedy |
| *? |
Zero or more. Non-greedy, similar to {0,}? |
| +? |
One or more. Non-greedy, similar to {1,}? |
| ?? |
Zero or one. Non-greedy, similar to {0,1}? |
| {n}? |
Exactly n times. Non-greedy |
| {n,}? |
At least n times. Non-greedy |
| {n,m}? |
At least n but not more than m times. Non-greedy |
Thus, digits in curly brackets of the form {n,m}, specify the minimum number of times to match the item n and the maximum m. The form {n} is equivalent to {n,n} and matches exactly n times. The form {n,} matches n or more times. There is no limit to the size of n or m, but large numbers will take more memory and slow down the execution of Regular Expression.
If a curly bracket occurs in any other context, it is treated as a regular character.
Examples:
| foob.*r |
Matches strings foobar, foobalkjdflkj9r and foobr |
| foob.+r |
Matches strings foobar, foobalkjdflkj9r, but not foobr |
| foob.?r |
Matches strings foobar, foobbr, foobr but not foobalkj9r |
| fooba{2}r |
Matches the string foobaar |
| fooba{2,}r |
Matches strings foobaar, foobaaar, foobaaaar, etc. |
| fooba{2,3}r |
Matches strings foobaar, or foobaaar, but not foobaaaar |
A small explanation about greediness. Greedy takes as many as possible, Non-greedy takes as few as possible. For example, b+ and b* applied to string abbbbc return bbbb, b+? returns b, b*? returns empty string, b{2,3}? returns bb, b{2,3} returns bbb.
You can switch all iterators into Non-greedy mode (see the modifier /g).
Alternatives
You can specify a series of alternatives for a pattern using | to separate them, so that fee|fie|foe will match any of fee, fie, or foe in the target string. The first alternative includes everything from the last pattern delimiter ((, [, or the beginning of the pattern) up to the first |, and the last alternative contains everything from the last | to the next pattern delimiter. For this reason, it's common practice to include alternatives in parentheses, to minimize confusion about where they start and end.
Alternatives are tried from left to right, so the first alternative found for which the entire expression matches, is the one that is chosen. This means that alternatives are not necessarily greedy. For example: when matching foo|foot against barefoot, only the foo part will match, as that is the first alternative tried, and it successfully matches the target string.
Also remember that | is interpreted as a literal within square brackets, so if you write [fee|fie|foe] you're really only matching [feio|].
Examples:
| foo(bar|foo) |
Matches strings foobar or foofoo |
Subexpressions
The bracketing construct ( ... ) may also be used to define Regular Expression subexpressions. Subexpressions are numbered from left to right in order of their opening parenthesis. First subexpression has number 1 (entire Regular Expression match has number 0).
Examples:
| (foobar){8,10} |
Matches strings which contain 8, 9 or 10 instances of the foobar |
| foob([0-9]|a+)r |
Matches foob0r, foob1r , foobar, foobaar, foobaar, etc. |
Backreferences
Metacharacters \1 through \9 are interpreted as backreferences. \n matches previously matched subexpression #n.
Examples:
| (.)\1+ |
Matches aaaa and cc |
| (.+)\1+ |
Matches abab and 123123 |
| (['"]?)(\d+)\1 |
Matches "13" (in double quotes), or '4' (in single quotes) or 77 (without quotes), etc |
Modifiers
Modifiers allows you to change the behaviour of the Regular Expression. Modifiers can be embedded within the Regular Expression itself using the (?...) construct.
| i |
Do case-insensitive pattern matching. |
| m |
Treat string as multiple lines. This changes ^ and $ from matching only at the very beginning or end of the text to the beginning or end of any line anywhere within the text. |
| s |
Treat string as single line. This changes . to match any character whatsoever, even a line separators, which it normally would not match. |
| g |
Switching it off you'll switch all iterators to the non-greedy mode (by default this modifier is on). So, if modifier /g is off then + works as +?, * as *? and so on |
| x |
Extend your pattern's legibility by permitting whitespace and comments (see explanation below). |
The modifier /x itself needs a little more explanation. It tells the Regular Expression to ignore whitespace that is neither backslashed nor within a character class. You can use this to break up your Regular Expression into (slightly) more readable parts. This also means that if you want real whitespace or # characters in the pattern (outside a character class, where they are unaffected by /x), that you'll either have to escape them or encode them using octal or hexadecimal escapes. Taken together, these features go a long way towards making regular expressions text more readable.
|