Extreme Picture Finder online documentation
Introduction
Quick Start Guide
Program Interface
Main program window
Main menu commands
New Project wizard
Project properties window
Starting addresses
Target files
Title and category
Regular site
TGP site or page
Destination folder
Naming
Conflicts
Excluded URLs
Included URLs
File size limits
File date limits
Image size limits
Stop conditions
After completion
Create category window
Select category window
New Search wizard
Search properties
Search phrase
Search mode
Search Engines
Destination folder
File size limits
File date limits
Image size limits
Download log viewer
Options
General options
Proxy settings
Connections
Picture viewer settings
Thumbnails settings
Database Explorer settings
Advanced options
Online Project Database Explorer
Report non-working Project window
Request a new category
Contribute Project window
Select local category window
File list columns setup window
URL Generator
Built-in picture viewer
Registration
How to register
Extreme Internet Software
Technical support

Resular Expressions

Regular Expression is a very powerful tool for pattern-matching, or finding one string within another. Here are some basic pattern-matching rules.

Any single character matches itself, unless it is a metacharacter with a special meaning described below. You can cause characters that normally function as metacharacters to be interpreted literally by prefixing them with a \ (e.g., \. matches a ., not any character; \\ matches a \). A series of characters matches that series of characters in the target string, so the pattern blurfl would match blurfl in the target string.

You can specify a character class, by enclosing a list of characters in [], which will match any one character from the list. If the first character after the [ is ^, the class matches any character not in the list. Within a list, the - character specifies a range, so that a-z represents all characters between a and z, inclusive. If you want either - or ] itself to be a member of a class, put it at the start of the list (possibly after a ^), or escape it with a backslash. - is also taken literally when it is at the end of the list, just before the closing ]. (The following all specify the same class of three characters: [-az], [az-], and [a\-z].) Also, if you try to use the character classes \w, \W, \s, \S, \d, or \D as endpoints of a range, that's not a range, the - is understood literally.

Note also that the whole range idea is rather unportable between character sets - and even within character sets they may cause results you probably didn't expect. A sound principle is to use only ranges that begin from and end at either alphabets of equal case ([a-e], [A-E]), or digits ([0-9]). Anything else is unsafe. If in doubt, spell out the character sets in full.

Characters may be specified using a metacharacter syntax: \n matches a newline, \t a tab, \r a carriage return, \f a form feed, etc. More generally, \nnn, where nnn is a string of octal digits, matches the character whose coded character set value is nnn. Similarly, \xnn, where nn are hexadecimal digits, matches the character whose numeric value is nn. The expression \cx matches the character control-x. Finally, the . metacharacter matches any character except \n (unless you use /s).

You can specify a series of alternatives for a pattern using | to separate them, so that fee|fie|foe will match any of fee, fie, or foe in the target string (as would f(e|i|o)e). The first alternative includes everything from the last pattern delimiter ((, [, or the beginning of the pattern) up to the first |, and the last alternative contains everything from the last | to the next pattern delimiter. That's why it's common practice to include alternatives in parentheses: to minimize confusion about where they start and end.

Within a pattern, you may designate subpatterns for later reference by enclosing them in parentheses, and you may refer back to the nth subpattern later in the pattern using the metacharacter \n. Subpatterns are numbered based on the left to right order of their opening parenthesis. A backreference matches whatever actually matched the subpattern in the string being examined, not the rules for that subpattern. Therefore, (0|0x)\d*\s\1\d* will match 0x1234 0x4321, but not 0x1234 01234, because subpattern 1 matched 0x, even though the rule 0|0x could potentially match the leading 0 in the second number.

Metacharacters

Below is the list of metacharacters used in Regular Expressions.

Escape sequences

\xnn Character with hex code nn
\t Tab character
\n Newline
\r Carriage return
\f Form feed
\a Alarm (bell)
\e Escape

Examples:

foo\x20bar Matches foo bar (note the space in the middle)
\tfoobar Matches foobar predefined by tab

Line separators

^ Beginning of the line
$ End of the line
\A Beginning of text
\Z End of text
. Any character

Examples:

^foobar Matches string foobar only if it's at the beginning of line
foobar$ Matches string foobar only if it's at the end of line
^foobar$ Matches string foobar only if it's the only string in line
foob.r Matches strings like foobar, foobbr, foob1r

The ^ metacharacter by default is only guaranteed to match at the beginning of the string/text, the $ metacharacter only at the end. Embedded line separators will not be matched by ^ or $.

The . metacharacter by default matches any character, but if you turn off the modifier /s, then . won't match embedded line separators.

Predefined character classes

\w Any letter or digit, including _
\W Any nonalphanumeric character (not letter nor digit)
\d Any numeric character (digit)
\D Any non-numeric character
\s Any space character (same as [ \t\n\r\f])
\S Any non-space character

You may use \w, \d and \s within custom character classes.

Examples:

foob\dr Matches strings foob1r, foob6r, but not foobar, foobbr
foob[\w\s]r Matches strings foobar, foob r, foobbr, but not foob2r, foob=r

Word boundaries

\b Matches a word boundary
\B Matches a non-word boundary

A word boundary \b is a spot between two characters that has a \w on one side of it and a \W on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a \W.

Iterators

Any item of a regular expression may be followed by another type of metacharacters - iterators. Using this kind of metacharacters you can specify number of occurences of previous character, metacharacter or subexpression.

* Zero or more. Greedy, similar to {0,}
+ One or more. Greedy, similar to {1,}
? Zero or one. Greedy, similar to {0,1}
{n} Exactly n times. Greedy
{n,} At least n times. Greedy
{n,m} At least n but not more than m times. Greedy
*? Zero or more. Non-greedy, similar to {0,}?
+? One or more. Non-greedy, similar to {1,}?
?? Zero or one. Non-greedy, similar to {0,1}?
{n}? Exactly n times. Non-greedy
{n,}? At least n times. Non-greedy
{n,m}? At least n but not more than m times. Non-greedy

Thus, digits in curly brackets of the form {n,m}, specify the minimum number of times to match the item n and the maximum m. The form {n} is equivalent to {n,n} and matches exactly n times. The form {n,} matches n or more times. There is no limit to the size of n or m, but large numbers will take more memory and slow down the execution of Regular Expression.

If a curly bracket occurs in any other context, it is treated as a regular character.

Examples:

foob.*r Matches strings foobar, foobalkjdflkj9r and foobr
foob.+r Matches strings foobar, foobalkjdflkj9r, but not foobr
foob.?r Matches strings foobar, foobbr, foobr but not foobalkj9r
fooba{2}r Matches the string foobaar
fooba{2,}r Matches strings foobaar, foobaaar, foobaaaar, etc.
fooba{2,3}r Matches strings foobaar, or foobaaar, but not foobaaaar

A small explanation about greediness. Greedy takes as many as possible, Non-greedy takes as few as possible. For example, b+ and b* applied to string abbbbc return bbbb, b+? returns b, b*? returns empty string, b{2,3}? returns bb, b{2,3} returns bbb.

You can switch all iterators into Non-greedy mode (see the modifier /g).

Alternatives

You can specify a series of alternatives for a pattern using | to separate them, so that fee|fie|foe will match any of fee, fie, or foe in the target string. The first alternative includes everything from the last pattern delimiter ((, [, or the beginning of the pattern) up to the first |, and the last alternative contains everything from the last | to the next pattern delimiter. For this reason, it's common practice to include alternatives in parentheses, to minimize confusion about where they start and end.

Alternatives are tried from left to right, so the first alternative found for which the entire expression matches, is the one that is chosen. This means that alternatives are not necessarily greedy. For example: when matching foo|foot against barefoot, only the foo part will match, as that is the first alternative tried, and it successfully matches the target string.

Also remember that | is interpreted as a literal within square brackets, so if you write [fee|fie|foe] you're really only matching [feio|].

Examples:

foo(bar|foo) Matches strings foobar or foofoo

Subexpressions

The bracketing construct ( ... ) may also be used to define Regular Expression subexpressions. Subexpressions are numbered from left to right in order of their opening parenthesis. First subexpression has number 1 (entire Regular Expression match has number 0).

Examples:

(foobar){8,10} Matches strings which contain 8, 9 or 10 instances of the foobar
foob([0-9]|a+)r Matches foob0r, foob1r , foobar, foobaar, foobaar, etc.

Backreferences

Metacharacters \1 through \9 are interpreted as backreferences. \n matches previously matched subexpression #n.

Examples:

(.)\1+ Matches aaaa and cc
(.+)\1+ Matches abab and 123123
(['"]?)(\d+)\1 Matches "13" (in double quotes), or '4' (in single quotes) or 77 (without quotes), etc

Modifiers

Modifiers allows you to change the behaviour of the Regular Expression. Modifiers can be embedded within the Regular Expression itself using the (?...) construct.

i Do case-insensitive pattern matching.
m Treat string as multiple lines. This changes ^ and $ from matching only at the very beginning or end of the text to the beginning or end of any line anywhere within the text.
s Treat string as single line. This changes . to match any character whatsoever, even a line separators, which it normally would not match.
g Switching it off you'll switch all iterators to the non-greedy mode (by default this modifier is on). So, if modifier /g is off then + works as +?, * as *? and so on
x Extend your pattern's legibility by permitting whitespace and comments (see explanation below).

The modifier /x itself needs a little more explanation. It tells the Regular Expression to ignore whitespace that is neither backslashed nor within a character class. You can use this to break up your Regular Expression into (slightly) more readable parts. This also means that if you want real whitespace or # characters in the pattern (outside a character class, where they are unaffected by /x), that you'll either have to escape them or encode them using octal or hexadecimal escapes. Taken together, these features go a long way towards making regular expressions text more readable.