Why Do We Care About This?
^
V
A regular expression is a "formula" for matching strings that follow some
pattern in order to operate on a
subject character string.
Text in HTML, log files, text files containing data, etc.
are parsed in order
to validate for correct formatting,
to extract substrings, or
to replace content.
The Perl ("Practical Extraction and Report Language") language has become popular partly because of its extensive support for regular expressions.
Perl allows you to embed regular expressions in file tests, control loops, output formats, and everything else.
http://www.wikiwand.com/en/Regular_expression
The term regular expression is often abbreviated as "regex" or "regexes" in plural.
Regular-Expressions.info
Steve Ramsay's Guide to Regular Expressions
Learning to Use Regular Expressions, by David Mertz
also discusses advanced Regular Expression Extensions such as Non-greedy quantifiers, backreferences,
and lookahead assertions.
Rx Cookbook at ActiveState has contributions from several people.
Regexp Power Part I (June 06, 2003) and
Part II (July 01, 2003) by Simon Cozens
Steve Mansour's A Tao of Regular Expressions
compares differences in expressions for various tools.
Five Habits for Successful Regular Expressions
by Tony Stubblebine describes how you can test regular expressions in PHP, Perl, and Python.
Regular Expression Pocket Reference
(
O'Reilly, August 2003)
by Tony Stubblebine
provides a concise "memory jogger" that you won't be embarassed to carry around.
Teach Yourself Regular Expressions in 10 Minutes
(Sams; February 28, 2004)
by Ben Forta
Beginning Regular Expressions
(Wrox Press, 2005)
by Andrew Watt
Try It Now
TIP:
The easiest way to learn this is to take a hands-on approach and try some patterns.
Test and debug regular expressions using these tools.
Download or clone
RegexExplained and see it used by its author @LeaVerou at
VIDEO:
/Reg(exp){2}lained/: Demystifying Regular Expressions
presented live at the O'Reilly Fluent conference May 2012.
RegexPal.com parses JavaScript on a web page.
Use the
Regex Coach to graphically experiment with (Perl-compatible) regular expressions interactively.
Dr. Edmund Weitz wrote this for use on Windows and Linux systems to show how Common Lisp
can be practical using the LispWorks IDE and cross-platform CAPI toolkit.
Regular Expression Tester parses within ASP.NET.
$40 RegExBuddy is a Windows program.
Engines
Beware that mime and vendor competitive urges have engendered
several versions of regular expressions:
- The historical Simple Basic Regular Expression (BRE) notation
described as part of the regexp() function in the XSH specification, which provide backward compatibility, but which may be withdrawn from a future specification set.
- The GNU operating system's regex package are available via ftp at
ftp.gnu.org.
- Compilers of programming languages Perl, Python, Emacs, Tcl, and .NET use a
backtracking
regular expression matcher that incorporates a
traditional Nondeterministic Finite Automaton (NFA) engine.
The standardized POSIX NFAs is slower.
- Utility programs initially developed for unix -- awk, egrep, and lex --
use a faster, but more limited,
pure regular expression Deterministic Finite Automaton (DFA) engine.
- The Extended Regular Expressions (ERE) version complies with the internationalized
ISO/IEC 9945-2:1993 standard. It matches based on the bit pattern used for encoding the character, not on the graphic representation of the character (which may represent more than one bit pattern).
-
Microsoft's .NET Framework regular expressions
are said to be compatible with Perl 5 regular expressions, but
include features not yet seen in other implementations,
such as right-to-left matching and on-the-fly compilation.
Parsing C/C++ style comments are a little more complex when you have to take into account string embedding, escaping, and line continuation.
For example, the match routine of the C language library,
accepts strings that are interpreted as regular expressions.
Regex Patterns
Instead of custom-written coding (looping through each line and invoking sub-string functions),
regex methods refer to a pattern of characters to vary its searching and matching.
This video shows how files containing different date formats can't be parsed
using just the sub-string function alone, which is a dangeroudly blunt tool.
Patterns comprises two basic character types available from a standard keyboard
(not using Greek alphas, lambdas, etc. like mathematicians do):
- literal (normal) text characters such as 0 thru 9 or a thru z; and
- Metacharacters specify filtering.
enabling a powerful, flexible, and efficient method for processing text.
However, their compactness make them easier to create than to read.
JOKE:
Some call regex expressions "ASCII puke" because it looks like a jumble of
letters and numbers.
The Kleene Star * (Wild Card) Metacharacter
The development of regular expressions is first traced back to the work of
Kleene (some pronounce like "clean knee") -- Stephen Cole Kleene (1904-1994), an American mathematician and theoretical computer scientist at Princeton and U. of Wisconsin-Madison.
For this reason, the "*" wildcard character used in computer searches is formally known as a "Kleene star."
Kleene's text-manipulation tools used by the Unix platform include
ed, vi text editor, and grep file search utilities made used his notations for “the algebra of regular sets.”
The use of < and > enclosing text is formally known as a "Kleene closure".
Basic Metacharacters
There are 12 of them.
Meta- character | Operator Name | Matches | Example regular expression |
. |
period |
any single character except NUL. |
r.t would match the strings rat, rut, r t, but not root (two o's) nor the Rot in Rotten (upper case R). |
* |
Kleene star>, asterisk, wildcard |
zero or more occurences of the character immediately preceding. |
.* means match any number of any characters. |
$ |
dollar currency anchor |
end of a line. |
weasel$ would match the end of the string "He's a weasel" but not the string
"They are a bunch of weasels."
When the $ operator is the last operator of a regular expression or immediately follows a right parenthesis,
it must be proceeded by a backslash \.
|
^ |
circumflex or caret anchor |
beginning of a string/line. |
^When in would match the beginning of the string "When
in the course of human events" but would not match "What and When
in the" . |
[ ]
[c1-c2]
[^c1-c2] |
square brackets |
any one of the characters between the brackets. |
r[aou]t matches rat, rot,
and rut, but not ret. Ranges of characters can specified
by using a hyphen. For example, the regular expression [0-9]
means match any digit. Multiple ranges can be specified as well. The regular
expression [A-Za-z] means match any upper or lower case
letter. To match any character except those in the range, the complement
range, use the caret as the first character after the opening bracket.
For example, the expression [^269A-Z] matches any characters
except 2, 6, 9, and upper case letters.
|
[^c1-c2] |
caret within square brackets |
the complement range -- any character except those in the range following the caret
as the first character after the opening bracket. |
[^269A-Z] will match any characters
except 2, 6, 9, and upper case letters.
When the ^ operator is the first operator of a regular expression or the first character inside brackets,
it must be preceded by a backslash.
|
\ |
back slash |
This is the quoting character, use it to treat the following character
as an ordinary character. For example, \$ is used to match
the dollar sign character ($) rather than the end of a line. Similarly,
the expression \. is used to match the period character rather than any
single character.
Operators inside brackets do not need to be preceded by a backslash.
|
\< \> |
left slash and arrow |
the beginning (\<) or end (\>) or a word. |
\<the
matches on "the" in the string "for the wise" but does not match
"the" in "otherwise". NOTE: this metacharacter is not supported
by all applications. |
\( \) |
left slash and parentheses |
the expression between \( and \) as a group. |
Also, saves the characters matched by the expression into temporary holding areas. Up to
nine pattern matches can be saved in a single regular expression. They
can be referenced as \1 through \9. |
| |
pipe (alternation) |
Or two conditions together. |
(him|her)
matches the line "it belongs to him" and matches the line "it
belongs to her" but does not match the line "it belongs to them."
NOTE: this metacharacter is not supported by all applications. |
+ |
plus sign |
one or more occurences of the character or regular expression
immediately preceding. |
9+ matches 9, 99, or 999.
NOTE: this metacharacter is not supported by all applications. |
\{i\}
\ {i,j\}
|
braces |
a specific number of instances or instances within
a range of the preceding character. |
A[0-9]\{3\}
will match "A" followed by exactly 3 digits. That is, it will match A123
but not A1234.
[0-9]\{4,6\}
matches any sequence of 4, 5, or 6 digits.
NOTE: this metacharacter is supported by Robot's C-VU language but not by all applications.
|
? |
question mark |
Matches 0 or 1 occurence of the character or regular expression immediately
preceding. |
? is equivalent to {0,1}.
NOTE: this metacharacter is supported by IBM/Rational Robot's
C-VU language but not by all applications.
Question marks are optionally used to specify Non-greedy quantifiers.
For example, "/A[A-Z]*?B/" means "match an A, followed by only as many capital letters as are needed to find a B."
|
In addition, VU regular expressions can include
ASCII control characters
in the range 0 to 7F hex (0 to 127 decimal).
Regex processes only ASCII character set and does not process Unicode (UTF-8).
Backward Slash Extended MetaCharacters
One of the ways people are confused with regular expressions is the use of a backward slash \
character.
For an analogy that you many already know, in Windows command line terminals,
people use dir *.txt /s to look for text files in subdirectories.
The asterisk or star character is a wildcard. The /s specifies processing of sub-folders.
With regex, the same parsing would be specified by .*\.txt,
with a back-slash in front of the dot for the escape character for the dot before txt
since the dot has another meaning within regex expressions.
The dot character . is used in regex to represent any one character.
Backreferences provide a convenient way to find repeating groups of characters. They can be thought of as a shorthand instruction to match the same string again.
Extended
Liks C and Java programs, regex programs use \ as an escape character to denote use of special characters as plain text.
These additional escape tags are recognized within Ruby regex:
\A | Beginning of a string |
\b | Word boundary |
\B | Non-word boundary |
\d | digit, same as {0..9} |
\D | Non-digit |
\s | Whitespace [\t\r\n] |
\S | Non-Whitespace |
\w | Word character |
\W | Non-Word character |
\z | End of a string |
\Z | End of string, before nl |
[10:00] To specify digits (numbers) [0-9]:
\d
[10:48] To specify letters, numbers, and underscore, use shortcut:
\w
[14:34] To match hex codes containing 3 or 6 numbers of hex code in CSS color specification
such as #abc, #f00, #BADA55, #C0FE56
/^#[a-f\d]{3}){1,2}$/i.test(str);
This matches letters between a-f or a digit {3} times, repeated {1,2} once or twice.
Double Backslash Regex in LoadRunner
The double backslash is required in C language programs invoking
regex because both C and regex "consume" a backslash as an escape character.
LoadRunner has this function which creates a parameter named "selected_value":
char *str = " ... the html text here ...";
web_reg_save_param_regexp(
"ParamName=selected_value",
"RegExp=<select name=\"Regulatory Code_0\"[\\s\\S]*?<option .*? selected>(.*?)</option>",
LAST );
The [\\s\\S] means match any white space and any non white space character =
any character (because no Perl like "s" modifier available).
Introduced with VuGen 12 is a new function:
char *str = " ... the html text here ...";
lr_save_param_regexp(str, strlen(str),
"RegExp=... the regex here ...",
"ResultParam=selected_value",
LAST);
General Rules
Winrunner TSL
regular expressions have the following characteristics:
- The concatenation of single-character operators matches the concatenation of the characters individually matched by each of the single-character operators.
- Parentheses () can be used within a regular expression for grouping single-character operators. A group of single-character operators can be used anywhere one single-character operator can be used - for example, as the operand of the * operator.
- Parentheses and the following non-ordinary operators have special meanings in regular expressions. They must be preceded by a backslash if they are to represent themselves:
Examples of Regular Expressions
This regular expression matches any day of the week:
((Mon)|(Tues)|(Wednes)|(Thurs)|(Fri)|(Satur)|(Sun))day
This matches simple dates against 1 or 2 digits for the month, 1 or 2 digit for the day, and either 2 or 4 digits for the year. Matches: [4/5/91], [04/5/1991], [4/05/89]
Non-Matches: [4/5/1]
((\d{2})|(\d))\/((\d{2})|(\d))\/((\d{4})|(\d{2}))
This identifies incorrect 24 hour time in the format hh:mm:
/((?:0?[0-9]|1[0-9]|2[0-3]):[0-5][0-9])/
Validate a number between 1 and 255, such as an IP octet:
^([1-9]|[1-9]\d|1\d{2}|2[0-4]\d|25[0-5])$
This breaks down a Uniform Resource Identifier (URI) into its component parts.
(from
ActiveState quoting Appendix B of IETF RFC 2396)
my $uri = "http://www.ics.uci.edu/pub/ietf/uri/#Related";
print "$1, $2, $3, $4, $5, $6, $7, $8, $9" if
$uri =~ m{^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?};
$1 = http:
$2 = http (the scheme)
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu (the authority)
$5 = /pub/ietf/uri/ (the path)
$6 =
$7 = (the query)
$8 = #Related
$9 = Related (the fragment)
Validate an ip address in the form 255.255.255.255 --
if it were combined with the email pattern above, the error above would not exist. Of course, the best way to test an email address is to send e-mail to it:
^([a-zA-Z0-9_\-])+(\.([a-zA-Z0-9_\-])+)*@((\[(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5])))\.(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5])))\.(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5])))\.(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5]))\]))|((([a-zA-Z0-9])+(([\-])+([a-zA-Z0-9])+)*\.)+([a-zA-Z])+(([\-])+([a-zA-Z0-9])+)*))$
Validates date in the US m/d/y format from 1/1/1600 - 12/31/9999. The days are validated for the given month and year. Leap years are validated for all 4 digits years from 1600-9999, and all 2 digits years except 00 since it could be any century (1900, 2000, 2100). Days and months must be 1 or 2 digits and may have leading zeros. Years must be 2 or 4 digit years. 4 digit years must be between 1600 and 9999. Date separator may be a slash (/), dash (-), or period (.)
^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[1,3-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
Validate passwords to be at least 4 characters, no more than 8 characters, and must include at least one upper case letter, one lower case letter, and one numeric digit.
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{4,8}$
Validate major credit card numbers from Visa (length 16, prefix 4), Mastercard (length 16, prefix 51-55), Discover (length 16, prefix 6011), American Express (length 15, prefix 34 or 37). All 16 digit formats accept optional hyphens (-) between each group of four digits.
^((4\d{3})|(5[1-5]\d{2})|(6011))-?\d{4}-?\d{4}-?\d{4}|3[4,7]\d{13}$
This will Use extended grep for a valid MAC address, such as [01:23:45:67:89:ab], [01:23:45:67:89:AB], [fE:dC:bA:98:76:54] with colons seperating octets. It will ignore strings too short or long, or with invalid characters, such as [01:23:45:67:89:ab:cd], [01:23:45:67:89:Az], [01:23:45:56:]. It will accept mixed case hexadecimal.
^([0-9a-fA-F][0-9a-fA-F]:){5}([0-9a-fA-F][0-9a-fA-F])$
This matches the name of any state in the United States:
[ACF-IK-PR-W][a-y]{2,4}[a-y][CDIJMVY]?[a-z]{0,7}
But you probably use a drop-down list rather than making people type them out.
This Perl script
(from Craig Berry) uses a pattern to validate British Royal Mail codes used in the UK.
Each code has 2 parts: the outward (first) part cannot contain any character in "CIKMOV."
use strict;
my @patterns = ('AN NAA', 'ANN NAA', 'AAN NAA', 'AANN NAA',
'ANA NAA', 'AANA NAA', 'AAA NAA');
foreach (@patterns) {
s/A/[A-Z]/g;
s/N/\\d/g;
s/ /\\s?/g;
}
my $re = join '|', @patterns;
while (<>) {
print /^(?:$re)$/o ? "valid\n" : "invalid\n";
}
Alternately, the RegEx:
(AB|B|BA|BB|BD|BH|BL|BN|BR|BS|BT|CA|CB|CF|CH|CM|CO|CR|CT|CV|CW|DA|DE|DG|DH|DL|DN|DT|DY|E|EC|EH|EN|EX|FK|FY|G|GL|GU|H|HG|HP|HR|HS|HU|HX|IG|IM|IP|IV|KA|KT|KW|KY|L|LA|LD|LE|LL|LN|LS|LU|M|ME|MK|MK|N|NE|NG|NN|NP|NR|NW|OL|OX|PA|PE|PH|PL|PO|PR|RG|RH|RM|S|SA|SE|SG|SK|SL|SM|SN|SO|SP|SR|SS|ST|SW|SY|TA|TD|TF|TN|TQ|TR|TS|TW|UB|W|WA|WC|WD|WF|WN|WR|WS|WV|Y|ZE)([1-9]|[1-9][0-9]) [1-9][A-Z]{2}
The RegEx for verifying Canadian postal codes:
This matches any hexadecimal number with a decimal value of 1 to 4 digits in the range 0 to 65535:
|
|
|
|
|
Error Recovery with Regular Expressions
If a VU regular expression contains an error, when you run a suite, TestManager writes the message to stderr output prefixed with the following header:
sqa7vui#xxx: fatal orig type error: tname: sname, line lineno
where #xxx identifies the user ID (not present if 0), fatal signifies that error recovery is not possible (otherwise not present), orig specifies the error origination (user, system, server, or program), and type specifies the general error category (initialization, argument parsing, script initialization, or runtime). If the error occurred during execution of a script (run-time category), tname specifies the name of the script being executed when the error occurred, sname specifies the name of the VU source file that contains the VU statement causing the error, and lineno specifies the line number of this VU statement in the source file. Note that the source file information will not be available if the script's source cross-reference section has been stripped.
If a run-time error occurs due to an improper regular expression pattern in the match library function, a diagnostic message of the following form follows the header:
Regular Expression Error = errno
where errno is an error code which indicates the type of regular expression error. The following table lists the possible errno values and explains each.
errno Explanation
2 Illegal assignment form. Character after )$ must be a digit.Example: "([0-9]+)$x"
3 Illegal character inside braces. Expecting a digit.Example: "x{1,z}"
11 Exceeded maximum allowable assignments. Only $0 through $9 are valid.Example: "([0-9]+)$10"
30 Missing operand to a range operator (? {m,n} + *).Example: "?a"
31 Range operators (? {m,n} + *) must not immediately follow a left parenthesis.Example: "(?b)"
32 Two consecutive range operators (? {m,n} + *) are not allowed.Example: "[0-9]+?"
34 Range operators (? {m,n} + *) must not immediately follow an assignment operation.Example: "([0-9]+)$0{1-4}"Correction: "(([0-9]+)$0){1-4}"
36 Range level exceeds 254.Example: "[0-9]{1-255}"
39 Range nesting depth exceeded maximum of 18 during matching of subject string.
41 Pattern must have non-zero length.Example: ""
42 Call nesting depth exceeded 80 during matching of subject string.
44 Extra comma not allowed within braces.Example: "[0-9]{3,4,}"
46 Lower range parameter exceeds upper range parameter.Example: "[0-9]{4,3}"
49 '\0' not allowed within brackets, or missing right bracket.Example: "[\0] or [0-9"
55 Parenthesis nesting depth exceeds maximum of 18.Example: "(((((((((((((((((((x)))))))))))))))))))"
56 Unbalanced parentheses. More right parentheses than left parentheses.Example: "([0-9]+)$1)"
57 Program error. Please report.
70 Program error. Please report.
90 Unbalanced parentheses. More left parentheses than right parentheses.Example: "(([0-9]+)$1"
91 Program error. Please report.
100 Program error. Please report.
|
|
|
|
|
C# Coding Example
System.Test.RegularExpressions;
This provides the Regex constructor which
instatiate a regex class:
var regex = new Regex( pattern );
Use the Match method defined within Regex on the subject text
to generate a match object:
var match = new regex.Method( subject );
See what came back:
Console.WriteLine( match.Success );
This code would go inside code to define a command-line program named MatchTest.exe:
CREDITS:
Portions ©Copyright 1996-2014 Wilson Mar. All rights reserved. | Privacy Policy |
Related:
Programming Languages
Java Programming
Applications Development
|