regldg regular expression capabilities

Current version: 1.0.0

Regular Expression Capabilities

The following is a comprehensive list of the regular expression language that regldg will understand.

Individual Characters
Meta-characters
Meta-character classes Groupings
Alternations
Backreferences Character classes
Quantifiers

Individual characters

You can enter individual characters in a few methods. regldg operates on characters in the ASCII and extended ASCII system, values 0 through 255.

Regular Expression	Meaning	Example	Produces
p (any printable character)	p (that printable character)	p	p
\a	Bell character (ASCII 7)	\a	[BEL]
\b	Backspace character (ASCII 8)	\b	[BS]
\t	Horizontal tab character (ASCII 9)	\t	[HT]
\n	Newline character (ASCII 10)	\n	[NL]
\v	Vertical tab character (ASCII 11)	\v	[VT]
\f	Form feed character (ASCII 12)	\f	[FF]
\r	Carriage return character (ASCII 13)	\r	[CR]
\e	Escape character (ASCII 27)	\e	[ESC]
\zNNN	A character specified by the ASCII code NNN (decimal). NNN can be 1, 2, or 3 digits, less than 256.	\z49	1
\z{NNN}	A character specified by the ASCII code NNN (decimal). NNN can be 1, 2, or 3 digits, less than 256. The { and } help to avoid confusion. See note below.	\z{119}	w
\oNNN	A character specified by the ASCII code NNN (octal). NNN can be 1, 2, or 3 digits, less than 400 (octal).	\o072	:
\o{NNN}	A character specified by the ASCII code NNN (octal). NNN can be 1, 2, or 3 digits, less than 400 (octal). The { and } help to avoid confusion. See note below.	\o{12}	[NL]
\xNN	A character specified by the ASCII code NN (hexadecimal). NN can be 1 or 2 digits, less than FF (hexadecimal).	\x5D	]
\x{NN}	A character specified by the ASCII code NN (hexadecimal). NN can be 1 or 2 digits, less than FF (hexadecimal). The { and } help to avoid confusion. See note below.	\x{26}	&

Possible confusion with numerically-specified characters

Numerically-specified characters (using the constructs \zNNN, \oNNN, and \xNNN are a source of possible confusion. Consider the regular expression \z1234. Does it mean \z1 234, \z12 34, or \z123 4? Who's to say? regldg will interpret it as the last case, because it will continue to build numerically specified characters until the limits of its type are reached. Here, a decimal numerically specified character can use up to three numbers, and since they were available, it used all three. To avoid possible confusion, use the { and } characters to tell regldg exactly which numbers to use in your numerically specified characters.

Meta-characters

Certain characters have two meanings in regular expressions. Alone, their meaning is not what they look like. See below on this page in other sections for each meta-characters special meaning. To use a meta-character's printed meaning, just put a \ before it (this is called "escaping" it). A list of these characters are as follows:

Meta-characters which must be escaped

\ | * ?

+ . ( )

[ ] { }

An example regular expression is 1+1. This does not mean 1+1 as it looks, because the + is a quantifier (see the section Quantifiers below). To make 1+1, you must escape the + in the regular expression, making the proper regular expression 1\+1.

Meta-character classes

regldg understands the basic meta-character classes in perl. In regldg, however, meta-character classes are subject to the constraints of the character universe and the strictness of checking the character universe. For more information about the character universe, see character universes.

Meta-character class Characters included Description

. Any character in the current character universe (including \n)

\d 0123456789 Digits

\D Any character in the current character universe, excluding the members of \d

\s [SPACE][HT][VT][NL][FF] Whitespaces

\S Any character in the current character universe, excluding the members of \s

\w ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz0123456789_ Alphanumerics and _

\W Any character in the current character universe, excluding the members of \w

\u{1} ABCDEFGHIJKLMNOPQRSTUVWXYZ Uppercase letters

\u{2} abcdefghijklmnopqrstuvwxyz Lowercase letters

\u{4} 0123456789 Digits

\u{8} !@#$%^&* Shift-with-numbers

\u{16} ;`:'[SPACE],".?_ Punctuation

\u{32} (){}[] Closures

\u{64} ~\/| Others

\u{128} +-=<> Math

\u{NNN} NNN is a number in decimal between 0 and 255, representing the sum of the pre-defined universe character class numbers. The resulting character class will be the union of all the included pre-defined character universes.

Example: \u{233}
233 = 128 + 64 + 32 + 8 + 1
So, \u{233} will be the union of \u{1}, \u{8}, \u{32}, \u{64}, and \u{128}

\U{NNN} NNN is a number in decimal between 0 and 255, representing the sum of the pre-defined universe character class numbers. The resulting character class will be any character in the current universe, excluding the members of the union of all the included pre-defined universe character classes.

Example: \U{189}
189 = 128 + 32 + 16 + 8 + 4 + 1
So, \U{189} will be any character in the current character universe, excluding the members of the union of \u{1}, \u{4}, \u{8}, \u{32}, and \u{128}

Groupings

Groupings, nested groupings, and backreferences to the groupings are supported. Grouping characters together helps clarify alternations, and allows repeating of past patterns (using backreferences and quantifiers) in a singular regular expression output.

> regldg -m 35 "(firstpart)anotherpart(second(third)part)"
firstpartanotherpartsecondthirdpart

Alternations

Alternations allow you to use "this" or "that".

> regldg "ab|cd"
ab
cd

Alternations are often used with groupings when there are things in the regular expression which are not to be involved in the "this" or "that" game.

> regldg "fla(t|pper)"
flat
flapper

regldg can also use multiple alternations to use "this" or "that" or "that" or "that" or "that".

> regldg "(spl|th|fl|r)at"
splat
that
flat
rat

Backreferences

Backreferences are placeholders used to repeat a grouping from before in the same pattern. Groupings are numbers by their starting ( and can be referred to only after they have been closed with a ).

> regldg -us 19 -m 46 "(Pat|Grandma) went to school today in \1's car\."
Pat went to school today in Pat's car.
Grandma went to school today in Grandma's car.

> regldg -m 9 -us 19 "(a(b)c) \1 \2"
abc abc b

regldg includes an alternative method to use backreferences. Instead of \1 to mean a backreference to grouping 1, you can use \!{1}. This will completely avoid the ambiguity of whether it is a backreference or an octally-specified character. (This is, of course, as long as you know this syntax. Otherwise, you might be completely confused as to what it is!) In action:

> regldg -m 9 -us 19 "(a(b)c) \\!{1} \\!{2}"
abc abc b

Note the double-\s... these were required for me to enter this regex in a tcsh. To avoid this problem, you could use the command line option --file=- and enter the regex directly into the program instead.

Character classes

Character classes represent all possible characters for a single location.

> regldg "[ab][cd]"
ac
bc
ad
bd

Some meta-characters don't need to be escaped while in character classes. These are (, *, +, ?, {, [, |, ), and }. The \ and . characters definitely need to be escaped in a character class. The range character - and the end-character-class character ] must be escaped unless they are the only character in the character class.

> regldg -uc 0 "[(*+?{[|)}\\\-\]\.]"
(
*
+
?
{
[
|
)
}
\
-
]
.

regldg is also capable of negated character classes, that is, character classes starting with the ^ character. A negated character class represents all characters in the current character universe, execpt those explicitly written in the negated character class.

> regldg -us 2 "[^abcde]"
f
g
h
i
j
k
l
m
n
o
p
q
...

[-] and []] are both handled correctly: [-] is a character class containing only a - character, and []] is a character class containing only a ] character. Both are actually silly, because a one-element character class could instead be just that character. In any other character class, the - and ] characters are meta-characters, and need to be escaped.

Quantifiers

Quantifiers will allow you to write a character, character class, meta-character class or group once, and have it occur a specifed (possibly variable) number of times.

Quantifier Meaning

* The previous character, character class, meta-character class or group occurs between 0 to unlimited times (inclusive). (Unlimited is controlled by the maximum word length of the program.)

+ The previous character, character class, meta-character class or group occurs between 1 to unlimited times (inclusive). (Unlimited is controlled by the maximum word length of the program.)

? The previous character, character class, meta-character class or group occurs between 0 to 1 time (inclusive).

{2} The previous character, character class, meta-character class or group occurs 2 times.

{1,3} The previous character, character class, meta-character class or group occurs between 1 to 3 times (inclusive).

{4,} The previous character, character class, meta-character class or group occurs between 4 and unlimited times (inclusive). (Unlimitied is controlled by the maximum word length of the program.)

It is assumed that you have a good understanding of the usage of these items. A very important example, however, is when a quantifier acts on groups containing alternations. In a single word of output, any number of sides of the alternation could be used. The regular expression [a|b]{2} will produce aa, bb, AND ab and ba. Shown in an explicit example:

> regldg "(ab|cd){2}"
abab
cdab
abcd
cdcd