Perl is a scripting language known for its powerful regular expression engine. Therefore, a lot of other applications and languages (for example Apache and PHP) use a Perl-compatible library called PCRE, which stands for Perl Compatible Regular Expressions. This page will explain some useful features available in Perl and PCRE. For the official reference of the Perl regular expression syntax, see the Perl regular expressions manpage.

Character class shorthands

Some character classes are being used so often that Perl created shorthands for them.

Shorthand Meaning Equivalent
\d Digit [0-9]
\D Non-digit [^0-9]
\s Whitespace [ \t\r\n]
\S Non-whitespace [^ \t\r\n]
\w Word character [a-zA-Z0-9_]
\W Non-word character [^a-zA-Z0-9_]

Character shorthands

Some characters have shorthands to make them human readable and easy to insert from a keyboard.

Shorthand Meaning
\t Tab character
\r Carriage return
\n Newline

Anchors

In the reference we only saw the ^ and $ anchors for matching the beginning and end of a line. Anchors match on positions instead of characters itself.

Perl provides a really useful anchor: \b, which matches a word boundary (the beginning or end of a word). Let's take a look at the following string: "This is a test". If we want to match the word "is" from that string, we can not just simply use the regexp is, because that would also match the end of "This". But \bis\b only matches the word "is" and not the end of "This".

Substitution

We already know you can use regular expressions to match certain patterns. But often you'll want to use them to replace the match to something else. We will call this substitution.

The generic way to write a regular expression that substitutes the matched pattern looks like this: s/match/replace/, but other tools might use a different interface. Another example: s/c[^au]t/cat/ will change cxt to cat and will change cyt to cat, but it won't change cut because it doesn't match the first part of the substitute statement.

Variables

We already saw that parentheses can be used for grouping with quantifiers and for limiting the scope of alternation. However, there is one extra side effect that is very useful. Everything between parentheses gets stored into a variable, starting with $1, and then $2, and so on. For example: s/(\w+).docx?/$1.txt/ will substitute "foo.doc" by "foo.txt", but it will also substitute "bar_baz.docx" by "bar_baz.txt".

Modifiers

It is possible to extend the substitute statement with modifiers. One of the most common modifiers is "i", which makes the matching case-insensitive. For example: s/usa/USA/i will change "usa" to "USA", but it will also change "Usa" or even "UsA" to "USA".

Another very common modifier is the "g" modifier, which stands for global. This means the matching won't stop after the first match. Combined with the substitute statement it will result in a replacement of every occuring match in the line.

As you may have guessed, it is also possible to combine modifiers. For example: s/[a-z]/X/gi will turn every (both upper and lower case) alphabetic character into a "X".

Not every tool or language uses the "s/match/replace/modifiers"-syntax. Most languages provide modifiers through parameters or different functions. Also, replacing and matching are often available through different functions.

Created by Ruud Jansen.