Literals

Literals are all other characters than the metacharacters [(\^$.|?+*. They represent normal text characters. So the regular expression cat matches "cat" and "concatenate", but it doesn't match "CAT".

If you do need one of the metacharacters literally, you should escape them with a "\". If you want to match "1+1=2", your regexp should look like 1\+1=2. In other words, you should escape the "+", because it's a metacharacter in regular expressions.

Start and end of line

The metacharacters "^" and "$" match the start and end of a line. As we've seen at the literals section, cat matches "cat" and "concatenate". However, ^cat only matches cat at the beginning of the line, so it will match "cat" and "cats", but it will not match "concatenate".

On the other hand, cat$ matches lines that end with "cat". So it will match "cat" and "I would like to have a cat", but it will not match "I like cats".

Knowing these metacharacters, you should know that ^cat$ matches a line with only "cat" on it, and that ^$ will match an empty line.

The dot

A dot is a metacharacter that matches any character. So . will match "a", "7", and "!". The regexp c.t matches "cat" and "cut", but also "12c3t45" and even "c.t" itself.

Character classes

Characters between square brackets indicate a character class. A character class matches any of the characters contained within it. For example, c[au]t matches "cat" and "cut", and nothing else.

It is also possible to specify a range of characters within a character class. The regexp [a-z] will match any lowercase letter. So c[a-z]t will match "cat" and "crt", but it will not match "c4t" or "cAt".

Character classes have their own rules. Metacharacters outside a character class are not necessarily metacharacters inside a character class. The only metacharacters inside a character class are a backslash (\), caret (^), closing bracket (]) and a dash (-). So within a character class, only these characters have to be escaped with a backslash.

As we've seen, a dash (-) can be used to specify a range. So the dash is a metacharacter, but only within a character class. If you want to use a literal dash within a character class, you should escape it with a backslash, except when the dash is the first or last character of the character class. So, the regexp [a\-z] is equal to [az-] and [-az], they will match any of those three characters.

When a character class starts with a caret, it will match any character that is not listed. It negates the given list of characters. For example, c[^u]t will match "cat", but it will not match "cut". It won't match "ct" either, which is a commonly made mistake. Any character that isn't listed does not mean no character at all.

Alternation

With a vertical bar (|) it is possible to combine expressions and match any of those separate subexpressions. We already saw that it is possible to match both "cat" and "cut" with a character class like c[au]t. The exact same expression with the use of alternation could be cat|cut, but could also be c(a|u)t. The parentheses limit the scope of the alternation.

Quantifiers

Quantifiers are one of the most important metacharacters in regular expressions. You can use them for matching a certain amount of appearances of an item.

The question mark (?)

The question mark means that the preceding item occurs zero or one times. This means the item is optional. For example, colou?r matches both "color" and "colour".

You can also use parentheses in combination with the question mark. For example, Jan(uary)? matches both "Jan" and "January".

The plus (+)

The plus means that the preceding item occurs one or more times. So ca+t matches "cat" and "caaat" but it doesn't match "ct".

The asterisk (*)

The asterisk means that the preceding item occurs zero or more times. When we substitute the plus with an asterisk in the last example like ca*t, we get a regular expression that will match "cat" and "caaat", but also "ct".

You will often see the .*-snippet in a regular expression. This will obviously match everything, because it literally means zero or more characters of any possible kind.

The interval

With braces ({}), it is possible to specify a specific interval. For example, ca{3,5}t matches "caaat", "caaaat" and "caaaaat". The regexp w{3,3}\.google\.com matches the URL of that well known search engine, but can be shortened to w{3}\.google\.com. It is even possible to use braces like ca{5,}t, which will match "caaaaat" or even more a's.

It is possible to group items for use with quantifiers, as we have seen with the regular expression Jan(uary)?. The same thing is also possible with all other quantifiers. For example, [0-9]+(-[0-9]+){2} matches dates like "2012-12-31" and "01-01-2000".

Created by Ruud Jansen.