Explore Perl Learn Perl Part 1 Context Scalar List Array Regex Subroutine @ARGV

10. Regular expressions (Regex)

Regular expressions are powerful. In this chapter, we look at two applications:

Pattern matching (10.1)
Find & replace (10.2)

At the end, some discussion of the special function tr

10.1 Pattern matching

10.1.1 Matching literal text with the =~ operator

$str = "My name is Reinier!";
$result = ($str =~ m/Reinier/);
print("$result\n"); # Output: 1

So, you can write:

$str = "My name is Reinier!";
if ($str =~ m/Reinier/) {
  print("The name 'Reinier' was found!\n");
}

... in a more concise way, omitting the character 'm' (of 'matching') and using the postfix form:

$str = "My name is Reinier!";
print("The name 'Reinier' was found!") if ($str =~ /Reinier/);

... using the special variable $_

$_= "My name is Reinier!";
print("The name 'Reinier' was found!\n") if (/Reinier/);

...implicitly in a foreach loop

@names = qw(Sebastian Daniel Floor Reinier);
foreach (@names) {
  print("The name 'Reinier' was found!\n") if (/Reinier/);
}

The negated variant of the =~ operator is !~

$str = "My name is Reinier!";
print("The name 'Barbara' was not found!") if ($str !~ /Barbara/);

Useful is alternation by the OR of | operator:

$str = "My name is Sebastian!";
print("The name 'Reinier' or 'Sebastian' was found!") if ($str =~ /Reinier|Sebastian/);

The special variable $& contains of course the value Sebastian.

Notice the following:

$str = "Reinier";
print("The name 'Reinier' or 'Beinier' was found!") if ($str =~ /[BR]einier/); # [BR] is a so called character class: study	10.1.5

Regular expressions do not have an && equivalent. A workaround (so called 'lookahead anchor') is (notice: 'match only whole words' by the metacharacter \b):

$str = "Sebastian Daniel Florence";
$str =~ /(.*(?=\bSebastian\b).*(?=\bDaniel\b))/ ? (print("True\n")) : (print("False\n"));

(?=regex) is a test. The code .*(?=\bSebastian\b) means "match .* (one or more characters), but match only if this match is followed by the whole word 'Sebastian'.

Grouping is useful. Notice that in the following example, the dot ('.') character matches any character. 'zero or more of the preceding character'

$_ = "ball is blue"
($obj, $color) = /(.*)\sis\s(.*)/; # no use of =~

The pattern matches anything (as a group), then whitespace (\s), the word 'is', whitespace and then anything (as a group). The two grouped matches are assigned to the list ($obj, $color), i.e. $obj contains 'ball' and $color 'blue'. Notice that without defining a variable, the expression refers to the special variable '$_'

10.1.2 Case sensitive

Regular expressions are case sensitive. So 'reinier' will not match 'Reinier'. Use the i flag to make regular expressions case insensitive.

$str = "My name is Reinier!";
print("The name 'reinier' was found!") if ($str =~ /reinier/i);

10.1.3 Global match

Regular expressions return only the first match. Of course, there are ways to return all matches. The easiest one is to use the g flag (global).

$str = "My name is Reinier Reinier!";
print("The name 'Reinier' was found " . scalar (@all_matches) . " times!") if (@all_matches = ($str =~ /Reinier/g));

10.1.3 Matching any character

The . character matches any character. Notice that in the next code the special variable $& contains the match.

@matches = ();
while(<DATA>){
 chomp;
 push (@matches,$&) if ( $_ =~ /ci./g ); # match all words containing the characters ci followed by another character 
}
print("@matches\n"); # output: cit cia 

__DATA__
acute
city
local
social
twice

The output is probably not interesting. One would like the complete words. Add simply the '+' character after the '.' character which means 'one or more of the preceding character or character class' which is the same as {1,} These additions on 'how many things to match' are called quantifiers.

@matches = ();
while(<DATA>){
 chomp;
 push (@matches,$&) if ( $_ =~ /ci.+/g ); # match all words containing the characters ci followed by one or more character 
}
print("@matches\n"); # output: city cial 

__DATA__
acute
city
local
social
twice

The result 'city' is ok, but 'cial' isn't. To match 'social', one could add the dot or '.' character (1) followed by the '*' character (like '+' also a quantifier), which means 'zero or more of the preceding character or character class' - which is the same as {0,}. The dot or '.' character between // refers to any character (exception )

@matches = ();
while(<DATA>){
 chomp;
 push (@matches,$&) if ( $_ =~ /.*ci.+/g ); # match all words containing the characters ci followed by one or more character and preceded by zero or more character. 
}
print("@matches\n"); # output: city precise social

__DATA__
acute
city
local
precise
social
twice

To match the literal character '.' you've to use '\.' The character '.' needs to be 'escaped'.

10.1.4 Start with and/or end with

To match all words that start with 'ci', add the caret character ^ at the beginning of the regular expression:

@matches = ();
while(<DATA>){
 chomp;
 push (@matches,$&) if ( $_ =~ /^ci.+/g ); # match all words that start with the characters ci followed by one or more character.
}
print("@matches\n"); # output: city

__DATA__
acute
city
local
precise
social
twice

To match all words that end with 'ce', add the $ character at the end of the regular expression:

@matches = ();
while(<DATA>){
 chomp;
 push (@matches,$&) if ( $_ =~ /.*ce$/g ); # match all words that end with the characters ce
}
print("@matches\n"); # output: twice

__DATA__
acute
city
local
precise
social
twice

To match all words that start with 'so' and end with 'al', use both ^ and $ character in the regular expression:

@matches = ();
while(<DATA>){
 chomp;
 push (@matches,$&) if ( $_ =~ /^s.*al$/g ); # match all words that begin with the character 's' and end with the characters 'al'
}
print("@matches\n"); # output: signal social

__DATA__
acute
city
local
precise
signal
social
twice

10.1.5 Matching using a character class

A character class (cf 10.1.6) defines a set of characters, any one of which can occur in an input string. [0123456789] or abbreviated as [0-9], [a-zA-Z] and [a-z0-9] are a few examples. In the next code refers 'c[ae]' to the character pairs 'ca' and 'ce'

@matches = ();
while(<DATA>){
 chomp;
 push (@matches,$&) if ( $_ =~ /.*c[ae].*/g ); # match all words that contain the characters 'ca' or 'ce'
}
print("@matches\n"); # output: local twice

__DATA__
acute
city
local
precise
signal
social
twice

The character ^ within class negates all its characters or ranges. [0-9] only matches all digits, [^0-9] matches any character that is not a digit. [ae] matches the characters \'a\' or \, [^ae] any character that is not \'a\' or \'e\'.

10.1.6 Subexpressions

With a character class you specify a range of characters, mostly by using a hyphen,e.g. [abc], [0-9], [a-zA-Z0-9]. The addition of the modifiers *, +, {0,}, {1,} or {1}, {2,3} gives information about how many occurrences of the elements of the character class are allowed; e.g [0-9]{3} matches 3 digits exactly, [a-d]{1,2} matches one or two characters a, b, c or d. The character '?' means 'zero or one' of the previous character or character class; it is the same as {0,1}. So, [a-d]? matches 12a but also 12

Some character classes have short cuts (metacharacters):

class	meaning	metacharacter
[0-9]	digit	\d
[^0-9]	non digit	\D
[_a-zA-Z0-9]	word	\w
[^_a-zA-Z0-9]	not word	\W
[\r\t\n\f]	space	\s
[^\r\t\n\f]	not space	\S

Notice that in Perl code, you need sometimes an extra backslash. Instead of \d you should write \d. Now a few examples:

$str = "012.1";
$pattern = "[0-9]{3}";
($str =~ /$pattern/ ) ? (print("Match! $&\n")) : (print("No match...\n")); # output: Match! 012

$str = "012.1";
$pattern = "[0-9]{4}";
($str =~ /$pattern/ ) ? (print("Match! $&\n")) : (print("No match...\n")); # output: No match...

$str = "012.1";
$pattern = "[0-9.]{4}";
($str =~ /$pattern/ ) ? (print("Match! $&\n")) : (print("No match...\n")); # output: Match! 012.

$str = "012.1";
$pattern = "^[0-9]{3}\$";
($str =~ /$pattern/ ) ? (print("Match! $&\n")) : (print("No match...\n")); # output: No match...

$str = "012.1";
$pattern = "[0-9]{3}\.[0-9]{1}"; # Notice that the dot is escaped! 
($str =~ /$pattern/ ) ? (print("Match! $&\n")) : (print("No match...\n")); # output: Match! 012.1

Match whole words by the metacharacter \b

@m_arr = ();
$str = "Wally read the wallpaper on a wall!";
print("@m_arr" . "\n") if (@m_arr = $str =~ m/\bwall\b/gi); # output: wall

@m_arr = ();
$str = "Wall y read the wallpaper on a wall!";
print("@m_arr") . "\n" if (@m_arr = $str =~ m/\bwall\b/gi); # output: Wall wall

10.1.7 Backreferences

You can refer to a match an expression with braces by metasequences , , etc. In addition, the variables $1, $2, $3 etc. contain the matched values.

$str = "197 197";
$pattern = '(\d{3}) \1'; # match a three digit pattern and its repetition
if ($str =~ /$pattern/) {
  print("Match! Number: $1 was found twice.\n");
}
else {
  print("No match...\n");
}

Another example:

$str = "19:07:55";
$pattern = '(\d{2}):(\d{2}):(\d{2})'; 
if ($str =~ /$pattern/ ) {
  ($hour, $minutes, $seconds) = ($1, $2, $3);
  print("Match! Time: $hour:$minutes:$seconds\n"); # the special variable $& contains of course 19:07:55
} 
else {
  print("No match...\n");
}

10.2 Find & replace

I've omitted the character 'm' in patterns above. However, it makes sense to use it: in case of subsitution, you can not omit the character 's'.

$str = "Explore the versatility of Windows operating system.";
$old = "Windows";
$new = "Linux";

$str =~ s/$old/$new/;
print("$str\n");

$str = "Da, da, da!";
print("$str\n") if ($str =~ s/da/do/); # output: Da, do, da!

$str = "Da, da, da!";
print("$str\n") if ($str =~ s/da/do/g); # output: Da, do, do!

$str = "Da, da, da!";
print(ucfirst($str) . "\n") if ($str =~ s/da/do/gi); # output: Do, do, do!

$str = "          trim_string                ";
$str =~ s/^\s+//;
$str =~ s/\s+$//;
print("|$str|\n"); # output: |trim_string|

It can written a bit more concise using the | operator:

$str = "          trim_string                ";
$str =~ s/(^\s+|\s+$)//g;
print("|$str|\n"); # output: |trim_string|

The in 10.1.1 mentioned 'lookahead anchor' is useful in replacing specific characters followed by a specific pattern. In the next example, only --- should be replaced if it is followed by ### (notice that ### is not included in the match).

$str = "---##---#---###";
$str =~ s/---(?=###)/xxx/g;
print("$str\n"); # output: ---##---#xxx###

There is also a 'lookbehind anchor'. If you want to replace specific characters not followed by a specific pattern, use (?!regex)

$str = "---##---#---###";
$str =~ s/---(?!###)/xxx/g;
print("$str\n"); # output: xxx##xxx#---###

10.3 Regex: how-to

How to use regex modifier dynamically?
Match example:

$mod = "i"; # case insensitive 
$pattern = "(?$mod)reinier";
$str = "My name is Reinier reinier!";
print("The name 'reinier' was found! " . scalar(@all) . " times \n") if (@all = $str =~ /$pattern/g);
 # The name 'reinier' was found! 2 times

$mod = "i"; # case insensitive 
$pattern = "rei((?$mod)N)ier"; # case insensitive on a selection, here the character N 
$str = "My name is reinier reiNier!";
print("The name 'reinier' was found! " . scalar(@all) . " times \n") if (@all = $str =~ /$pattern/g);
 # The name 'reinier' was found! 2 times

Substitution example:

$mod = "i"; # case insensitive 
$str = "Da, da, da!";
$pattern = "(?$mod:da)";
print("$str\n") if ($str =~ s/$pattern/do/g); # output: do, do, do!

Footnotes
(1) The dot character is a so called metacharacter, a character that has a special meaning during pattern processing. The dot character refers to any character (exception \n). Do you want to use these metacharacters literally, you have to escape them. So, if you want the dot within a regular expression, use \. etc. You'll find ome examples of metacharacters in the table of 10.1.6 Subexpressions.