Softpanorama
May the source be with you, but remember the KISS principle ;-)

Contents Bulletin Scripting in shell and Perl Network troubleshooting History Humor

Introduction to Perl 5.10 for Unix System Administrators

(Perl 5.10 without excessive complexity)

by Dr Nikolai Bezroukov

Contents : Foreword : Ch01 : Ch02 : Ch03 : Ch04 : Ch05 : Ch06 : Ch07 : Ch08 :


Prev | Up | Contents | Down | Next

5.2. Overview of Perl regular expressions


The Hello World Example

As was mentioned before regular expressions are a language inside the language. Regex should be viewed as a separate language that has no direct connections to Perl. It is used with many other languages (Python, PHP, Java) in almost the same form as in Perl just with different syntactic sugar. Still Perl was the first language to introduce "close binding" of regex and the language per se, the feature that was later more or less successfully copied to Python, TCL and other languages. Also the level of integration of the regular expression language into main language is higher in Perl, then in any alternative scripting language. Still as it is a different language some problems arise. For example Perl debugger can't debug regular expressions.

Perl language regular expression parser gradually evolves. The latest significant changes were introduced in version 5.10 make it more powerful and less probe to errors. This version of Perl is the minimal version recommended for any serious text parsing work.

As regular expressions (regex for short) is a new language, using the famous "Hello world" program as the first program seems to be appropriate. As a remnant from shell/AWK legacy a regular expression lexically is a special type of literals (similar to double quoted literal).

It is usually (but not necessarily) is included in slashes. In matching operator the source string (where matching occurs) is specified on the left side of the special =~ operator (matching operator), while regex is on the right side.

The simplest case is to search substring in string like in built-in function index. The following expression is true if the string Hello appears anywhere in the variable $sentence.

$sentence = "Hello world";
if ($sentence =~ /Hello/) {...} # expression is true if the string the appears in variable $sentence.

The regular expressions (called also regex of RE) are case sensitive, so if we assign to $sentence the same string but in lower case

$sentence = "hello world";

then the above match will fail.

The operator !~ can be used for  a non-match. For example

$sentence !~ /Hello/

is true if the string Hello does not appear in $sentence.

Alternatively you can use qr instead of slashes. That's very important,  if you regex contain a lot of slashes

$url !~ qr(/cygdrive/f/public_html)

Two types of regex

There are two main uses for regular expressions in Perl:

Regular expressions in Perl operate against strings. No arrays on left hand side of matching statement please.

Regular expressions in Perl operate against strings. No arrays on left hand side of matching statement please.

Two Binding Operators (=~ and !~)

As we mentioned before the $_ is the default operand for regular expressions. But the string against which to performs the match or substitution can be specified explicitly with operator =~ and its negation !~. For example:
$my_string = "The graph has many leaves";
if ( $my_string =~ m/graph/ ) {
   print("The source string contains the word 'graph'.\n");}
   $result =~ s/graph/tree/;
   print "Replaced with 'tree'\n";
}
print("initial string: '$my_string'\n.The result is '$result'\n");
In this example each of the regular expression operators applies to the $my_string variable instead of $_.

Success and Failure of Matching

We can capture the success or failure of the match in a scalar variable. This way we have a way to determine the success or failure of the matching and substitution, respectively:

@test_array=("The graph has many leaves",
             "Fallen leaves, so many leaves on the ground.");
foreach $test (@test_array) {
   $match = ($test =~ m/leaves/);
   print("Result of match of word 'leaves' in string '$test' is $match\n");
}

This program displays the following:

Result of match of word 'leaves' in string 'The graph has many leaves' is 1
Result of match of word 'leaves' in string 'Fallen leaves, so many leaves on the ground' is 1

The other useful feature of this example is that it shows you how to obtain the return values of the regular expression operators. In case subsequent action depends on the value of changed variables you should always check if the expression successive or failed because way to often regular expression behave differently then their creators expect.

In scalar context the match operation returns the number of matches. That means that if match failed it returns zero.

We could use a conditional as to check if match was successful or no:

$sentence = "Disneyworld in Orlando";
if ($sentence =~ /world/){
   print "there is a substring 'world' somewhere in the sentence: $sentence\n";
}

Sometimes it's easier to test the special variable $_, especially if you need to test each input string in the input loop. In this case you can write something like:

while (<>) { # get "Hello world" from the input stream
   if (/world/) {
      print "There is a word 'world' in the sentence '$_'\n";
   }
}

As we already have seen the $_ variable is the default for many Perl built-in functions (tr, split, etc).

Regular Expressions Metacharacters

The problem with regex metacharacters is that there are plenty of them. They provide a lot of power for sophisticated user and at the same time make them appear very complicated, at least at the very beginning.

It's best to build up your skills slowly: creation of complex regex can be considered as a kind of an art form (like solving a a puzzle or chess problems). Please pay special attention to non-greedy (lazy) quantifiers as they are simpler to use and less prone to errors.

It makes a lot of sense first to debug a complex regular expression is a special test script, feeding it with sample strings and observing the output.

Please pay special attention to non-greedy (lazy) quantifiers as they are simpler to use and less prone to errors. It makes a lot of sense first to debug a complex regular expression is a special test script, feeding it with sample strings and observing the output.

There are three types of metacharacters:

As they are used as metacharacters, characters $, |, [],{} (), \, / ^, / and several others in regular expressions should be preceded by a backslash, for example:

\|		# Vertical bar
\[		# An open square bracket
\)		# A closing parenthesis
\*		# An asterisk
\^		# A carat symbol
\/		# A slash
\\		# A backslash

For example:

$ip_addr=~/\d+\.\d+\.\d+\.\d+/; # dot character should be escaped

Regular metacharacters

Regular metacharacters are special characters that represent some class of symbols. They consume one character from the string if they are matched (with quantifiers it can be less or more). In other word, they 'eats' characters of the class they represent. A good example is metacharacter that consumes characters is . (dot) which match any character. Among the most common regular metacharacters are:

  1. . Any single character except a newline (length one). There is a special modifier to force . match newline too

  2. \d -- matches a digit (character grouping [0-9]). Equivalent to [0-9]
  3. \w -- matches a word character (underscore is counted as a word character here). Equivalent to [a-zA-Z_0-9]
  4. \s -- matches a 'space' character (tab, newline, space). Equivalent to [ \t\n\r\f]
  5. Classes. Classes can be called "definable metacharacters". They are group of characters in square brackets. They are can be sets or ranges and should be put inside square brackets a -(minus) indicates "between" and a ^ after [ means "not". For for example:

If you use capital latter instead of lower case letter the meaning of metacharacter is reversed:

Anchors

Anchors are metacharacters that serve as markers and that never consume characters from the string. Anchors always match zero number of characters of a particular class. That means that they do not require any character to be present, only some logical condition is this place of the string needs to be true. Anchors don't match a character, they match a condition. In other words they do not consume any symbols. They just tell the regex engine that the particular match occurred. Two most common anchors are ^ and $:

Quantifiers

Perl has three groups of quantifiers (which are also metacharacters, but they affect interpretation of previous character). The most important metacharacters include three groups with two members in each - one greedy and the other non-greedy (lazy):

Non greedy modifies are newer but easier to understand as they correspond to the search of substring, Greedy modifies correspond to search of the last occurrence of the substring. That's the key difference. We will discuss not greedy modifies in the next section: More Complex Perl Regular Expressions

For example:

$sentence="Hello world"; 
if ($sentence =~ /^\w+/) { # true if the sentence starts with a word like "Hello"  
   print "The string $sentence starts with a word\n";
}
Full list includes 12 quantifiers:
 
Maximal
(greedy)
Minimal
(lazy)
Allowed Range
{n,m} {n,m}?

Must occur at least n times but no more than m times

{n,} {n,}?

Must occur at least n times

{n} {n}?

Must match exactly n times

* *?

0 or more times (same as {0,})

+ +?

1 or more times (same as {1,})

? ??

0 or 1 time (same as {0,1})

We will discuss additional quantifiers later

Examples of simple regex

It's probably best to build up your use of regular expressions slowly from simplest cases to more complex. You are always better off starting with simple expressions, making sure that they work and them adding additional more complex elements one by one. Unless you have a couple of years of experience with regex do not even try to construct a complex regex one in one quaint step.

Here are a few examples:

$a = '404 - - ';
$a =~ /40\d/; # matches 400, 401, 403, etc.

Here we took a fragment of a record of the http log and tries to match the return code. Note that you can match any part of the integer, not only the whole integer. A similar idea works for real, but generally real numbers have much more complex syntax:

$target='simple real number: 22.33';
$target=~/\d+\.\d*/;

Note: the regex /\d+\.\d*/ isn't a general enough to match all the real numbers permissible in Perl or any other programming language. This is a actually a pretty difficult problem, given all of the formats that programming languages usually support and here regular expressions are of limited use: lexical analyzer is a better tool.

Now let's try to match works. The simplest regular expression that matches a single word is \w+. Here is a couple of examples:

$target='hello world'; 
$target~ m{(\w+)\s+(\w+)}; # detecting two words separated by white space
$target='A = b';
$target =~ /(\w+)\s*=\s*(\w+)/; # another way to ignore white space in matching

Here are more examples of simple regular expressions that might be reused in other contexts:

/t.t/		 # t followed by any letter followed by t
	
^131		 # 131 at the beginning of a line
0$		 # 0 at the end of a line
\.txt$		 # .txt at the end of a line
/^newfile\.\w*$/ # newfile. with any  followed by zero or more arbitrary characters
                 # This will match newfile.txt, new_prg, newscript, etc.
/^.*marker/      # head of the string up and including the word "marker"
/marker.*$/	 # tail of the string starting from the 'market' and till the end (up to newline). 		
/^$/		 # An empty line 

Several additional examples:

/0/		# zero: "0"
/0*/		# zero of more zeros		
/0+/		# one or more zeros
/0*0/		# same as above
/\d/		# any digit but only one
/\d+/           # any integer
/\d+\.\d*/      # a subset of real numbers. Please note that 0. is a real number
/\d+\.\d+\.\d+\.\d+/ # IP addresses starting (no control of the number of digits so 1000.1000.1000.1000 would match  this regex
/\d+\.\d+\.\d+\.255/ # IP addresses ending with 255

Tips:

Prev | Up | Contents | Down | Next



Etc

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available in our efforts to advance understanding of environmental, political, human rights, economic, democracy, scientific, and social justice issues, etc. We believe this constitutes a 'fair use' of any such copyrighted material as provided for in section 107 of the US Copyright Law. In accordance with Title 17 U.S.C. Section 107, the material on this site is distributed without profit exclusivly for research and educational purposes.   If you wish to use copyrighted material from this site for purposes of your own that go beyond 'fair use', you must obtain permission from the copyright owner. 

ABUSE: IPs or network segments from which we detect a stream of probes might be blocked for no less then 90 days. Multiple types of probes increase this period.  

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Haterís Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least


Copyright © 1996-2016 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License.

The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to make a contribution, supporting development of this site and speed up access. In case softpanorama.org is down you can use the at softpanorama.info

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.

Last modified: January 26, 2017