Introduction to Perl

PERL

is an acronym for Practical Extraction and Report Language,
is a public-domain program developed by Larry Wall,
has been ported to Unix, DOS, Macintosh ... ; it is close to being a universal scripting language.
is a blend of C, the Unix shell languages (sh & csh), and the Unix awk program notation.

AN EXAMPLE PERL PROGRAM

The program reads as below. Note that the #! characters on the first line must be in columns one and two. The remainder of the first line must give the location of the executable file for the perl program. (If the script file is flagged as executable, the Unix system will automatically invoke perl when you use the name of the perl script as a command.)

    #!/public/bin/perl

    chdir("/usr/man/man1") || die "cannot cd to /usr/man/man1";

    # loop over all files in directory /usr/man/man1
    while ($filename = <*>) {

	if (!open(MANFILE, "$filename")) {
	    print STDERR "could not read $filename\n";
	    next;
	}

	# read the file, looking for a line reading:
	#		.SH NAME
	while (<MANFILE>) {
	    if (/^\.SH  *NAME\b/) {
		last; # exit the loop
	    }
	}

	if ($_) {
	    # we found that line, read the following line
	    $_ = <MANFILE>;
	    if ($_) {
		s/\\//g; # remove backslashes in $_
		print $_;
	    } else {
		# It's an error if we are at the end of file
		print STDERR "$filename: eof found after .SH NAME\n";
	    }
	} else {
	    print STDERR "$filename: no .SH NAME line found\n";
	}
	close(MANFILE);
    }

Our example program visits a system directory named /usr/man/man1. In that directory, there are files containing unformatted on-line manual information. For example, there may be a file named ls.1v, and this contains the on-line documentation for the ls command. The file is meant to be input to the nroff or troff text formatting packages on Unix. (The groff program is an equivalent to nroff and troff.) Our Perl script reads each file looking for a line which is exactly as follows.

	.SH NAME

(where the period is in column one). This line appears in each description to introduce a one-line synopsis of the command. For example, on SunOS, the ls.1v file contains the two lines

	.SH NAME
	ls \- list the contents of a directory

Our Perl script will output that line (after removing the backslash character).

The net result is that we obtain synopses of all the Unix commands for which documentation is available.

Here is a quick run through of the program logic:

We change our directory into /usr/man/man1 where the on-line documentation for commands normally resides. If we cannot change to that directory, we print an error message and quit.
The construct <*> matches all the filenames in the current directory, and delivers the next filename to us each time we use that construct.
We assign the next filename into the variable $filename. (When there are no more filenames, the <*> construct delivers a special undef value; this value is considered equivalent to False and causes the loop to terminate.)
We open the file for input. If we cannot open the file, we print an error message and go back to the top of the loop for the next file.
We enter a loop, reading one line from the file on each iteration. The construct <MANFILE> implicitly reads the next line (as a string value) into the variable named $_. If no lines remain to be read, the undef value is assigned to $_.
Inside the while loop, we use regular expression pattern matching to check the contents of $_. If $_ matches the pattern ^\.SH *NAME\b then we exit the loop.
(The ^ forces a match at the beginning of the line; the * allows zero or more repetitions of the preceding space; the \b construct insists that we are at a word boundary.)
If we didn't hit the end of the file, we read the next line into variable $_. If we successfully read that line, we use the statement
```
	s/\\//g;
```
to remove all occurrences of a backslash in variable $_. (Without the g, only the first occurrence would be changed.)
And then we print the line in $_. In fact, the statement could be abbreviated to just:
```
	print;
```
because $_ will be supplied as a default argument if none is given in the program. (And this default behaviour applies to many Perl operations.)

SCALAR DATATYPES AND CONSTANTS

The scalar (simple) datatypes are numbers and strings.
There is no distinction between integers and floats; they are all stored internally as double-length floats.

Examples of numeric constants:
```
	0   99   -123   2.34   -6.3e23   24.34E-4
```
There are two forms of strings. If single quotes are used, the only escape combinations are \' and \\. The string can contain any characters including linefeeds. If double quotes are used, all the usual C escape codes may be used (plus some more) and, as in csh, variable names may get expanded.

Examples of string constants:
```
	'hi'    'don\'t'   'the backslash is \\, so there!'
	'line one
	line two'

	"line one\nline two"   "a doublequote is \"!"
```

OPERATORS

The arithmetic operators are:
```
	+   -   *   /   %   **
```
(% is modulus, as in C; ** is exponentiation.)
Numeric comparison operators are
```
	<   <=   ==   >=   >   !=
```
String comparison operators are
```
	lt   le   eq   ge   gt   ne
```
String operators are
```
	.     x
```
The dot operator concatenates two strings. E.g.
```
	"abc" . "def"
```
has the same value as "abcdef". The x operator performs replication. E.g.
```
	"abc" x 3
```
produces the string value "abcabcabc".

VARIABLES

There are scalar (simple) variables, their names begin with $.
There are array variables whose names begin with @.
There are associative array variables whose names begin with %.

Following the initial character must be a letter. The remainder of the name may be composed of letters, digits and underscores.

Names are case sensitive.

Examples of variable names are:

	$Counter   $i   $a_silly_name_37X
	@Array1   @A_B_C
	%Map_Number_37

ASSIGNMENT OPERATORS, etc

These are modelled after C.

The assignment operator is =. E.g.

	$X = $Y + 1;
	@Array1 = ( 0, "Hi!", -4.5, "Bye!" );

Arithmetic assignment operators are (similarly to C):

	+=   -=   *=   /=   %=   **=
	++   --

E.g.

	$X += 10;   $Y = ++$X;   $Z--;

String assignment operators are:

	.=   x=

Examples:

	$A .= "abc";	## same as $A = $A . "abc";
	$B x= 3;	## same as $B = $B x 3;

The operator chop() chops the last character off a string variable. Examples:

	$A = "abcdef";
	chop($A);  	## sets $A to be "abcde"
	$X = chop($A);  ## sets $X = "d", $A = "abc"

OPERATIONS ON ARRAYS

Arrays can be combined in expressions. For example:

	@A1 = ( 0, "A", -4.5 );
	@A2 = ( "First", @A1, "Last" );

Arrays are indexed using the usual notation. But the array name must be prefixed with $ (not @) to indicate a scalar value. Arrays have, by default, zero origin indexing as in C. Example:
```
	@A1 = ( 0, "A", -4.5 );
	$i = $A1[1];	## assigns "A" to $i
	$A1[1] = 3.3;	## replaces "A" with 3.3
```
If an entire array is assigned to a scalar variable, it is the array length that is assigned. E.g.
```
	@A1 = ( 0, "A", -4.5 );
	$Len = @A1;	## set $Len = 3
```
If an array value is assigned to a list of variables, elements from the front of the array are extracted. E.g.
```
	@A1 = ( 0, "A", -4.5 );
	($V0, $V1) = @A1;	## sets $V0 = 0, $V1 = "A"
	($X) = @A1;		## sets $X = 0
```

Slices of arrays may be extracted by providing a list on index values. E.g.

	@A1 = (10, 20, 30, 40, 50, 60);
	@A2 = @A1[2..4];	## sets @A2 = (30, 40, 50);
	@A3 = @A1[4,0,2];	## sets @A3 = (50, 10, 30);

Arrays are often used to implement stacks. Hence push() and pop() operators are provided.

	push(@A, $val);		## same as
				##	@A = (@A $val);

	$X = pop(@A);		## removes last value from @A
				## and assigns it to $X

Alternatively, arrays may be used to implement queues. The unshift() and shift() operations insert into and extract from a queue.

	unshift(@Q, $val);	## same as
				##	@Q = ($val @Q);

	$x = shift(@Q);		## same as
				##	$x = $Q[0];
				##	@Q = @Q[1..@Q-1];

Array operations include reverse(), sort() and chop(). E.g.

	@A1 = (1.1, 5.5, 2.2, 6.6, 3.3, 4.4);
	@A2 = reverse @A1;	## sets @A2 = (4.4,3.3, ... 1.1)
	@A3 = sort @A1;		## sets @A3 = (1.1,2.2, ... 6.6)

Applying chop() to an array is the same as applying chop() to each element.

CONTROL STRUCTURES

The control structures are similar to C except that the keyword continue is renamed to next and break becomes last. There are also some additions, which appear in the following list.

A block of sequentially executed statements ...
{ stmt₁; stmt₂; ... stmt_M; }
If and If-then-else statements ...
if ( expr ) { stmt₁; ... stmt_M; } if ( expr ) { stmt₁; ... stmt_M; } else { stmt₁; ... stmt_N; } if ( expr1 ) { stmt₁; ... stmt_M; } elsif ( expr2 ) { stmt₁; ... stmt_N; } ... else { stmt₁; ... stmt_P; }
Unless statement (is equivalent to an if statement with a negated test) ...
unless ( expr ) { stmt₁; ... stmt_M; }
While loop ...
while ( expr ) { stmt₁; ... stmt_M; }
Until loop (like a while loop with a negated test) ...
until ( expr ) { stmt₁; ... stmt_M; }
For loop (just like in C) ...
for( init_expr; test_expr; incr_expr ) { stmt₁; ... stmt_M; }
Foreach loop (same as in csh script language) ...
foreach $Var (@ListOfValues) { stmt₁; ... stmt_M; }
The next statement ...
causes the next iteration of the loop to start immediately. I.e., control jumps to the top of the loop, as with a C continue statement.
The last statement ...
causes Perl to exit the closest containing loop, as with a C break statement.

INPUT-OUTPUT AND FILE OPERATIONS

Files are accessed using File Handles.
A Perl program has 3 predefined file handles named STDIN, STDOUT, and STDERR. They correspond to the standard input, standard output and standard error output streams of Unix, respectively.
Given a file handle for an input stream, say STDIN, we can read one line from the file using the STDIN notation. E.g.
```
	$Line = <STDIN>;
```
reads the next line, as a string value, into variable $Line.

If no more input remains, $Line is assigned the special value undef.

The input line includes the linefeed character ("\n"). Thus,
```
	$Line = <STDIN>;  chop($Line);
```
is common usage.
If the STDIN notation is used alone, the input line is implicitly assigned to the variable $_. Thus,
```
	while (<STDIN>) {
	    print $_;
	}
```
copies all the standard input to the standard output stream.
Output is sent by the print operator, by default, to the standard output stream. Thus:
```
	print "Variable X = ", $X, " Y = ", $Y, "\n"
```
sends five values in succession to the output.
An array value may also be sent to the output stream. E.g.,
```
	print @A;
```
A handle for an output file may be supplied immediately after the print operator. Thus,
```
	print STDERR "Unrecoverable error!\n"
```
The operator die prints to the standard error stream and then terminates the program. Example:
```
	die "Unrecoverable error!\n"
```
Formatted printing using the same format controls as in C is available with the printf operator. For example:
```
	printf "Pi = %10.8; e = %10.8\n", pi, e;
```
A disk file may be opened for input as in:
```
	open(INPUTFILE,"/home/sue/project");
```
An undef value is assigned if the file cannot be opened.
A disk file may be opened for output as in:
```
	open(OUTPUTFILE,">results.txt");
```
(The leading `>' indicates output; if the first two characters are '>>', the file is opened in append mode.)
After use, a file may be closed by using the close function. E.g.
```
	close(INPUTFILE);
```

PATTERN MATCHING

Pattern expressions are pervasive in professional computing (and especially in Unix):
- editor find and replace
- filename "wildcards"
- translator tools such as Yacc and Lex
Mastering pattern expressions is a skill that will pay off again and again in your career.

Regular expressions

Frequently the pattern expressions are based on regular expressions. There is a large and beautiful body of mathematics underlying regular expressions. There are efficient and well-understood algorithms for finding regular expression matches in strings.
The pattern expressions in Perl are more complex than regular expressions. Nonetheless, a good understanding of regular expressions helps a lot in understanding Perl pattern expressions.
Regular expressions are based on a set of characters, for example the ASCII characters. There are a small number of rules for creating regular expressions:
1. Every character is a regular expression.
2. Concatenation: If X and Y are regular expressions then so is XY.
3. Alternation If X and Y are regular expressions then so is X+Y.
4. Repetition If X is a regular expressions then so is X*.
As in most conventional mathematical expressions, parentheses may be used for grouping.
There are a small number of rules for determining which strings match a regular expression:
1. A regular expression consisting of a single character matches that character.
2. If X matches string S0 and Y matches string S1 then XY matches the concatenation of S0 and S1.
3. If X matches string S0 and Y matches string S1 then X+Y matches S0 and S1.
4. If X is a regular expressions then so is X*.
5. If X matches string S0 then X* mateches zero or more occurrences of S0.
Examples:
- The regular expression x matches "x".
- The regular expression xy matches "xy".
- The regular expression x+y matches "x" and "y".
- The regular expression x* matches "", "x", "xx", "xxx", and many (an infinite number) of other strings.
- The regular expression (x+y+z)* matches "x", "y", "z", "xx", "zzyyyzx", and many others.

Perl patterns

Perl has many pattern operators; most are abbreviations for longer regular expressions.
The =~ operator may be used to match a string value on the left with a pattern on the right. For example:
```
	if ($Line =~ /^$/) { print "empty line\n"; }
```
performs a match against the pattern ^$ (with the same meaning as in the Unix programs grep and egrep). The slash characters surrounding the pattern indicate to perl that the meanings of special symbols are treated differently.
If a regular expression is used by itself, pattern matching against the content of the $_ variable takes place. E.g.
```
	if ( /^$/ } { emptyLineCnt++; }
```
Substitutions can be made as part of pattern matching. For example:
```
	$Line =~ s/man/person/;
```
changes the first occurrence of man in $Line to person.
Substitutions can be made implicitly to the $_ variable. The following is a complete statement:
```
	s/man/person/;  ## edit the $_ variable
```
The regular expression operators are much as in the egrep program on Unix but with additions.

^
matches at the beginning of a string
$
matches at the end of a string

\b
matches at a word boundary
\B
matches anywhere but at a word boundary

{m,n}
match previous element at least m times, but no more than n times
{m,}
match previous element at least m times
{n}
match previous element exactly n times
*
match previous element 0 or more times
+
match previous element 1 or more times
?
match previous element 0 or 1 times

.
matches any character except linefeed
[abc]
matches any one of the listed characters
[^abc]
matches any character except those listed

(...)
parentheses may enclose subpatterns

\n
matches a linefeed
\t
matches a tab
\r
matches a carriage return
\f
matches a formfeed
\d
matches any digit; same as [0-9]
\D
matches any non-digit; same as [^0-9]
\w
matches any alphanumeric; same as [a-zA-Z0-9_]
\W
matches any non-alphanumeric character
\s
matches any space character; same as [ \t\n\r\f]
\S
matches any non-space character

\1
matches the same text as the first parenthesized subpattern
\2
similarly for 2nd parenthesized subpattern, etc.

\023
matches character with octal code 023 (etc.)
\x7f
matches character with hex code 0X7f (etc.)
\cD
matches the control-D combination (etc.)

Frequently, a pattern will match more than one substring in a string. In that case, the "leftmost/longest" rule applies: the leftmost position where a match is found is chosen and, beginning at that position, the longest matching substring is selected.

Example program

The program below illustrates the pattern operators. Note that, after a successful match, $& contains the matched substring, $` contains the portion of the original string preceding the match, and $' contains the portion of the original string following the match. Thus

	$` . $& . $'

is always the same as the original string.

The program:

	#!/public/bin/perl

	@a = (  'a',    'ba',           'a*',
		'aa*',  '(aa)*',        '\b',
		' .\*', '(a|b|c)',      '(a|b|c)* (c|b|a)*'
	);

	$s0 = "abc aaa x*";

	foreach $s (@a) {
		if ($s0 =~ /$s/) {
			print "$s\n\t!$`!$&!$'!\n";
		} else {
			print "$s\n\tno match\n";
		}
	}

Exact output:

	a
		!!a!bc aaa x*!
	ba
		no match
	a*
		!!a!bc aaa x*!
	aa*
		!!a!bc aaa x*!
	(aa)*
		!!!abc aaa x*!
	\b
		!!!abc aaa x*!
	 .\*
		!abc aaa! x*!!
	(a|b|c)
		!!a!bc aaa x*!
	(a|b|c)* (c|b|a)*
		!!abc aaa! x*!

SUBROUTINES

Subroutines are invoked in a manner similar to the following examples:

	&foo( $arg1, "arg2", 3 );  ## 3 args passed

	&bar( @argArray );  ## each array element is an argument

	&doSomething();  ## no arguments passed

A subroutine is declared in the program using a notation like:
```
	sub foo {  ...  }
```
Arguments are passed in an array named @_. E.g., our foo subroutine could receive its three arguments like this:
```
	sub foo {
	    ($v1, $v2, $v3) = @_;
	    ...
	}
```

A variable number of arguments is easy to process too:

	sub foo {
	    foreach (@_) {
	        ## $_ holds the next argument to process
	        ...
	    }
	    ...
	}

Variables used inside a subroutine are, by default, global. (Our first versions of foo, above, modify global variables named $v1, $v2 and $v3.)
Local variables can be used as follows:
```
	sub foo {
	    local($v1, $v2, $v3) = @_;
	    ...
	}
```
and additional local variables may be created by executing a statement like this:
```
	local($temp);	## create variable $temp
```
The new variables are initialized to the undef value. The variables disappear when the subroutine returns.

The last expression evaluated in a subroutine is returned as a function result. E.g.

	sub maxOfMany {
	    local($result) = pop(@_);  ## remove last argument
	    foreach (@_) {
		if ($_ > $result) {
		    $result = $_;
		}
	    }
	    $result;  ## return the result
	}

And we can invoke that subroutine as in this example:

	$m = &maxOfMany( -10, 3, 15, -6, 4 );