CSCI 330 Lab 4: scanning and parsing

The primary focus of lab4 is to practice the use of regular expressions and context free grammars in the tokenizing (scanning) and parsing of programming languages.

The lab can be obtained and submitted through our usual git approach (repo name 'lab4' this time), and the starting repo contains a number of files and subdirectories:

Overview

For lab 4 you'll be editing and submitting lex and yacc files used to build a simple parser, lab4x, for a language to be described below. The lex/yacc files are compiled through a makefile using lex and yacc on the cubs/pups/tuxworld (details below).

The starting lab4.lex file (and hence a current build of lab4x) recognizes most, but not all, of the needed tokens, and the starting lab4.yacc file (and hence a current build of lab4x) recognizes a subset of the needed grammar.

You'll be expanding the set of token regular expressions in lab4.lex and the grammar rules in lab4.yacc so that a new build of lab4x can parse the full language.

For details on lex and yacc themselves, see the notes for the week of Mar. 3-7 in the lecture resources.

Objectives

Lab 4 is meant to give you practice with regular expressions, context free grammars, and their use in syntax checking source code using tools like lex/flex and yacc/bison.

You will be using lex regular expressions to describe the tokens in a language, and yacc CFG rules to describe the syntax of the language.

A makefile is provided to allow you to compile your lex and yacc files into a tool that can be used to check for syntax errors in user source code, by feeding the file content into the lab4x executable as standard input (./lab4x < someprog.yar), e.g.
make
./lab4x < valid/basic.yar     # tries parsing the basic.yar code example
The lab4x executable processes the specified file content, and displays (limited) error messages if it detects any syntax errors, and at the end it displays a final "Parsing complete." The error messages, if any, are typically accompanied by a row and column number indicating where in the source file the parser recognized it was in an error state, and an indication of what statement type it last entered and where (to the best of its knowledge).


The language to be parsed: yarrgh

The starting lab4.lex and lab4.yacc files have been provided for you, covering a subset of the full yarrgh tokenizing/parsing rules.

Informally, the language follows pretty typical language formats: The full official rules for the language are listed and discussed in the table below, with code samples provided later on this page.

yarrgh token types, comments, whitespace, file extensions

By convention, yarrgh files generally use the .yar extension.

Comments begin with a # and continue to the end of the line (newline: not Windows compatible), and must be stripped out by the tokenizer

Extra whitespace between tokens is ignored.

The supported token types are:
  • keywords: program open close def integer real set read write repeat until if else
  • (infix) math operators:   +   -   *   /   ^   //   %
  • (infix) comparison operators:   =   !=   <   <=   >   >=
  • parentheses:   (   )
  • the semicolon:   ;
  • integer literals: one or more digits
  • real literals: one or more digits then optionally a decimal point and another one or more digits (e.g. valid numeric literals include things like 99, 00123.4500, or 1.234, but you cannot have a number that begins or ends with a decimal point)
  • identifiers: start with an alphabetic character (upper or lowercase) and can be followed by any number of by any number of digits and/or upper or lowercase alphabetic characters (e.g. aAB933eAx2 would be a valid identifier)

yarrgh grammar rules For lab4, a yarrgh program has the following form:

  • a program: consists of the keyword program followed by a block

  • a block: consists of the keyword open followed by one or more statements followed by the keyword close

  • a statement: can be a variable declaration, an input statement, an output statement, an assignment statement, or a repeat loop

  • variable declarations : consist of the keyword def followed by an identifier then a type specifier (keywords integer or real) then a semicolon

  • input statements : consist of the keyword read followed by an identifier followed by a semicolon

  • output statements : consist of the keyword write followed by an expression followed by a semicolon

  • assignment statements : consist of the keyword set followed by an identifier followed by an expression followed by a semicolon

  • repeat loops : has the form:
    repeat block until condition semicolon
    where the condition has the form

  • conditions: have the form
    ( expression comparisonoperator expression )

  • if/else statements: has the form:
    if condition block else block

  • comparison operators: are any of the following

  • math operators: are any of the following
    + - * / // % ^
    where // is integer division, % is modulo, and ^ is exponent/power
    math operators can be parenthesized, e.g. (x + y) * (10 + a), and follow the usual precedence and associativity rules

  • expressions: use the infix math operators (above), applied to variables, literals, and (sub)expressions


Declaration, type, and scope rules

Note that we won't be carrying out type and declaration checking in lab4, but they will be a part of lab5.
  • Variables must be declared before they can be used.
  • All variables are locally scoped, and accessible anywhere after their point of declaration within the current block.
  • Integer and real values can be compared, and integers can freely be assigned to reals, but warnings are generated when a real is assigned to an integer (This will include passing a real to an integer parameter once we introduce functions in lab5.)

Code sample

The first sample program below is meant to show a good mix of language features, while the second program is a stripped down example that will work with the starter lab4.lex and lab4.yacc files.

# program 1: your finished lab4x should accept this, assuming Dave didn't screw up the example

program
open

def x integer ;
def foo real;

set x 125;
            read foo;
write 5 * (x + foo);

def sum real;
set sum 0.0;
repeat open
   def y real;
   read y;
   if (y >= x) open
      write y;
   close
   else open
      write x;
   close
   set sum (sum + x + y);
close until (x < foo);

close


# program 2, works with lab4 starter code, but is missing the program token # that would be needed for valid final yarrgh code # my grate code open def x integer ; set x 12 ; close

Requirements (fixing what's missing in the starter .lex and .yacc files)

Fix lab4.lex and lab4.yacc to properly handle the language elements they don't currently support.
The starter lab4.lex and .yacc only handle: You will need to add support for all the other language token types and features, and expand/fix some of the features.

Before submitting, ensure the lab4.lex and lab4.yacc files compile cleanly (no error/warning messages) when using "make".

Error messages

With respect to error/debugging messages (yyerror and other printfs), do your best to have your error messages (when tokenizing/parsing fails) describe the location/context of the failure. This can be challenging, but one plausible approach is to track the last successfully-parsed item and display that when parsing fails - letting the programmer know the point the parser successfully reached.

Context sensitive checking is not required

Eventually a parser would need to account for things like checking that a variable is declared before use, is in-scope at the point of use, that all the operands to an operator have compatible types, etc, but that's beyond the scope of the parsing we'll do in lab4 (parts of it will be in lab5).

For lab4 we just want to make sure the source code follows the token and CFG rules above.

(Note that correct precedence and associativity doesn't require context sensitive checking for this language, and should be handled correctly.)


Advice:

As with most forms of programming, be sure to add/test tokens and rules in small increments - debugging lex/yacc can be very challenging if you try to add too much at once.

lex and yacc reminders

Each token type specified in the .lex file needs a corresponding entry in the .yacc file on the tokens line:
  %token<long> INTLIT INTKEY ...etc...
and each non-terminal type needs an entry on the types line:
  %type<long> program statements ...etc...

To help with the effectiveness of your error messages, col (in the .lex file) needs to be consistently adjusted by the number of characters in the token that was read.


Testing

To compile/test your lab:

The repo starts with two subdirectories, valid and broken, each containing some test files for use with the starter lab4x.

As you work through your lex and yacc code, I strongly recommend augmenting these with more test cases tailored specifically to whichever feature you're currently working on.

lab4d and y.output

The makefile builds a second executable, lab4d, that is compiled with the -g options if you want to run it through a debugger. That compilation also uses yacc's -v option to produce a file called y.output with information on the state machine constructed by yacc (which can be helpful in debugging some yacc errors).