CSCI 439 lab 2

CSCI 439: Lab 2 (Fall 2022)

The lab2 repo is now available to fork/clone, following the same processes as lab1, e.g.
ssh csci fork csci439/lab2 csci439/$USER/lab2

In this lab we'll start building and using a symbol table, still with lex and yacc, to check context sensitive information as a program is being parsed. Our checks will essentially address the bonus material from lab 1: identifying variables that are used before being declared or are declared multiple times.

A sample language, .lex, and .yacc file are provided with the lab, where the .lex and .yacc contain partial implementations for a suitable symbol table. The language provided currently supports a sequence of variable declarations (variables declared of type int or float), followed by a sequence of assignment statements (variable = value;). A valid program might thus look like

x int ;
foo float ;
pi float ;
something int ;
something = 3 ;
pi = 3.14 ;
foo = pi ;
x = something ;

The core objective of the lab exercise is to complete the implementation so it correctly flags errors involving use-before-declaration, errors involving multiple-declaration, and type-mismatch errors (while still accepting valid programs of course).

The secondary objective is to add support a sequence of one or more blocks, each of the form { declarations assignments } with scope recognition: so that variables are accessible within the block in which they are declared but are not accessible outside that block. This means a variable can be declared multiple times, as long as they are in different blocks.

What's provided so far:

The lex file:

Tokenization is complete for the basic version, but does not yet support the bracket tokens for blocks.
The symbol table (and underlying struct) have been declared, along with insert and lookup functions.
The insert and lookup functions have been commented but the bodies have not been implemented.

The yacc file:

The grammar is complete for the basic version, but does not yet support the blocks version (just supports a series of declarations followed by a series of assignments)
The variable declaration rule has been completed so that it correctly calls lookup to check for redeclaration of variables, and calls insert if the variable has not yet been declared.
The grammar rules for values that are integer/float literals have been correctly completed to set the datatype of the value nonterminal.
The grammar rule for values that are variables has been commented but not completed.
The action grammar rule (i.e. assignment statements) has been commented but not completed.

Building and using a symbol table with lex/yacc

Part 1: storing data about variables, tokens and nonterminals

Ultimately we'll build a table to keep track of information we know about variables encountered within the program, e.g. their names and types. To do so, we might choose to declare a struct type to hold that information, e.g. within the .lex C declarations we might specify

 struct symentry {
    int datatype;
    char varname[128];
  };

%union { struct iteminf { char name[128]; int datatype; } info; }

%token<struct iteminf> INT FLOAT ASSIGN SEMI VARNAME FLTLIT INTLIT
%type<struct iteminf> program declarations declare datatype actions action value

Part 2: building the symbol table and insert/lookup functions

As we see variable declarations, we'll want to store their name and type. One simple way to do this is to create an array of our symentry structs and keep track of how many items we've stored in it so far. This would also go in the C declarations in our .lex, e.g.:

 struct symentry SymTable[128];
 int numEntries = 0;

 void insert(char name[], int dtype);
 int lookup(char name[]);

void insert(char name[], int dtype)
{
   /* suggested actions:
    *    if numEntries already too big then error, symbol table is full
    *    otherwise
    *       numEntries gives us the index of the next spot to fill in
    *          within the symbol table
    *       set the varname and datatype fields
    *          (remember to use strcpy to copy name into the varname field)
    *       and increment numEntries
    */
}

int lookup(char name[])
{
   /* suggested actions:
    *    for each position in the symbol table < numEntries
    *        compare its varname field to name
    *        (remember to use strcmp for comparisons)
    *        returning the current position if a match is found
    *    if not match was found then return -1
    */
   return -1;
}

Part 3: storing available data while tokenizing

When lex encounters a variable name while tokenizing, it can use the available yytext variable to grab the actual text of the variable name and then store it in that token's struct, through another available variable: yylval.

{Alpha}+   { col+=strlen(yytext); strcpy(yylval.info.name, yytext); return(VARNAME); }

Similarly, when we encounter an integer literal or float literal while tokenizing we can set its datatype field:

{Digit}+[.]{Digit}+  { col+=strlen(yytext); yylval.info.datatype = 1; return(FLTLIT); }
{Digit}+             { col+=strlen(yytext); yylval.info.datatype = 0; return(INTLIT); }

Part 4: accessing token/nonterminal data within grammar rules

Within the .yacc file, when a grammar rule is being applied we have access to the info struct fields for each element of the rule, using the notation $<info.fieldname>i where i is either $ (for the nonterminal on the left hand side of the rule) or a digit representing the position of the token/non-terminal on the right hand side.

Consider the variable declaration grammar rule below:

declare: VARNAME datatype SEMI
   {
        /* needs to check that the variable hasn't been previously declared,
         * and (if ok) record the variable's declaration and type,
         * call yyerror with suitable message if anything is incorrect
         */

        /* calling lookup to get the variable's position from the symbol table,
         *    which will give back -1 if not found (which is what we want here)
         *    or the variable's position in the symbol table array (if previously declared)
         */
        int dtype = lookup($<info.name>1);

        /* either generate an error message or carry out the insert */
        if (dtype != -1) {
           char errmsg[128];
           strcpy(errmsg, "Error, redeclaring variable: ");
           strcat(errmsg, $<info.name>1);
           yyerror(errmsg);
        } else {
           insert($<info.name>1, $<info.datatype>2);
        }
   }
   ;

value: INTLIT
   {
       $<info.datatype>$ = 0;
   }
   ;

value: FLTLIT
   {
       $<info.datatype>$ = 1;
   }
   ;

The lab exercise

As stated earlier, the primary part of the lab exercise (worth 70% of the lab mark) is to complete the lex/yacc code so it correctly identifies when variables are used without having been declared, when variables are declared multiple times, and when type mismatches occur (ints and floats are to be treated as incompatible types).

To do so:
(i) in lab2.lex you'll need to complete the insert and lookup functions,
(ii) in lab2.yacc you'll need to complete the C components of the "value: VARNAME" and "action" grammar rules.

The secondary objective of the lab exercise (worth 30% of the lab mark) is to add blocks and basic scope handling, so that variables are only valid in the block in which they are declared, but variables of the same name can be declared as long as they are in different blocks. Valid programs might thus look like:

{
   abc float ;
   i int ;
   j int ;
   abc = 1.1 ;
   j = 10 ;
   i = j ;
}
{
   x int ;
   i int ;
   x =  6 ;
   i = x ;
}
etc

To do so you will need to

update the token set to include { and }
update the grammar rules to expect a sequence of blocks (each composed of declarations followed by actions)
update the symbol table and insert/lookup functions to include/check scope information
update the yacc calls to insert/lookup

One possible approach would be to give each block a unique integer id using a seperate global variable (just increments at the start of each new scope) and including that with the variable information in the symbol table.