CSCI 330 Lab 5: symbol tables and context-sensitive checking

For lab5, we'll expand our lab4 solutions to include elements of context sensitive checking using a symbol table of our own design.

The lab can be obtained and submitted through our usual git approach (repo name 'lab5' this time), and the starting repo contains a the files for a sample solution to lab4. You're welcome to replace these with your own lab4 solution code or to modify them as you see fit..

Known issues in lab4 sample solution
this is here as a just-in-case: if anyone spots a bug in the lab4 sample solution let me know and I'll share it here along with a suggested fix

Overview

In this lab we're expanding our parser for yarrgh to do some basic context senstive checking:
  1. making sure variables are declared before use and are accessible in the current scope,
  2. producing warnings if the code tries to assign a real number to an integer variable.

The context-sensitive rules we'll check for are as follows:


Recommended Solution Approach

We have several key sets of information to keep track of as the program is parsed:
(i) which scopes are currently active?
(ii) what data type (integer or real) does each expression and subexpression evaluate to?
(iii) what variable names/types are declared/accessible in each scope?
We'll consider each of these individually and provide a workable approach for each (other approaches are certainly viable).


(i) which scopes are currently active?

We'd like to keep track of which scopes are active, adding new scopes as we enter them and removing them as we leave.

We can give each scope a unique integer identifier (e.g. start with 1 as the global scope, and incrementing each time a new scope is first entered). Perhaps have a global int variable to keep track of either the most recent or the next id value to be assigned.

A stack of ints is then the ideal mechanism for keeping track of the stacks that are currently open: each time you open a new scope you push its id onto the stack, and when the scope is finally closed you pop the top scope off the stack.

Thus in the C code declarations at the top of either your .lex or .yacc you can declare a simple stack of ints and functions for pop/push/top/isempty and maybe one to initialize the stack, along with your global variable for the next stack id value to use, e.g.
int nextID;        /* id to be assigned to next scope */
int scopes[128];   /* array of scopes */
int numScopes;     /* how many scopes are currently on the stack */
void initScopes(); /* init nextID, numScopes, and scopes array */
int pushScope();   /* push annd increment nextID, return 0 on fail or numScopes otherwise */
int popScope();    /* pop and return top scope id, 0 if fails */
int topScope();    /* return top scope id, 0 if fails */

e can call the initScopes from the main routine In the yacc rule for blocks, where block -> open statements close, we can insert a block of C to do the push operation after the open, and a block of C to do the pop operation after the close (along with error checking).
block: OPENKEY { if (pushScope() == 0) { deal with error, push failed, e.g. maybe stack is full }
       statements
       CLOSEKEY { if (popScope() == 0) { deal with error, pop failed somehow }
       SEMI ;


(ii) what data type does each expression and subexpression evalulate to?

Here things get a little trickier: we need to keep track of extra information associated with each token and nonterminal as parsing proceeds, things such as the actual variable name of an identifier or the actual value of an integer/real literal, as well as the type of an expression or subexpression.

Lex/yacc support the kind of information that can be stored for tokens/nonterminals through the .yacc lines
along with the specification in the token/type lines
%union { long info }
%token<long> OPENKEY CLOSEKEY etc
%type<long> program statements etc
Now, rather than just a single long, we'll allow storage of an entire struct for each token and nonterminal, e.g.
%union { struct itemInfo { char content[128], int dtype; } info; }
%token<struct itemInfo> OPENKEY CLOSEKEY etc
%token<struct itemInfo program statements etc
This specifies that each token and nonterminal gets an info.content field (char array to hold the text of an identifier or literal) and an info.dtype field (e.g. to use 0 for unknown, 1 for integers, 2 for reals).

Setting fields in the lex file

We can get at these fields for a specific token inside the lex file through a variable named yylval, e.g. yylval.info.content or yylval.info.dtype.

If you recall, in the lex file we had access to another local variable named yytext, containing a copy of the text pattern that was just read.

If we wanted to copy yytext into the yylval.info.content for an identifier we can thus use
{Alpha}{Alnum}+ { strncpy(yylval.info.content, yytext, 128); return(IDENT); }

Most type decisions will be made in the .yacc code, but in theory we could associate types with specific tokens too, e.g. if we wanted to associate type integer with the keyword integer
"integer" { yylval.info.dtype = 1; return(INTKEY); }

Setting/accessing fields in the yacc file

In the yacc file we can also either set or lookup the info associated with any token or nonterminal in a rule, but the syntax is a little unusual.

We use the syntax $<info.fieldname>POS to access a specific item's information, where fieldname is dtype or content (in our example) and POS is either the $ symbol (again) or an integer 1,2,3, etc.

$$ refers to that field in the nonterminal on the left hand side of the rule, while $<info.fieldname>1 refers to the first token or nonterminal on the right hand side, 2 refers to the second, 3 refers to the third, etc.

Consider the rule as an example:
something: AAA foo ICK ;
$<info.content>$ refers to something's text content, $<info.dtype>$ to its type
$<info.content>1 refers to AAA's text content, $<info.dtype>$ to its type
$<info.content>2 refers to foo's text content, $<info.dtype>$ to its type
etc

As parsing of the right hand side of the rule progresses, we can trust that the preceding elements have been fully analyzed and their content/dtype fields correctly set, so we can use those to do our current context sensitive checking and to set the content and dtype values for the thing on the left hand side of the rule.

Considering the rule above, we could do something like
something: AAA foo ICK {
      /* look at AAA and ICK's dtypes and, based on them, set something's dtype */
   };
At the bottom layers of the parsing rules the actions are relatively simple (here again assuming we've decided to use 1 for dtype when something is an integer, 2 when we've decided it's a real).
value: INTLIT { $<info.dtype>$ = 1; }
value: REALLIT { $<info.dtype>$ = 2; }
Similarly for type specifiers, if we had typespec -> INTKEY or REALKEY then we could record the correct type for the typespec nonterminal using
typespec: INTKEY { $<info.dtype>$ = 1; }
typespec: REALKEY { $<info.dtype>$ = 2; }
For expressions, we can look at the types of its subexpressions and base our decisions on that, e.g.
addexpr: addexpr addop multexpr {
      /* if the RHS addexpr and multexpr have the same type then use that for the LHS addexpr
         otherwise use real as the type */
       if ($<info.dtype>1 == $<info.dtype>2)
          $<info.dtype>$ = $<info.dtype>1;
       else $<info.dtype>$ = 2;
   };
(The above assumes that every subexpression would have dtype 1 or 2: the logic would be a little more complex if there are other possible types to worry about.)

Logic like the above can be used for everything except a rule like
value: IDENT {*****};
Here, the ***** would need to lookup the identifier in our symbol table to find out if there actually is a variable with that name in scope and to find out what its type is. We'll leave that discussion to part (iii) below.

One last complication with the positions in $<info.whatever>N
If you have interleaved C chunks in your rule those count as 'positions', e.g.
    thing: aaa bbb { C code stuff } ccc ;
ccc is in position 4 NOT position 3.


(iii) what variable names/types are declared/accessible in each scope?

This is where we want to keep track of all the declared variables, their scopes, and their types, storing them in a way that makes it easy(ish) to look up the information when we need it.

This is where the symbol table concept comes in: a table tracking everything we want to know about everything that has been declared, with insert and lookup functionality.

Suppose that for every declared variable we want to remember its name, declared type, and the scope in which it was declared.

We could define a struct representing that information (at the top of either your .lex or your .yacc), then have an array of these structs representing the 'table' of all the variables declared so far, with functions to insert new entries into the table and to search the table for existing entries.

We would call insert in the variable declaration rule, and would use the lookup when an identifier is used as a value, i.e. in the value: IDENT ; rule.

In terms of the necessary declarations up top, one possible setup would be:
 /* storage for a single symbol table entry */
struct SymEntry {
   char name[128]; /* symbol name */
   int scopeId;    /* declaration scope */
   int typeId;     /* 1 for integer, 2 for real */
};
struct SymEntry SymbolTbl[128];  /* just an array with space for 128 (or whatever) entries */
int numSyms;       /* number of symbols currently in table */

void initSyms();   /* initializes numSyms for SymbolTbl */
int insert(char nm[], int r, int c, int sc, int typ); /* insert new entry, return 1 if ok, 0 otherwise */
struct SymEntry* lookup(char nm[]); /* return ptr to entry of innermost name match in active scopes, null if not found */
The initSyms would be called in main to set up the table (maybe just sets numSyms to 0).

The insert is pretty straightforward: if the table is already full then return an error (0), otherwise fill in the next SymbolTbl array position with the passed data (remember to use strncpy for the name), increment numSyms and return true (1).

The lookup is a bit tricker: we want to search from the innermost current scope to the outermost, returning a pointer to the first variable found with a matching name. This means searching backwards through the activescopes, e.g.
for s = numScopes-1 downto 0:
    currentScope = activeScopes[s]
    for e = 0 to numSyms-1:
        if SymbolTbl[e]'s name matches the parameter nm
           return the address of SymbolTbl[e]
return null if we never found a match
Using lookup and insert in our grammar rules

Within our grammar rules we would make appropriate use of insert and lookup, e.g.
vardecl: DEFKEY IDENT typespec SEMI {
     /* $<info.content>2 has the name of the variable being declared
      * $<info.dtype>3 has the 1 or 2 for the desired type
      * topScope() tells us what scope we're currently in
      *
      * we can call lookup(the variable name) to see if there's already a variable with that name,
      *    (lookup returns a pointer to the variable's entry),
      *    and it's an error if that variable's declared scope is the exact same as our current scope
      * as long as that isn't the case, we can simply call insert,
      *    passing the name, scope, and data type
      * (and checking the return type to make sure insert didn't fail)
      */
   };

value: IDENT {
     /* call lookup on the variable, if it returns null then it's an error
      *    (undeclared variable), otherwise look up the ->typeId to find out
      *    what data type to assign to value ($<info.dtype>$)
      */
   };

assign: SETKEY IDENT expr SEMI {
     /* call lookup on the identifier, if it returns null then it's an error
      *    (undeclared variable), otherwise get the variable's type from the ->typeId
      *    in the returned pointer and compare that with the info.dtype for expr
      *    just in case the variable is an int and the expr is a real
      *    (in which case print a warning but keep going)
      */
   };
Aside from the usual foray into debugging and testing, that should take care of everything we need to get working type checking and variable declaration/scope checking working for a basic language!