The lab2 repo is now available to fork/clone, following
the same processes as lab1, e.g.
ssh csci fork csci439/lab2 csci439/$USER/lab2 |
In this lab we'll start building and using a symbol table, still with lex and yacc, to check context sensitive information as a program is being parsed. Our checks will essentially address the bonus material from lab 1: identifying variables that are used before being declared or are declared multiple times.
A sample language, .lex, and .yacc file are provided with the lab, where the .lex and .yacc contain partial implementations for a suitable symbol table. The language provided currently supports a sequence of variable declarations (variables declared of type int or float), followed by a sequence of assignment statements (variable = value;). A valid program might thus look like
x int ; foo float ; pi float ; something int ; something = 3 ; pi = 3.14 ; foo = pi ; x = something ;
The core objective of the lab exercise is to complete the implementation so it correctly flags errors involving use-before-declaration, errors involving multiple-declaration, and type-mismatch errors (while still accepting valid programs of course).
The secondary objective is to add support a sequence of one or more blocks, each of the form { declarations assignments } with scope recognition: so that variables are accessible within the block in which they are declared but are not accessible outside that block. This means a variable can be declared multiple times, as long as they are in different blocks.
What's provided so far:
The yacc file:
Part 1: storing data about variables, tokens and nonterminals
Ultimately we'll build a table to keep track of information we know about variables encountered within the program, e.g. their names and types. To do so, we might choose to declare a struct type to hold that information, e.g. within the .lex C declarations we might specify
struct symentry { int datatype; char varname[128]; };The lex and yacc files can also set and check extra information associated with each individual token and nonterminal as it tokenizes/parses, possibly using a similar struct defined within the .yacc file. This appears within the %union portion of the yacc declarations, as we might actually have different structs to associate different kinds of information with different types of token/nonterminal. For lab2 we'll just use one struct type and associate it with everything:
%union { struct iteminf { char name[128]; int datatype; } info; } %token<struct iteminf> INT FLOAT ASSIGN SEMI VARNAME FLTLIT INTLIT %type<struct iteminf> program declarations declare datatype actions action value
Part 2: building the symbol table and insert/lookup functions
As we see variable declarations, we'll want to store their name and type. One simple way to do this is to create an array of our symentry structs and keep track of how many items we've stored in it so far. This would also go in the C declarations in our .lex, e.g.:
struct symentry SymTable[128]; int numEntries = 0;Our .yacc will be a lot simpler if we provide functions to insert new entries into the table and to lookup existing entries. These can be prototyped in the C declarations in the .lex and implemented in the bottom C section of the .lex, e.g. in the declarations:
void insert(char name[], int dtype); int lookup(char name[]);and in the implementation:
void insert(char name[], int dtype) { /* suggested actions: * if numEntries already too big then error, symbol table is full * otherwise * numEntries gives us the index of the next spot to fill in * within the symbol table * set the varname and datatype fields * (remember to use strcpy to copy name into the varname field) * and increment numEntries */ } int lookup(char name[]) { /* suggested actions: * for each position in the symbol table < numEntries * compare its varname field to name * (remember to use strcmp for comparisons) * returning the current position if a match is found * if not match was found then return -1 */ return -1; }Note that I've assumed we'll use a simple integer value to identify the different possible data types we're tracking, say -1 for unknown, 0 for int, 1 for float.
Part 3: storing available data while tokenizing
When lex encounters a variable name while tokenizing, it can use the available yytext variable to grab the actual text of the variable name and then store it in that token's struct, through another available variable: yylval.
{Alpha}+ { col+=strlen(yytext); strcpy(yylval.info.name, yytext); return(VARNAME); }Note the three layers to access the name field of the info struct: yylval.info.name. The yylval data for that token will now be available when we use it in the yacc rules (yytext will not).
Similarly, when we encounter an integer literal or float literal while tokenizing we can set its datatype field:
{Digit}+[.]{Digit}+ { col+=strlen(yytext); yylval.info.datatype = 1; return(FLTLIT); } {Digit}+ { col+=strlen(yytext); yylval.info.datatype = 0; return(INTLIT); }
Part 4: accessing token/nonterminal data within grammar rules
Within the .yacc file, when a grammar rule is being applied we have access to the info struct fields for each element of the rule, using the notation $<info.fieldname>i where i is either $ (for the nonterminal on the left hand side of the rule) or a digit representing the position of the token/non-terminal on the right hand side.
Consider the variable declaration grammar rule below:
declare: VARNAME datatype SEMI { /* needs to check that the variable hasn't been previously declared, * and (if ok) record the variable's declaration and type, * call yyerror with suitable message if anything is incorrect */ /* calling lookup to get the variable's position from the symbol table, * which will give back -1 if not found (which is what we want here) * or the variable's position in the symbol table array (if previously declared) */ int dtype = lookup($<info.name>1); /* either generate an error message or carry out the insert */ if (dtype != -1) { char errmsg[128]; strcpy(errmsg, "Error, redeclaring variable: "); strcat(errmsg, $<info.name>1); yyerror(errmsg); } else { insert($<info.name>1, $<info.datatype>2); } } ;Similarly, within the value rules for integer literals and float literals we can set the datatype for the value nonterminal:
value: INTLIT { $<info.datatype>$ = 0; } ; value: FLTLIT { $<info.datatype>$ = 1; } ;
To do so:
(i) in lab2.lex you'll need to complete the insert and lookup functions,
(ii) in lab2.yacc you'll need to complete the C components of the
"value: VARNAME" and "action" grammar rules.
The secondary objective of the lab exercise (worth 30% of the lab mark) is to add blocks and basic scope handling, so that variables are only valid in the block in which they are declared, but variables of the same name can be declared as long as they are in different blocks. Valid programs might thus look like:
{ abc float ; i int ; j int ; abc = 1.1 ; j = 10 ; i = j ; } { x int ; i int ; x = 6 ; i = x ; } etc
To do so you will need to
One possible approach would be to give each block a unique integer id using a seperate global variable (just increments at the start of each new scope) and including that with the variable information in the symbol table.