CSCI 439 lab 3

CSCI 439: Lab 3 (Fall 2022): hand crafted tokenizers and parsers

CORRECTION:
In the original push, the parser.h file in the parser subdirectory is missing Nterm_def in the TNTtype list. (This was fixed in a push around 2:30pm on the 19th.)
CORRECTION TO THE CORRECTION:
In the correction I added Nterm_def to the end of the list of non terminals, but left LastNTerm=Nterm_asgn instead of =Nterm_def. That is fixed in the latest push.
ADDITIONAL CORRECTION:
In parser.c the Nterm_def was also missing from the conv2str function (would come out as "unknown" in output).

There are two parts to lab 3, the first involving the completion of a handcrafted tokenizer for a language and the second involving the completion of a handcrafted parser for the same language. (The two parts are independent of one another - there are no code ties between the scanner and parser for this lab.)

As usual, a csci439 git repository (lab3) has been created and contains starter code for the lab. Starter code for the tokenizer can be found in the tokenizer subdirectory in the repo, while starter code for the parser can be found in the parser subdirectory.

The rules for the language that will be scanned/parsed are as follows:

The token rules:
   VAR    [a-z]+
   EQ     =
   TERM   ;
   PRINT  print
   INT    int
   REAL   real
   INUM   [0-9]+
   RNUM   [0-9]+[.][0-9]+

The CFG rules:
   prog --> deflist stmtlist
   deflist --> def deflist | nil
   stmtlist --> stmt stmtlist | nil
   stmt --> VAR EQ asgn TERM
          | PRINT VAR TERM
   def --> INT VAR TERM
          | REAL VAR TERM
   asgn --> VAR EQ asgn
          | INUM
          | RNUM
          | VAR

The tokenizer is based on the tokenizer shown here and discussed in this youtube video. We'll discuss the tokenizing portion of the lab in the lab sessions of Oct. 17-19.

The parser is based on the parser shown here and discussed in this youtube video. We'll discuss the parsing portion of the lab in the lab sessions of Oct. 26-28.

For both parts of the lab exercise, the program infrastructure is complete, what is left to the student is to finish the code segments related to the actual language tokenizing logic and parsing logic.

In the tokenizer, the section left to complete is within the readToken function in the tokens.c file, and is clearly marked with a comment. The current version of readToken only recognizes the EQ token. The student is meant to complete this so it correctly tokenizes programs written in our target language.
In the parser, the section left to complete is within the switch statement in the main routine function in the parser.c file, and is clearly marked with a comment. The current version only handles the initial program state (replacing the starting prog nonterminal on the stack). The student is meant to complete this so it correctly parses programs written in our target language. (The token sequence for a sample test program is already present in the tokens.h file.)

Feel free to add additional functions as desired, and to restructure the readToken in tokens.c or the main routine in parser.c if you find them getting clumsy to work with.

Bonus marks:
10% bonus marks if your tokenizer also identifies both the content and token type for each token, e.g.
x(VAR), int(INT), ;(SEMI), 1.34(RNUM), 204(INUM)