The goal of this exercise is to introduce the basics of program translation. We start by defining a simple, assembly-like, target language. This language provides very simple operations and primitive control structures.
After we define the target language, we define a simple programming language -- the input language. The language includes input and output statements, assignment statements, expressions written using infix notation and integer variables with implicit declarations.
The program you write for this assignment will translate programs written in the input language into the target language. Source code for an interpreter for the target language is provided so that you can actually run the programs you have compiled. If you spot a significant problem with the code (i.e., a bug, or a particularly ugly piece of code), send me mail (maccabe@cs.unm.edu) and you'll earn a bonus point provided that you carefully document the problem.
The target language is line oriented, i.e., each instruction is written on a single line and each line has at most one instruction. Blank lines and lines in which the first non-space character is ';' are ignored. Instruction lines consist of an optional instruction label (a label is an alphanumeric string followed by a colon, ':'); followed by an operation; followed by zero, one, or two operands. Additional text after the last operand is ignored; however, this text should be proceeded by a semicolon (i.e., it should look like a comment.)
Operation | Operand 1 | Operand 2 | Meaning |
---|---|---|---|
input | store label | input --> store[opnd1] | |
output | store label | store[opnd1] --> output | |
copy | store label | store label | store[opnd1] --> store[opnn2] |
set | value | store label | opnd1 --> store[opnd2] |
mult | store label | store label | store[opnd1] * store[opnd2] --> store[opnd2] |
div | store label | store label | store[opnd1] / store[opnd2] --> store[opnd2] |
add | store label | store label | store[opnd1] + store[opnd2] --> store[opnd2] |
sub | store label | store label | store[opnd1] - store[opnd2] --> store[opnd2] |
eq | store label | code label | if( store[opnd1] == 0 ) opnd2 --> PC |
ne | store label | code label | if( store[opnd1] != 0 ) opnd2 --> PC |
lt | store label | code label | if( store[opnd1] < 0 ) opnd2 --> PC |
le | store label | code label | if( store[opnd1] <= 0 ) opnd2 --> PC |
ge | store label | code label | if( store[opnd1] >= 0 ) opnd2 --> PC |
gt | store label | code label | if( store[opnd1] > 0 ) opnd2 --> PC |
goto | code label | opnd1 --> PC | |
nop | no operation | ||
stop | terminates execution | ||
end | marks the end of the program (translation time) |
The following sample program reads two numbers and computes the product of the numbers using addition:
;; ;; a sample program -- read two integers and produce the product ;; using addition only ;; input a input b set 0 prod copy a t1 ge t1 l1 ; make sure t1 is non-negative set 0 t2 sub t2 t1 l1: nop l2: nop eq t1 l3 ; quit when we get to zero add b prod set -1 t3 ; decrement t1 add t3 t1 goto l2 l3: nop ge a l4 ; negate the result if a < 0 set 0 t2 sub t2 prod l4: nop output prod stop end -4 5
Programs written in the input language consist of a sequences of statement. The language provides five types of statements: read, write, assignment, if, and while. All variables are of type integer and implicitly declared by their first use.
The following is a grammar for the input language:
<goal> :: <stmt list> eof <stmt list> :: <stmt> | <stmt> <stmt list> <stmt> :: <read stmt> | <write stmt> | <assign stmt> | <if stmt> | <while stmt> <read stmt> :: "read" <operand> ';' <write stmt> :: "write" <expr> ';' <assign stmt> :: <operand> '=' <expr> ';' <if stmt> :: "if" <expr> "then" <stmt list> "end" | "if" <expr> "then" <stmt list> "else" <stmt list> "end" <while stmt> :: "while" <expr> "do" <stmt list> "end" <expr> :: <operand> | <expr> <binary operator> <expr> | <unary operator> <expr> | '(' <expr> ')' <operand> :: <letter or digit> | <letter or digit> <operand> <binary operator> :: '*' | '/' | '+' | '-' | "==" | "!=" | "<" | "<=" | ">=" | ">" <unary operator> :: '-' <letter or digit> :: 'a' | 'A' | 'b' | 'B' | ... '8' | '9'
The following program computes the product of its two inputs (again, using addition).
read x; read y; prod = 0; temp = x; if temp < 0 then temp = -temp; end while temp != 0 do temp = temp - 1; prod = prod + y; end if x < 0 then prod = -prod; end write prod;
The grammar given for the input language has two significant problems. First, it is ambiguous. Second, it cannot be used directly in a recursive descent parse. We will correct these shortcomings in this section.
We start by addressing the ambiguity problem. The ambiguity problem occurs in the recognition of expressions. Consider the expression:
a * b + c
This expression can be recognized by two parse trees:
<expr> ______________________/ | \______ / | \ / | \ <expr> | \ _________/ | \__________ | \ / | \ | \ / | \ | \ <operand> <binary operator> <operand> <binary operator> <operand> | | | | | a * b + c <expr> ____/ | \_______________________ / | \ / | \ / | <expr> / | ________/ | \_________ / | / | \ / | / | \ <operand> <binary operator> <operand> <binary operator> <operand> | | | | | a * b + c
The following expression grammar resolves the ambiguities in the original grammar by introducing traditional associativity and precedence rules:
<expr> :: <expr> <rel op> <term> | <term> <term> :: <term> <add op> <factor> | <factor> <factor> :: <factor> <mult op> <primary> | <primary> <primary> :: <operand> | <unary op> <expr> | '(' <expr> ')' <rel op> :: "==" | "!=" | "<" | "<=" | ">=" | ">" <add op> :: '+' | '-' <mult op> :: '*' | '/' <unary op> :: '-'
This grammar yields the following parse tree for the expression given earlier:
<expr> | <term> _______________/ | \ / | \ / | \ <term> | \ | | \ <factor> | | ____/ | \__ | | / | \ | \ / | \ | \ <factor> | \ | <factor> | | \ | | <primary> | <primary> | <primary> | | | | | <operand> <mult op> <operand> <add op> <operand> | | | | | a * b + c
The problem with this grammar is that it does not lend itself to predictive parsing. The problem can be seen by considering how you know whether to expand <expr> to <term> or to <expr> <rel op> <term>. When you can see the whole expression, the decision is relatively easy; however, when you have a limited lookahead, it becomes impossible to make this decision in the general case. We avoid this problem by altering the grammar one final time.
<expr> :: <term><term'> <term'> :: epsilon | <rel op><term><term'> <term> :: <factor><factor'> <factor'> :: epsilon | <add op><factor><factor'> <factor> :: <primary><primary'> <primary'> :: epsilon | <mult op><primary><primary'> <primary> :: <operand> | <unary op> <expr> | '(' <expr> ')'
Compilers usually break the recognition process into two stages: lexical analysis and parsing. Lexical analysis is the process of recognizing simple structures in the code called "tokens," which include punctuation marks and words. Parsing is the process of recognizing the larger structures define by the grammar.
For this project, your lexical analyzer should recognize the following tokens: semicolon, left parenthesis, right parenthesis, equal (==), not equal (!=), less than (<), less or equal (<=), greater or equal (>=), greater than (>), plus (+), minus (-), asterisk (*), slash (/), assign (=), the reserved words ("while", "do", "if", "then", "else", "end", "read", "write"), operands, and eof.
Using the final expression grammar, along with the other parts of first grammar, parsing is relatively straightforward. To illustrate this, the following code fragment illustrates the recognition of an expression:
void read_term(){ read_fact(); read_fact_(); }; void read_term_() { Token t = lex.peek(); switch( t.kind() ) { case TOK_EQOP: case TOK_NEOP: case TOK_LTOP: case TOK_LEOP: case TOK_GEOP: case TOK_GTOP: lex.get(); // consume the token read_term(); read_term_(); default: ; // epsilon } }; void read_expr() { read_term(); read_term_(); }