The goal of this exercise is to introduce the basics of program translation. We start by defining a simple, assembly-like, target language. This language provides very simple operations and primitive control structures.
After we define the target language, we define a simple programming language -- the input language. The language includes input and output statements, assignment statements, expressions written using infix notation and integer variables with implicit declarations.
The program you write for this assignment will translate programs written in the input language into the target language. Source code for an interpreter for the target language is provided so that you can actually run the programs you have compiled. If you spot a significant problem with the code (i.e., a bug, or a particularly ugly piece of code), send me mail (maccabe@cs.unm.edu) and you'll earn a bonus point provided that you carefully document the problem.
The target language is line oriented, i.e., each instruction is written on a single line and each line has at most one instruction. Blank lines and lines in which the first non-space character is ';' are ignored. Instruction lines consist of an optional instruction label (a label is an alphanumeric string followed by a colon, ':'); followed by an operation; followed by zero, one, or two operands. Additional text after the last operand is ignored; however, this text should be proceeded by a semicolon (i.e., it should look like a comment.)
| Operation | Operand 1 | Operand 2 | Meaning |
|---|---|---|---|
| input | store label | input --> store[opnd1] | |
| output | store label | store[opnd1] --> output | |
| copy | store label | store label | store[opnd1] --> store[opnn2] |
| set | value | store label | opnd1 --> store[opnd2] |
| mult | store label | store label | store[opnd1] * store[opnd2] --> store[opnd2] |
| div | store label | store label | store[opnd1] / store[opnd2] --> store[opnd2] |
| add | store label | store label | store[opnd1] + store[opnd2] --> store[opnd2] |
| sub | store label | store label | store[opnd1] - store[opnd2] --> store[opnd2] |
| eq | store label | code label | if( store[opnd1] == 0 ) opnd2 --> PC |
| ne | store label | code label | if( store[opnd1] != 0 ) opnd2 --> PC |
| lt | store label | code label | if( store[opnd1] < 0 ) opnd2 --> PC |
| le | store label | code label | if( store[opnd1] <= 0 ) opnd2 --> PC |
| ge | store label | code label | if( store[opnd1] >= 0 ) opnd2 --> PC |
| gt | store label | code label | if( store[opnd1] > 0 ) opnd2 --> PC |
| goto | code label | opnd1 --> PC | |
| nop | no operation | ||
| stop | terminates execution | ||
| end | marks the end of the program (translation time) |
The following sample program reads two numbers and computes the product of the numbers using addition:
;;
;; a sample program -- read two integers and produce the product
;; using addition only
;;
input a
input b
set 0 prod
copy a t1
ge t1 l1 ; make sure t1 is non-negative
set 0 t2
sub t2 t1
l1: nop
l2: nop
eq t1 l3 ; quit when we get to zero
add b prod
set -1 t3 ; decrement t1
add t3 t1
goto l2
l3: nop
ge a l4 ; negate the result if a < 0
set 0 t2
sub t2 prod
l4: nop
output prod
stop
end
-4
5
Programs written in the input language consist of a sequences of statement. The language provides five types of statements: read, write, assignment, if, and while. All variables are of type integer and implicitly declared by their first use.
The following is a grammar for the input language:
<goal> :: <stmt list> eof
<stmt list> :: <stmt> | <stmt> <stmt list>
<stmt> :: <read stmt> | <write stmt> | <assign stmt> | <if stmt> | <while stmt>
<read stmt> :: "read" <operand> ';'
<write stmt> :: "write" <expr> ';'
<assign stmt> :: <operand> '=' <expr> ';'
<if stmt> :: "if" <expr> "then" <stmt list> "end"
| "if" <expr> "then" <stmt list> "else" <stmt list> "end"
<while stmt> :: "while" <expr> "do" <stmt list> "end"
<expr> :: <operand>
| <expr> <binary operator> <expr>
| <unary operator> <expr>
| '(' <expr> ')'
<operand> :: <letter or digit>
| <letter or digit> <operand>
<binary operator> :: '*' | '/' | '+' | '-' | "==" | "!=" | "<" | "<=" | ">=" | ">"
<unary operator> :: '-'
<letter or digit> :: 'a' | 'A' | 'b' | 'B' | ... '8' | '9'
The following program computes the product of its two inputs (again, using addition).
read x;
read y;
prod = 0;
temp = x;
if temp < 0 then
temp = -temp;
end
while temp != 0 do
temp = temp - 1;
prod = prod + y;
end
if x < 0 then
prod = -prod;
end
write prod;
The grammar given for the input language has two significant problems. First, it is ambiguous. Second, it cannot be used directly in a recursive descent parse. We will correct these shortcomings in this section.
We start by addressing the ambiguity problem. The ambiguity problem occurs in the recognition of expressions. Consider the expression:
a * b + c
This expression can be recognized by two parse trees:
<expr>
______________________/ | \______
/ | \
/ | \
<expr> | \
_________/ | \__________ | \
/ | \ | \
/ | \ | \
<operand> <binary operator> <operand> <binary operator> <operand>
| | | | |
a * b + c
<expr>
____/ | \_______________________
/ | \
/ | \
/ | <expr>
/ | ________/ | \_________
/ | / | \
/ | / | \
<operand> <binary operator> <operand> <binary operator> <operand>
| | | | |
a * b + c
The following expression grammar resolves the ambiguities in the original grammar by introducing traditional associativity and precedence rules:
<expr> :: <expr> <rel op> <term>
| <term>
<term> :: <term> <add op> <factor>
| <factor>
<factor> :: <factor> <mult op> <primary>
| <primary>
<primary> :: <operand> | <unary op> <expr> | '(' <expr> ')'
<rel op> :: "==" | "!=" | "<" | "<=" | ">=" | ">"
<add op> :: '+' | '-'
<mult op> :: '*' | '/'
<unary op> :: '-'
This grammar yields the following parse tree for the expression given earlier:
<expr>
|
<term>
_______________/ | \
/ | \
/ | \
<term> | \
| | \
<factor> | |
____/ | \__ | |
/ | \ | \
/ | \ | \
<factor> | \ | <factor>
| | \ | |
<primary> | <primary> | <primary>
| | | | |
<operand> <mult op> <operand> <add op> <operand>
| | | | |
a * b + c
The problem with this grammar is that it does not lend itself to predictive parsing. The problem can be seen by considering how you know whether to expand <expr> to <term> or to <expr> <rel op> <term>. When you can see the whole expression, the decision is relatively easy; however, when you have a limited lookahead, it becomes impossible to make this decision in the general case. We avoid this problem by altering the grammar one final time.
<expr> :: <term><term'>
<term'> :: epsilon | <rel op><term><term'>
<term> :: <factor><factor'>
<factor'> :: epsilon | <add op><factor><factor'>
<factor> :: <primary><primary'>
<primary'> :: epsilon | <mult op><primary><primary'>
<primary> :: <operand> | <unary op> <expr> | '(' <expr> ')'
Compilers usually break the recognition process into two stages: lexical analysis and parsing. Lexical analysis is the process of recognizing simple structures in the code called "tokens," which include punctuation marks and words. Parsing is the process of recognizing the larger structures define by the grammar.
For this project, your lexical analyzer should recognize the following tokens: semicolon, left parenthesis, right parenthesis, equal (==), not equal (!=), less than (<), less or equal (<=), greater or equal (>=), greater than (>), plus (+), minus (-), asterisk (*), slash (/), assign (=), the reserved words ("while", "do", "if", "then", "else", "end", "read", "write"), operands, and eof.
Using the final expression grammar, along with the other parts of first grammar, parsing is relatively straightforward. To illustrate this, the following code fragment illustrates the recognition of an expression:
void read_term(){
read_fact();
read_fact_();
};
void read_term_() {
Token t = lex.peek();
switch( t.kind() ) {
case TOK_EQOP:
case TOK_NEOP:
case TOK_LTOP:
case TOK_LEOP:
case TOK_GEOP:
case TOK_GTOP:
lex.get(); // consume the token
read_term();
read_term_();
default:
; // epsilon
}
};
void read_expr() {
read_term();
read_term_();
}