9.3 Sample Program: A Simple Expression Compiler

This program's sample program is a bit complex. In fact, the theory behind this program is well beyond the scope of this text (since it involves compiler theory). However, this example is such a good demonstration of the capabilities of HLA's macro facilities and DSEL capabilities, it was too good not to include here. The following paragraphs will attempt to explain how this compile-time program operates. If you have difficulty understanding what's going on, don't feel too bad, this code isn't exactly the type of stuff that beginning assembly language programmers would normally develop on their own.

This program presents a (very) simple expression compiler. This code includes a macro, u32expr, that emits a sequence of instructions that compute the value of an arithmetic expression and leave that result sitting in one of the 80x86's 32-bit registers. The syntax for the u32expr macro invocation is the following:


 
u32expr( reg32, uns32_expression );

This macro emits the code that computes the following (HLL) statement:


 
reg32 := uns32_expression;

For example, the macro invocation "u32expr( eax, ebx+ecx*5 - edi );" computes the value of the expression "ebx+ecx*5 - edi" and leaves the result of this expression sitting in the EAX register.

The u32expr macro places several restrictions on the expression. First of all, as the name implies, it only computes the result of an uns32 expression. No other data types may appear within the expression. During computation, the macro uses the EAX and EDX registers, so expressions should not contain these registers as their values may be destroyed by the code that computes the expression (EAX or EDX may safely appear as the first operand of the expression, however). Finally, expressions may only contain the following operators:

<, <=, >, >=, <>, !=, =, ==
 
+, -
 
*, /
 
(, )

The "<>" and "!=" operators are equivalent (not equals) and the "=" and "==" operators are also equivalent (equals). The operators above are listed in order of increasing precedence; i.e., "*" has a higher precedence than "+" (as you would expect). You can override the precedence of an operator by using parentheses in the standard manner.

It is important to remember that u32expr is a macro, not a function. That is, the invocation of this macro results in a sequence of 80x86 assembly language instructions that computes the desired expression. The u32expr invocation is not a function call. to some routine that computes the result.

To understand how this macro works, it would be a good idea to review the section on "Converting Arithmetic Expressions to Postfix Notation" on page 635. That section discusses how to convert floating point expressions to reverse polish notation; although the u32expr macro works with uns32 objects rather than floating point objects, the approach it uses to translate expressions into assembly language uses this same algorithm. So if you don't remember how to translate expressions into reverse polish notation, it might be worthwhile to review that section of this text.

Converting floating point expressions to reverse polish notation is especially easy because the 80x86's FPU uses a stack architecture. Alas, the integer instructions on the 80x86 use a register architecture and efficiently translating integer expression to assembly language is a bit more difficult (see "Arithmetic Expressions" on page 597). We'll solve this problem by translating the expressions to assembly code in a somewhat less than efficient manner; we'll simulate an integer stack architecture by using the 80x86's hardware stack to hold temporary results during an integer calculation.

To push an integer constant or variable onto the 80x86 hardware stack, we need only use a PUSH or PUSHD instruction. This operation is trivial.

To add two values sitting on the top of stack together, leaving their sum on the stack, all we need do is pop those two values into registers, add the register values, and then push the result back onto the stack. We can do this operation slightly more efficiently, since addition is commutative, by using the following code:


 
// Compute X+Y where X is on NOS (next on stack) and Y is on TOS (top of stack):
 

 
		pop( eax );							// Get Y's value.
 
		add( eax, [esp] );							// Add with X's value and leave sum on TOS.

Subtraction is identical to addition. Although subtraction is not commutative the operands just happen to be on the stack in the proper order to efficiently compute their difference. To compute "X-Y" where X is on NOS and Y is on TOS, we can use code like the following:


 
// Compute X-y where X is on NOS and Y is on TOS:
 

 
		pop( eax );
 
		sub( eax, [esp] );

Multiplication of the two items on the top of stack is a little more complicated since we must use the MUL instruction (the only unsigned multiplication instruction available) and the destination operand must be the EDX:EAX register pair. Fortunately, multiplication is a commutative operation, so we can compute the product of NOS (next on stack) and TOS (top of stack) using code like the following:


 
// Compute X*Y where X is on NOS and Y is on TOS:
 

 
		pop( eax );
 
		mul( [esp], eax );							// Note that this wipes out the EDX register.
 
		mov( eax, [esp] );

Division is problematic because it is not a commutative operation and its operands on the stack are not in a convenient order. That is, to compute X/Y it would be really convenient if X was on TOS and Y was in the NOS position. Alas, as you'll soon see, it turns out that X is at NOS and Y is on the TOS. To resolve this issue requires slightly less efficient code that the sequences we've used above. Since the DIV instruction is so slow anyway, this will hardly matter.


 
// Compute X/Y where X is on NOS and Y is on TOS:
 

 
		mov( [esp+4], eax );								// Get X from NOS.
 
		xor( edx, edx );								// Zero-extend EAX into EDX:EAX
 
		div( [esp], edx:eax );								// Compute their quotient.
 
		pop( edx );								// Remove unneeded Y value from the stack.
 
		mov( eax, [esp] );								// Store quotient to the TOS.

The remaining operators are the comparison operators. These operators compare the value on NOS with the value on TOS and leave true (1) or false (0) sitting on the stack based on the result of the comparison. While it is easy to work around the non-commutative aspect of many of the comparison operators, the big challenge is converting the result to true or false. The SETcc instructions are convenient for this purpose, but they only work on byte operands. Therefore, we will have to zero extend the result of the SETcc instructions to obtain an uns32 result we can push onto the stack. Ultimately, the code we must emit for a comparison is similar to the following:


 
// Compute X <= Y where X is on NOS and Y is on TOS.
 

 
		pop( eax );
 
		cmp( [esp], eax );
 
		setbe( al );							// This instruction changes for other operators.
 
		movzx( al, eax );
 
		mov( eax, [esp] );

As it turns out, the appearance of parentheses in an expression only affects the order of the instructions appearing in the sequence, it does not affect the number of type of instructions that correspond to the calculation of an expression. As you'll soon see, handling parentheses is an especially trivial operation.

With this short description of how to emit code for each type of arithmetic operator, it's time to discuss exactly how we will write a macro to automate this translation. Once again, a complete discussion of this topic is well beyond the scope of this text, however a simple introduction to compiler theory will certainly ease the understanding the u32expr macro.

For efficiency and reasons of convenience, most compilers are broken down into several components called phases. A compiler phase is collection of logically related activities that take place during compilation. There are three general compiler phases we are going to consider here: (1) lexical analysis (or scanning), (2) parsing, and (3) code generation. It is important to realize that these three activities occur concurrently during compilation; that is, they take place at the same time rather than as three separate, serial, activities. A compiler will typically run the lexical analysis phase for a short period, transfer control to the parsing phase, do a little code generation, and then, perhaps, do some more scanning and parsing and code generation (not necessarily in that order). Real compilers have additional phases, the u32expr macro will only use these three phases (and if you look at the macro, you'll discover that it's difficult to separate the parsing and code generation phases).

Lexical analysis is the process of breaking down a string of characters, representing the expression to compile, into a sequence of tokens for use by the parser. For example, an expression of the form "MaxVal - x <= $1c" contains five distinct tokens:

MaxVal
-
x
<=
$1c

Breaking any one of these tokens into smaller objects would destroy the intent of the expression (e.g., converting MaxVal to "Max" and "Val" or converting "<=" into "<" and "="). The job of the lexical analyzer is to break the string down into a sequence of constituent tokens and return this sequence of tokens to the parser (generally one token at a time, as the parser requests new tokens). Another task for the lexical analyzer is to remove any extra white space from the string of symbols (since expressions may generally contain an arbitrary amount of white space).

Fortunately, it is easy to extract the next available token in the input string by skipping all white space characters and then look at the current character. Identifiers always begin with an alphabetic character or an underscore, numeric values always begin with a decimal digit, a dollar sign ("$"), or a percent sign ("%"). Operators always begin with the corresponding punctuation character that represents the operator. There are only two major issues here: how do we classify these tokens and how do we differentiate two or more distinct tokens that start with the same character (e.g., "<", "<=", and "<>")? Fortunately, HLA's compile-time functions provide the tools we need to do this.

Consider the declaration of the u32expr macro:

#macro u32expr( reg, expr ):sexpr;

The expr parameter is a text object representing the expression to compile. The sexpr local symbol will contain the string equivalent of this text expression. The macro translates the text expr object to a string with the following statement:

?sexpr := @string:expr;

From this point forward, the macro works with the string in sexpr.

The lexer macro (compile-time function) handles the lexical analysis operation. This macro expects a single string parameter from which it extracts a single token and removes the string associated with that token from the front of the string. For example, the following macro invocation returns "2" as the function result and leaves "+3" in the parameter string (str2Lex):

	?str2Lex := "2+3";
 
	?TokenResult := lexer( str2Lex );

The lexer function actually returns a little more than the string it extracts from its parameter. The actual return value is a record constant that has the definition:

    tokType:
 
        record
 
        
 
            lexeme:string;
 
            tokClass:tokEnum;
 
            
 
        endrecord;

The lexeme field holds that actual string (e.g., "2" in this example) that the lexer macro returns. The tokClass field holds a small numeric value (see the tokEnum enumerated data type) that specifies that type of the token. In this example, the call to lexer stores the value intconst into the tokClass field. Having a single value (like intconst) when the lexeme could take on a large number of different forms (e.g., "2", "3", "4", ...) will help make the parser easier to write. The call to lexer in the previous example produces the following results:

str2lex : "+3"
 
TokenResult.lexeme: "2"
 
TokenResult.tokClass: intconst

A subsequent call to lexer, immediately after the call above, will process the next available character in the string and return the following values:

str2lex : "3"
 
TokenResult.lexeme: "+"
 
TokenResult.tokClass: plusOp

To see how lexer works, consider the first few lines of the lexer macro:


 
#macro lexer( input ):theLexeme,boolResult;
 

 
    ?theLexeme:string;      // Holds the string we scan.
 
    ?boolResult:boolean;    // Used only as a dummy value.
 
    
 
    // Check for an identifier.
 
    
 
    #if( @peekCset( input, tok1stIDChar ))
 
    
 
        // If it began with a legal ID character, extract all
 
        // ID characters that follow.  The extracted string
 
        // goes into "theLexeme" and this call also removes
 
        // those characters from the input string.
 
        
 
        ?boolResult := @oneOrMoreCset( input, tokIDChars, input, theLexeme );
 
        
 
        // Return a tokType constant with the identifier string and
 
        // the "identifier" token value:
 
        
 
        tokType:[ theLexeme, identifier ]
 
        
 
        
 
        
 
    // Check for a decimal numeric constant.
 
    
 
    #elseif( @peekCset( input, digits ))
 
		.
 
		.
 
		.

The real work begins with the #IF statement where the code uses the @peekCset function to see if the first character of the input parameter is a member of the tok1stIDChar set (which is the alphabetic characters plus an underscore, i.e., the set of character that may appear as the first character of an identifier). If so, the code executes the @oneOrMoreCset function to extract all legal identifier characters (alphanumerics plus underscore), storing the result in the theLexeme string variable. Note that this function call to @oneOrMoreCset also removes the string it matches from the front of the input string (see the description of @oneOrMoreCset for more details). This macro returns a tokType result by simply specifying a tokType constant containing theLexeme and the enum constant identifier.

If the first character of the input string is not in the tok1stIDChar set, then the lexer macro checks to see if the first character is a legal decimal digit. If so, then this macro processes that string of digits in a manner very similar to identifiers. The code handles hexadecimal and binary constants in a similar fashion. About the only thing exciting in the whole macro is the way it differentiates tokens that begin with the same symbol. Once it determines that a token begins with a character common to several lexemes, it calls @matchStr to attempt to match the longer tokens before settling on the shorter lexeme (i.e., lexer attempts to match "<=" or "<>" before it decides the lexeme is just "<"). Other than this complication, the operation of the lexer is really quite simple.

The operation of the parser/code generation phases is a bit more complex, especially since these macros are indirectly recursive; to simplify matters we will explore the parser/code generator in a bottom-up fashion.

The parser/code generator phases consist of four separate macros: doTerms, doMulOps, doAddOps, and doCmpOps. The reason for these four separate macros is to handle the different precedences of the arithmetic operators and the parentheses. An explanation of how these four macros handle the different arithmetic precedences is beyond the scope of this text; we'll just look at how these four macros do their job.

The doTerms macro is responsible for handling identifiers, numeric constants, and subexpressions surrounded by parentheses. The single parameter is the current input string whose first (non-blank) character sequence is an identifier, constant, or parenthetical expression. Here is the full text for this macro:

#macro doTerms( expr ):termToken;
 

 
    // Begin by removing any leading white space from the string:
 

 
    ?expr := @trim( expr, 0 );
 

 
    // Okay, call the lexer to extract the next token from the input:
 

 
    ?termToken:tokType := lexer( expr );
 

 
    // See if the current token is an identifier.  If so, assume that
 
    // it's an uns32 identifier and emit the code to push its value onto 
 
    // the stack.
 

 
    #if( termToken.tokClass = identifier )
 
    
 
        // If we've got an identifier, emit the code to
 
        // push that identifier onto the stack.
 

 
        push( @text( termToken.lexeme ));
 
              
 
    // If it wasn't an identifier, see if it's a numeric constant.
 
    // If so, emit the code that will push this value onto the stack.
 

 
    #elseif( termToken.tokClass = intconst )
 
    
 
        // If we've got a constant, emit the code to push
 
        // that constant onto the stack.
 
        
 
        pushd( @text( termToken.lexeme ) );
 
        
 
    // If it's not an identifier or an integer constant, see if it's
 
    // a parenthesized subexpression.  If so, invoke the doCmpOps macro
 
    // to do the real work of code generation for the subexpression.
 
    // The call to the doCmpOps macro emits all the code needed to push
 
    // the result of the subexpression onto the stack;  note that this
 
    // macro doesn't need to emit any code for the parenthetical expression,
 
    // all the code emission is handled by doCmpOps.
 

 
    #elseif( termToken.tokClass = lparen )
 
    
 
        // If we've got a parenthetical expression, emit
 
        // the code to leave the parenthesized expression
 
        // sitting on the stack.
 
        
 
        doCmpOps( expr );
 

 
        // We must have a closing right parentheses after the subexpression.
 
        // Skip any white space and check for the closing ")" here.
 

 
        ?expr := @trim( expr, 0 );
 
        ?termToken:tokType := lexer( expr );
 
        #if( termToken.tokClass <> rparen )
 
        
 
            #error( "Expected closing parenthesis: " + termToken.lexeme )
 
            
 
        #endif
 
        
 
        
 
    // If we get to this point, then the lexer encountered something besides
 
    // an identifier, a numeric constant, or a parenthetical expression.
 

 
    #else
 
    
 
        #error( "Unexpected term: '" + termToken.lexeme + "'" )
 
        
 
    #endif
 
    
 
#endmacro;

The doTerms macro is responsible for leaving a single item sitting on the top of the 80x86 hardware stack. That stack item is either the value of an uns32 identifier, the value of an uns32 expression, or the value left on the stack via a parenthesized subexpression. The important thing to remember is that you can think of doTerms as a function that emits code that leaves a single item on the top of the 80x86 stack.

The doMulOps macro handles expressions consisting of a single term (items handled by the doTerms macro) optionally followed by zero or more pairs consisting of a multiplicative operator ("*" or "/") and a second term. It is especially important to remember that the doMulOps macro does not require the presence of a multiplicative operator; it will legally process a single term (identifier, numeric constant, or parenthetical expression). If one or more multiplicative operator and term pairs are present, the doMulOps macro will emit the code that will multiply the values of the two terms together and push the result onto the stack. E.g., consider the following:

X * 5

Since there is a multiplicative operator present ("*"), the doMulOps macro will call doTerms to process the two terms (pushing X and then Y onto the stack) and then the doMulOps macro will emit the code to multiply the two values on the stack leaving their product on the stack. The complete code for the doMulOps macro is the following:

#macro doMulOps( sexpr ):opToken;
 
    
 
    // Process the leading term (not optional).  Note that
 
    // this expansion leaves an item sitting on the stack.
 
    
 
    doTerms( sexpr );
 
    
 
    // Process all the MULOPs at the current precedence level.
 
    // (these are optional, there may be zero or more of them.)
 
    // Begin by removing any leading white space.
 

 
    ?sexpr := @trim( sexpr, 0 );
 
    #while( @peekCset( sexpr, MulOps ))
 

 
        // Save the operator so we know what code we should
 
        // generate later.
 
        
 
        ?opToken := lexer( sexpr );
 
        
 
        // Get the term following the operator.
 
        
 
        doTerms( sexpr );           
 
        
 
        // Okay, the code for the two terms is sitting on
 
        // the top of the stack (left operand at [esp+4] and
 
        // the right operand at [esp]).  Emit the code to
 
        // perform the specified operation.
 
        
 
        #if( opToken.lexeme = "*" )
 
        
 
            // For multiplication, compute
 
            // [esp+4] = [esp] * [esp+4] and
 
            // then pop the junk off the top of stack.
 
            
 
            pop( eax );
 
            mul( (type dword [esp]) );
 
            mov( eax, [esp] );
 
            
 
        #elseif( opToken.lexeme = "/" )
 
        
 
            // For division, compute
 
            // [esp+4] = [esp+4] / [esp] and
 
            // then pop the junk off the top of stack.
 
            
 
            mov( [esp+4], eax );
 
            xor( edx, edx );
 
            div( [esp], edx:eax );
 
            pop( edx );
 
            mov( eax, [esp] );
 
            
 
        #endif
 
        ?sexpr := @trim( sexpr, 0 );
 
        
 
    #endwhile   
 
    
 
#endmacro;

Note the simplicity of the code generation. This macro assumes that doTerms has done its job leaving two values sitting on the top of the stack. Therefore, the only code this macro has to generate is the code to pop these two values off the stack and them multiply or divide them, depending on the actual operator that is present. The code generation uses the sequences appearing earlier in this section.

The doAddOps and doCmpOps macros work in a manner nearly identical to doMulOps. The only difference is the operators these macros handle (and, of course, the code that they generate). See Program 9.7, below, for details concerning these macros.

Once we've got the lexer and the four parser/code generation macros written, writing the u32expr macro is quite easy. All that u32expr needs to do is call the doCmpOps macro to compile the expression and then pop the result off the stack and store it into the destination register appearing as the first operand. This requires little more than a single POP instruction.

About the only thing interesting in the u32expr macro is the presence of the RETURNS statement. This HLA statement takes the following form:


 
returns( { statements }, string_expression )

This statement simply compiles the sequence of statements appearing between the braces in the first operand and then it uses the second string_expression operand as the "returns" value for this statement. As you may recall from the discussion of instruction composition ( see "Instruction Composition in HLA" on page 558), HLA substitutes the "returns" value of a statement in place of that statement if it appears as an operand to another expression. The RETURNS statement appearing in the u32expr macro returns the register you specify as the first parameter as the "returns" value for the macro invocation. This lets you invoke the u32expr macro as an operand to many different instructions (that accept a 32-bit register as an operand). For example, the following u32expr macro invocations are all legal:

		mov( u32expr( eax, i*j+k/15 - 2), m );
 
		if( u32expr( edx, eax < (ebx-2)*ecx) ) then ... endif;
 
		funcCall( u32expr( eax, (x*x + y*y)/z*z ), 16, 2 );

Well, without further ado, here's the complete code for the u32expr compiler and some test code that checks out the operation of this macro:


 
// u32expr.hla
 
//
 
// This program demonstrates how to write an "expression compiler"
 
// using the HLA compile-time language.  This code defines a macro
 
// (u32expr) that accepts an arithmetic expression as a parameter.
 
// This macro compiles that expression into a sequence of HLA
 
// machine language instructions that will compute the result of
 
// that expression at run-time.
 
//
 
//  The u32expr macro does have some severe limitations.
 
// First of all, it only support uns32 operands.
 
// Second, it only supports the following arithmetic
 
// operations:
 
//
 
//  +, -, *, /, <, <=, >, >=, =, <>.
 
//
 
//  The comparison operators produce zero (false) or
 
// one (true) depending upon the result of the (unsigned)
 
// comparison.
 
//
 
//  The syntax for a call to u32expr is
 
//
 
//  u32expr( register, expression )
 
//
 
//  The macro computes the result of the expression and
 
// leaves this result sitting in the register specified
 
// as the first operand.  This register overwrites the
 
// values in the EAX and EDX registers (though these
 
// two registers are fine as the destination for the
 
// result).
 
//
 
//  This macro also returns the first (register) parameter
 
// as its "returns" value, so you may use u32expr anywhere
 
// a 32-bit register is legal, e.g.,
 
//
 
//      if( u32expr( eax, (i*3-2) < j )) then
 
//
 
//          << do something if (i*3-2) < j >>
 
//
 
//      endif;
 
//
 
// The statement above computes true or false in EAX and the
 
// "if" statement processes this result accordingly.
 

 

 

 
program TestExpr;
 
#include( "stdlib.hhf" )
 

 
// Some special character classifications the lexical analyzer uses.
 

 
const
 

 
    // tok1stIDChar is the set of legal characters that
 
    // can begin an identifier.  tokIDChars is the set
 
    // of characters that may follow the first character
 
    // of an identifier.
 
    
 
    tok1stIDChar := { 'a'..'z', 'A'..'Z', '_' };
 
    tokIDChars := { 'a'..'z', 'A'..'Z', '0'..'9', '_' };
 
    
 
    // digits, hexDigits, and binDigits are the sets
 
    // of characters that are legal in integer constants.
 
    // Note that these definitions don't allow underscores
 
    // in numbers, although it would be a simple fix to
 
    // allow this.
 
    
 
    digits := { '0'..'9' };
 
    hexDigits := { '0'..'9', 'a'..'f', 'A'..'F' };
 
    binDigits := { '0'..'1' };
 
    
 
    
 
    // CmpOps, PlusOps, and MulOps are the sets of
 
    // operator characters legal at three levels
 
    // of precedence that this parser supports.
 
    
 
    CmpOps  := { '>', '<', '=', '!' };
 
    PlusOps := { '+', '-' };
 
    MulOps := { '*', '/' };
 

 

 

 
    
 
type
 

 
    // tokEnum-
 
    //
 
    // Data values the lexical analyzer returns to quickly
 
    // determine the classification of a lexeme.  By
 
    // classifying the token with one of these values, the
 
    // parser can more quickly process the current token.
 
    // I.e., rather than having to compare a scanned item
 
    // against the two strings "+" and "-", the parser can
 
    // simply check to see if the current item is a "plusOp"
 
    // (which indicates that the lexeme is "+" or "-").
 
    // This speeds up the compilation of the expression since
 
    // only half the comparisons are needed and they are
 
    // simple integer comparisons rather than string comparisons.
 
    
 
    tokEnum:    enum
 
                { 
 
                    identifier, 
 
                    intconst, 
 
                    lparen,
 
                    rparen,
 
                    plusOp,
 
                    mulOp,
 
                    cmpOp
 
                };
 
        
 
    // tokType-
 
    //
 
    //  This is the "token" type returned by the lexical analyzer.
 
    // The "lexeme" field contains the string that matches the
 
    // current item scanned by the lexer.  The "tokClass" field
 
    // contains a generic classifcation for the symbol (see the
 
    // "tokEnum" type above).
 
            
 
    tokType:
 
        record
 
        
 
            lexeme:string;
 
            tokClass:tokEnum;
 
            
 
        endrecord;
 
        
 

 

 
// lexer-
 
//
 
// This is the lexical analyzer.  On each call it extracts a
 
// lexical item from the front of the string passed to it as a
 
// parameter (it also removes this item from the front of the
 
// string).  If it successfully matches a token, this macro
 
// returns a tokType constant as its return value.
 

 
macro lexer( input ):theLexeme,boolResult;
 

 
    ?theLexeme:string;      // Holds the string we scan.
 
    ?boolResult:boolean;    // Used only as a dummy value.
 
    
 
    // Check for an identifier.
 
    
 
    #if( @peekCset( input, tok1stIDChar ))
 
    
 
        // If it began with a legal ID character, extract all
 
        // ID characters that follow.  The extracted string
 
        // goes into "theLexeme" and this call also removes
 
        // those characters from the input string.
 
        
 
        ?boolResult := @oneOrMoreCset( input, tokIDChars, input, theLexeme );
 
        
 
        // Return a tokType constant with the identifier string and
 
        // the "identifier" token value:
 
        
 
        tokType:[ theLexeme, identifier ]
 
        
 
        
 
    
 
    // Check for a decimal numeric constant.
 
    
 
    #elseif( @peekCset( input, digits ))
 
    
 
        // If the current item began with a decimal digit, extract
 
        // all the following digits and put them into "theLexeme".
 
        // Also remove these characters from the input string.
 
        
 
        ?boolResult := @oneOrMoreCset( input, digits, input, theLexeme );
 
        
 
        // Return an integer constant as the current token.
 
        
 
        tokType:[ theLexeme, intconst ]
 
    
 
    
 
    
 
    // Check for a hexadecimal numeric constant.
 
    
 
    #elseif( @peekChar( input, '$' ))
 
    
 
        // If we had a "$" symbol, grab it and any following
 
        // hexadecimal digits.  Set boolResult true if there
 
        // is at least one hexadecimal digit.  As usual, extract
 
        // the hex value to "theLexeme" and remove the value
 
        // from the input string:
 
        
 
        ?boolResult := @oneChar( input, '$', input ) &
 
                        @oneOrMoreCset( input, hexDigits, input, theLexeme );
 
        
 
        // Returns the hex constant string as an intconst object:
 
        
 
        tokType:[ '$' + theLexeme, intconst ]
 
    
 
    
 
    
 
    // Check for a binary numeric constant.
 
    
 
    #elseif( @peekChar( input, '%' ))
 
    
 
        // See the comments for hexadecimal constants.  This boolean
 
        // constant scanner works the same way.
 
        
 
        ?boolResult := @oneChar( input, '%', input ) &
 
                        @oneOrMoreCset( input, binDigits, input, theLexeme );
 
        tokType:[ '%' + theLexeme, intconst ]
 
    
 
    
 
    // Handle the "+" and "-" operators here.
 
    
 
    #elseif( @peekCset( input, PlusOps ))
 
    
 
        // If it was a "+" or "-" sign, extract it from the input
 
        // and return it as a "plusOp" token.
 
        
 
        ?boolResult := @oneCset( input, PlusOps, input, theLexeme );
 
        tokType:[ theLexeme, plusOp ]
 
    
 
    
 
    
 
    // Handle the "*" and "/" operators here.
 
    
 
    #elseif( @peekCset( input, MulOps ))
 
    
 
        // If it was a "*" or "/" sign, extract it from the input
 
        // and return it as a "mulOp" token.
 
        
 
        ?boolResult := @oneCset( input, MulOps, input, theLexeme );
 
        tokType:[ theLexeme, mulOp ]
 
        
 
        
 
        
 
    // Handle the "=" ("=="), "<>" ("!="), "<", "<=", ">", and ">="
 
    // operators here.
 
    
 
    #elseif( @peekCset( input, CmpOps ))
 
    
 
        // Note that we must check for two-character operators
 
        // first so we don't confuse them with the single
 
        // character opertors:
 
        
 
        #if
 
        ( 
 
                @matchStr( input, ">=", input, theLexeme ) 
 
            |   @matchStr( input, "<=", input, theLexeme )
 
            |   @matchStr( input, "<>", input, theLexeme )
 
        )
 
        
 
            tokType:[ theLexeme, cmpOp ]
 
            
 
        #elseif( @matchStr( input, "!=", input, theLexeme ))
 
    
 
            tokType:[ "<>", cmpOp ]
 

 
        #elseif( @matchStr( input, "==", input, theLexeme ))
 
    
 
            tokType:[ "=", cmpOp ]
 

 
        #elseif( @oneCset( input, {'>', '<', '='}, input, theLexeme ))
 
        
 
            tokType:[ theLexeme, cmpOp ]
 
        
 
        #else
 
        
 
            #error( "Illegal comparison operator: " + input )
 
            
 
        #endif
 

 

 
    // Handle the parentheses down here.
 

 
    #elseif( @oneChar( input, '(', input, theLexeme ))
 
    
 
        tokType:[ "(", lparen ]
 
        
 
    #elseif( @oneChar( input, ')', input, theLexeme ))
 
    
 
        tokType:[ ")", rparen ]
 
        
 
            
 
    // Anything else is an illegal character.
 
    
 
    #else
 
    
 
        #error
 
        ( 
 
            "Illegal character in expression: '" + 
 
            @substr( input, 0, 1 ) + 
 
            "' ($" + 
 
            string( dword( @substr( input, 0, 1 ))) + 
 
            ")"
 
        )
 
        ?input := @substr( input, 1, @length(input) - 1 );
 
    
 
    #endif
 
     
 
endmacro;
 
        
 

 

 

 
// Handle identifiers, constants, and sub-expressions within
 
// paretheses within this macro.
 
//
 
//  terms-> identifier | intconst | '(' CmpOps ')'
 
//
 
// This compile time function does the following:
 
//
 
//  (1) If it encounters an indentifier, it emits the
 
//      following instruction to the code stream:
 
//
 
//          push( identifier );
 
//
 
//  (2) If it encounters an (unsigned) integer constant, it emits
 
//      the following instruction to the code stream:
 
//
 
//          pushd( constant_value );
 
//
 
//  (3) If it encounters an expression surrounded by parentheses,
 
//      then it emits whatever instruction sequence is necessary
 
//      to leave the value of that (unsigned integer) expression
 
//      sitting on the top of the stack.
 
//
 
//  (4) If the current lexeme is none of the above, then this
 
//      macro prints an appropriate error message.
 
//
 
//  The end result of the execution of this macro is the emission
 
//  of some code that leaves a single 32-bit unsigned value sitting
 
//  on the top of the 80x86 stack (assuming no error).
 

 
macro doTerms( expr ):termToken;
 

 
    ?expr := @trim( expr, 0 );
 
    ?termToken:tokType := lexer( expr );
 
    #if( termToken.tokClass = identifier )
 
    
 
        // If we've got an identifier, emit the code to
 
        // push that identifier onto the stack.
 

 
        push( @text( termToken.lexeme ));
 
                                      
 
    #elseif( termToken.tokClass = intconst )
 
    
 
        // If we've got a constant, emit the code to push
 
        // that constant onto the stack.
 
        
 
        pushd( @text( termToken.lexeme ) );
 
        
 
    #elseif( termToken.tokClass = lparen )
 
    
 
        // If we've got a parenthetical expression, emit
 
        // the code to leave the parenthesized expression
 
        // sitting on the stack.
 
        
 
        doCmpOps( expr );
 
        ?expr := @trim( expr, 0 );
 
        ?termToken:tokType := lexer( expr );
 
        #if( termToken.tokClass <> rparen )
 
        
 
            #error( "Expected closing parenthesis: " + termToken.lexeme )
 
            
 
        #endif
 
        
 
        
 
    #else
 
    
 
        #error( "Unexpected term: '" + termToken.lexeme + "'" )
 
        
 
    #endif
 
    
 
endmacro;
 

 

 

 
// Handle the multiplication, division, and modulo operations here.
 
//
 
// MulOps-> terms ( mulOp terms )*
 
//
 
// The above grammar production tells us that a "MulOps" consists
 
// of a "terms" expansion followed by zero or more instances of a 
 
// "mulop" followed by a "terms" expansion (like wildcard filename
 
// expansions, the "*" indicates zero or more copies of the things
 
// inside the parentheses).
 
//
 
// This code assumes that "terms" leaves whatever operands/expressions
 
// it processes sitting on the 80x86 stack at run time.  If there is
 
// a single term (no optional mulOp/term following), then this code
 
// does nothing (it leaves the result on the stack that was pushed
 
// by the "terms" expansion).  If one or more mulOp/terms pairs are
 
// present, then for each pair this code assumes that the two "terms"
 
// expansions left some value on the stack.  This code will pop
 
// those two values off the stack and multiply or divide them and
 
// push the result back onto the stack (sort of like the way the
 
// FPU multiplies or divides values on the FPU stack).
 
//
 
// If there are three or more operands in a row, separated by
 
// mulops ("*" or "/") then this macro will process them in
 
// a left-to-right fashion, popping each pair of values off the
 
// stack, operating on them, pushing the result, and then processing
 
// the next pair.  E.g.,
 
//
 
//      i * j * k
 
//
 
//  yields:
 
//
 
//      push( i );  // From the "terms" macro.
 
//      push( j );  // From the "terms" macro.
 
//
 
//      pop( eax ); // Compute the product of i*j
 
//      mul( (type dword [esp]));
 
//      mov( eax, [esp]);
 
//
 
//      push( k );  // From the "terms" macro.
 
//
 
//      pop( eax );                 // Pop K
 
//      mul( (type dword [esp]));   // Compute K* (i*j) [i*j is value on TOS].
 
//      mov( eax, [esp]);           // Save product on TOS.
 

 

 
macro doMulOps( sexpr ):opToken;
 
    
 
    // Process the leading term (not optional).  Note that
 
    // this expansion leaves an item sitting on the stack.
 
    
 
    doTerms( sexpr );
 
    
 
    // Process all the MULOPs at the current precedence level.
 
    // (these are optional, there may be zero or more of them.)
 
    
 
    ?sexpr := @trim( sexpr, 0 );
 
    #while( @peekCset( sexpr, MulOps ))
 

 
        // Save the operator so we know what code we should
 
        // generate later.
 
        
 
        ?opToken := lexer( sexpr );
 
        
 
        // Get the term following the operator.
 
        
 
        doTerms( sexpr );           
 
        
 
        // Okay, the code for the two terms is sitting on
 
        // the top of the stack (left operand at [esp+4] and
 
        // the right operand at [esp]).  Emit the code to
 
        // perform the specified operation.
 
        
 
        #if( opToken.lexeme = "*" )
 
        
 
            // For multiplication, compute
 
            // [esp+4] = [esp] * [esp+4] and
 
            // then pop the junk off the top of stack.
 
            
 
            pop( eax );
 
            mul( (type dword [esp]) );
 
            mov( eax, [esp] );
 
            
 
        #elseif( opToken.lexeme = "/" )
 
        
 
            // For division, compute
 
            // [esp+4] = [esp+4] / [esp] and
 
            // then pop the junk off the top of stack.
 
            
 
            mov( [esp+4], eax );
 
            xor( edx, edx );
 
            div( [esp], edx:eax );
 
            pop( edx );
 
            mov( eax, [esp] );
 
            
 
        #endif
 
        ?sexpr := @trim( sexpr, 0 );
 
        
 
    #endwhile   
 
    
 
endmacro;
 

 

 

 

 
// Handle the addition, and subtraction operations here.
 
//
 
// AddOps-> MulOps ( addOp MulOps )*
 
//
 
// The above grammar production tells us that an "AddOps" consists
 
// of a "MulOps" expansion followed by zero or more instances of an 
 
// "addOp" followed by a "MulOps" expansion.
 
//
 
// This code assumes that "MulOps" leaves whatever operands/expressions
 
// it processes sitting on the 80x86 stack at run time.  If there is
 
// a single MulOps item then this code does nothing.  If one or more 
 
// addOp/MulOps pairs are present, then for each pair this code 
 
// assumes that the two "MulOps" expansions left some value on the 
 
// stack.  This code will pop those two values off the stack and 
 
// add or subtract them and push the result back onto the stack. 
 

 
macro doAddOps( sexpr ):opToken;
 

 
    // Process the first operand (or subexpression):
 
    
 
    doMulOps( sexpr );
 
    
 
    // Process all the ADDOPs at the current precedence level.
 
    
 
    ?sexpr := @trim( sexpr, 0 );
 
    #while( @peekCset( sexpr, PlusOps ))
 
    
 
        // Save the operator so we know what code we should
 
        // generate later.
 
        
 
        ?opToken := lexer( sexpr );
 

 
        // Get the MulOp following the operator.
 
        
 
        doMulOps( sexpr );
 
        
 
        // Okay, emit the code associated with the operator.
 
        
 
        #if( opToken.lexeme = "+" )
 
        
 
            pop( eax );
 
            add( eax, [esp] );
 
            
 
        #elseif( opToken.lexeme = "-" )
 
        
 
            pop( eax );
 
            sub( eax, [esp] );
 
            
 
        #endif
 
        
 
    #endwhile
 
            
 
endmacro;
 

 

 

 
// Handle the comparison operations here.
 
//
 
// CmpOps-> addOps ( cmpOp AddOps )*
 
//
 
// The above grammar production tells us that a "CmpOps" consists
 
// of an "AddOps" expansion followed by zero or more instances of an 
 
// "cmpOp" followed by an "AddOps" expansion.
 
//
 
// This code assumes that "MulOps" leaves whatever operands/expressions
 
// it processes sitting on the 80x86 stack at run time.  If there is
 
// a single MulOps item then this code does nothing.  If one or more 
 
// addOp/MulOps pairs are present, then for each pair this code 
 
// assumes that the two "MulOps" expansions left some value on the 
 
// stack.  This code will pop those two values off the stack and 
 
// add or subtract them and push the result back onto the stack. 
 

 

 
macro doCmpOps( sexpr ):opToken;
 

 
    // Process the first operand:
 
    
 
    doAddOps( sexpr );
 
    
 
    // Process all the CMPOPs at the current precedence level.
 
    
 
    ?sexpr := @trim( sexpr, 0 );
 
    #while( @peekCset( sexpr, CmpOps ))
 
    
 
        // Save the operator for the code generation task later.
 
        
 
        ?opToken := lexer( sexpr );
 
        
 
        // Process the item after the comparison operator.
 
        
 
        doAddOps( sexpr );
 
        
 
        
 
        // Generate the code to compare [esp+4] against [esp]
 
        // and leave true/false sitting on the stack in place
 
        // of these two operands.
 
        
 
        #if( opToken.lexeme = "<" )
 
        
 
            pop( eax );
 
            cmp( [esp], eax );
 
            setb( al );
 
            movzx( al, eax );
 
            mov( eax, [esp] );
 
            
 
        #elseif( opToken.lexeme = "<=" )
 
        
 
            pop( eax );
 
            cmp( [esp], eax );
 
            setbe( al );
 
            movzx( al, eax );
 
            mov( eax, [esp] );
 
            
 
        #elseif( opToken.lexeme = ">" )
 
        
 
            pop( eax );
 
            cmp( [esp], eax );
 
            seta( al );
 
            movzx( al, eax );
 
            mov( eax, [esp] );
 
            
 
        #elseif( opToken.lexeme = ">=" )
 
        
 
            pop( eax );
 
            cmp( [esp], eax );
 
            setae( al );
 
            movzx( al, eax );
 
            mov( eax, [esp] );
 
            
 
        #elseif( opToken.lexeme = "=" )
 
        
 
            pop( eax );
 
            cmp( [esp], eax );
 
            sete( al );
 
            movzx( al, eax );
 
            mov( eax, [esp] );
 
            
 
        #elseif( opToken.lexeme = "<>" )
 
        
 
            pop( eax );
 
            cmp( [esp], eax );
 
            setne( al );
 
            movzx( al, eax );
 
            mov( eax, [esp] );
 
            
 
            
 
        #endif
 
        
 
    #endwhile
 
            
 
endmacro;
 

 

 

 
// General macro that does the expression compliation.
 
// The first parameter must be a 32-bit register where
 
// this macro will leave the result.  The second parameter
 
// is the expression to compile.  The expression compiler
 
// will destroy the value in EAX and may destroy the value
 
// in EDX (though EDX and EAX make fine destination registers
 
// for this macro).
 
//
 
// This macro generates poor machine code.  It is more a
 
// "proof of concept" rather than something you should use
 
// all the time.  Nevertheless, if you don't have serious
 
// size or time constraints on your code, this macro can be
 
// quite handy.  Writing an optimizer is left as an exercise
 
// to the interested reader.
 

 
macro u32expr( reg, expr):sexpr;
 

 
    // The "returns" statement processes the first operand
 
    // as a normal sequence of statements and then returns
 
    // the second operand as the "returns" value for this
 
    // macro.
 
    
 
    returns
 
    (
 
        {
 
        
 
            ?sexpr:string := @string:expr;
 
            #if( !@IsReg32( reg ) )
 
            
 
                #error( "Expected a 32-bit register" )
 
                
 
            #else
 
            
 
                // Process the expression and leave the
 
                // result sitting in the specified register.
 
                
 
                doCmpOps( sexpr );
 
                pop( reg );
 
                    
 
            #endif
 
        },
 
        
 
        // Return the specified register as the "returns"
 
        // value for this compilation:
 
        
 
        @string:reg
 
    )
 
    
 
        
 
    
 
endmacro;
 

 

 

 
// The following main program provides some examples of the
 
// use of the above macro:
 

 
static
 
    x:uns32;
 
    v:uns32 := 5;
 

 

 
begin TestExpr;
 

 
    mov( 10, x );
 
    mov( 12, ecx );
 

 
    // Compute:
 
    //
 
    //  edi := (x*3/v + %1010 == 16) + ecx;
 
    //
 
    //  This is equivalent to:
 
    //
 
    //  edi := (10*3/5 + %1010 == 16) + 12
 
    //      := ( 30/5 + %1010 == 16) + 12
 
    //      := ( 6 + 10 == 16) + 12
 
    //      := ( 16 == 16) + 12
 
    //      := ( 1 ) + 12
 
    //      := 13
 
    //
 
    //  This macro invocation emits the following code:
 
    //
 
    //  push(x);
 
    //  pushd(3);
 
    //  pop(eax);
 
    //  mul( (type dword [esp]) );
 
    //  mov( eax, [esp] );
 
    //  push( v );
 
    //  mov( [esp+4], eax );
 
    //  xor edx, edx  
 
    //  div( [esp], edx:eax );
 
    //  pop( edx );
 
    //  mov( eax, [esp] );
 
    //  pushd( 10 );
 
    //  pop( eax );
 
    //  add( eax, [esp] );
 
    //  pushd( 16 );
 
    //  pop( eax );
 
    //  cmp( [esp], eax );
 
    //  sete( al );
 
    //  movzx( al, eax );
 
    //  mov( eax, [esp+0] );
 
    //  push( ecx );
 
    //  pop( eax );
 
    //  add( eax, [esp] );
 
    //  pop( edi );
 

 

 
    u32expr( edi, (x*3/v+%1010 == 16) + ecx );
 
    stdout.put( "Sum = ", (type uns32 edi), nl );
 
    
 

 

 
    // Now compute:
 
    //
 
    //  eax := x + ecx/2
 
    //      := 10 + 12/2
 
    //      := 10 + 6
 
    //      := 16
 
    //
 
    // This macro emits the following code:
 
    //
 
    //  push( x );
 
    //  push( ecx );
 
    //  pushd( 2 );
 
    //  mov( [esp+4], eax );
 
    //  xor( edx, edx );  
 
    //  div( [esp], edx:eax );
 
    //  pop( edx );
 
    //  mov( eax, [esp] );
 
    //  pop( eax );
 
    //  add( eax, [esp] );
 
    //  pop( eax );
 

 

 
    u32expr( eax, x+ecx/2 );
 
    stdout.put( "x=", x, " ecx=", (type uns32 ecx), " v=", v, nl );
 
    stdout.put( "x+ecx/2 = ", (type uns32 eax ), nl );
 
    
 

 
    // Now determine if (x+ecx/2) < v 
 
    //  (it is not since (x+ecx/2)=16 and v = 5.)
 
    //
 
    //  This macro invocation emits the following code:
 
    //
 
    //  push( x );
 
    //  push( ecx );
 
    //  pushd( 2 );
 
    //  mov( [esp+4], eax );
 
    //  xor( edx, edx );  
 
    //  div( [esp], edx:eax );
 
    //  pop( edx );
 
    //  mov( eax, [esp] );
 
    //  pop( eax );
 
    //  add( eax, [esp]);
 
    //  push( v );
 
    //  pop( eax );
 
    //  cmp( eax, [esp+0] );
 
    //  setb( al );
 
    //  movzx( al, eax );
 
    //  mov( eax, [esp+0] );
 
    //  pop( eax );
 

 

 
    if( u32expr( eax, x+ecx/2 < v ) ) then
 
    
 
        stdout.put( "x+ecx/2 < v" nl );
 
        
 
    else
 
    
 
        stdout.put( "x+ecx/2 >= v" nl );
 
        
 
    endif;
 
    
 
end TestExpr;
 

 
Program 9.7	 Uns32 Expression Compiler

9.4 Putting It All Together

The ability to extend the HLA language is one of the most powerful features of the HLA language. In this chapter you got to explore the use of several tools that allow you to extend the base language. Although a complete treatise on language design and implementation is beyond the scope of this chapter, further study in the area of compiler construction will help you learn new techniques for extending the HLA language. Later volumes in this text, including the volume on advanced string handling, will cover additional topics of interest to those who want to design and implement their own language constructs.

9.3 Sample Program: A Simple Expression Compiler

9.4 Putting It All Together

Web Site Hits Since Jan 1, 2000

Web Site Hits Since
Jan 1, 2000