CS232, Spring 2008

Chapter 7

Assembly Process

 

Consider what an assembler must do as it is converting the assembly code to machine language.  The assembly code, remember, is just a text file (think of it as one long string of characters including carriage return characters indicating ends of lines).  It needs to figue out what all those characters mean and translate it into machine instructions.  To do so, the assembly language program needs to have a specific syntax so that the assembler knows what to look for and where to look for it.

 

General idea:  As the assembler steps through the text file, it handles one line at a time (a line consisting of the substring of those characters separated by carriage returns):

(a)   the assembler first forms a list of the words or tokens or symbols or names in the line.

(b)   if the line is blank or has just white space and/or comments, it throws it away.

(c)    if the line is a machine instruction, it replaces the instruction name with the corresponding opcode and each operand with the corresponding value

(d)  if the line is a .data pseudo instruction, the assembler replaces it with the corresponding number of bytes of data

(e)   There are other pseudo instructions as well, such as macro, include, and conditional compilation instructions, that we'll look at later.

 

For example, consider the JVM assembly code (left) and corresponding machine code (right)

    .data 3 0                           0      0      0

    iconst_0                            3

L:  iload 2                             21     2

    ifeq X                              153   ?

    iinc 2 1                            132   2      1

    goto L                              167   ?

X:  stop                                252

 

Problem:  If labels are used as operands, how does the assembler know what are their values are?  This is especially a problem if the label refers to a line further down in the program that has not yet been assembled  (such a reference is called a forward reference).

 

Solution:  Use a 2-pass assembler.

 

 

Pass 1:

Uses

1.       opcode table that gives the opcode for each operation and the size of the instruction and the number of bits per operand.  This table is built ahead of time.

2.      ILC (instruction location counter) to keep track of the location in memory where the current instruction will be placed.  This is incremented for each line in pass 1.

 

Builds

1.      symbol table that gives the memory address corresponding to each label

2.      macro table of the macro body associated with each macro name

3.      EQU table of all EQUÕs defined in the program.

 

Note that pass 1 just builds three tables and does not generate machine language code.

 

Pass 1 initializes ILC to 0 and then steps through assembly code. 

(a)   When it encounters a regular instruction, the assembler increments the ILC by the size of the instruction.

(b)   When it encounters a label, it adds the label and current ILC value to the symbol table.

(c)    When it encounters a macro definition, it adds the macro name and body to the macro table. (See below)

(d)  When it encounters a macro call, it replaces the call (by string substitution) with the macro body.

(e)   When it encounters a ".include <file>" pseudoinstruction, it textually replaces the pseudoinstruction with the body of the file.  (See below)

(f)     When it encounters an "EQU" pseudoinstruction, it adds the name and value to the EQU table.

(g)   When it encounters a ".data" pseudoinstruction, it increments the ILC by the number of bytes of data being initialized by the pseudoinstruction.

 

Pass 2:

After all this processing has been done, it is fairly straightforward in Pass 2 to replace all instructions with their binary values.

1.       For each instruction, it replaces the instruction name with the binary opcode (as obtained from the opcode table)

2.       For each operand that is not already a literal, it replaces the name with the value (as obtained from the symbol table or EQU table).

3.       For each .data pseudoinstruction, it inserts the (binary) data values into the compiled code.

4.       It does not use the macro table since that was needd only in Pass 1.

 

 

.include Pseudoinstructions

In assembly language program, as in high-level language programs, it's nice to be able to divide a large program into several files.  For example, you might want to include all the EQUs and macros in a separate file if those are commonly used EQUs and macros. 

 

A common way to divide code into separate files is to use the .include pseudoinstruction in the main file.  It tells the assembler to include the text from another file in this file before assembling.  For example,

        .include "def.h"

tells the assembler to get the text from the file "def.h" and include it here in this file before assembling.

 

When pass 1 sees an ".include filename" pseudoinstruction, it pastes the text of the file into the new source code file where the include pseudoinstruction was.  It then continues the preprocessing at the start of the newly-included statements.  It needs to preprocess the text of the file since it in turn might have some include pseudoinstructions or other pseudoinstructions the preprocessor needs to deal with.

 

Macros

When a set of statements is repeated over and over again in the code, it can be handled 3 ways:

1.       It can be written over and over again by the programmer, resulting in longer and less readable and more error-prone code.

2.       It can be incorporated into a subroutine, with the resulting overhead in setting up and destroying a stack frame.

3.       It can be dealt with by a macro.

 

HereÕs how a macro is handled by the preprocessor.

1.       If the preprocessor sees a macro definition, it adds the definition to a Macro table of name & body pairs.

2.       If it sees a macro call, it looks it up in the macro table and then pastes the body of the macro into the new source code file where the macro call was, after replacing any parameters with their values in the macro body.  Note that it just does a textual substitution only.  No evaluation or conversion to machine language goes on.

3.       If there are any local labels inside the macro body, the preprocessor replaces them with unique labels and replaces any occurrence of a local label in the macro body with the corresponding unique label.

 

Note that a disassembler would never know there was a macro in the original source code.

 

 

Example:  Consider the following source code:

 

;; start of source code file

;; first some pseudoinstructions

.include def.h

n EQU 2

macro swap $1, $2  ;;uses R3 as a temp register

     mov $1, R3

     mov $2, $1

     mov R3, $2

endmacro

;; now the actual instructions

add R0, R1

swap R0, R1

swap R1, R2

test

load R0, x

test

M: stop

x: .data 1 3

;; end of source code file

 

Assume that def.h was a file with the following in it:

R0 EQU 0

R1 EQU 1

R2 EQU 2

R3 EQU 3

macro test

        local L

     L: add R0, R1

        jmpn R0, L

        jmp M  ;;M is not a local label in the macro

endmacro

 

From these two files, the preprocessor would create the following new source code (without the comments):

 

;; start of new source code

add 0, 1

mov 0, 3

mov 1, 0

mov 3, 1

mov 1, 3

mov 2, 1

mov 3, 2

L$1: add 0, 1  ;; L$1 is a new unique label

jmpn 0, L$1

jmp M  ;;M is not a local label

load R0, x

L$2: add 0, 1  ;; L$2 is a new unique label

jmpn 0, L$2

jmp M  ;;M is not a local label

M:   stop

x:   .data 1 3

;; end of new source code

 

 

 

Error checking

How does the assembler know whether the program is legal?  What are all the things that can go wrong?

 

Some syntax checking is done in Pass 1 and some in Pass 2.

Pass 1 checks for

  1. legal instruction and pseudoinstruction names
  2. proper label forms

 

Pass 2 checks for legal structure of the whole program.  It uses a context-free grammar (CFG).

 

 

 

 

Example: CPU Sim assembly language grammar

 

The following context-free grammar (CFG) gives the syntax of legal assembly language programs in CPU Sim. The start symbol for the CFG is Program. EOL is an end-of-line token and EOF is an end-of-file token. Square brackets indicate an optional item. Items in parentheses followed by "*" indicate 0 or more copies of those items. Items in parentheses followed by "+" indicates 1 or more copies of those items. To separate tokens that the assembler would otherwise treat as one token, use one or more spaces or tab characters. Items in quotes are terminal symbols. Terminal symbols are case sensitive.

 

Program -> [CommentsAndEOLs] EquMacroIncludePart InstructionPart EOF
CommentsAndEOLs
-> ([Comment] EOL)+

Comment -> ";" <any-sequence-of-characters-not-including-EOF-or-EOL>

EquMacroIncludePart -> ((EquDeclaration | MacroDeclaration | Include)

CommentsAndEOLs)*

Include -> ".include" <quoted-sequence-of-characters-not-including-EOL-

or-EOF>

EquDeclaration -> Symbol "EQU" Operand

MacroDeclaration -> "MACRO" Symbol [Symbol ([","] Symbol)*]

CommentsAndEOLs InstructionPart "ENDM"

InstructionPart -> ((RegularInstructionCall |

DataPseudoinstructionCall) CommentsAndEOLs)*

RegularInstructionCall -> (Label CommentsAndEOLs)* [Label] Symbol

[Operand ([","] Operand)*]

DataPseudoinstructionCall ->

(Label CommentsAndEOLs)* [Label] ".data"

   Operand [","] ( Operand | "[" [Operand ([","] Operand)*] "]")

Label -> Symbol ":"

Operand -> Symbol | Literal

Literal -> [ "-" | "+" ] ( 0-9 )+

Symbol -> (<letter> ) (<letter or digit or - or + or _> )*

 

Here is a summary of the parts of an assembly language program. The basic building blocks consist of the following items.

 

A literal is a decimal integer, with an optional plus or minus sign in front. No commas or decimal points are allowed in literals.

 

A symbol consists of a sequence of characters, the first character of which must be an upper or lower case letter. The remaining characters must be letters or digits or + or - or _. CPU Sim distinguishes between upper and lower case letters; hence, "Data" and "data" are considered different symbols.

 

A label is a symbol followed immediately by a colon. The colon is just a separator and is not considered part of the label. The label and colon pair is an optional feature on every line of assembly language programs including those lines that are otherwise blank or contain only comments. In the latter two cases, the assembler will treat the label as if it referred to the next regular instruction or data pseudoinstruction. Labels can be used as operands in statements.

 

A comment is any sequence of characters preceded by a semicolon ";" and ending at the end of the line or at the end of the file. Any line of a program can contain a comment. Blank lines and lines containing only comments are also allowed in assembly language programs, and are ignored by the assembler. When the program is assembled, the comments after regular instructions and data pseudoinstructions are saved and appear on each line of the assembled program. However, remember that blank lines or lines with only comments are discarded when the program gets assembled.

 

A program consists of two parts. The first part contains any number of EQU declarations, include directives, and macro definitions in any order. The second part contains any number of regular instructions and data pseudoinstructions, one per line.