CS232, Spring 2008
Chapter
7
Assembly
Process
Consider what an assembler must do
as it is converting the assembly code to machine language. The assembly code, remember, is just a
text file (think of it as one long string of characters including carriage return
characters indicating ends of lines).
It needs to figue out what all those characters mean and translate it
into machine instructions. To do
so, the assembly language program needs to have a specific syntax so that the
assembler knows what to look for and where to look for it.
General idea: As the
assembler steps through the text file, it handles one line at a time (a line consisting of the substring of those characters separated
by carriage returns):
(a) the assembler first forms a list of the words or tokens or
symbols or names in the line.
(b) if the line is blank or has just white space and/or
comments, it throws it away.
(c) if the line is a machine instruction, it replaces the
instruction name with the corresponding opcode and each operand with the corresponding
value
(d) if the line is a .data pseudo
instruction, the assembler replaces it with the corresponding number of bytes
of data
(e) There are other pseudo instructions as well, such as macro,
include, and conditional compilation instructions, that we'll look at later.
For example, consider the JVM assembly code (left) and corresponding machine code (right)
.data 3 0 0
0 0
iconst_0 3
L: iload 2 21 2
ifeq X 153 ?
iinc 2 1 132 2 1
goto L 167 ?
X: stop 252
Problem: If
labels are used as operands, how does the assembler know what are their values
are? This is especially a problem
if the label refers to a line further down in the program that has not yet been
assembled (such a reference is
called a forward reference).
Solution: Use a 2-pass assembler.
Pass 1:
Uses
1.
opcode table that gives the opcode for each operation and the size of
the instruction and the number of bits per operand. This table is built ahead of time.
2.
ILC (instruction location
counter) to keep track of the location in
memory where the current instruction will be placed. This is incremented for each line in pass 1.
Builds
1. symbol table that gives the memory address corresponding to each label
2.
macro table of the macro body associated with each macro name
3.
EQU table of all EQUÕs defined in the program.
Note that pass 1 just builds three
tables and does not generate machine language code.
Pass 1 initializes ILC to 0 and
then steps through assembly code.
(a) When it encounters a regular instruction, the assembler
increments the ILC by the size of the instruction.
(b) When it encounters a label, it adds the label and current
ILC value to the symbol table.
(c) When it encounters a macro definition, it adds the macro
name and body to the macro table. (See below)
(d) When it encounters a macro call, it replaces the call (by
string substitution) with the macro body.
(e) When it encounters a ".include <file>"
pseudoinstruction, it textually replaces the pseudoinstruction with the body of
the file. (See below)
(f) When it encounters an "EQU" pseudoinstruction, it adds the
name and value to the EQU table.
(g) When it encounters a ".data" pseudoinstruction, it
increments the ILC by the number of bytes of data being initialized by the
pseudoinstruction.
Pass 2:
After all this processing has been done, it is fairly straightforward in Pass 2 to replace all instructions with their binary values.
1. For each instruction, it replaces the instruction name with the binary opcode (as obtained from the opcode table)
2. For each operand that is not already a literal, it replaces the name with the value (as obtained from the symbol table or EQU table).
3. For each .data pseudoinstruction, it inserts the (binary) data values into the compiled code.
4. It does not use the macro table since that was needd only in Pass 1.
.include Pseudoinstructions
In assembly language program, as in high-level language programs, it's nice to be able to divide a large program into several files. For example, you might want to include all the EQUs and macros in a separate file if those are commonly used EQUs and macros.
A common way to divide code into separate files is to use the .include pseudoinstruction in the main file. It tells the assembler to include the text from another file in this file before assembling. For example,
.include
"def.h"
tells the assembler to get the text from the file "def.h" and include it here in this file before assembling.
When pass 1 sees an ".include filename" pseudoinstruction, it pastes the text of the file into the new source code file where the include pseudoinstruction was. It then continues the preprocessing at the start of the newly-included statements. It needs to preprocess the text of the file since it in turn might have some include pseudoinstructions or other pseudoinstructions the preprocessor needs to deal with.
Macros
When a set of statements is repeated over and over again in the code, it can be handled 3 ways:
1. It can be written over and over again by the programmer, resulting in longer and less readable and more error-prone code.
2. It can be incorporated into a subroutine, with the resulting overhead in setting up and destroying a stack frame.
3. It can be dealt with by a macro.
HereÕs how a macro is handled by the preprocessor.
1. If the preprocessor sees a macro definition, it adds the definition to a Macro table of name & body pairs.
2. If it sees a macro call, it looks it up in the macro table and then pastes the body of the macro into the new source code file where the macro call was, after replacing any parameters with their values in the macro body. Note that it just does a textual substitution only. No evaluation or conversion to machine language goes on.
3. If there are any local labels inside the macro body, the preprocessor replaces them with unique labels and replaces any occurrence of a local label in the macro body with the corresponding unique label.
Note that a disassembler would never know there was a macro in the original source code.
Example: Consider the following source code:
;; start of source code file
;; first some pseudoinstructions
.include def.h
n EQU 2
macro swap $1, $2
;;uses R3 as a temp register
mov
$1, R3
mov
$2, $1
mov
R3, $2
endmacro
;; now the actual instructions
add R0, R1
swap R0, R1
swap R1, R2
test
load R0, x
test
M: stop
x: .data 1 3
;; end of source code file
Assume that def.h was a file with the following in it:
R0 EQU 0
R1 EQU 1
R2 EQU 2
R3 EQU 3
macro test
local L
L:
add R0, R1
jmpn R0, L
jmp M ;;M is not a local label in the macro
endmacro
From these two files, the preprocessor would create the following new source code (without the comments):
;; start of new source code
add 0, 1
mov 0, 3
mov 1, 0
mov 3, 1
mov 1, 3
mov 2, 1
mov 3, 2
L$1: add 0, 1 ;; L$1 is a new unique label
jmpn 0, L$1
jmp M ;;M is
not a local label
load R0, x
L$2: add 0, 1 ;; L$2 is a new unique label
jmpn 0, L$2
jmp M ;;M is
not a local label
M: stop
x: .data
1 3
;; end of new source code
Error checking
How does the assembler know whether the program is legal? What are all the things that can go wrong?
Some syntax checking is done in Pass 1 and some in Pass 2.
Pass 1 checks for
Pass 2 checks for legal structure of the whole program. It uses a context-free grammar (CFG).
Example: CPU Sim assembly language grammar
The following context-free grammar (CFG) gives the syntax of legal assembly language programs in CPU Sim. The start symbol for the CFG is Program. EOL is an end-of-line token and EOF is an end-of-file token. Square brackets indicate an optional item. Items in parentheses followed by "*" indicate 0 or more copies of those items. Items in parentheses followed by "+" indicates 1 or more copies of those items. To separate tokens that the assembler would otherwise treat as one token, use one or more spaces or tab characters. Items in quotes are terminal symbols. Terminal symbols are case sensitive.
Program -> [CommentsAndEOLs] EquMacroIncludePart
InstructionPart EOF
CommentsAndEOLs
-> ([Comment] EOL)+
Comment -> ";"
<any-sequence-of-characters-not-including-EOF-or-EOL>
EquMacroIncludePart -> ((EquDeclaration | MacroDeclaration |
Include)
CommentsAndEOLs)*
Include -> ".include"
<quoted-sequence-of-characters-not-including-EOL-
or-EOF>
EquDeclaration -> Symbol "EQU" Operand
MacroDeclaration -> "MACRO" Symbol [Symbol ([","]
Symbol)*]
CommentsAndEOLs InstructionPart
"ENDM"
InstructionPart -> ((RegularInstructionCall |
DataPseudoinstructionCall)
CommentsAndEOLs)*
RegularInstructionCall -> (Label CommentsAndEOLs)* [Label] Symbol
[Operand ([","]
Operand)*]
DataPseudoinstructionCall ->
(Label CommentsAndEOLs)* [Label]
".data"
Operand [","] ( Operand | "[" [Operand ([","] Operand)*] "]")
Label -> Symbol ":"
Operand -> Symbol | Literal
Literal -> [ "-" | "+" ] ( 0-9 )+
Symbol -> (<letter> ) (<letter or digit or - or + or _>
)*
Here is a summary of the parts of an assembly language program. The basic building blocks consist of the following items.
A literal is a decimal integer, with an optional plus or minus sign in front. No commas or decimal points are allowed in literals.
A symbol consists of a sequence of characters, the first character of which must be an upper or lower case letter. The remaining characters must be letters or digits or + or - or _. CPU Sim distinguishes between upper and lower case letters; hence, "Data" and "data" are considered different symbols.
A label is a symbol followed immediately by a colon. The colon is just a separator and is not considered part of the label. The label and colon pair is an optional feature on every line of assembly language programs including those lines that are otherwise blank or contain only comments. In the latter two cases, the assembler will treat the label as if it referred to the next regular instruction or data pseudoinstruction. Labels can be used as operands in statements.
A comment is any sequence of characters preceded by a semicolon ";" and ending at the end of the line or at the end of the file. Any line of a program can contain a comment. Blank lines and lines containing only comments are also allowed in assembly language programs, and are ignored by the assembler. When the program is assembled, the comments after regular instructions and data pseudoinstructions are saved and appear on each line of the assembled program. However, remember that blank lines or lines with only comments are discarded when the program gets assembled.
A program consists of two parts. The first part contains any number of EQU declarations, include directives, and macro definitions in any order. The second part contains any number of regular instructions and data pseudoinstructions, one per line.