WinAsm Studio, The Assembly IDE - Free Downloads, Source Code
Articles
Programming Quick Start
32-bit Assembler is Easy
Porting Iczelion tutorials
What is FASM
What is FASM G
Hard Drive Recovery
Wiring your own LAN
 
What is FASM G

What is FASM G, an assembly engine, which let you create an assembler for any processor

Author: Tomasz Grysztar

link to original article

What is flat assembler g?

It is an assembly engine designed as a successor of the one used in flat assembler 1, one of the recognized assemblers for x86 processors. This is a bare engine that by itself has no ability to recognize and encode instructions of any processor, however it has the ability to become an assembler for any CPU architecture. It has a macroinstruction language that is substantially improved compared to the one provided by flat assembler 1 and it allows to easily implement instruction encoders in form of customizable macroinstructions. This approach has a great flexibility at the cost of performance.

If it happened that very fast assembly comparable to flat assembler 1 was required and the performance drawback of macroinstructions was not acceptable, it would be possible to address the issue by building a custom assembler based on this engine, and the complete source code is available to anyone who would like to try it. But the focus of this package is on the use of flat assembler g in its pure form.

The source code of this tool can be compiled with flat assembler 1, but it is also possible to use flat assembler g itself to compile it. The source contains clauses that include different header files depending on the assembler used. When flat assembler g compiles itself, it uses the macroinstructions that come with the supplied example programs, since they implement x86 instructions and formats with a syntax compatible with flat assembler 1.

The macroinstructions that process the syntax of x86 instructions are complex and take a long time to assemble, but incidentally the time it takes for flat assembler g to compile itself on an ordinary modern machine is comparable to the time that an early version of flat assembler 1 needed to assemble itself a decade and a half earlier on a computer that was then similarly mediocre. This can be seen as an interesting demonstration of how the software may be getting slower at the same rate as the hardware becomes faster.

The example programs for x86 architecture that come in this package are the selected samples that originally came with flat assembler 1, with an addition of the sets of the macroinstructions that implement instruction encoders and output formatters required to assemble them just like the original flat assembler did. While they are not complete, they are intended to encourage the creation of further sets of macroinstructions that would provide more instructions and output formats.

To demonstrate how the instruction sets of different architectures may be implemented, there are some example programs for the microcontrollers, 8051 and AVR. They have been kept simple and therefore they do not provide a complete framework for programming such CPUs, though they may provide a solid base for the creation of such environments.

There is also an example of assembling the JVM bytecode, which is a conversion of the sample originally created for flat assembler 1. For this reason it is somewhat crude and does not fully utilize the capabilities offered by the new engine. However it is good at visualising the structure of a class file.

How does this work?

The essential function of flat assembler g is to generate output defined by the instructions in the source code. Given the one line of text as shown below, the assembler would generate a single byte with the stated value:

        db 90h

The macroinstructions can be defined to generate some specific sequences of data depending on the provided parameters. They may correspond to the instructions of chosen machine language, as in the following example, but they could as well be defined to generate other kinds of data, for various purposes.

        macro int number
                if number = 3
                        db 0CCh
                else
                        db 0CDh, number
                end if
        end macro

        int 20h         ; generates two bytes        

The assembly as seen this way may be considered a kind of interpreted language, and the assembler certainly has many characteristics of the interpreter. However it also shares certain aspects with a compiler. It is possible for an instruction to use the value which is defined later in the source and may depend on the instructions that come before that definition, as demonstrated by the following sample.

        macro jmpi target
                if target-($+2) < 80h & target-($+2) >= -80h                    
                        db 0EBh
                        db target-($+1)
                else
                        db 0E9h
                        dw target-($+2)
                end if 
        end macro

                jmpi start  
                db 'some data'  
        start:

The "jmpi" defined above produces the code of jump instruction as in 8086 architecture. Such code contains the relative offset of the target of a jump, stored in either single byte or 16-bit word. The relative offset is computed as a difference between the address of the target and the address of the next instruction. The special symbol "$" provides the address of current instruction and it is used to calculate the relative offset and determine whether it may fit in a single byte.

Therefore the code generated by "jmpi start" in the above sample depends on the value of an address labeled as "start", and this in turn depends on the length of the output of all the instructions that precede it, including the said jump. This creates a loop of dependencies and the assembler needs to find a solution that fulfills all the contraints created by the source text. This would not be possible if assembler was simply a straightforward interpreter.

Finding a solution for such circular dependencies may resemble solving an equation, and it is even possible to construct an example where flat assembler g is indeed capable of solving one:

        x = (x-1)*(x+2)/2-2*(x+1)
        db x

The circular reference has been reduced here to a single definition that references itself to construct the value. The flat assembler g is able to find a solution in this case, though in many others it may fail. The method used by this assembler is to perform multiple passes over the source text and then try to predict all the values with the knowledge gathered this way. This approach is in most cases good enough for the assembly of machine codes, but rarely suffices to solve the complex equations and the above sample is one of the exceptions.

What are the means of parsing the arguments of an instruction?

Not all instructions have a simple syntax like then ones in the previous examples. To aid in the processing of arguments that may contain special constructions, flat assembler g provides a few capable tools, demonstrated below on the examples that implement selected few instructions of the Z80 processor. The rules governing the use of presented features are found in the manual.

When an instruction has a very small set of allowed arguments, each one of them can be treated separately with the "match" construction:

        macro EX? first,second
                match (=SP?), first
                        match =HL?, second
                                db 0E3h
                        else match =IX?, second
                                db 0DDh,0E3h
                        else match =IY?, second
                                db 0FDh,0E3h
                        else
                                err "incorrect second argument"
                        end match
                else match =AF?, first
                        match =AF'?, second
                                db 08h
                        else
                                err "incorrect second argument"
                        end match
                else match =DE?, first
                        match =HL?, second
                                db 0EBh
                        else
                                err "incorrect second argument"
                        end match
                else
                        err "incorrect first argument"
                end match
        end macro

        EX (SP),HL
        EX (SP),IX
        EX AF,AF'
        EX DE,HL

The "?" character appears in many places to mark the names as case-insensitive and all these occurrences could be removed to further simplify the example.

When the set of possible values of an argument is larger but has some regularities, the textual substitutions can be defined to replace some of the symbols with carefully chosen constructions that can then be recognized and parsed:

        A? equ [:111b:]
        B? equ [:000b:]
        C? equ [:001b:]
        D? equ [:010b:]
        E? equ [:011b:]
        H? equ [:100b:]
        L? equ [:101b:]

        macro INC? argument
                match [:r:], argument
                        db 100b + r shl 3
                else match (=HL?), argument
                        db 34h
                else match (=IX?+d), argument
                        db 0DDh,34h,d
                else match (=IY?+d), argument
                        db 0FDh,34h,d
                else
                        err "incorrect argument"
                end match
        end macro

        INC A
        INC B
        INC (HL)
        INC (IX+2)

In case of an argument structured like "(IX+d)" it could sometimes be desired to allow other algebraically equivalent forms of the expression, like "(d+IX)" or "(c+IX+d)". Instead of parsing every possible variant individually, it is possible to let the assembler evaluate the expression while treating the selected symbol in a distinct way. When a symbol is declared as an "element", it has no value and when it is used in an expression, it is treated algebraically like a variable term in a polynomial.

        element HL?
        element IX? 
        element IY? 

        macro INC? argument
                match [:r:], argument
                        db 100b + r shl 3
                else match (a), argument
                        if a eq HL
                                db 34h
                        else if a relativeto IX
                                db 0DDh,34h,a-IX
                        else if a relativeto IY
                                db 0FDh,34h,a-IY
                        else
                                err "incorrect argument"
                        end if
                else
                        err "incorrect argument"
                end match
        end macro

        INC (3*8+IX+1)

        virtual at IX
                x db ?
                y db ?
        end virtual        

        INC (y)

There is a small problem with the above macroinstruction. A parameter may contain any text and when such value is placed into an expression, it may induce erratic behavior. For example if "INC (1|0)" was processed, it would turn the "a eq HL" expression into "1|0 eq HL" and this logical expression is correct and true even though the argument was malformed. To prevent this from happening, a local variable may be used as a proxy holding the value of an argument:

        macro INC? argument
                match [:r:], argument
                        db 100b + r shl 3
                else match (a), argument
                        local value
                        value = a
                        if value eq HL
                                db 34h
                        else if value relativeto IX
                                db 0DDh,34h,a-IX
                        else if value relativeto IY
                                db 0FDh,34h,a-IY
                        else
                                err "incorrect argument"
                        end if
                else
                        err "incorrect argument"
                end match
        end macro

There is an additional advantage of such proxy variable, thanks to the fact that its value is computed before the macroinstruction begins to generate any output. When an expression contains a symbol like "$", it may give different values depending where it is calculated and the use of proxy variable ensures that the value taken is the one obtained by evaluating the argument before generating the code of an instruction.

When the set of symbols allowed in expressions is larger, it is better to have a single construction to process an entire family of them. An "element" declaration may associate an additional value with a symbol and this information can then be retrieved with the "metadata" operator applied to a linear polynomial that contains given symbol as a variable. The following example is another variant of the previous macroinstruction that demonstrates the use of this feature:

        element register
        element A? : register + 111b
        element B? : register + 000b
        element C? : register + 001b
        element D? : register + 010b
        element E? : register + 011b
        element H? : register + 100b
        element L? : register + 101b

        element HL?
        element IX? 
        element IY? 

        macro INC? argument
                local value
                match (a), argument
                        value = a
                        if value eq HL
                                db 34h
                        else if value relativeto IX
                                db 0DDh,34h,a-IX
                        else if value relativeto IY
                                db 0FDh,34h,a-IY
                        else
                                err "incorrect argument"
                        end if
                else match any more, argument
                        err "incorrect argument"
                else
                        value = argument
                        if value eq value element 1 & value metadata 1 relativeto register
                                db 100b + (value metadata 1 - register) shl 3
                        else
                                err "incorrect argument"
                        end if
                end match
        end macro

The "any more" pattern is there to catch any argument that contains a complex expressions consisting of more than one token. This prevents the use of syntax like "INC A+0" or "INC A+B-A". But in case of some of the instructions sets, the inclusion of such constraint may depend on a personal preference.

The "value eq value element 1" condition ensures that the value does not contain any terms other than the name of a register. Even when an argument is forced to contain no more than a single token, it is still possible that is has a complex value, for instance if there were definitions like "X = A + B" or "Y = 2 * A". Both "INC X" and "INC Y" would then cause the operator "element 1" to return the value "A", which differs from the value checked in either case.

How are the labels processed?

A standard way of defining a label is by following its name with ":" (this also acts like a line break and any other command, including another label, may follow in the same line). Such label simply defines a symbol with the value equal to the current address, which initially is zero and increases when any bytes are added into the output.

In some variants of assembly language it may be desirable to allow label to precede an instruction without an additional ":" inbetween. It is then necessary to create a labeled macroinstruction that after defining a label passes processing to the original macroinstruction with the same name:

        struc INC? argument
                .:
                INC argument
        end struc

        start   INC A
                INC B

This has to be done for every instruction that needs to allow this kind of syntax. A simple loop like the following one would suffice:

        iterate instruction, EX,INC
                struc instruction? argument
                        .: instruction argument
                end struc
        end iterate

Every built-in instruction that defines data already has the labeled variant.

By defining a labeled instruction that has "?" in place of name it is possible to intercept every line that starts with an identifier that is not a known instruction and is therefore assumed to be a label. The following one would allow a label without ":" to begin any line in the source text (it also handles the special cases so that labels followed with ":" or with "=" and a value would still work):

        struc ? tail&
                match :, tail 
                        .: 
                else match : instruction, tail
                        .: instruction
                else match == value, tail
                        . = value
                else 
                        .: tail
                end match 
        end struc

Obviously, it is no longer needed to define any specific labeled macrointructions when a global effect of this kind is applied. A variant should be chosen depending on the type of syntax that needs to be allowed.

Intercepting even the labels defined with ":" may become useful when the value of current address requires some additional processing before being assigned to a label - for example when a processor uses addresses with a unit larger than a byte. The intercepting macroinstruction might then look like this:

        struc ? tail&
                match :, tail 
                        label . at $ shr 1
                else match : instruction, tail
                        label . at $ shr 1
                        instruction
                else
                        . tail
                end match
        end struc

The value of current address that is used to define labels may be altered with "org". If the labels need to be differentiated from absolute values, a symbol defined with "element" may be used to form an address:

        element CODEBASE
        org CODEBASE + 0

        macro CALL? argument
                local value
                value = argument
                if value relativeto CODEBASE
                        db 0CDh
                        dw value - CODEBASE
                else
                        err "incorrect argument"
                end if 
        end macro

To define labels in an address space that is not going to be reflected in the output, a "virtual" block should be declared. The following sample prepares macroinstructions "DATA" and "CODE" to switch between generating program instructions and data labels. Only the instruction codes would go to the output:

        element DATA
        DATA_OFFSET = 2000h
        element CODE
        CODE_OFFSET = 1000h

        macro DATA?
                _END
                virtual at DATA + DATA_OFFSET
        end macro

        macro CODE?
                _END
                org CODE + CODE_OFFSET
        end macro

        macro _END?
                if $ relativeto DATA
                        DATA_OFFSET = $ - DATA
                        end virtual
                else if $ relativeto CODE
                        CODE_OFFSET = $ - CODE
                end if
        end macro

        postpone
                _END
        end postpone

        CODE

The "postpone" block is used here to ensure that the "virtual" block always gets closed correctly, even if source text ends with data definitions.

Within the environment prepared by the above sample any instruction would be able to distinguish data labels from the ones defined within program. For example a branching instruction could be made to accept an argument being either a label within a program or an absolute value, but to disallow any label of data:

        macro CALL? argument
                local value
                value = argument
                if value relativeto CODE
                        db 0CDh
                        dw value - CODE
                else if value relativeto 0
                        db 0CDh
                        dw value
                else
                        err "incorrect argument"
                end if 
        end macro

        DATA

        variable db ?

        CODE

        routine:

In this context either "CALL routine" or "CALL 1000h" would be allowed, while "CALL variable" would not be.

When the labels have values that are not absolute numbers, it is possible to generate relocations for instructions that use them. A special "virtual" block may be used to store the offsets of values inside the program that need to be relocated when its base changes:

        virtual at 0
                Relocations::
                rw RELOCATION_COUNT
        end virtual

        RELOCATION_INDEX = 0

        postpone
                RELOCATION_COUNT := RELOCATION_INDEX                
        end postpone

        macro WORD? value
                if value relativeto CODE
                        store $ - CODE : 2 at Relocations : RELOCATION_INDEX shl 1
                        RELOCATION_INDEX = RELOCATION_INDEX + 1
                        dw value - CODE
                else
                        dw value
                end if
        end macro 

        macro CALL? argument
                local value
                value = argument
                if value relativeto CODE | value relativeto 0
                        db 0CDh
                        word value
                else
                        err "incorrect argument"
                end if 
        end macro 

The table of relocations that is created this way can then be accessed with "load". The following two lines could be used to put the table in its entirety somewhere in the output:

        load RELOCATIONS : RELOCATION_COUNT shl 1 from Relocations : 0
        dw RELOCATIONS

The "load" reads the whole table into a single string, then "dw" writes it into output (padded to multiple of a word, but in this case the string never requires such padding).

What options are there to parse other kinds of syntax?

In some cases a command that assembler needs to parse may begin with something different than a name of instruction or a label. It may be that a name is preceded by a special character, like "." or "!", or that it is an entirely different kind of construction. It is then necessary to use "macro ?" to intercept whole lines of source text and process any special syntax of such kind.

For example, if it was required to allow a command written as ".CODE", it would not be possible to implement it directly as a macroinstruction, because initial dot causes the symbol to be interpreted as a local one and globally defined instruction could never be executed this way. The intercepting macroinstruction provides a solution:

        macro ? line&
                match .=CODE?, line
                        CODE
                else match .=DATA?, line
                        DATA
                else
                        line
                end match
        end macro  

The lines that contain either ".CODE" or ".DATA" text are processed here in such a way, that they invoke the global macroinstruction with corresponding name, while all other intercepted lines are executed without changes. This method allows to filter out any special syntax and let the assembler process the regular instructions as usual.

Sometimes unconventional syntax is expected only in a specific area of source text, like inside a block with defined boundaries. The parsing macroinstruction should then be applied only in this place, and removed with "purge" when the block ends:

        macro concise
                macro ? line&
                        match =end =concise, line
                                purge ?
                        else match dest+==src, line
                                ADD dest,src
                        else match dest-==src, line
                                SUB dest,src
                        else match dest==src, line
                                LD dest,src
                        else match dest++, line
                                INC dest
                        else match dest--, line
                                DEC dest
                        else match any, line
                                err "syntax error"
                        end match
                end macro
        end macro

        concise
                C=0
                B++
                A+=2
        end concise

Copyright © 2004-2016 Tomasz Grysztar.

Sponsors
Computer Science

Internet
C/C++
Hardware & PC maintenance

HiEditor

General Discussions
Suggestions/Bug Reports
WinAsm Studio

General Discussions
Suggestions/Bug Reports
WinAsm Studio FAQ
Multilingual User Interface
Add-Ins
Assembly Programming

Main
Newbies
Projects
Custom Controls
Snippets
Announcements & Rules

Announcements

General

Online Degrees - Distance Learning
The Heap
Russian