Goodasm

Published at

6 days ago

Main Article

https://github.com/travisgoodspeed/goodasm

Howdy y'all,

These days, the world has many excellent disassemblers, but when writing assembly in a weird languages, many of us just fall back to handwriting bytes in Nasm or painfully adjusting to yet another macro assembler. GoodASM is my attempt to fix that.

This assembler is very easy to retarget by extending C++ class and calling a few functions. You pretty much just define the endianness and byte size, then copy the instruction definitions almost verbatim from the programmer's guide.

Defining a new target is completely symmetric. While assembly is the primary purpose of this tool, a disassembler is also produced. An example for each instruction automatically becomes a unit test, insuring that you don't accidentally fat-finger a bitfield or two.

When writing assembly, particularly shellcode, you often have questions that a calculator should answer. What are all the instruction that begin with ro in this language? What are all of the potential forms of the jmp instruction? Does ror a assemble to ASCII? An interactive REPL mode answers these in a jiffy, with tab completion.

If you like this program, please buy my book on Microcontroller Exploits for yourself or for a clever student.

--Travis

Status

Some critical features remain to be written before a public release, but enough features are supported for 6502 and the GameBoy to write real programs. See the Issue Tracker for known bugs and missing features.

A language is alpha if it is relatively unused, beta if successful test programs have been written for it, and stable if it has been used for real projects and compared against competing assemblers for accuracy. partial languages are not yet complete, and you should check the issue tracker for their status.

Source code and binaries will be publicly released at DistrictCon on Feb 21, 2025.

Building

For GUI development, install the Qt Dev Kit and then open CMakefile.txt in Qt Creator.

To build in Linux without the GUI, first install qt6-declarative-dev, qml6-module-\* and cmake, then run the following:

git clone https://github.com/travisgoodspeed/goodasm
cd goodasm
mkdir build
cd build
cmake ..
make -j 8 clean all

The preferred executable is goodasm. The GUI for iOS and Android is more of a fun toy than a tool.

Examples

Demo 3/1 from 2600 Games

tests/6502/nanochess contains translations of the first few chapters of Programming Games for Atari 2600 by Oscar Toledo G. Hashes are compared to make sure that our assembler produces the same binaries as Oscar's, and pull requests are waiting if you convert any of the remaining demos from that book into this assembler's style.

GameBoy Example

tests/sm83/hello.asm is a simple Hello World for the SM83 architecture of the Nintendo Game Boy.

NipperTool dumping a Dish Network ROM3 Card

tests/6805/nipperpatch.asm is the 6805 shellcode from my NipperTool exploit for the Nagra ROM3 cards that Dish Network used twenty-five years ago. In NipperTool, this code is exported as Go source code with a map of symbol names, so that the tool can patch the target address being dumped by the shellcode.

Using the Disassembler

Running goodasm --help will show the command line arguments, supported languages and other features.

air% goodasm --help
Usage: goodasm [options] input
GoodASM is an easily retargetable assembler for CISC microcontrollers.

Options:
  -h, --help                Displays help on commandline options.
  --help-all                Displays help, including generic Qt options.
  -v, --version             Displays version information.
  -V, --verbose             Talk too much.
  --cheat                   Print a cheat sheet of mnemonics.
  --opcodetable             Print a table of opcodes.
  -t, --test                Selftest the selected language.
  --fuzz                    Fuzz test to find crashes.
  -g, --grade               Is this binary in the chosen language?
  -d, --dis                 Disassemble a binary input.
  --base <base>             Base address of an input binary.
  -o, --export <file>       Output filename.
  -S, --symboltable <file>  Export symbol table
  --repl                    Interactive shell.
  --wait                    Wait after cleanup.
  -L, --list                Print a listing to stdout.
  -H, --hex                 Print hex of output to stdout.
  -N, --nasm                Output is legal in NASM.
  -C                        Output is legal in C.
  -G, --golang              Output is legal in Golang.
  -Y, --yara-x              Output is legal in Yara-X.
  -M, --markdown            Output is legal in Markdown.
  -b, --bytes               List bytes in output.
  -B, --bits                List bits in output.
  -A, --auto                Comment with cheat sheet when available.
  -a, --adr                 List addresses in output.
  --8086                    Intel 8086 (broken)
  --6502                    MOS 6502 (stable)
  --6805                    68HC05 (stable)
  --fcard                   68HC05SC21 DSS P1 F-Card (stable)
  --sm83, --gb              Sharp SM83 / Gameboy (stable)
  --z80                     Zilog Z80 (stable)
  --tlcs47                  Toshiba TLCS47 (alpha)
  --8051                    Intel 8051 (alpha)
  --ucom43                  Sharp UCOM43 (alpha)
  --s2000, --emz1001        AMI S2000 / Iskra EMZ1001 (alpha)
  --pic16c5x                PIC16C5x (12-bit) (alpha)
  --marc4                   Atmel MARC4 (alpha)
  --chip8                   Chip-8 (broken)

Arguments:
  input                     ROM image to decode or source to assemble.

Let's begin with the guesses test file for AMI's S2000 architecture, an assembly language I expect you've never heard of. Its biggest claim to fame is that it was also called EMZ1001, Yugoslavia's only locally manufactured microcontroller.

By default, GoodASM has no language defined, so it will just output db directives.

% goodasm -da guesses 
0x00: 	db 0x00
0x01: 	db 0x01
0x02: 	db 0x02
0x03: 	db 0x03
0x04: 	db 0x6f
0x05: 	db 0xcf

We can get a more useful listing by specifying the language with --s2000. Of the single letter parameters, d disassembles, b prints the bytes within the disassembly, a includes the address of each instruction, and A auto-comments with the help entry of each instruction.

% goodasm -dbaA --s2000 guesses
0000:   00      nop                       ; No operation.
0001:   01      brk                       ; Break.
0002:   02      rt                        ; Return from subroutine.
0003:   03      rts                       ; Return from subroutine and skip next instruction.
0004:   6f      pp #0x000f                ; Prepare page or bank register.
0005:   cf      jmp #0x000f               ; Jump to immediate in prepared page and bank.

You can also use the interactive mode to type in bytes for disassembly. This is useful when writing shellcode by hand or working out errors in a damaged disassembly.

carbon% goodasm -dba --s2000 --repl
goodasm> 00 01 02 03 6f cf
0000:   00      nop                       
0001:   01      brk                       
0002:   02      rt                        
0003:   03      rts                       
0004:   6f      pp #0x000f                
0005:   cf      jmp #0x000f               
goodasm> 6fcf
0000:   6f      pp #0x000f                
0001:   cf      jmp #0x000f               
goodasm>

Using the Assembler

The assembler functions from the exact same definitions as the disassembler, so there is no language that works with one but not the other.

Use the -H parameter to dump the output as text or -o as a binary file. -L prints an instruction listing, matching source code to the binary. The same language flags work as in the disassembler, but you can also use the .lang directive to specify it within your source file.

% cat undefined.asm
	.lang 6502
	nop			; Do nothing.
	brk			; 1-byte in documentation,
	nop			; but we follow with padding.
% goodasm -H undefined.asm -o undefined.bin
ea 00 ea
% goodasm -Lba undefined.asm 
0000:   ea         nop                    ; Do nothing.
0001:   00         brk                    ; 1-byte in docs.
0002:   ea         nop                    ; but we follow with padding.

Just like in disassembly, you can use the interactive mode to try out instructions. Tab completion will give you verb names and parameters.

% goodasm --6502 
goodasm> st<TAB>
sta  stx  sty  
goodasm> stx <TAB>
stx @0xdead
stx @0x35
stx @0x35, y
goodasm> stx @0xdead
[0] 8e ad de 
goodasm>

Because every assembly language is unique, and it's a devil to remember all the forms of a new one, a handy cheat sheet from the instruction text cases will remind you of the available instructions.

% goodasm --6502 --cheat
adc #0xff           ;; Add to A with Carry. (immediate)
adc @0xdead         ;; Add to A with Carry. (absolute)
adc @0xdead, x      ;; Add to A with Carry. (X-indexed absolute)
adc @0xdead, y      ;; Add to A with Carry. (Y-indexed absolute)
adc @0x35           ;; Add to A with Carry. (zero-page)
adc @0x35, x        ;; Add to A with Carry. (X-indexed zero-page)
adc (@0x35, x)      ;; Add to A with Carry. (X-Indexed Zero Page Indirect)
adc (@0x35), y      ;; Add to A with Carry. (Zero Page Indirect Y-Indexed)
...

The rules of syntax will differ a bit from other languages, but generally it works like this:

# begins an immediate value, a literal constant.
Only code address targets are used without a prefix, as in call or jmp.
@ begins an absolute value, referring to the value at the address.
() are grouping symbols, usually referring to the value at the address within the group.
, separates parameters within a group.
Integers begin as 0x for hex, 0b for binary. Default is decimal.
A label is defined with a symbol name before a colon, or by .equ.
Labels needn't be defined on first usage, but they must be defined by the end.
Assembly passes continue until all symbols settle to consistent values.

A number of directives are available for all languages:

.org 0xff00          ;; First call sets the base address.
.org 0xff10          ;; Second call fills the binary until target adr.
.lang 6502           ;; Switch the language to 6502.
.include "foo.asm"   ;; Includes a sourceflie.
.incbin "foo.bin"    ;; Includes the bytes of a binary file.
.equ foo 0xcafe      ;; Defines the symbol foo to be 0xCAFE.
.db 0xde, 0xad, 0xbe ;; Includes the bytes DEADBE.
.db "hello", 0       ;; Null terminated string.
.dh 9d de ad be ef   ;; Hex bytes, as data.
.ib 9d de ad be ef   ;; Instruction bytes, disassembled then reloaded.

Some are intended for interactive use in REPL mode.

.cheat               ;; Print the cheat sheet.
.symbols             ;; Print the symbol table.

iOS and Android Screenshots

A little less versatile than the REPL mode, we also support an interactive mode for iOS and Android. Please don't do real work this way, but it's handy when studying an instruction set with pen and paper, away from a real laptop.

Defining a New Language

You can find instruction definitions in galang8051.cpp/.h. Many instructions are opcodes with no parameters, so they are defined as just a name and an opcode byte. gamnem is defined to be new GAMnemonic, and it's a common design pattern to shorten definitions like that when defining languages.

The simplest instruction is nop, which does nothing. It has an opcode of 00. It is defined like this within the constructor of GALang8051, along with a short help string and a shorter example. The example doubles as a unit test, and goodasm --8051 --test will fail if the examples do not properly assemble to their matching mnemonics.

insert(mnem("nop", 1, "\x00", "\xff"))
  ->help("No operation.")
  ->example("nop");

The swap instruction with opcode c4 is a little more complex, taking the accumulator as an argument. In this case, I first insert my mnemonic and define its help and example strings, then I call regname() to specify that the parameter must be a. Note that this defines both the machine language instruction c4 and all the parsing needed for the assembly language instruction swap a.

insert(mnem("swap", 1, "\xc4", "\xff"))
  ->help("Swap nybbles of the accumulator.")
  ->example("swap a")
  ->regname("a");

For instructions with common parameter types, we can quickly define arguments. You've already seen regname() which specifies exactly one register of a particular name, but other common types include abs() to specify the byte in memory at particular address, imm() to specify an immediate value in bytes, adr() and rel() to specify absolute and relative addresses, and others that are defined in gaparameter.h.

Here's how the pop instruction is written in 8051, popping a byte from the stack into memory at a specified address. We might write pop @0xde to pop one word from the stack into memory at DE; this will assemble to D0 DE.

insert(mnem("pop", 2, "\xd0\x00", "\xff\x00"))
   ->help("Pop from stack to mem[adr].")
   ->example("pop @0xff")
   ->abs("\x00\xff");

Relative addressing is used for short jumps in many architectures. The standard form assumes a signed relative offset from the program counter, and you should always specify an constant offset. This might be zero, but on CISC chips like 8051 it's more likely the length of the jump instruction, as the program counter has already advanced when the jump occurs.

Here we define a short jump with a relative offset in its second byte, plus a constant of 2 for the length of the instruction. Other 8051 jump instructions are three bytes long, and they use 3 for the constant offset in rel().

insert(mnem("sjmp", 2, "\x80\x00", "\xff\x00"))
        ->help("Short Jump.")
        ->example("loop: sjmp loop")
        ->rel("\x00\xff", 2);

Defining new Parameter Types

Sometimes you'll have a parameter that is totally unique to your chip, and these must be defined from scratch. Generally you need to define a constructor that sets the mask, a match() method that determines whether assembly code is matching, and encode()/decode() methods to convert between machine language and assembly. The parameter is self-contained; it does not care about other parameters or the opcode.

Sticking with the 8051 example, that machine has short jumps and short calls with an 11-bit address. This address is not relative, but rather overwrites the low 11 bits of the program counter, after the program counter is incremented by the instruction size.

First, we make a new class that inherits from GAParameter, overriding some important functions.

//Represents an 11-bit address offset on 8051
class GAParameter8051Addr11 : public GAParameter {
public:
    GAParameter8051Addr11(const char* mask);
    int match(GAParserOperand *op, int len) override;

    QString decode(GALanguage *lang, uint64_t adr,
                   const char *bytes, int inslen) override;
    void encode(GALanguage *lang,
                uint64_t adr, QByteArray &bytes,
                GAParserOperand op,
                int inslen
                ) override;
};

We don't want to keep re-implementing bit masking, so our methods here will call rawdecode() or rawencode() to read or write the appropriate bits of the instruction. In encoding, the GAParserOperand parameter provides the value as a string or a decoded integer.

Here is how we encode this 11 bit address.

void GAParameter8051Addr11::encode(GALanguage *lang,
            uint64_t adr, QByteArray &bytes,
            GAParserOperand op,
            int inslen
            ){
    int64_t val=op.uint64(true);
    val-=2;       // PC+2 will be applied when the CPU decodes the instruction.
    val&=~0xf800; // Mask off upper bits, which are not encoded.

    rawencode(lang,adr,bytes,op,inslen,val);
}

Once the methods are in place, a preprocessor directive lets us easily insert it into instructions.

//11 bit address field, kinda like an in-page branch.
#define addr11(x) insert(new GAParameter8051Addr11((x)))

insert(mnem("acall", 2, "\x11\x00", "\x1f\x00"))
  ->help("Calls a subroutine within 2kB.")
  ->example("loop: acall loop")
  ->addr11("\xe0\xff");

Classes to Know

I've taken care to comment these classes. Please read first the .h and then the .cpp of any class that you intend to use.

GALanguage tracks an assembly language, and later, it might also be used to track dialects. By default, the language has no instructions, and any unknown instructions will disassembler to a db directive for a data byte. galang*.cpp subclass this for specific languages by adding GAMnemonic records for every supported instruction.

GoodASM tracks the project as a whole. It is constructed around a GALanguage instance, and many of the flags that you specify on the CLI are implemented by public variables of this class, such as listadr/-a and listbytes/-b.

GAMnemonic implements the abstract idea of an instruction, roughly equivalent to one opcode in CISC. These are loaded into a GALanguage instance to teach it the individual opcodes, names, and parameters. Overlaps are resolved by setting priorities, and new parameter types can be generated by class inheritance when your particular language is a little strange.

GAInstruction implements a concrete instruction of a program, and its ->next() method can advance to the next instruction. This differs from GAMnemonic, whose job is to describe the generic instruction and its parameters rather than a known sequence of bytes. A GAInstruction might have come from a line of source code, or it might have come from byte in a file.

main.cpp implements the CLI and all its parameters. Until other examples of using this as a library are added, this will be your best source of examples for using the library.

galexer.cpp implements the low-level lexer, which feeds into gaparser.cpp for parsing. Parsing is very strictly shared by all languages, so that someday the parser can be rewritten without breaking code compatibility.

Similar Assembler/Disassemblers

Naken ASM

Similar Assemblers

Gasm80 and tinyasm by Oscar Toledo G, from whose Atari 2600 book many of the 6502 tests were built.

dasm the 8-bit Macro Assembler

The Macroassembler AS

Similar Disassemblers

MAME's Universal Disassembler

gittech. site