Skip to content

Lexical_analysis

k-off edited this page Sep 27, 2019 · 5 revisions

Before the actual translation of the assembly code into the byte-code, program will perform a so called lexical analysis of the source code.

In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an assigned and thus identified meaning).

Source: Lexical analysis — Wikipedia

the campion's code will be parsed into components, each of them being assigned to a certain type.

For example, following string:

loop: sti r1, %:live, %1

... will be decomposed into tokens:

Nr Content Type
1 loop: LABEL
2 sti INSTRUCTION
3 r1 REGISTER
4 , SEPARATOR
5 %:live DIRECT_LABEL
6 , SEPARATOR
7 %1 DIRECT

It is possible to study in detail how the provided translator program identifies each type of token using the error messages it displays.

How it works

Brute-forcing original asm with wrong inputs and reading it's error messages we can find out the following:

Commands order

Places of commands for name and comment definitions can be switched

.comment    "This city needs me"
.name       "Batman"

STRING

STRING token contains everything between opening and closing double quotation marks ", even if there are new line characters \n and comment characters (# or ;) in between.

The following example is absolutely correct:

.name "one
#two
three"

Extra tabs and spaces

If translator can distinguish two components without whitespaces in between them, spaces can be omitted (valid examples):

live%42
loop:sti r1, %:live, %1
.name"Batman"
.comment"This city needs me"

But obviously there are cases where whitespaces are mandatory.

±0

Arguments can contain -, but a + will lead to an error message.

There also can be present some leading zeroes:

live %0000042

Registries

The provided asm recognises as a registry a string containing character r with one or two trailing digits.

The following examples are valid:

r2
r01
r99

Last example deserves some attention.

Lack of error messages on registry 99 means asm doesn't know about virtual machines' configuration and is not connected to the constant REG_NUMBER.

This is absolutely logic. Translator asm and the virtual machine are different programs which are not aware of each other's configuration (or even existence).

Invalid byte-code may become valid (and vice versa) if we change value of the constant REG_NUMBER.

If value of REG_NUMBER is 16, argument r42 is incorrect. If value of constant is 49 argument becomes valid.

Two important cases (r0 and r00)

This two arguments are translated without error messages by the original asm. But this contradicts to the subject of the project, which clearly states:

Registry: (r1 <–> rx with x = REG_NUMBER)

So if we can explain behaviour of translator with the top boundary (it's not connected to REG_NUMBER), arguments r0 and r00 are always invalid.

Executable code size

The size of executable code is not restricted by the translator. It will translate both files with one instruction and with one million instructions.

The size is limited in the virtual machine. The max size is defined in the constant CHAMP_MAX_SIZE in the file op.h:

# define MEM_SIZE           (4 * 1024)
// ...
# define CHAMP_MAX_SIZE     (MEM_SIZE / 6)

So max size is 682 bytes.

The min size is not limited by the virtual machine (it will run a champion without executable code).

But it is quiet difficult to create such a file. If .s file contains name and comment of champion and nothing more, it is invalid. But if we add at least one label to the end of file, it will be successfully translated.

.comment    "This city needs me"
.name       "Batman"

loop:
# EOF