posts/20240817-aarch64-bootstrapping-1.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266

---
date: 2024-08-17
title: AArch64 Bootstrapping 1 - Seed Groundwork
tags:
- bootstrapping
- aarch64
- assembly
---

I got a new laptop[^newlaptop] recently, and in the spirit of how open and
accessible and tinkerable the whole system is, I thought it would be fun to try
and do a little bootstrapping challenge. Starting from a tiny hand-made ELF
executable "seed," I'm going to try to build a complete programming environment
from scratch.

<!--more-->
[^newlaptop]: [MNT Pocket Reform](https://www.crowdsupply.com/mnt/pocket-reform)

Ideally I'd like to have started off by `cat`ting raw hexadecimal into a file
and building up from there,[^bcompiler] or something to that effect, but my new
laptop uses AArch64, also known as Arm64.[^aarch64] That's the instruction set
used in most smartphones and newer Macs, and as a RISC-ish ISA it's quite a bit
cleaner than the convoluted mess x86/64 has grown into, but it's also
significantly more difficult to write by hand. I gave it a good shot, but on
top of the instruction fields being weirdly aligned and my near-complete
inexperience with directly using the instruction set, I finally relented when I
realized that ELF executables have to be little-endian,[^endian] meaning I'd
have to write the raw, unfamiliar instructions *backwards.*

[^bcompiler]: [`bcompiler` --- one of the inspirations for this project](https://github.com/certik/bcompiler)
[^aarch64]: [AArch64 instruction set quick reference sheet](https://courses.cs.washington.edu/courses/cse469/19wi/arm64.pdf)
[^endian]: [Wikipedia: Endianness](https://en.wikipedia.org/wiki/Endianness)

Fortunately, the endianness thing is a pretty straightforward problem to solve,
so I thought I'd try using that as a warmup. What I'd like is a program that
takes in a space-delimited stream of varying sizes of numbers in hexadecimal,
and spits them out the other side "compiled" to binary in little-endian byte
order. Giving it a first go in C to make sure I had my implementation concept
down right, I came up with this:

```c:0
// hex.c
#include <stdio.h>

int main(void) {
  char c = 0;      // Input character
  int count = 0;   // Hex digits parsed for this number
  long long v = 0; // Current value

  while ((c = getchar()) != EOF) {

    // Skip comments
    if (c == '\\') { while ((c = getchar()) != EOF && c != '\n') {} }

    // Process incoming digits and add to running value
    else if (('0' <= c) && (c <= '9')) { v = (v << 4) + (c - '0'); count++; }
    else if (('a' <= c) && (c <= 'f')) { v = (v << 4) + (c - 'a' + 10); count++; }

    // Send the finished number out the other end
    else if (c == ' ') {

      // Turn the number of hex digits into a number of bytes
      if (count & 1) count++;
      count >>= 1;

      // Emit each byte, littlest first
      for (; count>0; count--) {
        c = v & 0xFF;
        putchar(c);
        v >>= 8;
      }
    }
  }
}
```

That worked like a charm, but C is still pretty high-level for where we're
going, so I'm not going to spend too long going over it here. Instead, I
rewrote it in assembly. Given that I went for a mostly-direct translation, odds
are there's plenty more efficiency that could be squeezed out of it in terms of
speed or instruction count, but I'm really only planning on using this on
itself and the very next stage, so I figure if it works correctly, that's good
enough. I used the GNU `as` assembler because I was already familiar with it,
and I find the ephemeral numbered labels especially nifty.

Starting off, we set up a couple "global" variables, then begin the main loop
by pulling in a new character and checking if we've reached the end of the
input, and exiting if so.

```armasm:0:2
// hex.s
.section .data
buf:
  .byte 0       // input/output character buffer

.section .text
.global _start
_start:
  mov x19, #0   // value
  mov x20, #0   // count

loop:
  mov x8, #63   // read
  mov x0, #0    // stdin
  adr x1, buf
  mov x2, #1
  svc 0

  cmp x0, #0    // count of bytes read; 0 means EOF
  b.ne 1f

exit:
  mov x8, #93   // exit
  mov x0, #0    // error code (no error)
  svc 0
```

System calls are made in AArch64 Linux by loading the system call
number[^syscalls] into the register `x8`, plus the syscall arguments into some
of the other registers, and then running the "supervisor call" instruction `svc
0` which handles the transition over to the kernel's control.

[^syscalls]: [AArch64 Linux system call reference sheet](https://arm64.syscall.sh/)

For the `read` syscall (63), the arguments are the source file descriptor, a
pointer to the target character buffer(!), and the requested length. Note that
the call specifically requires a destination *buffer;* as far as I can tell,
there's no way to request a single char read directly into a register. So, we
need to load a register with a pointer to our one-byte buffer, then after the
syscall returns, we load the character from there.

AArch64 has a special instruction, `adr`, for loading a register with an
absolute memory address given a program-counter-relative offset. My
understanding is that this is the idiomatic way to do this, even though it
might prove to be a bit of a pain when we're manually calculating addresses and
offsets later.

Once the syscall has placed the byte into the buffer, we pull it out and check
to see if it's a backslash, meaning we're at the start of a comment. If so, we
enter an inner loop where we read characters repeatedly until we find the end
of the line, then we start the main loop again from the top.

```armasm:26:2
1:
  adr x1, buf
  ldurb w9, [x1]

  cmp w9, #92   // backslash; start of comment
  b.ne 2f

comment:
  mov x8, #63   // read
  mov x0, #0    // stdin
  adr x1, buf
  mov x2, #1
  svc 0

  cmp x0, #0
  b.eq exit

  adr x1, buf
  ldurb w9, [x1]
  cmp w9, '\n'
  b.ne comment
  b loop
```

If it's not a comment, then it might be a hex digit we need to collect into the
current value. I broke this part up into three sections: one that finds the
value of decimal digits, one that finds the value of hex digits `a` through `f`
(case sensitive!), and one that shifts the digit into the current large value
we're building. Once that's done, we head back up to the top of the main loop
to process the next character.

```armasm:49:2
2:
  cmp w9, '0'
  b.lt 3f
  cmp w9, '9'
  b.gt 3f

  sub w9, w9, '0'
  b 4f

3:
  cmp w9, 'a'
  b.lt 5f
  cmp w9, 'f'
  b.gt 5f

  sub w9, w9, ('a' - 10)
  b 4f

4:
  lsl x19, x19, #4
  add x19, x19, w9, uxtb
  add x20, x20, #1
  b loop
```

Note the use of the... flag? `uxtb` in the `add` instruction on line 69. This
signals the processor to treat the value in `w9` as an unsigned (i.e.
zero-extended) byte for the purposes of the addition. Technically we only need
a nibble (4 bits, not 8), but since we know the range of values in `w9` is only
0--15, byte-sized extension works fine. To be honest, you could probably just
use the full-width `x9` and avoid needing to specify an extension scheme at
all, but I felt like it would be nice to keep using `w9` for consistency. It
also showed me something about the ISA structure that I wouldn't otherwise have
thought to look into.

If it's not a comment and it's not a hex digit, then it's either a character
we're ignoring (like a line break) or it's a space, which we use as the cue to
flush the current value to the output.

```armasm:73:2
5:
  cmp w9, ' '
  b.ne loop
```

To flush the output, first we normalize the count of nibbles into a count of
bytes: we round up (i.e. increment if it's odd), then we shift right one bit
(dividing the count by two). The `tst` instruction used here differs from the
`cmp` instruction we've been using everywhere else in that it does the
comparison using a bitwise "and" operation, rather than subtraction.

```armasm:77:2
  tst x20, #1
  cinc x20, x20, ne
  lsr x20, x20, #1
```

Finally, we enter a loop for each byte. We store the unsigned low-order byte
from our big value in `x19` (accessed through the 32-bit sized `w19`), send it
out using the `write` syscall, shift our big value over by 8 to put the next
byte in the low-order spot, then decrement the byte count until we're done
handling this value and we jump back up to the top of the main loop to start on
the next value.

```armasm:81:2
6:
  cmp x20, #0
  b.eq loop

  mov x8, #64   // write
  mov x0, #1    // stdout
  adr x1, buf
  sturb w19, [x1]
  mov x2, #1
  svc 0

  lsr x19, x19, #8

  sub x20, x20, #1
  b 6b
```

That wasn't too bad! In the next post, we'll test this new "language" out with
a few small executables, then rewrite this program in hex itself, then *maybe*
start on the next stage, a tiny little Forth implementation.

Until then! :wave:

---

Files:&ensp;[hex.c](/20240817-aarch64-bootstrapping-1/hex.c)&ensp;[hex.s](/20240817-aarch64-bootstrapping-1/hex.s)