[post] AArch64 Bootstrapping 1

author: Alexis Hovorka <[email protected]> 2024-08-17 19:32:33 -0600
committer: Alexis Hovorka <[email protected]> 2024-08-17 19:32:33 -0600
commit: 3712c0af48b9cc68fe52729cb7aa24006e35cb5d (patch)
tree: 14ebf05a0335624d56b171c7f17450d0baa339bb /posts
parent: d2a685e50d20101a42b7d9d08d1b90bb92defcd3 (diff)
1 files changed, 265 insertions, 0 deletions
diff --git a/posts/20240817-aarch64-bootstrapping-1.md b/posts/20240817-aarch64-bootstrapping-1.md
new file mode 100644
index 0000000..34e7b76
--- /dev/null
+++ b/posts/20240817-aarch64-bootstrapping-1.md
@@ -0,0 +1,265 @@
+---
+date: 2024-08-17
+title: AArch64 Bootstrapping 1 - Seed Groundwork
+tags:
+- bootstrapping
+- aarch64
+- assembly
+---
+
+I got a new laptop[^newlaptop] recently, and in the spirit of how open and
+accessible and tinkerable the whole system is, I thought it would be fun to try
+and do a little bootstrapping challenge. Starting from a tiny hand-made ELF
+executable "seed," I'm going to try to build a complete programming environment
+from scratch.
+
+<!--more-->
+[^newlaptop]: [MNT Pocket Reform](https://www.crowdsupply.com/mnt/pocket-reform)
+
+Ideally I'd like to have started off by `cat`ting raw hexadecimal into a file
+and building up from there, or something to that effect, but my new laptop uses
+AArch64, also known as Arm64.[^aarch64] That's the instruction set used in most
+smartphones and newer Macs, and as a RISC-ish ISA it's quite a bit cleaner than
+the convoluted mess x86/64 has grown into, but it's also significantly more
+difficult to write by hand. I gave it a good shot, but on top of the
+instruction fields being weirdly aligned and my near-complete inexperience with
+directly using the instruction set, I finally relented when I realized that ELF
+executables have to be little-endian,[^endian] meaning I'd have to write the
+raw, unfamiliar instructions *backwards.*
+
+[^aarch64]: [AArch64 instruction set quick reference sheet](https://courses.cs.washington.edu/courses/cse469/19wi/arm64.pdf)
+[^endian]: [Wikipedia: Endianness](https://en.wikipedia.org/wiki/Endianness)
+
+Fortunately, the endianness thing is a pretty straightforward problem to solve,
+so I thought I'd try using that as a warmup. What I'd like is a program that
+takes in a space-delimited stream of varying sizes of numbers in hexadecimal,
+and spits them out the other side "compiled" to binary in little-endian byte
+order. Giving it a first go in C to make sure I had my implementation concept
+down right, I came up with this:
+
+```c:0
+// hex.c
+#include <stdio.h>
+
+int main(void) {
+  char c = 0;      // Input character
+  int count = 0;   // Hex digits parsed for this number
+  long long v = 0; // Current value
+
+  while ((c = getchar()) != EOF) {
+
+    // Skip comments
+    if (c == '\\') { while ((c = getchar()) != EOF && c != '\n') {} }
+
+    // Process incoming digits and add to running value
+    else if (('0' <= c) && (c <= '9')) { v = (v << 4) + (c - '0'); count++; }
+    else if (('a' <= c) && (c <= 'f')) { v = (v << 4) + (c - 'a' + 10); count++; }
+
+    // Send the finished number out the other end
+    else if (c == ' ') {
+
+      // Turn the number of hex digits into a number of bytes
+      if (count & 1) count++;
+      count >>= 1;
+
+      // Emit each byte, littlest first
+      for (; count>0; count--) {
+        c = v & 0xFF;
+        putchar(c);
+        v >>= 8;
+      }
+    }
+  }
+}
+```
+
+That worked like a charm, but C is still pretty high-level for where we're
+going, so I'm not going to spend too long going over it here. Instead, I
+rewrote it in assembly. Given that I went for a mostly-direct translation, odds
+are there's plenty more efficiency that could be squeezed out of it in terms of
+speed or instruction count, but I'm really only planning on using this on
+itself and the very next stage, so I figure if it works correctly, that's good
+enough. I used the GNU `as` assembler because I was already familiar with it,
+and I find the ephemeral numbered labels especially nifty.
+
+Starting off, we set up a couple "global" variables, then begin the main loop
+by pulling in a new character and checking if we've reached the end of the
+input, and exiting if so.
+
+```armasm:0:2
+// hex.s
+.section .data
+buf:
+  .byte 0       // input/output character buffer
+
+.section .text
+.global _start
+_start:
+  mov x19, #0   // value
+  mov x20, #0   // count
+
+loop:
+  mov x8, #63   // read
+  mov x0, #0    // stdin
+  adr x1, buf
+  mov x2, #1
+  svc 0
+
+  cmp x0, #0    // count of bytes read; 0 means EOF
+  b.ne 1f
+
+exit:
+  mov x8, #93   // exit
+  mov x0, #0    // error code (no error)
+  svc 0
+```
+
+System calls are made in AArch64 Linux by loading the system call
+number[^syscalls] into the register `x8`, plus the syscall arguments into some
+of the other registers, and then running the "supervisor call" instruction `svc
+0` which handles the transition over to the kernel's control.
+
+[^syscalls]: [AArch64 Linux system call reference sheet](https://arm64.syscall.sh/)
+
+For the `read` syscall (63), the arguments are the source file descriptor, a
+pointer to the target character buffer(!), and the requested length. Note that
+the call specifically requires a destination *buffer;* as far as I can tell,
+there's no way to request a single char read directly into a register. So, we
+need to load a register with a pointer to our one-byte buffer, then after the
+syscall returns, we load the character from there.
+
+AArch64 has a special instruction, `adr`, for loading a register with an
+absolute memory address given a program-counter-relative offset. My
+understanding is that this is the idiomatic way to do this, even though it
+might prove to be a bit of a pain when we're manually calculating addresses and
+offsets later.
+
+Once the syscall has placed the byte into the buffer, we pull it out and check
+to see if it's a backslash, meaning we're at the start of a comment. If so, we
+enter an inner loop where we read characters repeatedly until we find the end
+of the line, then we start the main loop again from the top.
+
+```armasm:26:2
+1:
+  adr x1, buf
+  ldurb w9, [x1]
+
+  cmp w9, #92   // backslash; start of comment
+  b.ne 2f
+
+comment:
+  mov x8, #63   // read
+  mov x0, #0    // stdin
+  adr x1, buf
+  mov x2, #1
+  svc 0
+
+  cmp x0, #0
+  b.eq exit
+
+  adr x1, buf
+  ldurb w9, [x1]
+  cmp w9, '\n'
+  b.ne comment
+  b loop
+```
+
+If it's not a comment, then it might be a hex digit we need to collect into the
+current value. I broke this part up into three sections: one that finds the
+value of decimal digits, one that finds the value of hex digits `a` through `f`
+(case sensitive!), and one that shifts the digit into the current large value
+we're building. Once that's done, we head back up to the top of the main loop
+to process the next character.
+
+```armasm:49:2
+2:
+  cmp w9, '0'
+  b.lt 3f
+  cmp w9, '9'
+  b.gt 3f
+
+  sub w9, w9, '0'
+  b 4f
+
+3:
+  cmp w9, 'a'
+  b.lt 5f
+  cmp w9, 'f'
+  b.gt 5f
+
+  sub w9, w9, ('a' - 10)
+  b 4f
+
+4:
+  lsl x19, x19, #4
+  add x19, x19, w9, uxtb
+  add x20, x20, #1
+  b loop
+```
+
+Note the use of the... flag? `uxtb` in the `add` instruction on line 69. This
+signals the processor to treat the value in `w9` as an unsigned (i.e.
+zero-extended) byte for the purposes of the addition. Technically we only need
+a nibble (4 bits, not 8), but since we know the range of values in `w9` is only
+0--15, byte-sized extension works fine. To be honest, you could probably just
+use the full-width `x9` and avoid needing to specify an extension scheme at
+all, but I felt like it would be nice to keep using `w9` for consistency. It
+also showed me something about the ISA structure that I wouldn't otherwise have
+thought to look into.
+
+If it's not a comment and it's not a hex digit, then it's either a character
+we're ignoring (like a line break) or it's a space, which we use as the cue to
+flush the current value to the output.
+
+```armasm:73:2
+5:
+  cmp w9, ' '
+  b.ne loop
+```
+
+To flush the output, first we normalize the count of nibbles into a count of
+bytes: we round up (i.e. increment if it's odd), then we shift right one bit
+(dividing the count by two). The `tst` instruction used here differs from the
+`cmp` instruction we've been using everywhere else in that it does the
+comparison using a bitwise "and" operation, rather than subtraction.
+
+```armasm:77:2
+  tst x20, #1
+  cinc x20, x20, ne
+  lsr x20, x20, #1
+```
+
+Finally, we enter a loop for each byte. We store the unsigned low-order byte
+from our big value in `x19` (accessed through the 32-bit sized `w19`), send it
+out using the `write` syscall, shift our big value over by 8 to put the next
+byte in the low-order spot, then decrement the byte count until we're done
+handling this value and we jump back up to the top of the main loop to start on
+the next value.
+
+```armasm:81:2
+6:
+  cmp x20, #0
+  b.eq loop
+
+  mov x8, #64   // write
+  mov x0, #1    // stdout
+  adr x1, buf
+  sturb w19, [x1]
+  mov x2, #1
+  svc 0
+
+  lsr x19, x19, #8
+
+  sub x20, x20, #1
+  b 6b
+```
+
+That wasn't too bad! In the next post, we'll test this new "language" out with
+a few small executables, then rewrite this program in hex itself, then *maybe*
+start on the next stage, a tiny little Forth implementation.
+
+Until then! :wave:
+
+---
+
+Files:&ensp;[hex.c](/20240817-aarch64-bootstrapping-1/hex.c)&ensp;[hex.s](/20240817-aarch64-bootstrapping-1/hex.s)
author	Alexis Hovorka <[email protected]>	2024-08-17 19:32:33 -0600
committer	Alexis Hovorka <[email protected]>	2024-08-17 19:32:33 -0600
commit	3712c0af48b9cc68fe52729cb7aa24006e35cb5d (patch)
tree	14ebf05a0335624d56b171c7f17450d0baa339bb /posts
parent	d2a685e50d20101a42b7d9d08d1b90bb92defcd3 (diff)