Knowing when to look past your code

There’s a weird tension in programming — on the one hand, as you learn the ropes, you (hopefully) learn very quickly that the problem is almost always in your code, and not, say, the compiler, stdlib, kernel, etc. This is usually very correct; the people who’ve worked on those things have many times the experience you did when you decided that there must be a bug in printf or something.

You’ll later realise you tried to print something through a pointer to a stack-allocated variable that’s long since gone. These accusations tend to wane as you gain familiarity with your subject matter, and wax as you step out into lands populated with ever more footguns, exposing more of the architecture than you ever suspected was there. (See also: the emails from me to the libev mailing list in 2011.)

At some point, though, your journies will take you to places where things aren’t so clear cut, and you’ll start to gain a sixth sense; a kind of visceral experience that things are not as they have been promised to be.

A few weeks ago, that sixth sense whispered in my ear: “what if, instead of your cruddy bootloader written in a pre-1.0 systems language for a platform you don’t fully understand, it’s the 20 year-old project with 80,000 commits that’s wrong?” And it was right.

Daintree’s bootloader, dainboot, worked great on QEMU, but would fail hard and fast on hardware with synchronous aborts. It’s a UEFI application, which means we get a lot for free – see how easy it is to search connected FAT filesystems for binaries. I don’t particularly want to spend much effort on a bootloader, so this makes sense to me.

QEMU comes bundled with a build of TianoCore EDK2, which makes it really easy to get started. On my ROCKPro64 I have a build of U-Boot installed to the eMMC, an extremely versatile bootloader found on all kinds of devices1. It makes a TFTP-based development cycle remarkably pleasant.

But different bootloaders mean very different execution environments. For one, EDK2 seems to execute the UEFI application in EL1, whereas U-Boot gives over control in EL2. There are many, many differences in the state of the various system control registers. And in this case, right in the beginning, we were getting an exception in U-Boot before we’d done barely any work:

Booting /efi\boot\BOOTAA64.efi
AC"Synchronous Abort" handler, esr 0x96000010
elr: fffffffffd1c4d98 lr : fffffffffd1c4d5c (reloc)
elr: 0000000078f07d98 lr : 0000000078f07d5c
x0 : 0000000000000000 x1 : 0000000000000000
x2 : 000000007bfdf450 x3 : 000000000000004c
x4 : 0000000000002800 x5 : 000000007bfdf480
x6 : 000000007bfdcb50 x7 : 0000000079f71680
x8 : 0000000078f0ab78 x9 : 0000000000003b5c
x10: 0000000000000000 x11: 0000000000000020
x12: 000000000000ed83 x13: 000000000000ed9c
x14: 0000000079f29d28 x15: 0000000008100000
x16: 0000000000000010 x17: 0000000000000000
x18: 0000000000000000 x19: 000000007bf43b78
x20: 0000000079f29fa0 x21: 0000000078f10040
x22: 0000000000005800 x23: 0000000079f542e0
x24: 000000007bff4eac x25: 0000000000000000
x26: 0000000000000000 x27: 0000000000000000
x28: 0000000079f4ba00 x29: 0000000079f29a30

Code: f9400fe8 f9400109 f94007ea 8b0a0129 (3940012b)
UEFI image [0x0000000078f07000:0x0000000078f0c35f] pc=0xd98

The disassembly showed that at pc=0xd98 in the image we were attempting to load a byte from the address pointed to by register x9:

d98: 2b 01 40 39                   ldrb    w11, [x9]

In the register dump, x9 has the value 0000000000003b5c, whereas we are clearly relocated much higher in memory (note the UEFI image offset; the pc is relative to that). Should x9 actually have a higher computed address? I hacked some things together to get a register dump from a similar place in QEMU (on EDK2, where this all worked):

(gdb) info registers
x0             0x0                 0
x1             0x2                 2
x2             0x1                 1
x3             0x5f46d944          1598478660
x4             0x43                67
x5             0x0                 0
x6             0x70616d6d          1885433197
x7             0x0                 0
x8             0x5c1f5b78          1545558904
x9             0x5c1f5b5c          1545558876
x10            0x0                 0
x11            0x64                100
x12            0x0                 0
x13            0x8                 8
x14            0x0                 0
x15            0x0                 0
x16            0x5f6b4ab0          1600866992
x17            0xffffa6ac          4294944428
x18            0x0                 0
x19            0x0                 0
x20            0x5f37d000          1597493248
x21            0x5f37f000          1597501440
x22            0x0                 0
x23            0x5f37f000          1597501440
x24            0x0                 0
x25            0x1                 1
x26            0x0                 0
x27            0x5f37f000          1597501440
x28            0x5f37d55d          1597494621
x29            0x5f6b45b0          1600865712
x30            0x5c1f2d5c          1545547100
sp             0x0                 0x0
pc             0x5c1f2da0          0x5c1f2da0

I’ve highlighted x9 – it has a value that’s clearly much more valid-pointer-looking, and it even ends in ...b5c, which makes me think it’s a corrected version of the 3b5c value we saw on the ROCKPro64.

What could cause a register to have a correct-looking address on one platform but not on the other? Let’s look at the code up and until the point where things fall apart. Unfortunately, UEFI code objects are all COFFs. I’m super inexperienced with these, and so too it turns out is the tooling in the area; I think it must be a bit of a hack that Zig or LLVM knows how to produce them, because it also produces a PDB alongside that presumably contains the debugging/line info, but then llvm-objdump refuses to use the same thing, helpfully declaring:

llvm-objdump: warning: 'dainboot/zig-cache/bin/BOOTAA64.
rockpro64.efi': failed to parse debug information for

Hand-written disassembly.

(The right-hand page has the disassembly in reverse, since we only have the state of registers at the point in time of the last instruction and have to trace data dependencies backwards. The precise values are different because they shifted every time I added some breaks or debugging helpers anywhere.)

I have no line numbers – it’s up to disassembly and guesswork. Here’s the former:

d8c: 09 01 40 f9                   ldr     x9, [x8]
d90: ea 07 40 f9                   ldr     x10, [sp, #8]
d94: 29 01 0a 8b                   add     x9, x9, x10
d98: 2b 01 40 39                   ldrb    w11, [x9]

I’ve grown very fond of aarch64 (dis)assembly. Look at those four-byte instructions. This experience was a crash course in learning it.

So what have we here? Translated into pseudo-C:

  • The faulting instruction tries to load a byte from the address stored in x9, which we know to be 00003b5c, i.e. w11 = *(u8 *)x9.
  • x9 was calculated as x9 = x9 + x10.
  • x10 came from x10 = *(sp + 8).
  • x9 came from x9 = *x8.

At first I thought that x10 must’ve been some relocation base, but it’s zero on both QEMU and hardware. The pointer is whatever we get from x8, which is 0000000078f0ab78 on hardware and 0x5c1f5b78 on QEMU. The last few digits line up nicely, again, so it looks like whatever’s at that address is a relocated address on QEMU/EDK2 but just not on ROCKPro64/U-Boot. What even is at that address? Let’s look at the symbol table.

U-Boot told us in the exception dump that the UEFI image was loaded at 0x78f07000, so 0x78f0ab78 is at offset 0x3b78 from the start of the image. What’s that? objdump to the rescue. It’s in our .data section:

3b70 287b737d 290d0a00 5c3b0000 00000000  ({s})...........

I really had to squint at this for a moment before realising that 0x3b78 actually contains a 64-bit pointer value, 0x00000000003b5c — in other words, the exact value of x8 we saw! So this was pulled directly out of the loaded image. Two questions arose: what is it? And how is it that QEMU/EDK2 got something different here?

It felt off that it should point to a value directly before itself in memory. Here’s context:

3b50 02200000 00000000 00000000 6461696e  ............dain
3b60 74726565 20626f6f 746c6f61 64657220  tree bootloader 
3b70 287b737d 290d0a00 5c3b0000 00000000  ({s})...........

A string! 0x3b78 points to 0x3b5c, which is the greeting string printed from the bootloader. Why this indirection in the binary itself?

There are three sections in the PE/COFF files generated by the build process: .text, which contains the executable code, .data, which contains strings and other bits, and .reloc. It still felt like this was a relocation issue, so I read Microsoft’s PE Format documentation carefully. objdump was doing a lot of the heavy lifting for us, but to really understand it, I wanted to pull apart the format myself. This approach would turn out to be invaluable.

The PE relocation section is comprised of a number of base relocation blocks, which defines a 32-bit base address for the block, each with any number of entries that specify the type of relocation and the 12-bit offset in that block to apply it at. Here’s a decoded .reloc we got:

Page RVA: 0xa000
28 relocations:
  0xaca8 0xacb8 0xacc8 0xacd8 0xace8 0xacf8 0xad08 0xad18
  0xad28 0xad38 0xad48 0xad58 0xad68 0xad78 0xadc8 0xadd8
  0xae10 0xae40 0xae90 0xaeb0 0xaef8 0xaf00 0xaf08 0xaf10
  0xaf48 0xaf88 0xafb8 0xafe0
Page RVA: 0xb000
53 relocations:
  0xb010 0xb040 0xb070 0xb0c0 0xb100 0xb160 0xb188 0xb1a0
  0xb210 0xb240 0xb270 0xb2a0 0xb2d0 0xb300 0xb328 0xb370
  0xb3b0 0xb3f8 0xb450 0xb4a0 0xb4b0 0xb4c8 0xb528 0xb578
  0xb5c0 0xb610 0xb680 0xb6d0 0xb720 0xb778 0xb7c8 0xb7e0
  0xb848 0xb898 0xb8e8 0xb938 0xb9c8 0xba18 0xba68 0xbab8
  0xbb08 0xbb58 0xbba8 0xbbd8 0xbc08 0xbc40 0xbcd0 0xbd20
  0xbd70 0xbdc0 0xbe18 0xbe68 0xbea8

objdump couldn’t give us this! (The addresses don’t line up with the above because I continued to hack on and modify things, shifting everything in the binary around. This was as frustrating as it might seem. These lined up exactly with the indirection seen above.)

For my purposes, I ignore all entries except of the type IMAGE_REL_BASED_DIR64: “The base relocation applies the difference to the 64-bit field at offset.” It turns out Zig/LLVM only generates those, and they work as simply as they sound: at the address specified by the relocation entry, treat it as a 64-bit field and add the relocation offset to it.

So, a PE loader should, after loading all data, visit all those addresses and add the relocation offset to them. It seemed like that was happening correctly on QEMU/EDK2 — when we loaded the addresses, they had been shifted for us. But why not on U-Boot?

To answer this, I looked at the U-Boot source code. And when looking wasn’t enough, it was time to printf debug, which meant getting a U-Boot build that actually ran on my hardware. There’s a 4+ hour gap between the two blocks of chat here:

"finally got my own uboot build running!! fuck"

(I nuked the boot environment quite a few times getting to this point.)

printf debugging to the rescue: U-Boot wasn’t doing any of the relocations, at all. It thought there weren’t any:

doing relocations (rel 0x0000000078efb2e0 rel_size 0x68
                   efi_reloc 0x0000000078ef6000
                   image_base 0x0)
DAINDBG: delta: 0x78ef6000
rel: 0x0000000078efb2e0, end: 0x0000000078efb348,
     rel->SizeOfBlock: 0x0

rel->SizeOfBlock here corresponds to the Block Size field in the base relocation block. My tools were clearly showing that block had a very non-zero size, so how was U-Boot getting a different idea?

I tried dumping the table raw, 32 bytes as hex to see what we got, and it was all zero. Something was still up. I had to keep going back.

"so now i need to suspect the code immediately

I riddled the loader with printfs:

DAINDBG: loading section[2]: VA 0x0000000078efb2e0 --
    setting 0x68 bytes to zero, then loading 0x200 bytes
    from 0x0000000002085600
    (efi 0x0000000002080000 + PointerToRawData 0x5600)
DAINDBG: loading section[1]: VA 0x0000000078ef9860 --
    setting 0x1a75 bytes to zero, then loading 0x1c00 bytes
    from 0x0000000002083a00
    (efi 0x0000000002080000 + PointerToRawData 0x3a00)
DAINDBG: loading section[0]: VA 0x0000000078ef6200 --
    setting 0x364c bytes to zero, then loading 0x3800 bytes
    from 0x0000000002080200
    (efi 0x0000000002080000 + PointerToRawData 0x200)

Here’s the code that corresponds to the above, with the printf re-approximated post-hoc for your reading pleasure:

/* Load sections into RAM */
for (i = num_sections - 1; i >= 0; i--) {
	/* XXX */
	printf("DAINDBG: loading section[%d]: VA %p -- "
	       "setting 0x%x bytes to zero, then loading "
	       "0x%x bytes from %p (efi %p + "
	       "PointerToRawData %x)\n", 
	       efi_reloc + sec->VirtualAddress,
	       efi + sec->PointerToRawData,
	/* XXX */
	IMAGE_SECTION_HEADER *sec = &sections[i];
	memset(efi_reloc + sec->VirtualAddress, 0,
	memcpy(efi_reloc + sec->VirtualAddress,
	       efi + sec->PointerToRawData,

Firstly, we load them in reverse. Weird, but okay. Secondly, it became clear that section[1] (.data) overlaps section[2] (.reloc) by 384 bytes (!!):

  1. .reloc: load at 0x78efb2e0, zero 0x68 bytes at start, then load 0x200 bytes at start, until 0x78efb4e0 non-incl.
  2. .data: load at 0x78ef9860, zero 0x1a75 bytes at start, then load 0x1c00 bytes at start, until 0x78efb460 non-incl. (0x180 bytes into .reloc target !!)
  3. .text: load at 0x78ef6200, zero 0x364c bytes at start, then load 0x3800 bytes at start, until 0x78ef9a00 non-incl. (0x1a0 bytes into .data target !!)

Oh boy. I didn’t even notice until now that the .text overlapped .data too. Who knows what would’ve happened if the sections aligned such that the relocations worked. It would’ve been much harder to diagnose.

Why would this happen? It seems like VirtualAddress + SizeOfRawData overlaps with other VirtualAddresses. What does the PE format say about VirtualAddress?

For executable images, the address of the first byte of the section relative to the image base when the section is loaded into memory. For object files, this field is the address of the first byte before relocation is applied; for simplicity, compilers should set this to zero. Otherwise, it is an arbitrary value that is subtracted from offsets during relocation.

SizeOfRawData is below it and caught my eye. Emphasis mine:

The size of the section (for object files) or the size of the initialized data on disk (for image files). For executable images, this must be a multiple of FileAlignment from the optional header. If this is less than VirtualSize, the remainder of the section is zero-filled. Because the SizeOfRawData field is rounded but the VirtualSize field is not, it is possible for SizeOfRawData to be greater than VirtualSize as well. When a section contains only uninitialized data, this field should be zero.

SizeOfRawData is rounded to a multiple of FileAlignment, which the docs tell us is typically 512 bytes. (Notice they all divide by 0x200.) But that’s presumably just because of the way they align in the file! What is VirtualSize, which until now we’re only using to determine how much to zero?

The total size of the section when loaded into memory. If this value is greater than SizeOfRawData, the section is zero-padded. This field is valid only for executable images and should be set to zero for object files.

The total size. It seems pretty clear that we should never load more than this many bytes, even if SizeOfRawData happens to be bigger. The size of the section can’t be bigger than this. If we were to constrain our memcpys to min(VirtualSize, SizeOfRawData), we get this instead:

  1. .reloc: load at 0x78efb2e0, zero 0x68 bytes at start, then load 0x200 0x68 bytes at start, until 0x78efb348 non-incl.
  2. .data: load at 0x78ef9860, zero 0x1a75 bytes at start, then load 0x1c00 0x1a75 bytes at start, until 0x78efb2d5 non-incl. (before .reloc begins)
  3. .text: load at 0x78ef6200, zero 0x364c bytes at start, then load 0x3800 0x364c bytes at start, until 0x78ef984c non-incl. (before .data begins)

It looked like a bug. If U-Boot loaded sections forwards, this wouldn’t have been exposed, but either way it appeared to be an error to do this at all. The section shouldn’t be loaded beyond its VirtualSize.

A quick trip to the EDK2 PE loader shows they load at most min(VirtualSize, SizeOfRawData) bytes into memory, and then pad up to VirtualSize with zeroes if needed. (The zeroing behaviour is for BSS-style initialisation.) They never touch memory past VirtualSize bytes.

 memcpy(efi_reloc + sec->VirtualAddress,
        efi + sec->PointerToRawData,
-       sec->SizeOfRawData);
+       min(sec->Misc.VirtualSize, sec->SizeOfRawData));

One short conversation later, and the bug was fixed.

Screenshot of it working.

(This is what success looks like.)

Sometimes, the problem is not in your code.

  1. Maybe it’d be nice if I could slap EDK2 on the ROCKPro64 – edk2-porting/edk2-rk3399 appears to be the strongest effort in this area so far, and it doesn’t look great. U-Boot is mature on many embedded platforms.