Knowing when to look past your code
There’s a weird tension in programming — on the one hand, as you learn the
ropes, you (hopefully) learn very quickly that the problem is almost always
in your code, and not, say, the compiler, stdlib, kernel, etc. This is usually
very correct; the people who’ve worked on those things have many times the
experience you did when you decided that there must be a bug in printf
or
something.
You’ll later realise you tried to print something through a pointer to a stack-allocated variable that’s long since gone. These accusations tend to wane as you gain familiarity with your subject matter, and wax as you step out into lands populated with ever more footguns, exposing more of the architecture than you ever suspected was there. (See also: the emails from me to the libev mailing list in 2011.)
At some point, though, your journies will take you to places where things aren’t so clear cut, and you’ll start to gain a sixth sense; a kind of visceral experience that things are not as they have been promised to be.
A few weeks ago, that sixth sense whispered in my ear: “what if, instead of your cruddy bootloader written in a pre-1.0 systems language for a platform you don’t fully understand, it’s the 20 year-old project with 80,000 commits that’s wrong?” And it was right.
Daintree’s bootloader, dainboot, worked great on QEMU, but would fail hard and fast on hardware with synchronous aborts. It’s a UEFI application, which means we get a lot for free – see how easy it is to search connected FAT filesystems for binaries. I don’t particularly want to spend much effort on a bootloader, so this makes sense to me.
QEMU comes bundled with a build of TianoCore EDK2, which makes it really easy to get started. On my ROCKPro64 I have a build of U-Boot installed to the eMMC, an extremely versatile bootloader found on all kinds of devices1. It makes a TFTP-based development cycle remarkably pleasant.
But different bootloaders mean very different execution environments. For one, EDK2 seems to execute the UEFI application in EL1, whereas U-Boot gives over control in EL2. There are many, many differences in the state of the various system control registers. And in this case, right in the beginning, we were getting an exception in U-Boot before we’d done barely any work:
Booting /efi\boot\BOOTAA64.efi
AC"Synchronous Abort" handler, esr 0x96000010
elr: fffffffffd1c4d98 lr : fffffffffd1c4d5c (reloc)
elr: 0000000078f07d98 lr : 0000000078f07d5c
x0 : 0000000000000000 x1 : 0000000000000000
x2 : 000000007bfdf450 x3 : 000000000000004c
x4 : 0000000000002800 x5 : 000000007bfdf480
x6 : 000000007bfdcb50 x7 : 0000000079f71680
x8 : 0000000078f0ab78 x9 : 0000000000003b5c
x10: 0000000000000000 x11: 0000000000000020
x12: 000000000000ed83 x13: 000000000000ed9c
x14: 0000000079f29d28 x15: 0000000008100000
x16: 0000000000000010 x17: 0000000000000000
x18: 0000000000000000 x19: 000000007bf43b78
x20: 0000000079f29fa0 x21: 0000000078f10040
x22: 0000000000005800 x23: 0000000079f542e0
x24: 000000007bff4eac x25: 0000000000000000
x26: 0000000000000000 x27: 0000000000000000
x28: 0000000079f4ba00 x29: 0000000079f29a30
Code: f9400fe8 f9400109 f94007ea 8b0a0129 (3940012b)
UEFI image [0x0000000078f07000:0x0000000078f0c35f] pc=0xd98
'/efi\boot\BOOTAA64.efi'
The disassembly showed that at pc=0xd98
in the image we were attempting to
load a byte from the address pointed to by register x9
:
d98: 2b 01 40 39 ldrb w11, [x9]
In the register dump, x9
has the value 0000000000003b5c
, whereas we are
clearly relocated much higher in memory (note the UEFI image offset; the pc
is relative to that). Should x9
actually have a higher computed address? I
hacked some things together to get a register dump from a similar place in QEMU
(on EDK2, where this all worked):
(gdb) info registers
x0 0x0 0
x1 0x2 2
x2 0x1 1
x3 0x5f46d944 1598478660
x4 0x43 67
x5 0x0 0
x6 0x70616d6d 1885433197
x7 0x0 0
x8 0x5c1f5b78 1545558904
x9 0x5c1f5b5c 1545558876
x10 0x0 0
x11 0x64 100
x12 0x0 0
x13 0x8 8
x14 0x0 0
x15 0x0 0
x16 0x5f6b4ab0 1600866992
x17 0xffffa6ac 4294944428
x18 0x0 0
x19 0x0 0
x20 0x5f37d000 1597493248
x21 0x5f37f000 1597501440
x22 0x0 0
x23 0x5f37f000 1597501440
x24 0x0 0
x25 0x1 1
x26 0x0 0
x27 0x5f37f000 1597501440
x28 0x5f37d55d 1597494621
x29 0x5f6b45b0 1600865712
x30 0x5c1f2d5c 1545547100
sp 0x0 0x0
pc 0x5c1f2da0 0x5c1f2da0
I’ve highlighted x9
– it has a value that’s clearly much more
valid-pointer-looking, and it even ends in ...b5c
, which makes me think it’s
a corrected version of the 3b5c
value we saw on the ROCKPro64.
What could cause a register to have a correct-looking address on one platform
but not on the other? Let’s look at the code up and until the point where
things fall apart. Unfortunately, UEFI code objects are all
COFFs. I’m super inexperienced with
these, and so too it turns out is the tooling in the area; I think it must be a
bit of a hack that Zig or LLVM knows how to produce them, because it also
produces a PDB alongside that
presumably contains the debugging/line info, but then llvm-objdump
refuses to
use the same thing, helpfully declaring:
llvm-objdump: warning: 'dainboot/zig-cache/bin/BOOTAA64.
rockpro64.efi': failed to parse debug information for
dainboot/zig-cache/bin/BOOTAA64.rockpro64.efi
(The right-hand page has the disassembly in reverse, since we only have the state of registers at the point in time of the last instruction and have to trace data dependencies backwards. The precise values are different because they shifted every time I added some breaks or debugging helpers anywhere.)
I have no line numbers – it’s up to disassembly and guesswork. Here’s the former:
d8c: 09 01 40 f9 ldr x9, [x8]
d90: ea 07 40 f9 ldr x10, [sp, #8]
d94: 29 01 0a 8b add x9, x9, x10
d98: 2b 01 40 39 ldrb w11, [x9]
I’ve grown very fond of aarch64 (dis)assembly. Look at those four-byte instructions. This experience was a crash course in learning it.
So what have we here? Translated into pseudo-C:
- The faulting instruction tries to load a byte from the address stored in
x9
, which we know to be00003b5c
, i.e.w11 = *(u8 *)x9
. x9
was calculated asx9 = x9 + x10
.x10
came fromx10 = *(sp + 8)
.x9
came fromx9 = *x8
.
At first I thought that x10
must’ve been some relocation base, but it’s zero
on both QEMU and hardware. The pointer is whatever we get from x8
, which is
0000000078f0ab78
on hardware and 0x5c1f5b78
on QEMU. The last few digits
line up nicely, again, so it looks like whatever’s at that address is a
relocated address on QEMU/EDK2 but just not on ROCKPro64/U-Boot. What even
is at that address? Let’s look at the symbol table.
U-Boot told us in the exception dump that the UEFI image was loaded at
0x78f07000
, so 0x78f0ab78
is at offset 0x3b78
from the start of the
image. What’s that? objdump
to the rescue. It’s in our .data
section:
3b70 287b737d 290d0a00 5c3b0000 00000000 ({s})...........
I really had to squint at this for a moment before realising that 0x3b78
actually contains a 64-bit pointer value, 0x00000000003b5c
— in other
words, the exact value of x8
we saw! So this was pulled directly out of the
loaded image. Two questions arose: what is it? And how is it that QEMU/EDK2
got something different here?
It felt off that it should point to a value directly before itself in memory. Here’s context:
3b50 02200000 00000000 00000000 6461696e ............dain
3b60 74726565 20626f6f 746c6f61 64657220 tree bootloader
3b70 287b737d 290d0a00 5c3b0000 00000000 ({s})...........
A string! 0x3b78
points to 0x3b5c
, which is the greeting string printed
from the bootloader. Why this indirection in the binary itself?
There are three sections in the PE/COFF files generated by the build process:
.text
, which contains the executable code, .data
, which contains strings
and other bits, and .reloc
. It still felt like this was a relocation issue,
so I read Microsoft’s PE
Format
documentation carefully. objdump
was doing a lot of the heavy lifting for
us, but to really understand it, I wanted to pull apart the format
myself.
This approach would turn out to be invaluable.
The PE relocation section is comprised of a number of base relocation blocks,
which defines a 32-bit base address for the block, each with any number of
entries that specify the type of relocation and the 12-bit offset in that block
to apply it at. Here’s a decoded .reloc
we got:
Page RVA: 0xa000
28 relocations:
0xaca8 0xacb8 0xacc8 0xacd8 0xace8 0xacf8 0xad08 0xad18
0xad28 0xad38 0xad48 0xad58 0xad68 0xad78 0xadc8 0xadd8
0xae10 0xae40 0xae90 0xaeb0 0xaef8 0xaf00 0xaf08 0xaf10
0xaf48 0xaf88 0xafb8 0xafe0
Page RVA: 0xb000
53 relocations:
0xb010 0xb040 0xb070 0xb0c0 0xb100 0xb160 0xb188 0xb1a0
0xb210 0xb240 0xb270 0xb2a0 0xb2d0 0xb300 0xb328 0xb370
0xb3b0 0xb3f8 0xb450 0xb4a0 0xb4b0 0xb4c8 0xb528 0xb578
0xb5c0 0xb610 0xb680 0xb6d0 0xb720 0xb778 0xb7c8 0xb7e0
0xb848 0xb898 0xb8e8 0xb938 0xb9c8 0xba18 0xba68 0xbab8
0xbb08 0xbb58 0xbba8 0xbbd8 0xbc08 0xbc40 0xbcd0 0xbd20
0xbd70 0xbdc0 0xbe18 0xbe68 0xbea8
objdump
couldn’t give us this! (The addresses don’t line up with the above
because I continued to hack on and modify things, shifting everything in the
binary around. This was as frustrating as it might seem. These lined up
exactly with the indirection seen above.)
For my purposes, I ignore all entries except of the type
IMAGE_REL_BASED_DIR64
: “The base relocation applies the difference to the
64-bit field at offset.” It turns out Zig/LLVM only generates those, and they
work as simply as they sound: at the address specified by the relocation entry,
treat it as a 64-bit field and add the relocation offset to it.
So, a PE loader should, after loading all data, visit all those addresses and add the relocation offset to them. It seemed like that was happening correctly on QEMU/EDK2 — when we loaded the addresses, they had been shifted for us. But why not on U-Boot?
To answer this, I looked at the U-Boot source code. And when looking wasn’t
enough, it was time to printf
debug, which meant getting a U-Boot build that
actually ran on my hardware. There’s a 4+ hour gap between the two blocks of
chat here:
(I nuked the boot environment quite a few times getting to this point.)
printf
debugging to the rescue: U-Boot wasn’t doing any of the relocations,
at all. It thought there weren’t any:
doing relocations (rel 0x0000000078efb2e0 rel_size 0x68
efi_reloc 0x0000000078ef6000
image_base 0x0)
DAINDBG: delta: 0x78ef6000
rel: 0x0000000078efb2e0, end: 0x0000000078efb348,
rel->SizeOfBlock: 0x0
DAINDBG: done
rel->SizeOfBlock
here corresponds to the Block
Size
field in the base relocation block. My tools were clearly showing that block
had a very non-zero size, so how was U-Boot getting a different idea?
I tried dumping the table raw, 32 bytes as hex to see what we got, and it was all zero. Something was still up. I had to keep going back.
I riddled the loader with printf
s:
DAINDBG: loading section[2]: VA 0x0000000078efb2e0 --
setting 0x68 bytes to zero, then loading 0x200 bytes
from 0x0000000002085600
(efi 0x0000000002080000 + PointerToRawData 0x5600)
DAINDBG: loading section[1]: VA 0x0000000078ef9860 --
setting 0x1a75 bytes to zero, then loading 0x1c00 bytes
from 0x0000000002083a00
(efi 0x0000000002080000 + PointerToRawData 0x3a00)
DAINDBG: loading section[0]: VA 0x0000000078ef6200 --
setting 0x364c bytes to zero, then loading 0x3800 bytes
from 0x0000000002080200
(efi 0x0000000002080000 + PointerToRawData 0x200)
Here’s the code that corresponds to the above, with the printf
re-approximated post-hoc for your reading pleasure:
/* Load sections into RAM */
for (i = num_sections - 1; i >= 0; i--) {
/* XXX */
printf("DAINDBG: loading section[%d]: VA %p -- "
"setting 0x%x bytes to zero, then loading "
"0x%x bytes from %p (efi %p + "
"PointerToRawData %x)\n",
i,
efi_reloc + sec->VirtualAddress,
sec->Misc.VirtualSize,
sec->SizeOfRawData,
efi + sec->PointerToRawData,
efi,
sec->PointerToRawData);
/* XXX */
IMAGE_SECTION_HEADER *sec = §ions[i];
memset(efi_reloc + sec->VirtualAddress, 0,
sec->Misc.VirtualSize);
memcpy(efi_reloc + sec->VirtualAddress,
efi + sec->PointerToRawData,
sec->SizeOfRawData);
}
Firstly, we load them in reverse. Weird, but okay. Secondly, it became clear
that section[1]
(.data
) overlaps section[2]
(.reloc
) by 384 bytes (!!):
.reloc
: load at0x78efb2e0
, zero0x68
bytes at start, then load0x200
bytes at start, until0x78efb4e0
non-incl..data
: load at0x78ef9860
, zero0x1a75
bytes at start, then load0x1c00
bytes at start, until0x78efb460
non-incl. (0x180
bytes into.reloc
target !!).text
: load at0x78ef6200
, zero0x364c
bytes at start, then load0x3800
bytes at start, until0x78ef9a00
non-incl. (0x1a0
bytes into.data
target !!)
Oh boy. I didn’t even notice until now that the .text
overlapped .data
too.
Who knows what would’ve happened if the sections aligned such that the
relocations worked. It would’ve been much harder to diagnose.
Why would this happen? It seems like VirtualAddress + SizeOfRawData
overlaps
with other VirtualAddress
es. What does the PE
format
say about VirtualAddress
?
For executable images, the address of the first byte of the section relative to the image base when the section is loaded into memory. For object files, this field is the address of the first byte before relocation is applied; for simplicity, compilers should set this to zero. Otherwise, it is an arbitrary value that is subtracted from offsets during relocation.
SizeOfRawData
is below it and caught my eye. Emphasis mine:
The size of the section (for object files) or the size of the initialized data on disk (for image files). For executable images, this must be a multiple of FileAlignment from the optional header. If this is less than VirtualSize, the remainder of the section is zero-filled. Because the SizeOfRawData field is rounded but the VirtualSize field is not, it is possible for SizeOfRawData to be greater than VirtualSize as well. When a section contains only uninitialized data, this field should be zero.
SizeOfRawData
is rounded to a multiple of FileAlignment
, which the docs
tell us is typically 512 bytes. (Notice they all divide by 0x200
.) But
that’s presumably just because of the way they align in the file! What is
VirtualSize
, which until now we’re only using to determine how much to zero?
The total size of the section when loaded into memory. If this value is greater than SizeOfRawData, the section is zero-padded. This field is valid only for executable images and should be set to zero for object files.
The total size. It seems pretty clear that we should never load more than
this many bytes, even if SizeOfRawData
happens to be bigger. The size of the
section can’t be bigger than this. If we were to constrain our memcpy
s to
min(VirtualSize, SizeOfRawData)
, we get this instead:
.reloc
: load at0x78efb2e0
, zero0x68
bytes at start, then load0x200
0x68
bytes at start, until0x78efb348
non-incl..data
: load at0x78ef9860
, zero0x1a75
bytes at start, then load0x1c00
0x1a75
bytes at start, until0x78efb2d5
non-incl. (before.reloc
begins).text
: load at0x78ef6200
, zero0x364c
bytes at start, then load0x3800
0x364c
bytes at start, until0x78ef984c
non-incl. (before.data
begins)
It looked like a bug. If U-Boot loaded sections forwards, this wouldn’t have been exposed, but either way it appeared to be an error to do this at all. The section shouldn’t be loaded beyond its VirtualSize.
A quick trip to the EDK2 PE
loader
shows they load at most min(VirtualSize, SizeOfRawData)
bytes into memory,
and then pad up to VirtualSize
with zeroes if needed. (The zeroing behaviour
is for BSS-style initialisation.) They never touch memory past VirtualSize
bytes.
memcpy(efi_reloc + sec->VirtualAddress,
efi + sec->PointerToRawData,
- sec->SizeOfRawData);
+ min(sec->Misc.VirtualSize, sec->SizeOfRawData));
One short conversation later, and the bug was fixed.
(This is what success looks like.)
Sometimes, the problem is not in your code.
-
Maybe it’d be nice if I could slap EDK2 on the ROCKPro64 – edk2-porting/edk2-rk3399 appears to be the strongest effort in this area so far, and it doesn’t look great. U-Boot is mature on many embedded platforms. ↩