Untangling cycles

2023-06-29

This is straight from my journal, so it starts without warning.


The bit packing is turning out to be surprisingly tricky!

Memory is synchronous but our uses of addr[0] were all comb, so they didn’t align with the actual target in the cycle it got transmitted from memory when we were advancing addr every cycle. This was a really good exercise in Being Confused As Heck.

Going to try to explicate the above a bit more clearly for my own elucidation. Ignoring the write half of the equation for simplicity—the issues faced are the same.

This post is literate Python. Why not. We have the following as baseline:

import math
from typing import Optional

from amaranth import Elaboratable, Memory, Module, Record, Signal
from amaranth.build import Platform
from amaranth.hdl.ast import ShapeCastable
from amaranth.hdl.mem import ReadPort
from amaranth.hdl.rec import DIR_FANIN, DIR_FANOUT
from amaranth.sim import Simulator


class ROMBus(Record):
    def __init__(self, addr: ShapeCastable, data: ShapeCastable):
        super().__init__(
            [
                ("addr", addr, DIR_FANIN),
                ("data", data, DIR_FANOUT),
            ],
            name="ROMBus",
        )

class Downstream(Record):
    def __init__(self):
        super().__init__(
            [
                ("data", 8, DIR_FANIN),
                ("stb", 1, DIR_FANIN),
            ]
        )

A ROMBus is a connectable path to access some read-only memory. Downstream here is a hypothetical recipient of data being read from ROM. (The ROM is actually a RAM that gets filled on power-on from flash.)

The key problem I was solving was that, until now, I’ve been storing all my data in 8-bit wide Memory instances, but a lot of the actual embedded block RAM I’m using has 16-bit wide words. As a result, the upper 8 bits of every word has been left unused.

It’d be nice to add a translation layer that transparently forwarded reads and writes from an 8-bit addressable space into the 16-bit words. Even bytes in the lower halves, odd bytes in the upper halves. Here’s what that’d look like:

ROM_CONTENT_PACKED = [0x2211, 0x4433, 0x6655, 0x8877]
ROM_LENGTH = 8

The length of the ROM that all the downstream consumers care about is the 8-bit addressable one—address 0 has 0x11, address 1 0x22, etc. The fact that we have 8 bytes packed into 4 words of 16 bits is irrelevant to them.

Here’s where our example will play out:

class Example(Elaboratable):
    def __init__(self):
        self.downstream = Downstream()

    def elaborate(self, platform: Optional[Platform]):
        m = Module()

The Downstream is exposed on the instance so we can access it from our simulator process.

We now need to do the following things:

        packed_size = math.ceil(ROM_LENGTH / 2)
        rom_mem = Memory(
            width=16,
            depth=packed_size,
            init=ROM_CONTENT_PACKED,
        )
        m.submodules.rom_rd = rom_rd = rom_mem.read_port()
        assert len(rom_rd.addr) == 2
        assert len(rom_rd.data) == 16

rom_rd.addr determines the address in the 16-bit-wide RAM (0x00x3), and rom_rd.data returns those 16 bits. Memory is synchronous by default (and the read enable is also always on under default settings), so, given a made-up mem[x] operator, the following timeline applies:

Now we’ll create our ROMBus. This is what all the RTL I had was already using—it was connected directly to the read port of the 8-wide memory.

        rom_bus = ROMBus(range(ROM_LENGTH), 8)
        assert len(rom_bus.addr) == 3
        assert len(rom_bus.data) == 8

We’re going to put the actual translation logic and state machine in separate functions, so they can be changed later while preserving the literacy of this post. Why not.

        self.translation(m, rom_rd, rom_bus)
        self.fsm(m, rom_bus)

        return m

We want to hook up the ROM bus to the memory in a transparent fashion. Here’s what I started with:

    def translation(self, m: Module, rom_rd: ReadPort, rom_bus: ROMBus):
        m.d.comb += [
            rom_rd.addr.eq(rom_bus.addr >> 1),
            rom_bus.data.eq(
                rom_rd.data.word_select(rom_bus.addr[0], 8)
            ),
        ]

Now we implement a reader from our ROM:

    def fsm(self, m: Module, rom_bus: ROMBus):
        m.d.sync += self.downstream.stb.eq(0)

        with m.FSM():
            with m.State("INITIAL"):
                # cycle n+0
                m.d.sync += rom_bus.addr.eq(0)
                m.next = "WAIT"

            with m.State("WAIT"):
                # cycle n+1 / n'+1
                m.next = "READ"

            with m.State("READ"):
                # cycle n+2, n'+0
                m.d.sync += [
                    self.downstream.data.eq(rom_bus.data),
                    self.downstream.stb.eq(1),
                    rom_bus.addr.eq(rom_bus.addr + 1),
                ]
                m.next = "WAIT"

This is a simple process that reads data and passes them along to some downstream process (which needs to be able to accept this data as fast as we give it to them!).

We end up strobing the downstream every other cycle. (That strobe is seen in the n+1 / n’+1 cycle.)

Let’s simulate it and report the results:

def main():
    dut = Example()

    def process():
        count = 0
        yield
        while True:
            if (yield dut.downstream.stb):
                print(f"data: {(yield dut.downstream.data):02x}")
                count += 1
                if count == 8:
                    return
            yield

    sim = Simulator(dut)
    sim.add_clock(1e-6)
    sim.add_sync_process(process)
    sim.run()

This can now be run:

$ python -c 'import ex; ex.main()'
data: 11
data: 22
data: 33
data: 44
data: 55
data: 66
data: 77
data: 88

It’s perfect!

Almost. Let’s revisit the timeline for accessing the synchronous memory:

The important part is that you can assign a new address y in cycle n+1, without impacting what happens in cycle n+2, such that mem[y] is now available to use in cycle n+3. The read port will only see the address y in the same cycle that it’s already propagated mem[x] into its data register.

Let’s now change our state machine to take advantage of this:

def fsm(self: Example, m: Module, rom_bus: ROMBus):
    m.d.sync += self.downstream.stb.eq(0)

    with m.FSM():
        with m.State("INITIAL"):
            # cycle n+0
            m.d.sync += rom_bus.addr.eq(0)
            m.next = "WAIT"

        with m.State("WAIT"):
            # cycle n+1, n'+0
            m.d.sync += rom_bus.addr.eq(1)
            m.next = "READ"

        with m.State("READ"):
            # cycle n+2, n'+1, n''+0
            m.d.sync += [
                self.downstream.data.eq(rom_bus.data),
                self.downstream.stb.eq(1),
                rom_bus.addr.eq(rom_bus.addr + 1),
            ]


Example.fsm = fsm

We don’t change state once we’re in READ: every cycle we hand to downstream the data from the address we set two cycles ago; every cycle the memory is seeing the address we gave one cycle ago; every cycle we increment the address to keep it going.

(My wording here muddles up the timing of when we “set” a given value quite a lot — really, we initiate the setting of the address two cycles ago, which one cycle ago is set (and seen), which this cycle we then see the data returned of.)

This is pretty theoretical in this form, but I have a few state machines that do this kind of sliding continuous read in a limited fashion.

So what happens?

$ python -c 'import ex; ex.main()'
data: 22
data: 11
data: 44
data: 33
data: 66
data: 55
data: 88
data: 77

All the bytes are reversed! (This was a lot weirder to debug when the same problem might have been affecting the initial write to RAM, too.)

Why?

We’ll review the translation statements:

m.d.comb += [
    rom_rd.addr.eq(rom_bus.addr >> 1),
    rom_bus.data.eq(
        rom_rd.data.word_select(rom_bus.addr[0], 8)
    ),
]

This translation happens in the combinatorial domain, meaning that rom_rd.addr will change to rom_bus.addr >> 1 as soon as a change on rom_bus.addr is registered — there isn’t an additional cycle between the requested 8-bit address on the ROM bus changing and the read port’s 16-bit address changing:

cycle statement issued ROM bus addr read port addr read port data
0 rom_bus.addr.eq(0) x x x
1 rom_bus.addr.eq(1) 0 0 x
2 rom_bus.addr.eq(2) 1 0 0x2211
3 rom_bus.addr.eq(3) 2 1 0x2211
4 rom_bus.addr.eq(4) 3 1 0x4433
5 rom_bus.addr.eq(5) 4 2 0x4433

Similarly, the ROM bus data port will be updated as soon as the read port’s data port (rom_rd.data) changes.

It will also be updated as soon as the LSB of the ROM bus’s requested address changes (rom_bus.addr[0]).

But by the time we’re actually getting data in the read port for an address, the ROM bus has registered the next address! Thus we select the half of the 16-bit word based on the LSB of the following address, which (given the addresses are sequential) will always be the opposite half to the one we really want:

cycle ROM bus addr read port data ROM bus addr [0] ROM bus data
0 x x x x
1 0 x 0 x
2 1 0x2211 1 0x22
3 2 0x2211 0 0x11
4 3 0x4433 1 0x44
5 4 0x4433 0 0x33

We need to introduce a delay in the address as used by the translation on the way back out, to account for the fact that read data corresponds to the address from the previous registered cycle, not this one:

def translation(
    self: Example,
    m: Module,
    rom_rd: ReadPort,
    rom_bus: ROMBus,
):
    last_addr = Signal.like(rom_bus.addr)
    m.d.sync += last_addr.eq(rom_bus.addr)

    m.d.comb += [
        rom_rd.addr.eq(rom_bus.addr >> 1),
        rom_bus.data.eq(rom_rd.data.word_select(last_addr[0], 8)),
    ]


Example.translation = translation

This gives:

cycle ROM bus addr last ROM bus addr read port data last ROM bus addr [0] ROM bus data
0 x x x x x
1 0 x x x x
2 1 0 0x2211 0 0x11
3 2 1 0x2211 1 0x22
4 3 2 0x4433 0 0x33
5 4 3 0x4433 1 0x44

And so:

$ python -c 'import ex; ex.main()'
data: 11
data: 22
data: 33
data: 44
data: 55
data: 66
data: 77
data: 88

I like how the x’s in this table don’t flow back “up” in time as the data dependencies flow right, whereas in the previous table, they do.