[Open-graphics] Multipliers in oga1hq
wpmcnamara at yahoo.com
Mon Sep 3 22:09:59 EDT 2007
Timothy Normand Miller wrote:
> On 9/3/07, Patrick McNamara <wpmcnamara at yahoo.com> wrote:
>> Timothy Normand Miller wrote:
>>> Well, that's not a bad idea. It's also worth pondering architectures
>>> that have 512 local registers, unifying the scratch space with the
>>> register file. But that may be too radical.
>> This is pretty common in micro controllers. In the 8 bit AVR series,
>> there are 32 registers that are the low 32 bytes of the memory space.
>> Same goes for the 8051 series, though the 8051 also supports a
>> completely separate external memory space as well. Personally, I think
>> make perfect sense to not distinguish between the register set, scratch
>> memory, and I/O space. It reduces the number of instruction types,
>> reduces the amount of decode logic, etc.
>> While we obvious want to make the controller as easily programmable as
>> possible, efficiency in execution and efficiency in implementation take
>> precedence in my mind.
> Can you give a little more detail on what you're envisioning here?
> The advantage to a REG-REG architecture is that the instructions are
> simple and fixed-size. Mapping registers into the memory space has
> some interesting theoretical advantages, but now you need more logic
> to distinguish, and you lose some of the benefits of the way a RISC
> processor is pipelined.
First, my apologies as I have only been skimming the posts related to
the nano-controller. I haven't had the time available to get deeply
involved in them which I tend to do with things that really interest
me. So forgive me if I am re-hashing prior discussions.
As a starting point for having a single memory space for registers and
RAM take for example the ATtiny45. This controller has 32 general
purpose registers, 64 I/O registers, and 256 bytes of RAM. The memory
maps looks something like this.
0x0000-0x001F: general purpose registers
0x0020-0x005F: I/O registers
I won't go into the AVR instruction set, but I can access any location
within the memory map with a single instruction type. The AVR ISA does
still preserve register syntax in a number of different instruction
mnemonics, and we could as well. Nothing says we couldn't map several
mnemonics to a single instruction. All instructions now comprise of
two source and one destination address fields. If we allow for
immediates, they replace one or both of the source addresses. To go
with Petter's example, the high bit selects IO space or memory space.
Assuming we allow for more memory space than we need for IO space, it
would be quite ok to mirror the IO space. Say for example you have a
128 byte memory space but only need 32 bytes of IO space, you can
effectively ignore bits 5 and 6 in the IO address and effectively
replicate the IO space four times. I'm afraid I haven't been paying
close enough attention lately to have a good feel for how big of a
scratch RAM space is needed.
My assumption in all this is that the controller does not have direct
access to the card memory space. That card memory access would be done
through IO ports or we would have explicit instructions for card memory
access ala the MOVX instruction on the Intel 8051 series. All working
memory for the controller would be in the controller core.
IIRC, if a BRAM is 512x36 correct? Since the BRAM is dual ported are
allowed 2 reads and two writes per cycle assuming you read on one clock
edge and write on the other. We could break the BRAM in two, using half
for memory/register and the upper half as dedicated stack space. Even
if you only get one read and write per cycle, appropriately designing
the pipeline could work around this.
Something else I was thinking about relates to using the same controller
core for both PCI and VGA duties. We effectively have to be able to
context switch to do this, and we have to be able to do it quickly to
meet PCI timing requirements. I don't know what our BRAM budget is
right now, but could we effectively have two sets of memory/registers
and stack for the core. When we need to context switch we switch BRAMS.
You could actual expand this to as many BRAMs as you want to use. To
keep from having to flush the pipe, and a context pipeline that marches
in step with the normal processor pipeline. For two contexts this is
just an n bit shift register (where n is the number of stages in the
pipeline). The value of the bit at any given stage in the pipeline
selects the target BRAM for that stage. More than two hardware contexts
means expanding the width of course.
Context switching does of course bring us back to the problem of the
multiplier. If multiplying doesn't stall the pipe waiting for the
answer, then we really don't want to context switch (or interrupt) in
the middle of a multiply. This causes all sort of problems though since
we are effectively working in a realtime environment. If we need to go
service a PCI transaction, we can't wait 10-20 cycles for a pending
multiply to finish. This means that we have to have the output be
context (or interrupt) aware. If the multiplier is context aware then
the answer could be written to a separate output as necessary.
Which brings me to a question that has been tickling the back of my head
for a bit. Why aren't we using the multipliers embedded in the FPGA? I
know there are limitations on how the BRAMs can be configured and still
use the multipliers, but I couldn't find anything quickly in my archive
of list messages.
I suppose I need to go take a look at what Petter has been working on
and then ask some more questions. :)
More information about the Open-graphics