[Open-graphics] Multipliers in oga1hq
wpmcnamara at yahoo.com
Tue Sep 4 07:49:38 EDT 2007
Nicholas S-A wrote:
> On Sep 3, 2007, at 10:09 PM, Patrick McNamara wrote:
>> As a starting point for having a single memory space for registers and
>> RAM take for example the ATtiny45. This controller has 32 general
>> purpose registers, 64 I/O registers, and 256 bytes of RAM. The memory
>> maps looks something like this.
>> 0x0000-0x001F: general purpose registers
>> 0x0020-0x005F: I/O registers
>> 0x0060-0x015F: RAM
>> I won't go into the AVR instruction set, but I can access any location
>> within the memory map with a single instruction type. The AVR ISA does
>> still preserve register syntax in a number of different instruction
>> mnemonics, and we could as well. Nothing says we couldn't map several
>> mnemonics to a single instruction. All instructions now comprise of
>> two source and one destination address fields. If we allow for
>> immediates, they replace one or both of the source addresses. To go
>> with Petter's example, the high bit selects IO space or memory space.
>> Assuming we allow for more memory space than we need for IO space, it
>> would be quite ok to mirror the IO space. Say for example you have a
>> 128 byte memory space but only need 32 bytes of IO space, you can
>> effectively ignore bits 5 and 6 in the IO address and effectively
>> replicate the IO space four times. I'm afraid I haven't been paying
>> close enough attention lately to have a good feel for how big of a
>> scratch RAM space is needed.
> What is stopping us from just having 512 registers? Is it the
> instruction size?
> If so, don't we have 36 bits, not 32?
Only instruction size. Instruction size and the number of instructions
needed control the address space available for each instruction.
>> IIRC, if a BRAM is 512x36 correct? Since the BRAM is dual ported are
>> allowed 2 reads and two writes per cycle assuming you read on one clock
>> edge and write on the other. We could break the BRAM in two, using half
>> for memory/register and the upper half as dedicated stack space. Even
>> if you only get one read and write per cycle, appropriately designing
>> the pipeline could work around this.
>> Something else I was thinking about relates to using the same controller
>> core for both PCI and VGA duties. We effectively have to be able to
>> context switch to do this, and we have to be able to do it quickly to
>> meet PCI timing requirements. I don't know what our BRAM budget is
>> right now, but could we effectively have two sets of memory/registers
>> and stack for the core. When we need to context switch we switch BRAMS.
>> You could actual expand this to as many BRAMs as you want to use. To
>> keep from having to flush the pipe, and a context pipeline that marches
>> in step with the normal processor pipeline. For two contexts this is
>> just an n bit shift register (where n is the number of stages in the
>> pipeline). The value of the bit at any given stage in the pipeline
>> selects the target BRAM for that stage. More than two hardware contexts
>> means expanding the width of course.
> Interesting. We only have 24 BRAMs on the device, and need some for
> but I think that this might work.
Also consider that the FPGA version only needs to be a proof of
concept. If we only allow for two contexts, but get them working, then
with the ASIC where we may not be as constrained on real estate then we
>> Context switching does of course bring us back to the problem of the
>> multiplier. If multiplying doesn't stall the pipe waiting for the
>> answer, then we really don't want to context switch (or interrupt) in
>> the middle of a multiply. This causes all sort of problems though since
>> we are effectively working in a realtime environment. If we need to go
>> service a PCI transaction, we can't wait 10-20 cycles for a pending
>> multiply to finish. This means that we have to have the output be
>> context (or interrupt) aware. If the multiplier is context aware then
>> the answer could be written to a separate output as necessary.
> Hmm, that is a pretty big problem. Even just ignoring the context switch,
> we are not going to ever need a multiply for PCI, right?
Not sure. I'm not even sure we have to have a multiply to do the VGA
conversion routines, mainly because I haven't sat down to think real
hard about what is necessary. The actual question is "Do we need
variable multiplication?" If all the multiplication we need to do is by
a fix amount, say the line number times screen width, then using a
multiply instruction is probably slower than using shifts and adds.
>> Which brings me to a question that has been tickling the back of my head
>> for a bit. Why aren't we using the multipliers embedded in the FPGA? I
>> know there are limitations on how the BRAMs can be configured and still
>> use the multipliers, but I couldn't find anything quickly in my archive
>> of list messages.
> The basic problem is that we are using the Lattice XP FPGA instead of the
> Spartan for the nanocontroller to give us more room for the OpenGL
> They have distributed (9K) RAM but no multipliers.
Doh! That makes sense.
More information about the Open-graphics