[Open-graphics] Multipliers in oga1hq
wpmcnamara at yahoo.com
Tue Sep 4 13:04:32 EDT 2007
Timothy Normand Miller wrote:
> On 9/3/07, Patrick McNamara <wpmcnamara at yahoo.com> wrote:
>> As a starting point for having a single memory space for registers and
>> RAM take for example the ATtiny45. This controller has 32 general
>> purpose registers, 64 I/O registers, and 256 bytes of RAM. The memory
>> maps looks something like this.
> One of my concerns is that since we now have gotten some outside
> attention wrt our VGA implementation, we may want to just go with
> basically what we have, for the sake of expediency.
Obviously a concern. One of the problems to consider here is we don't
know how we are going to make VGA work. For example, how are we going
to handle reads from VGA memory? What about font bitmaps? There are a
lot of details surrounding the VGA pipeline that will need to be worked
out too. We don't want to get stuck in an architecture that can't
support our final needs either.
>> My assumption in all this is that the controller does not have direct
>> access to the card memory space. That card memory access would be done
>> through IO ports or we would have explicit instructions for card memory
>> access ala the MOVX instruction on the Intel 8051 series. All working
>> memory for the controller would be in the controller core.
> This is correct.
>> IIRC, if a BRAM is 512x36 correct? Since the BRAM is dual ported are
>> allowed 2 reads and two writes per cycle assuming you read on one clock
>> edge and write on the other. We could break the BRAM in two, using half
>> for memory/register and the upper half as dedicated stack space. Even
>> if you only get one read and write per cycle, appropriately designing
>> the pipeline could work around this.
> One thing about the BRAM. If we were to try to use it as the primary
> register file, I'm not sure we could double-pump it like we need to.
> Routing to/from the RAM may impose too much delay. One of the
> advantages of the current architecture is that it is effectively
> triple-ported. We can write one reg and read two at the same time.
> If we use the BRAM as you describe, we serialize it, making any
> instruction that requires access to three operands take 3 cycles.
That is definitely a concern. Allowing for only one read and write per
cycle would require addition of a second fetch stage in the pipeline.
> We have a nice MIPS-like pipeline working. Going to a different
> architecture would require a completely different pipeline, and
> frankly, I don't know how to lay it out so that it's efficient. Most
> well-pipelined processors use this REG-REG RISC architecture like
> MIPS, at least on an abstract level, but usually quite concretely.
>> Something else I was thinking about relates to using the same controller
>> core for both PCI and VGA duties. We effectively have to be able to
>> context switch to do this, and we have to be able to do it quickly to
> Did you mean DMA and VGA? We'd never be doing DMA at the same time as VGA.
> If you're referring to the fact that it has to handle VGA translation
> at the same time as intercepting PCI transactions so it can do the
> rest of the VGA stuff, then you're right.
Are you positive we won't be doing DMA at the same time as VGA
transactions? We won't be initiating DMA, but we may very well be the
target. But, yes, I was referring to PCI and VGA.
>> meet PCI timing requirements. I don't know what our BRAM budget is
>> right now, but could we effectively have two sets of memory/registers
>> and stack for the core. When we need to context switch we switch BRAMS.
> Switching BRAMs involves adding multiplexing. I might suggest instead
> that we virtually break the BRAM into four sections. Two stacks and
> two reg files of 128 entries each. This way, the context switch
> involves changing static address inputs to the BRAM.
This is certainly faster and as long as you have enough space is simpler
to design to. Your context controls the high bits of your register
address. If you separate scratch RAM from registers, you can do the
same thing there as well if needed.
>> Context switching does of course bring us back to the problem of the
>> multiplier. If multiplying doesn't stall the pipe waiting for the
>> answer, then we really don't want to context switch (or interrupt) in
>> the middle of a multiply. This causes all sort of problems though since
> I think we can cope with this. I and Petter both have some ideas
> about handling the context switch. As long as the timing isn't
> critical (you can't read the product early, but you can read it late),
> then we can manage it.
My concern is that on interrupt or context switch, a multiply is needed
early on in the execution. Two independent threads of execution should
not have to know what the other was doing, especially since interrupts
cannot know what the other was doing. If the code has to check whether
a multiply is pending from another execution path before submitting its
own, this has a further impact on multiply performance.
>> we are effectively working in a realtime environment. If we need to go
>> service a PCI transaction, we can't wait 10-20 cycles for a pending
>> multiply to finish. This means that we have to have the output be
>> context (or interrupt) aware. If the multiplier is context aware then
>> the answer could be written to a separate output as necessary.
> If the muliplier is pipelined, then we can just give it more than one
> set of control ports. One for user context, the other for interrupt.
>> Which brings me to a question that has been tickling the back of my head
>> for a bit. Why aren't we using the multipliers embedded in the FPGA? I
> The XP10 doesn't have any embedded multipliers.
I was reminded of that. :)
More information about the Open-graphics