[Open-graphics] Multipliers in oga1hq
urkedal at nbi.dk
Sat Sep 1 15:08:48 EDT 2007
On 2007-09-01, Timothy Normand Miller wrote:
> I'm not sure we want to add additional MUXing after the REG stage. It
> might be better to move it into the MEM stage. This is especially not
> a problem since we have gobs of time to schedule when the product is
My idea is to put it after the register fetches are registered and
parallel to the other QOP_ cases in the ALU. Note that the expensive
part of the ALU, the add is already MUXed separately on the ALU output,
assign res_o = res_is_add_r? res_if_add_r : res_r;
where res_is_add_r is registered. So, I don't think adding a multiply
case (or actually, replacing the current case with "sink" and "source"
cases), will have an effect on timing. At least in theory, though as
pointed out no long ago, the synthesis results are easily disturbed.
> Having a special instruction to initiate the multiply would save us
> one cycle (worth it?). Otherwise, there would be two moves into the
> scratch/io space. But the product is only a single word fetch.
> Putting it into r31 would save a cycle, because we wouldn't have to
> move it into a register first before using it as an operand to another
> My main concerns are the extra multiplexing logic hurting our max clock rate.
I think the IO approach is nice due to the fact that it completely
decouples the operation from the CPU, thus separate people can easily
maintain the two pieces of code. However, that presumes we don't push
the result into a fixed register. But, it seems a bit odd to me if we
write the operand to IO-ports 8 and 9 and magically get back the result
in r31 without an IO-read; did I misunderstand?
Also, it's worth considering that even if I am right about the timing,
we save a bit of instruction decoding logic by using fully IO-port based
approach. I think the trade-off is 3-instruction multiply in 36 cycles
versus 1-instruction multiply in 34 cycles. So, it's not a big
> I think we may in fact need interrupts, and I'm struggling with it.
> The problem is VGA graphics modes. In 640x480x16 and such,
> framebuffer reads and writes are not simple accesses. You can apply
> raster operators to writes, and you can make reads fill a blt buffer
> larger than your word size so that when you write, it causes more than
> a word size to get written out. This way, you can bitblt faster than
> you can move data over the bus.
> Now, for VGA mode, mostly what the controller does is read VGA text or
> pixels and convert them in the background into pixels suitable for our
> video controller. At the same time, we want the controller to handle
> the extra smarts of VGA. One way to do this is to support interrupts;
> when a PCI access comes in, we can intercept it and do the extra
> stuff. While writes could be queued for us to process periodically,
> reads have to be processed as soon as possible.
I don't know much about VGA, so maybe someone can enlighten me:
* A framebuffer driver can use the 3D pipeline directly.
* The BIOS VGA code we write could use the 3D pipeline directly.
If I recall previous discussions, I suspect the sole reason we need VGA
in hardware is that existing code by-passes the BIOS calls. Right?
> Interrupts won't stall lower parts of the pipeline, but they would
> divert the instruction flow. We need to determine how this will
> affect our static instruction scheduling.
> Correct me if I'm wrong, but a subroutine call stores the return
> address into r31, right? Of course, since that's under main program
> control, no problem! But with interrupts, I think we should dump the
> return address into a redefined address in the scratch memory.
The return address is directed to the write-back register encoded in the
instruction word. We probably need a special instruction to return from
an interrupt, since the nanocontroller may need to know that it is no
longer running interrupt code. Therefore it may make sense to store the
interrupt return-address in a dedicated place.
> What about context switches? Should we require the ISR to copy
> registers to the scratch memory? That's a fair amount of overhead,
> depending on how many we need to clobber. How about doubling the size
> of the register file? The lower half for normal execution, the upper
> half for interrupts. (Like how the Z80 did it.) (In this case, the
> interrupt return address appears in what we might internally call
If interrupt handlers are simpler than normal code, we could just decide
it will only use a few registers, but otherwise doubling the register
file makes sense. We don't need to change the instruction word. The
nanocontroller will arrange to fill in the upper bit on read and
> Oh, and don't forget the delayed branch issue and how it'll
> affect interrupt--one extra instruction from the main program will get
> executed, so the return PC must account for that, and be sure to
> consider the situation where the interrupt arrives at the same time as
> a branch instruction is being fetched in the main program.
I think the second issue you raise is the most severe. One way to solve
it is to disable interrupts while executing the delay-slot.
More information about the Open-graphics