[Open-graphics] Multipliers in oga1hq
Timothy Normand Miller
theosib at gmail.com
Sat Sep 1 18:03:22 EDT 2007
On 9/1/07, Petter Urkedal <urkedal at nbi.dk> wrote:
> On 2007-09-01, Timothy Normand Miller wrote:
> > I'm not sure we want to add additional MUXing after the REG stage. It
> > might be better to move it into the MEM stage. This is especially not
> > a problem since we have gobs of time to schedule when the product is
> > grabbed.
> My idea is to put it after the register fetches are registered and
> parallel to the other QOP_ cases in the ALU. Note that the expensive
> part of the ALU, the add is already MUXed separately on the ALU output,
> assign res_o = res_is_add_r? res_if_add_r : res_r;
> where res_is_add_r is registered. So, I don't think adding a multiply
> case (or actually, replacing the current case with "sink" and "source"
> cases), will have an effect on timing. At least in theory, though as
> pointed out no long ago, the synthesis results are easily disturbed.
I think you're right that 3-to-1 or 4-to-1 shouldn't be much worse
than 2-to-1. We'll see. (Of course, we cannot inject it _before_ the
register, because that's built into the memory for the register file.)
> I think the IO approach is nice due to the fact that it completely
> decouples the operation from the CPU, thus separate people can easily
> maintain the two pieces of code. However, that presumes we don't push
> the result into a fixed register. But, it seems a bit odd to me if we
> write the operand to IO-ports 8 and 9 and magically get back the result
> in r31 without an IO-read; did I misunderstand?
One extreme is to have an instruction that initiates the multiply
"directly" and have the result magically appear in r31. Another
extreme is to use I/O ports 8 and 9 for the inputs and then later read
from port 10 for the output. There are also combinations of those
things, and we shouldn't avoid one just because it seems weird.
If we're going to 'override' r31, I think we should make it dependent
only on whether or not there's a pending product. That is, if a
product is not pending, we read the real r31. If a product is
pending, reading r31 causes the product to be fetched AND resets the
override. (Thus, we have only one opportunity to fetch the product
before it effectively disappears.) Writes to r31 would have no effect
either way. This is a bit weird, I have to admit, so maybe we should
consider other ways of going about it.
> Also, it's worth considering that even if I am right about the timing,
> we save a bit of instruction decoding logic by using fully IO-port based
> approach. I think the trade-off is 3-instruction multiply in 36 cycles
> versus 1-instruction multiply in 34 cycles. So, it's not a big
That's what I was thinking, but there's also an argument to be made
for using some dedicated instructions. When doing only through I/O,
we need to execute three instructions (and use up the corresponding
three additional cycles), two to set up the product, and one to fetch
the result. When using a dedicated instruction and a register
override, we need one extra instruction (one to set up the multiply,
zero to fetch it, because it can be an operand directly in another
> I don't know much about VGA, so maybe someone can enlighten me:
> * A framebuffer driver can use the 3D pipeline directly.
Depends on what you mean by a "framebuffer driver." In the usual
sense, there is no use of an engine in a framebuffer driver. In a
proper, fully-accelerated driver, it would use the 3D pipeline.
> * The BIOS VGA code we write could use the 3D pipeline directly.
If you're talking about the VESA BIOS extensions, yes. Otherwise, the
nano controller would be doing things that probably wouldn't benefit
from using the 3D engine.
> If I recall previous discussions, I suspect the sole reason we need VGA
> in hardware is that existing code by-passes the BIOS calls. Right?
> > Interrupts won't stall lower parts of the pipeline, but they would
> > divert the instruction flow. We need to determine how this will
> > affect our static instruction scheduling.
> > Correct me if I'm wrong, but a subroutine call stores the return
> > address into r31, right? Of course, since that's under main program
> > control, no problem! But with interrupts, I think we should dump the
> > return address into a redefined address in the scratch memory.
> The return address is directed to the write-back register encoded in the
> instruction word.
Oh, yeah. Even better. :)
> We probably need a special instruction to return from
> an interrupt, since the nanocontroller may need to know that it is no
> longer running interrupt code. Therefore it may make sense to store the
> interrupt return-address in a dedicated place.
Yes, quite true. There's no benefit to making it referencable in I/O
space, because you need a special RTI instruction anyhow to change the
> > What about context switches? Should we require the ISR to copy
> > registers to the scratch memory? That's a fair amount of overhead,
> > depending on how many we need to clobber. How about doubling the size
> > of the register file? The lower half for normal execution, the upper
> > half for interrupts. (Like how the Z80 did it.) (In this case, the
> > interrupt return address appears in what we might internally call
> > r63.)
> If interrupt handlers are simpler than normal code, we could just decide
> it will only use a few registers, but otherwise doubling the register
> file makes sense. We don't need to change the instruction word. The
> nanocontroller will arrange to fill in the upper bit on read and
Yeah. Sort of a "interrupt mode" state bit that is used as the
higher-order bit of the register index.
> > Oh, and don't forget the delayed branch issue and how it'll
> > affect interrupt--one extra instruction from the main program will get
> > executed, so the return PC must account for that, and be sure to
> > consider the situation where the interrupt arrives at the same time as
> > a branch instruction is being fetched in the main program.
> I think the second issue you raise is the most severe. One way to solve
> it is to disable interrupts while executing the delay-slot.
It's probably the case that either the branch instruction or its delay
slot will cause a problem but not both. But maybe both. We probably
need to draw a timing diagram to work it out.
Timothy Normand Miller
Open Graphics Project
More information about the Open-graphics