[Open-graphics] Synthesizing oga1hq
Mark
mark at jarvin.net
Mon Aug 13 16:19:50 EDT 2007
Petter Urkedal wrote:
> On 2007-08-13, Mark wrote:
>> Petter Urkedal wrote:
>>> If we want to do a compromise, we could instead implement 32x16->32
>>> multiply. That is, two multipliers in the ALU stage, and an adder in
>>> the IO stage. If again we incorporate the shifts, we are down to 4
>>> instructions for to compute a 32x32->32 product:
>>> mul_32x32_from_32x16:
>>> mul/h r0, r1, r3 ; r3 := r0 * r1[31:16]
>>> mul/l r0, r1, r2 ; r2 := r0 * r1[15:0]
>>> shift r3, 16, r3
>>> add r2, r3, r2
>>> Note that register forwarding does not work fully for the mul
>>> instruction in this case, since it's split over two stages. There is a
>>> 1 cycle delay before we can use the result, which means this is the only
>>> way to order the instructions.
>>> My guess is that the 16x16->32 multiplier with shifts on both the second
>>> operand and the result is much cheaper than the extra adder and
>>> multiplier of the 32x16->32 solution, and we save only one instruction
>>> by by going to 32x16->32.
>> How about if the shift was implicit in mul/h? That should be cheap in
>> terms of hardware and it would decrease the cost of the soft 32x32 multiply
>> to three cycles -- wouldn't it? (Sorry -- I have yet to read up on your
>> architecture in detail.)
>
> That's what I did in the 16x16->32 case, but in this case, the two-stage
> mul/l instruction will not have a result ready at the point of the shift
> instruction, so we can't save that cycle anyway.
>
The benefit, in my mind, is that you get a slot in which you can
schedule something else.
ALU IO
issue mulh -
issue mull mulh completes
<free> mull completes
issue add <free>
- add ready
Besides, if the adder is in the IO stage, can't you pack that all into
three cycles?
ALU IO
issue mulh -
issue mull mulh completes
issue add mull completes
- perform addition
Of course, I could be missing something fundamentally obvious here.
Just to be clear, I'm suggesting that mulh be something like
rC[31:16] := rA * rB[31:16]
rC[15:0] := 0
All this requires is a 2:1 mux at the output of the multiplier (possibly
retimed into a later stage). I'm not suggesting you reuse your ALU
barrel-shifter for this simple, special-purpose shift.
More information about the Open-graphics
mailing list