[Open-graphics] Synthesizing oga1hq

Mark mark at jarvin.net
Mon Aug 13 16:19:50 EDT 2007


Petter Urkedal wrote:
> On 2007-08-13, Mark wrote:
>> Petter Urkedal wrote:
>>> If we want to do a compromise, we could instead implement 32x16->32
>>> multiply.  That is, two multipliers in the ALU stage, and an adder in
>>> the IO stage.  If again we incorporate the shifts, we are down to 4
>>> instructions for to compute a 32x32->32 product:
>>> mul_32x32_from_32x16:
>>>         mul/h   r0, r1, r3	; r3 := r0 * r1[31:16]
>>>         mul/l   r0, r1, r2	; r2 := r0 * r1[15:0]
>>>         shift   r3, 16, r3
>>>         add     r2, r3, r2
>>> Note that register forwarding does not work fully for the mul
>>> instruction in this case, since it's split over two stages.  There is a
>>> 1 cycle delay before we can use the result, which means this is the only
>>> way to order the instructions.
>>> My guess is that the 16x16->32 multiplier with shifts on both the second
>>> operand and the result is much cheaper than the extra adder and
>>> multiplier of the 32x16->32 solution, and we save only one instruction
>>> by by going to 32x16->32.
>> How about if the shift was implicit in mul/h?  That should be cheap in 
>> terms of hardware and it would decrease the cost of the soft 32x32 multiply 
>> to three cycles -- wouldn't it? (Sorry -- I have yet to read up on your 
>> architecture in detail.)
> 
> That's what I did in the 16x16->32 case, but in this case, the two-stage
> mul/l instruction will not have a result ready at the point of the shift
> instruction, so we can't save that cycle anyway.
> 
The benefit, in my mind, is that you get a slot in which you can 
schedule something else.

ALU                     IO
issue mulh              -
issue mull              mulh completes
<free>                  mull completes
issue add               <free>
-                       add ready

Besides, if the adder is in the IO stage, can't you pack that all into 
three cycles?

ALU                     IO
issue mulh              -
issue mull              mulh completes
issue add               mull completes
-                       perform addition

Of course, I could be missing something fundamentally obvious here.

Just to be clear, I'm suggesting that mulh be something like
   rC[31:16] := rA * rB[31:16]
   rC[15:0]  := 0
All this requires is a 2:1 mux at the output of the multiplier (possibly 
retimed into a later stage).  I'm not suggesting you reuse your ALU 
barrel-shifter for this simple, special-purpose shift.


More information about the Open-graphics mailing list