[Open-graphics] Synthesizing oga1hq

Mark mark at jarvin.net
Tue Aug 14 12:20:58 EDT 2007


Timothy Normand Miller wrote:
> On 8/14/07, Farhan Mohamed Ali <farhan at cmu.edu> wrote:
>> Speaking of multipliers, i was wondering what speed we are targeting for the floatmult25 or the entire FPGA in general? I have managed to get a 3 stage version working at <9ns according to the tools (targeted for 3S1500 but i'm not sure how accurate the auto generated timing constraints are. Seems a bit quirky to me, it does not respond as i expect to changes i make (as in, why do changes i make in stage1 affect the critical path which is in stage2, and weird stuff like that). I'm used to working on full custom ASICs where i control everything, so i don't find the synthesizer to be very intuitive :\
>>
> It's trying to do the P&R automatically, and it's using simulated
> annealing to do it.  It's an optimization problem using randomization.
>  So to begin with, what you get isn't deterministic.  But since
> there's competition for resources, changing something in one place
> will affect everything else.  It can be frustrating sometimes.  We
> find that when we're on the edge of being able to meet timing, we'll
> have to run P&R several times before it gives us what we want.
> 
In addition to the nondeterminism in simulated annealing (if indeed 
Xilinx uses simulated annealing for placement, which isn't necessarily 
the case), XST probably does at least some retiming on your circuit 
(depending on your synthesis script), which could move critical paths 
between stages.  There's tech mapping, packing, and routing, too -- all 
of which are heuristic and nonlinear.

If you ask me, the bigger problem is that the whole synthesis flow is 
"nonlinear", insofar as a small change in the input could result in a 
massive perturbation in the output.  Xilinx's tools (and probably 
others) are deterministic insofar as identical input will always yield 
the same output (I'm fairly sure of this).  However, a tiny perturbation 
in the input (RTL, constraints...) can completely change the final 
implementation.

You can work around the unpredictability in critical portions of the 
design by explicitly instantiating LUTs, carry chains, etc. and locking 
them to specific locations on the part (again, at least with Xilinx). 
You can use FPGA Editor to rewire a post-PAR design.  You don't _have_ 
to use synthesis at all if the 
control/performance/effort/maintainability/portability trade-off of 
doing full-custom designs looks right.

I think that unless you impose external timing constraints, Xilinx's 
tools try for the lowest possible delay first and minimize area as a 
secondary goal (possibly with power optimizations as a tertiary goal in 
recent releases).  I'd worry less about constraints and more about 
synthesizing for the right part if you're looking for a reasonable 
timing estimate.  I'm assuming "the right part" means XC3S4000-fg676-5 
or LFXP10C-5F256C based on the svn BOM, datasheets, and this discussion.

>> Back to the XP10, if it doesn't have hard multipliers, we can make our own :) But again i'm not sure how well that works out for FPGAs.
>>
> 
> Yeah.  It's not worth the extra logic to fully pipeline it (nor could
> we keep a 32-stage multiplier pipeline fully fed).  We could have
> separate logic that would run in parallel.  Or we could have special
> mult-stepping instructions.  With the latter, partial multiplies can
> be optimized to take fewer cycles.
> 
> Here's what I'm thinking....
> 
> If we had a stand-alone multstep instruction, it would need four
> operands:  (1) an accumulator, (2) the mutiplicand, (3) one bit from
> the multiplier to determine whether or not the multiplicand is added
> to the accumulator, and (4) a loop counter from which to compute the
> multiplicand left-shift and which bit to take from the multiplier.
> 
> Now, I don't like the idea of adding extra state.  What if we want to
> add the ability to handle interrupts?  But we can tinker with the
> idea:  Have one special instruction whose job is to load the counter
> and the multiplier.  The step instruction would have the accumulator
> (as a source and the target) and the muliplicand.  Each step would
> step the counter, shift the multiplier, and add (or not, depending on
> the bit from the multiplier) the shifted multiplicand to the
> accumulator.  That puts a shifter in line with an adder, though, so
> maybe we want to load the multiplier and multiplicand in the first
> instruction (so they're shifted 1 each cycle) and then specify the
> counter and accumulator in the step?  We'll have to work out the
> permutations.
> 
> BTW, the only reason to do this is because without it, a multiply
> would take at least 4 times longer due to the overhead of explicit
> shifts and branches.  Do we care?
> 

I would imagine Lattice's tools still infer multipliers and I believe 
there is a fast(ish) multiplier implementation for the Lattice XP 
architecture using LUTs and carry chains (it's alluded to in the datasheet).

A fixed one-bit shift after the adder should be so cheap as to make no 
odds.  It's just a two-input MUX -- it'll fit in one LUT, possibly 
packed with something else.

All of these implementation ideas seem to be very complex as compared to 
simply blocking the pipeline.  This nanocontroller isn't going to be an 
OOO ILP-exploiting powerhouse anyhow.  Why not just pipeline the 
multiplier enough to maintain the clock speed you're shooting for and 
let that determine the number of cycles that the multiply instruction 
blocks the ALU stage of the pipeline?  This eliminates polluting the 
instruction set with odd instructions, is straightforward to implement, 
and won't slow down the instruction stream any more than any other 
multicycle multiplication would (plus code size would be negligibly 
smaller).

A slightly more complex solution that could significantly pick up ILP 
would be to let the pipeline proceed and block only when the 
multiplication result is used (no avoiding that) or another 
multiplication is issued (assuming the multiplier is a simple serial 
implementation and not fully pipelined).  Whoever's writing assembly (or 
the assembler itself) could try to schedule multiplications so that the 
blocking condition occurs rarely, if at all.  Or, you could simply 
eliminate the hazard-detection logic altogether and presume that the 
coder or toolchain is dealing with it.

It should still be feasible to implement an early-exit for narrow 
multiplications.

Another approach might be to consider moving the nanocontroller to the 
Xilinx part to leverage the hard multipliers.  What's the interface 
between the Spartan and the XP?  Is it possible to leave the DMA on the 
XP and have the nanocontroller on the Spartan?  Or to move both to the 
Spartan?  How is the system partitioned between the two FPGAs?

What kind of code will the nanocontroller be running?  Who'll be writing 
code for it?  Will it be written in a high-level language or strictly in 
assembly?  What kind of throughput is required (or, in other words, 
where does the 10ns constraint come from)?  Is there any way to figure 
out how useful a 32x32 multiply is before worrying about how to 
implement it?

(Is all this documented somewhere?  If so, please forgive my ignorance 
and pass on the URL.)

I'm going to try to get a feel for the cost of multipliers on the 
Lattice part.  They haven't sent me a Synplify license yet, though.

Incidentally, sorry for coming out of nowhere with all this.  I've been 
lurking on the list for a couple of months and just spotted somewhere I 
felt I could chime in helpfully.  I'm a grad student at the University 
of Toronto interested in helping however I can.  If I'm in breach of any 
etiquette, please let me know.  Cheers,

Mark.


More information about the Open-graphics mailing list