[Open-graphics] Synthesizing oga1hq
Farhan Mohamed Ali
farhan at cmu.edu
Tue Aug 14 13:11:43 EDT 2007
On Tue, August 14, 2007 12:20 pm, Mark said:
> Timothy Normand Miller wrote:
>> On 8/14/07, Farhan Mohamed Ali <farhan at cmu.edu> wrote:
>>> Speaking of multipliers, i was wondering what speed we are targeting
>>> for the floatmult25 or the entire FPGA in general? I have managed to
>>> get a 3 stage version working at <9ns according to the tools
>>> (targeted for 3S1500 but i'm not sure how accurate the auto
>>> generated timing constraints are. Seems a bit quirky to me, it does
>>> not respond as i expect to changes i make (as in, why do changes i
>>> make in stage1 affect the critical path which is in stage2, and
>>> weird stuff like that). I'm used to working on full custom ASICs
>>> where i control everything, so i don't find the synthesizer to be
>>> very intuitive :\
>>>
>> It's trying to do the P&R automatically, and it's using simulated
>> annealing to do it. It's an optimization problem using randomization.
>> So to begin with, what you get isn't deterministic. But since there's
>> competition for resources, changing something in one place will affect
>> everything else. It can be frustrating sometimes. We find that when
>> we're on the edge of being able to meet timing, we'll have to run P&R
>> several times before it gives us what we want.
>>
> In addition to the nondeterminism in simulated annealing (if indeed
> Xilinx uses simulated annealing for placement, which isn't necessarily
> the case), XST probably does at least some retiming on your circuit
> (depending on your synthesis script), which could move critical paths
> between stages. There's tech mapping, packing, and routing, too -- all
> of which are heuristic and nonlinear.
>
> If you ask me, the bigger problem is that the whole synthesis flow is
> "nonlinear", insofar as a small change in the input could result in a
> massive perturbation in the output. Xilinx's tools (and probably others)
> are deterministic insofar as identical input will always yield the same
> output (I'm fairly sure of this). However, a tiny perturbation in the
> input (RTL, constraints...) can completely change the final
> implementation.
>
> You can work around the unpredictability in critical portions of the
> design by explicitly instantiating LUTs, carry chains, etc. and locking
> them to specific locations on the part (again, at least with Xilinx). You
> can use FPGA Editor to rewire a post-PAR design. You don't _have_ to use
> synthesis at all if the
> control/performance/effort/maintainability/portability trade-off of doing
> full-custom designs looks right.
>
> I think that unless you impose external timing constraints, Xilinx's
> tools try for the lowest possible delay first and minimize area as a
> secondary goal (possibly with power optimizations as a tertiary goal in
> recent releases). I'd worry less about constraints and more about
> synthesizing for the right part if you're looking for a reasonable timing
> estimate. I'm assuming "the right part" means XC3S4000-fg676-5 or
> LFXP10C-5F256C based on the svn BOM, datasheets, and this discussion.
>
You are right, a fixed input generates a fixed output. It's just that
even small changes in the RTL change the synthesis results in
unpredictable ways, and i was wondering whether i was doing something
wrong, but apparently that's just the way it works. I did not set any
constraints, as i'm just learning to use the tools. I will be of more help when it comes to the ASIC version, i hope.
>>> Back to the XP10, if it doesn't have hard multipliers, we can make
>>> our own :) But again i'm not sure how well that works out for FPGAs.
>>>
>>>
>>
>> Yeah. It's not worth the extra logic to fully pipeline it (nor could
>> we keep a 32-stage multiplier pipeline fully fed). We could have
>> separate logic that would run in parallel. Or we could have special
>> mult-stepping instructions. With the latter, partial multiplies can be
>> optimized to take fewer cycles.
>>
>> Here's what I'm thinking....
>>
>> If we had a stand-alone multstep instruction, it would need four
>> operands: (1) an accumulator, (2) the mutiplicand, (3) one bit from
>> the multiplier to determine whether or not the multiplicand is added to
>> the accumulator, and (4) a loop counter from which to compute the
>> multiplicand left-shift and which bit to take from the multiplier.
>>
>> Now, I don't like the idea of adding extra state. What if we want to
>> add the ability to handle interrupts? But we can tinker with the idea:
>> Have one special instruction whose job is to load the counter and the
>> multiplier. The step instruction would have the accumulator (as a
>> source and the target) and the muliplicand. Each step would step the
>> counter, shift the multiplier, and add (or not, depending on the bit
>> from the multiplier) the shifted multiplicand to the accumulator. That
>> puts a shifter in line with an adder, though, so maybe we want to load
>> the multiplier and multiplicand in the first instruction (so they're
>> shifted 1 each cycle) and then specify the counter and accumulator in
>> the step? We'll have to work out the permutations.
>>
>> BTW, the only reason to do this is because without it, a multiply would
>> take at least 4 times longer due to the overhead of explicit shifts and
>> branches. Do we care?
>>
>
> I would imagine Lattice's tools still infer multipliers and I believe
> there is a fast(ish) multiplier implementation for the Lattice XP
> architecture using LUTs and carry chains (it's alluded to in the
> datasheet).
>
> A fixed one-bit shift after the adder should be so cheap as to make no
> odds. It's just a two-input MUX -- it'll fit in one LUT, possibly packed
> with something else.
>
> All of these implementation ideas seem to be very complex as compared to
> simply blocking the pipeline. This nanocontroller isn't going to be an
> OOO ILP-exploiting powerhouse anyhow. Why not just pipeline the
> multiplier enough to maintain the clock speed you're shooting for and let
> that determine the number of cycles that the multiply instruction blocks
> the ALU stage of the pipeline? This eliminates polluting the instruction
> set with odd instructions, is straightforward to implement, and won't
> slow down the instruction stream any more than any other multicycle
> multiplication would (plus code size would be negligibly smaller).
>
> A slightly more complex solution that could significantly pick up ILP
> would be to let the pipeline proceed and block only when the
> multiplication result is used (no avoiding that) or another
> multiplication is issued (assuming the multiplier is a simple serial
> implementation and not fully pipelined). Whoever's writing assembly (or
> the assembler itself) could try to schedule multiplications so that the
> blocking condition occurs rarely, if at all. Or, you could simply
> eliminate the hazard-detection logic altogether and presume that the
> coder or toolchain is dealing with it.
>
> It should still be feasible to implement an early-exit for narrow
> multiplications.
>
> Another approach might be to consider moving the nanocontroller to the
> Xilinx part to leverage the hard multipliers. What's the interface
> between the Spartan and the XP? Is it possible to leave the DMA on the
> XP and have the nanocontroller on the Spartan? Or to move both to the
> Spartan? How is the system partitioned between the two FPGAs?
>
> What kind of code will the nanocontroller be running? Who'll be writing
> code for it? Will it be written in a high-level language or strictly
> in assembly? What kind of throughput is required (or, in other words,
> where does the 10ns constraint come from)? Is there any way to figure
> out how useful a 32x32 multiply is before worrying about how to implement
> it?
>
> (Is all this documented somewhere? If so, please forgive my ignorance
> and pass on the URL.)
>
> I'm going to try to get a feel for the cost of multipliers on the Lattice
> part. They haven't sent me a Synplify license yet, though.
>
> Incidentally, sorry for coming out of nowhere with all this. I've been
> lurking on the list for a couple of months and just spotted somewhere I
> felt I could chime in helpfully. I'm a grad student at the University of
> Toronto interested in helping however I can. If I'm in breach of any
> etiquette, please let me know. Cheers,
>
> Mark. _______________________________________________ Open-graphics
> mailing list Open-graphics at duskglow.com
> http://lists.duskglow.com/mailman/listinfo/open-graphics List service
> provided by Duskglow Consulting, LLC (www.duskglow.com)
>
>
More information about the Open-graphics
mailing list