[Open-graphics] Sample VGA translation code, for nanocontroller
urkedal at nbi.dk
Sat Sep 8 07:45:18 EDT 2007
On 2007-09-06, Timothy Normand Miller wrote:
> On 9/6/07, Petter Urkedal <urkedal at nbi.dk> wrote:
> > > I'm just worried about the race conditions. In fact, I know they'll
> > > be a problem. We really don't want to give the nanocontroller a
> > > separate pipe to memory. So if we have pending reads, then data will
> > > come back out of order.
> > I though we could solve that by encoding the thread number in the
> > request and reply. An attempt to read a data which the thread does not
> > own, would cause a context switch.
> I'm trying to imagine this, and I can't see a solution that wouldn't
> be more complicated than what I'm proposing. Could you go into some
> detail about assumptions you're making about the memory system? How
> is it to know if a read word was requested by one thread or the other?
> And does this mean that thread switching is completely automatic?
> What other conditions would cause a thread switch?
(Note that is mostly of academic interest, since we'll try to avoid
context if possible. The nanocontroller is a simple design.)
Let's focus on the tricky part, the reads. My assumption is that the
nanocontroller can write requests to the sink-end of a FIFO and read
from the source-end of another FIFO. The threading extension involves
tagging each request with a thread number, which will be propagated back
to the other FIFO.
First, a simplified suboptimal solution without continuous
thread-switching: When a thread requests reading from the source, one
of two things can happen:
* The source is nonempty and the next data belongs to the current
* The source is empty or the next data belongs to another thread.
In the former case the instruction is executed. In the latter case, the
read instruction will be propagated down the pipeline as a noop, the PC
is reset to point to the same read instruction, and a context switch
happens. Another case which could trigger a context switch, would be
However, if we can multi-task the most critical code, we should try to
I1. The same thread never runs in two consecutive cycles.
That way we
* get rid of the delay slot,
* can rip out half of the register forwarding, and
* can pipeline the adder over two stages without imposing
data-dependency constraints in the machine code.
How would we implement a smart scheduler which runs threads
alternatively and be predictive enough about reads that it causes the
minimum number of noops to be inserted?
First, with only 2 context, I1 predetermines that the two threads must
be run alternatively. That could easily mean that half of the cycles
will process inserted noops when both threads tries to read. Therefore,
with 2 threads should abandon I1, or use a smarter read-data FILO which
splits the replies into two sources, one for each thread.
With 4 contexts the scheduling gets more interesting. One thing we can
decide right away is that
S1. If there is data in the read-reply FIFO and the owner was not
run on the previous cycles, then we run the owner.
Otherwise, we know that any thread which tries to read can not proceed.
Further heuristic is difficult since we don't know if the next
instruction for a thread is a read before we fetch it. The easiest
solution is probably to make an attempt, if we fail we set a
"pending-read" bit for that thread, and we don't re-schedule it before
S1 applies. A more predictive heuristic is to keep track of how many
pending read-replies each thread has, and run the thread with least
pending replies, the argument being that it is the least likely to
block. It seems that with 4 context we could also benefit from routing
read-replies to separate sources for each thread.
If we more than one async source to deal with, scheduling decisions
become more complex, however, it such a case we could probably exploit
the independence of the resources, so that only selected threads have
access to each resource.
> > Given these conclusions, we are close to the desired nanocontroller:
> > First, lets `ifdef out the multiplier logic. Then, maybe we turn IO
> > access into registers. If not, a minor practical-aesthetic point is
> > that I'd suggest negative addresses for IO-ports because it lets us
> > expand the scratch memory without changing the IO base address. Then,
> > we can try to synthesise it again.
> Minor nit-pick. I'd suggest choosing a high address bit (but
> something within the range of our immediates) to specify the bottom of
> I/O port space. 16384, I guess, from 15-bit unsigned immediates. The
> reason I don't like negative addresses is that it introduces
> additional math that I don't want to deal with, even if that doesn't
> translate into any real hardware. Note that if the immediate gets
> sign extended, it doesn't make any difference. We just ignore the
> upper bits anyhow, and we treat -16384 as the offset for I/O ports.
> So we use bit 15 as the "is it scratch or I/O" flag.
In the current version at least immediates are always sign extended, and
I don't see a reason to change that. So, the two schemes are equivalent
seen from any immediate-encodable address. However, if we were to
extend scratch memory beyond the range of immediates, then the I/O area
will shadow the range just above the highest positive immediate address,
whereas if we use bit 31 as the "scratch or I/O"-flag then I/O space will
always be out of the way.
More information about the Open-graphics