novaBBS - comp.arch - Re: The Tera MTA

On 7/17/2021 2:39 PM, MitchAlsup wrote:
> On Saturday, July 17, 2021 at 11:46:52 AM UTC-5, BGB wrote:
>> On 7/16/2021 10:40 AM, MitchAlsup wrote:
>>> On Friday, July 16, 2021 at 10:00:17 AM UTC-5, Quadibloc wrote:
>>>> The Cray XMT (not to be confused with the X-MP) recently came to my
>>>> attention again when I happened to stumble upon an eBay sale for a
>>>> Threadcrusher 4.0 chip.
>>>>
>>>> I remembered that at one point, AMD allowed other companies to make
>>>> chips that would fit into the same sockets as an AMD processor; in
>>>> connection with that, I remember hearing of an FPGA accelerator chip that
>>>> did this.
>>>>
>>>> But I didn't know there was also the Threadcrusher 4.0 from Cray, which could
>>>> fit in an Opteron socket also.
>>>>
>>>> I knew that Sun and/or Oracle made versions of the SPARC chip that took
>>>> simultaneous multi-threading (SMT) beyond what Intel did with
>>>> Hyper-Threading - instead of two simultaneous threads, their chips could
>>>> offer up to _eight_.
>>>>
>>>> But the Threadcrusher chip from Cray (and it fit into an AMD socket... and
>>>> currently AMD has a product it calls the Threadripper...) took SMT to a
>>>> rather higher level, with 128 simultaneous threads.
>>> <
>>> Lookup Burton Smith
>>> <
>>> Tera was another Denelore HEP-like processor architecture.
>>>>
>>>> It seems, from the history, that the Tera MTA couldn't have been a
>>>> complete failure.
>>>>
>>>> The Tera MTA later became known as the Cray MTA. When I heard that
>>>> I thought, oh, Cray bought Tera. But no, this happened after Tera bought
>>>> Cray - from Silicon Graphics. And then decided that the name "Cray" had
>>>> a bit more of a _cachet_ than the name "Tera", and adopted, therefore,
>>>> that name for the combined company.
>>>>
>>>> The same architecture was used in later versions of the machine; the
>>>> chip for sale on eBay was, after all, a Threadcrusher 4.0, and the original
>>>> Tera MTA didn't use a single-chip processor.
>>>>
>>>> The rationale behind using so many threads on a processor was so that
>>>> the processor could be doing something useful while waiting, and
>>>> waiting, and waiting for requested data to arrive from main memory.
>>>> So, if one had enough threads, _memory latency_ would not be an
>>>> issue.
>>> <
>>> This same approach is used in GPUs, where one switches threads every
>>> instruction! I don't remember a lot about Tera, but at Denelcore the processor
>>> had a number of threads and a number of calculation units and memory.
>>> A thread was not eligible to run a second instruction until all aspects of
>>> the first instruction had been performed. Each word (64-bits) of memory
>>> could be {Read, Written, Read-if-Full, Written-if-empty} So a memory ref
>>> could be sent out and literally not return for thousands of clocks. The
>>> registers had the same kind of stuff but we didn't use that--just the memory.
>>>>
>>>> Apparently, this processor had eight data registers, and eight address
>>>> modification registers. A bit like a 68000, or even my original
>>>> Concertina architecture. But I haven't been able to find any information
>>>> about its instruction set.
>>>>
>>>> Of course, having 128 threads on a single-core chip, while it does,
>>>> without diminishing throughput, slow each individual thread down to
>>>> the pace of memory, might seem to create bandwidth issues. However,
>>>> that didn't stop Intel from giving the world Xeon Phi.
>>>>
>>>> In any case, was this a valid approach, or is there a good reason why
>>>> it is no longer viable?
>>> <
>>> It migrated into GPUs, where a single instruction may cause 32 (or 64) calculations
>>> or memory references to 64 different cache lines, in 64 different pages,.....
>>> The average memory reference takes something like 400 cycles, so the GPU
>>> better have other things to do while it waits. So, switching to a new set of
>>> threads every cycle dramatically improves throughput even if dragging out the
>>> latency on a per thread basis.
>> Hmm...
>>
>> Reminds me of an observation I made while fiddling around with stuff for
>> my own ISA:
> <
>> If the CPU core ran at half the effective clock-speed, but only had to
>> spend half as many cycles on cache misses, the relative impact on
>> performance would be "surprisingly minor" for things like Doom (the
>> clock speed drops but the IPC increases enough to compensate).

As noted, this seems to be because Doom is mostly bound by memory
accesses, spending the majority of its clock cycles waiting on L2 misses.

Poking at it some more:

If memory stays at the same speed, Doom is still fairly playable at
around 12MHz or 16MHz, though at these speeds transitions over into
being mostly instruction-bound.

Its behavior and properties seem to change slightly, and the "low/high
detail" option becomes "actually relevant" (has a more obvious effect on
framerate).

A similar property seems to hold for Quake, which only seems to suffer a
modest slowdown (relative to 50MHz) when running at 12MHz (still low
single digit framterates).

GLQuake performance tanks pretty bad at 12MHz though (averages 0 fps,
this scenario somewhat favors software-rendered Quake).

> <
> I have noticed several occurrences of similar merit::
> <
> As the length of the pipeline is decreased the pressure on the predictors
> {branch, Jump, Call/Return} is reduced because the recovery multiplier is
> smaller. Conversely, a 15 stage pipeline needs a predictor that has no more
> than 1/3rd of the mispredicts as a 5 stage pipeline to achieve similar miss
> prediction recovery cycles.

OK. Branch misses don't appear to be too major of a problem with an
8-stage pipeline. The predictors I have mostly seem to work OK.

> <
> As the execution width becomes wider, one needs to service more AGENs
> per cycle:: 1-wide needs 1-AGEN, 4-wide needs 2 AGENs, 6-wide needs
> 3 AGENs--all these AGENs end up colliding at the cache ports and it you
> can't service this many on a per cycle basis, you might want to reconsider
> the width you are trying to achieve.

OK, in my 3-wide core, there is only 1 memory port.

So, I guess from the pattern above, the idea is that, as the number of
lanes increases, one needs roughly n/2 memory ports to be worthwhile (as
opposed to, say, doing a 6-wide core with 1 memory port, and the other 5
lanes only able to do ALU ops).

> <
> As the AGENs per cycle goes up, the TLB must end up servicing all of them
> on a per cycle basis, too.
> <
> As width goes up, fetching wide does not improve as much as expected
> because one ends up taking a branch every 8-9 instructions. Somewhere
> around 6-instruction ISSUEs per cycle is where one needs to issue inst-
> ructions from a <predicted> taken basic block along with the instructions
> that take you to the predicted basic block.

This can be partly sidestepped by putting the TLB on the L1<->L2
interface, then the TLB only gets invoked on L1 misses. Tradeoff is that
L1 cache lines now need to keep track of both virtual and physical
addresses and double-mapping may introduce awkward cache-consistency issues.

Putting the TLB inline with L1 access would likely mean needing more
cycles to access the L1 cache.

Then again, another simpler workaround to the double-map cache
consistency issue would be using non-hashed direct mapping, which
basically eliminates the issue in cases where the size of the L1 cache
is less than the page size (at the cost of a higher L1 miss rate).

>>
>>
>> It seems like SMT could be done in theory, just it would effectively
>> double the size of the register file (each hardware thread effectively
>> banking all the registers).
> <
> The register file size is multiplied by the degree of SMT-ness.
>

Yeah.

This seems like a possible issue with the idea.

Another possible issue is that a core running such an SMT configuration
could suffer from a higher cache miss rate than a core running a single
thread.

It could have a lower resource cost than dual-core, but to be similarly
effective would likely mean roughly doubling the sizes of the L1 caches.

A naive approach (single pipeline) would have the big drawback (besides
the lower effective clock speed), that both threads would stall whenever
either thread had an L1 miss. Avoiding this seems like it would require
two separate pipelines, and a bit of additional multiplex capability.

In this case, it could likely make sense to only do the half-speed
operation when both threads are running instructions, and if one thread
is stalled on an L1 miss, then whichever is the non-stalled thread
switches to full-speed operation.

Another question would be whether the CPU tries to behave as if it has
two independent cores, or if it is left up to the OS scheduler:
Using SMT effectively involving the OS scheduler setting up both threads
at the same time, or (at its discretion) only scheduling a single thread
(which runs at a higher effective clock speed).

....

Subject	Replies	Author
The Tera MTA By: Quadibloc on Fri, 16 Jul 2021	54	Quadibloc

The universe is all a spin-off of the Big Bang.

computers / comp.arch / Re: The Tera MTA