Welcome to novaBBS (click a section below)

mail files register newsreader groups login

Message-ID:

If loving linux is wrong, I dont wanna be right. -- Topic for #LinuxGER

The Tera MTA

Subject	Author
The Tera MTA	Quadibloc
Re: The Tera MTA	MitchAlsup
Re: The Tera MTA	BGB
Re: The Tera MTA	MitchAlsup
Re: The Tera MTA	BGB
Re: The Tera MTA	MitchAlsup
Re: The Tera MTA	Quadibloc
Re: The Tera MTA	Anton Ertl
Re: The Tera MTA	BGB
Re: The Tera MTA	Marcus
Re: The Tera MTA	Quadibloc
Re: The Tera MTA	Quadibloc
Re: The Tera MTA	Stephen Fuld
Re: The Tera MTA	BGB
Re: The Tera MTA	Quadibloc
Re: The Tera MTA	MitchAlsup
Re: The Tera MTA	Quadibloc
Re: The Tera MTA	Terje Mathisen
Re: The Tera MTA	Ivan Godard
Re: The Tera MTA	Stephen Fuld
Re: The Tera MTA	Bill Findlay
Re: The Tera MTA	Marcus
Re: The Tera MTA	Chris M. Thomasson
Re: The Tera MTA	BGB
Re: The Tera MTA	Marcus
Re: The Tera MTA	MitchAlsup
Re: The Tera MTA	Quadibloc
Re: The Tera MTA	MitchAlsup
Re: The Tera MTA	Paul A. Clayton
Re: The Tera MTA	Quadibloc
Re: The Tera MTA	Paul A. Clayton
Re: The Tera MTA	Stephen Fuld
Re: The Tera MTA	Paul A. Clayton
Re: The Tera MTA	Stefan Monnier
Re: The Tera MTA	Quadibloc
Re: The Tera MTA	Paul A. Clayton
Re: The Tera MTA	Tom Gardner
Re: The Tera MTA	Quadibloc
Re: The Tera MTA	Quadibloc
Re: The Tera MTA	Quadibloc
Re: The Tera MTA	MitchAlsup
Re: The Tera MTA	Quadibloc
Re: The Tera MTA	Paul A. Clayton
Re: The Tera MTA	Quadibloc
Re: The Tera MTA	Ivan Godard
Re: The Tera MTA	Paul A. Clayton
Re: The Tera MTA	Anton Ertl
Re: The Tera MTA	Michael S
Re: The Tera MTA	Anton Ertl
Re: The Tera MTA	Michael S
Re: The Tera MTA	Anton Ertl
Re: The Tera MTA	Anton Ertl
Re: The Tera MTA	John Dallman
Re: The Tera MTA	Quadibloc
Re: The Tera MTA	Quadibloc

Pages:12 3

The Tera MTA

<d8e86c4a-44a4-4db2-b92b-ddd9c966b9fdn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18838&group=comp.arch#18838

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:1c4:: with SMTP id t4mr9575688qtw.140.1626447616020;
Fri, 16 Jul 2021 08:00:16 -0700 (PDT)
X-Received: by 2002:a05:6808:d54:: with SMTP id w20mr12498793oik.175.1626447615679;
Fri, 16 Jul 2021 08:00:15 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 16 Jul 2021 08:00:15 -0700 (PDT)
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fa3c:a000:2556:f07:e0d:c581;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fa3c:a000:2556:f07:e0d:c581
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d8e86c4a-44a4-4db2-b92b-ddd9c966b9fdn@googlegroups.com>
Subject: The Tera MTA
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Fri, 16 Jul 2021 15:00:15 +0000
Content-Type: text/plain; charset="UTF-8"

by: Quadibloc - Fri, 16 Jul 2021 15:00 UTC

The Cray XMT (not to be confused with the X-MP) recently came to my
attention again when I happened to stumble upon an eBay sale for a
Threadcrusher 4.0 chip.

I remembered that at one point, AMD allowed other companies to make
chips that would fit into the same sockets as an AMD processor; in
connection with that, I remember hearing of an FPGA accelerator chip that
did this.

But I didn't know there was also the Threadcrusher 4.0 from Cray, which could
fit in an Opteron socket also.

I knew that Sun and/or Oracle made versions of the SPARC chip that took
simultaneous multi-threading (SMT) beyond what Intel did with
Hyper-Threading - instead of two simultaneous threads, their chips could
offer up to _eight_.

But the Threadcrusher chip from Cray (and it fit into an AMD socket... and
currently AMD has a product it calls the Threadripper...) took SMT to a
rather higher level, with 128 simultaneous threads.

It seems, from the history, that the Tera MTA couldn't have been a
complete failure.

The Tera MTA later became known as the Cray MTA. When I heard that
I thought, oh, Cray bought Tera. But no, this happened after Tera bought
Cray - from Silicon Graphics. And then decided that the name "Cray" had
a bit more of a _cachet_ than the name "Tera", and adopted, therefore,
that name for the combined company.

The same architecture was used in later versions of the machine; the
chip for sale on eBay was, after all, a Threadcrusher 4.0, and the original
Tera MTA didn't use a single-chip processor.

The rationale behind using so many threads on a processor was so that
the processor could be doing something useful while waiting, and
waiting, and waiting for requested data to arrive from main memory.
So, if one had enough threads, _memory latency_ would not be an
issue.

Apparently, this processor had eight data registers, and eight address
modification registers. A bit like a 68000, or even my original
Concertina architecture. But I haven't been able to find any information
about its instruction set.

Of course, having 128 threads on a single-core chip, while it does,
without diminishing throughput, slow each individual thread down to
the pace of memory, might seem to create bandwidth issues. However,
that didn't stop Intel from giving the world Xeon Phi.

In any case, was this a valid approach, or is there a good reason why
it is no longer viable?

John Savard

Re: The Tera MTA

<127e7a94-d498-4467-8834-c6b639515c3bn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18841&group=comp.arch#18841

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:1465:: with SMTP id j5mr2499704qkl.63.1626450057424;
Fri, 16 Jul 2021 08:40:57 -0700 (PDT)
X-Received: by 2002:a9d:4e0a:: with SMTP id p10mr8472545otf.329.1626450057188;
Fri, 16 Jul 2021 08:40:57 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.snarked.org!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 16 Jul 2021 08:40:56 -0700 (PDT)
In-Reply-To: <d8e86c4a-44a4-4db2-b92b-ddd9c966b9fdn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:e016:c858:874a:cac3;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:e016:c858:874a:cac3
References: <d8e86c4a-44a4-4db2-b92b-ddd9c966b9fdn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <127e7a94-d498-4467-8834-c6b639515c3bn@googlegroups.com>
Subject: Re: The Tera MTA
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 16 Jul 2021 15:40:57 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 74

by: MitchAlsup - Fri, 16 Jul 2021 15:40 UTC

On Friday, July 16, 2021 at 10:00:17 AM UTC-5, Quadibloc wrote:
> The Cray XMT (not to be confused with the X-MP) recently came to my
> attention again when I happened to stumble upon an eBay sale for a
> Threadcrusher 4.0 chip.
>
> I remembered that at one point, AMD allowed other companies to make
> chips that would fit into the same sockets as an AMD processor; in
> connection with that, I remember hearing of an FPGA accelerator chip that
> did this.
>
> But I didn't know there was also the Threadcrusher 4.0 from Cray, which could
> fit in an Opteron socket also.
>
> I knew that Sun and/or Oracle made versions of the SPARC chip that took
> simultaneous multi-threading (SMT) beyond what Intel did with
> Hyper-Threading - instead of two simultaneous threads, their chips could
> offer up to _eight_.
>
> But the Threadcrusher chip from Cray (and it fit into an AMD socket... and
> currently AMD has a product it calls the Threadripper...) took SMT to a
> rather higher level, with 128 simultaneous threads.
<
Lookup Burton Smith
<
Tera was another Denelore HEP-like processor architecture.
>
> It seems, from the history, that the Tera MTA couldn't have been a
> complete failure.
>
> The Tera MTA later became known as the Cray MTA. When I heard that
> I thought, oh, Cray bought Tera. But no, this happened after Tera bought
> Cray - from Silicon Graphics. And then decided that the name "Cray" had
> a bit more of a _cachet_ than the name "Tera", and adopted, therefore,
> that name for the combined company.
>
> The same architecture was used in later versions of the machine; the
> chip for sale on eBay was, after all, a Threadcrusher 4.0, and the original
> Tera MTA didn't use a single-chip processor.
>
> The rationale behind using so many threads on a processor was so that
> the processor could be doing something useful while waiting, and
> waiting, and waiting for requested data to arrive from main memory.
> So, if one had enough threads, _memory latency_ would not be an
> issue.
<
This same approach is used in GPUs, where one switches threads every
instruction! I don't remember a lot about Tera, but at Denelcore the processor
had a number of threads and a number of calculation units and memory.
A thread was not eligible to run a second instruction until all aspects of
the first instruction had been performed. Each word (64-bits) of memory
could be {Read, Written, Read-if-Full, Written-if-empty} So a memory ref
could be sent out and literally not return for thousands of clocks. The
registers had the same kind of stuff but we didn't use that--just the memory.
>
> Apparently, this processor had eight data registers, and eight address
> modification registers. A bit like a 68000, or even my original
> Concertina architecture. But I haven't been able to find any information
> about its instruction set.
>
> Of course, having 128 threads on a single-core chip, while it does,
> without diminishing throughput, slow each individual thread down to
> the pace of memory, might seem to create bandwidth issues. However,
> that didn't stop Intel from giving the world Xeon Phi.
>
> In any case, was this a valid approach, or is there a good reason why
> it is no longer viable?
<
It migrated into GPUs, where a single instruction may cause 32 (or 64) calculations
or memory references to 64 different cache lines, in 64 different pages,.....
The average memory reference takes something like 400 cycles, so the GPU
better have other things to do while it waits. So, switching to a new set of
threads every cycle dramatically improves throughput even if dragging out the
latency on a per thread basis.
>
> John Savard

Re: The Tera MTA

<scuq5c$h7h$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18868&group=comp.arch#18868

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: spamj...@blueyonder.co.uk (Tom Gardner)
Newsgroups: comp.arch
Subject: Re: The Tera MTA
Date: Sat, 17 Jul 2021 15:40:36 +0100
Organization: A noiseless patient Spider
Lines: 67
Message-ID: <scuq5c$h7h$1@dont-email.me>
References: <d8e86c4a-44a4-4db2-b92b-ddd9c966b9fdn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 17 Jul 2021 14:40:44 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="e7af8fa42ddfece1c449e989d81bf92b";
logging-data="17649"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18cvHNbcyPt+ptie/hEc7DP"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
Firefox/52.0 SeaMonkey/2.49.4
Cancel-Lock: sha1:XIXOOKwyVcxe5HaPz4HgQIKAZEc=
In-Reply-To: <d8e86c4a-44a4-4db2-b92b-ddd9c966b9fdn@googlegroups.com>

by: Tom Gardner - Sat, 17 Jul 2021 14:40 UTC

On 16/07/21 16:00, Quadibloc wrote:
> The Cray XMT (not to be confused with the X-MP) recently came to my
> attention again when I happened to stumble upon an eBay sale for a
> Threadcrusher 4.0 chip.
>
> I remembered that at one point, AMD allowed other companies to make
> chips that would fit into the same sockets as an AMD processor; in
> connection with that, I remember hearing of an FPGA accelerator chip that
> did this.
>
> But I didn't know there was also the Threadcrusher 4.0 from Cray, which could
> fit in an Opteron socket also.
>
> I knew that Sun and/or Oracle made versions of the SPARC chip that took
> simultaneous multi-threading (SMT) beyond what Intel did with
> Hyper-Threading - instead of two simultaneous threads, their chips could
> offer up to _eight_.
>
> But the Threadcrusher chip from Cray (and it fit into an AMD socket... and
> currently AMD has a product it calls the Threadripper...) took SMT to a
> rather higher level, with 128 simultaneous threads.
>
> It seems, from the history, that the Tera MTA couldn't have been a
> complete failure.
>
> The Tera MTA later became known as the Cray MTA. When I heard that
> I thought, oh, Cray bought Tera. But no, this happened after Tera bought
> Cray - from Silicon Graphics. And then decided that the name "Cray" had
> a bit more of a _cachet_ than the name "Tera", and adopted, therefore,
> that name for the combined company.
>
> The same architecture was used in later versions of the machine; the
> chip for sale on eBay was, after all, a Threadcrusher 4.0, and the original
> Tera MTA didn't use a single-chip processor.
>
> The rationale behind using so many threads on a processor was so that
> the processor could be doing something useful while waiting, and
> waiting, and waiting for requested data to arrive from main memory.
> So, if one had enough threads, _memory latency_ would not be an
> issue.
>
> Apparently, this processor had eight data registers, and eight address
> modification registers. A bit like a 68000, or even my original
> Concertina architecture. But I haven't been able to find any information
> about its instruction set.
>
> Of course, having 128 threads on a single-core chip, while it does,
> without diminishing throughput, slow each individual thread down to
> the pace of memory, might seem to create bandwidth issues. However,
> that didn't stop Intel from giving the world Xeon Phi.

See Sun's Niagara processors, and the XMOS xCORE processors.
While doing the latter, understanding how it works with xC
is a valuable different perspective.

I liked the Niagara processors for embarrassingly parallel
applications.

I like the xCORE processors for hard realtime embedded
applications.

> In any case, was this a valid approach, or is there a good reason why
> it is no longer viable?

I don't know whether a one word answer is correct or
sufficient: "Oracle".

Re: The Tera MTA

<scv1hp$gh9$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18870&group=comp.arch#18870

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: The Tera MTA
Date: Sat, 17 Jul 2021 11:46:45 -0500
Organization: A noiseless patient Spider
Lines: 113
Message-ID: <scv1hp$gh9$1@dont-email.me>
References: <d8e86c4a-44a4-4db2-b92b-ddd9c966b9fdn@googlegroups.com>
<127e7a94-d498-4467-8834-c6b639515c3bn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 17 Jul 2021 16:46:49 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="54c432bccd893a13c6f850c42fa94b65";
logging-data="16937"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/wye1lyAjjzw+nioJc9o5y"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.12.0
Cancel-Lock: sha1:+oHieYhfImvTcJsg24FaJYPuh3I=
In-Reply-To: <127e7a94-d498-4467-8834-c6b639515c3bn@googlegroups.com>
Content-Language: en-US

by: BGB - Sat, 17 Jul 2021 16:46 UTC

On 7/16/2021 10:40 AM, MitchAlsup wrote:
> On Friday, July 16, 2021 at 10:00:17 AM UTC-5, Quadibloc wrote:
>> The Cray XMT (not to be confused with the X-MP) recently came to my
>> attention again when I happened to stumble upon an eBay sale for a
>> Threadcrusher 4.0 chip.
>>
>> I remembered that at one point, AMD allowed other companies to make
>> chips that would fit into the same sockets as an AMD processor; in
>> connection with that, I remember hearing of an FPGA accelerator chip that
>> did this.
>>
>> But I didn't know there was also the Threadcrusher 4.0 from Cray, which could
>> fit in an Opteron socket also.
>>
>> I knew that Sun and/or Oracle made versions of the SPARC chip that took
>> simultaneous multi-threading (SMT) beyond what Intel did with
>> Hyper-Threading - instead of two simultaneous threads, their chips could
>> offer up to _eight_.
>>
>> But the Threadcrusher chip from Cray (and it fit into an AMD socket... and
>> currently AMD has a product it calls the Threadripper...) took SMT to a
>> rather higher level, with 128 simultaneous threads.
> <
> Lookup Burton Smith
> <
> Tera was another Denelore HEP-like processor architecture.
>>
>> It seems, from the history, that the Tera MTA couldn't have been a
>> complete failure.
>>
>> The Tera MTA later became known as the Cray MTA. When I heard that
>> I thought, oh, Cray bought Tera. But no, this happened after Tera bought
>> Cray - from Silicon Graphics. And then decided that the name "Cray" had
>> a bit more of a _cachet_ than the name "Tera", and adopted, therefore,
>> that name for the combined company.
>>
>> The same architecture was used in later versions of the machine; the
>> chip for sale on eBay was, after all, a Threadcrusher 4.0, and the original
>> Tera MTA didn't use a single-chip processor.
>>
>> The rationale behind using so many threads on a processor was so that
>> the processor could be doing something useful while waiting, and
>> waiting, and waiting for requested data to arrive from main memory.
>> So, if one had enough threads, _memory latency_ would not be an
>> issue.
> <
> This same approach is used in GPUs, where one switches threads every
> instruction! I don't remember a lot about Tera, but at Denelcore the processor
> had a number of threads and a number of calculation units and memory.
> A thread was not eligible to run a second instruction until all aspects of
> the first instruction had been performed. Each word (64-bits) of memory
> could be {Read, Written, Read-if-Full, Written-if-empty} So a memory ref
> could be sent out and literally not return for thousands of clocks. The
> registers had the same kind of stuff but we didn't use that--just the memory.
>>
>> Apparently, this processor had eight data registers, and eight address
>> modification registers. A bit like a 68000, or even my original
>> Concertina architecture. But I haven't been able to find any information
>> about its instruction set.
>>
>> Of course, having 128 threads on a single-core chip, while it does,
>> without diminishing throughput, slow each individual thread down to
>> the pace of memory, might seem to create bandwidth issues. However,
>> that didn't stop Intel from giving the world Xeon Phi.
>>
>> In any case, was this a valid approach, or is there a good reason why
>> it is no longer viable?
> <
> It migrated into GPUs, where a single instruction may cause 32 (or 64) calculations
> or memory references to 64 different cache lines, in 64 different pages,.....
> The average memory reference takes something like 400 cycles, so the GPU
> better have other things to do while it waits. So, switching to a new set of
> threads every cycle dramatically improves throughput even if dragging out the
> latency on a per thread basis.

Hmm...

Reminds me of an observation I made while fiddling around with stuff for
my own ISA:
If the CPU core ran at half the effective clock-speed, but only had to
spend half as many cycles on cache misses, the relative impact on
performance would be "surprisingly minor" for things like Doom (the
clock speed drops but the IPC increases enough to compensate).

It seems like SMT could be done in theory, just it would effectively
double the size of the register file (each hardware thread effectively
banking all the registers).

Don't think I will go this route though, as some other workloads (such
as my software GL), would take a pretty big performance hit if run at 25
MHz.

OTOH: If there are FPGA's which can socket into a PC MOBO and aren't
super expensive, that might be interesting to know about.

Though, looking into it, it seems the FPGA board I have is roughly the
same layout as Nano-ITX (though finding a sanely-priced Nano-ITX case is
the hard part).

It looks like some of the other boards either use a Mini-ITX layout, or
could be shoehorned into a Mini-ITX case (with some tweaks made to the
case to allow for different screw-mount positions, shoving a 147x170mm
board into a case meant for 170x170mm).

Somehow I had failed to notice this previously...

Though, yeah, some sort of case to put the FPGA board into would be an
improvement.

Re: The Tera MTA

<1b43802a-3438-4aa8-bdc4-0b86f976eda9n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18873&group=comp.arch#18873

copy link Newsgroups: comp.arch

X-Received: by 2002:a0c:a422:: with SMTP id w31mr16917016qvw.34.1626550754940;
Sat, 17 Jul 2021 12:39:14 -0700 (PDT)
X-Received: by 2002:a05:6808:d54:: with SMTP id w20mr16782941oik.175.1626550754692;
Sat, 17 Jul 2021 12:39:14 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 17 Jul 2021 12:39:14 -0700 (PDT)
In-Reply-To: <scv1hp$gh9$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:95ef:5480:867b:c854;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:95ef:5480:867b:c854
References: <d8e86c4a-44a4-4db2-b92b-ddd9c966b9fdn@googlegroups.com>
<127e7a94-d498-4467-8834-c6b639515c3bn@googlegroups.com> <scv1hp$gh9$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1b43802a-3438-4aa8-bdc4-0b86f976eda9n@googlegroups.com>
Subject: Re: The Tera MTA
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 17 Jul 2021 19:39:14 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Sat, 17 Jul 2021 19:39 UTC

On Saturday, July 17, 2021 at 11:46:52 AM UTC-5, BGB wrote:
> On 7/16/2021 10:40 AM, MitchAlsup wrote:
> > On Friday, July 16, 2021 at 10:00:17 AM UTC-5, Quadibloc wrote:
> >> The Cray XMT (not to be confused with the X-MP) recently came to my
> >> attention again when I happened to stumble upon an eBay sale for a
> >> Threadcrusher 4.0 chip.
> >>
> >> I remembered that at one point, AMD allowed other companies to make
> >> chips that would fit into the same sockets as an AMD processor; in
> >> connection with that, I remember hearing of an FPGA accelerator chip that
> >> did this.
> >>
> >> But I didn't know there was also the Threadcrusher 4.0 from Cray, which could
> >> fit in an Opteron socket also.
> >>
> >> I knew that Sun and/or Oracle made versions of the SPARC chip that took
> >> simultaneous multi-threading (SMT) beyond what Intel did with
> >> Hyper-Threading - instead of two simultaneous threads, their chips could
> >> offer up to _eight_.
> >>
> >> But the Threadcrusher chip from Cray (and it fit into an AMD socket... and
> >> currently AMD has a product it calls the Threadripper...) took SMT to a
> >> rather higher level, with 128 simultaneous threads.
> > <
> > Lookup Burton Smith
> > <
> > Tera was another Denelore HEP-like processor architecture.
> >>
> >> It seems, from the history, that the Tera MTA couldn't have been a
> >> complete failure.
> >>
> >> The Tera MTA later became known as the Cray MTA. When I heard that
> >> I thought, oh, Cray bought Tera. But no, this happened after Tera bought
> >> Cray - from Silicon Graphics. And then decided that the name "Cray" had
> >> a bit more of a _cachet_ than the name "Tera", and adopted, therefore,
> >> that name for the combined company.
> >>
> >> The same architecture was used in later versions of the machine; the
> >> chip for sale on eBay was, after all, a Threadcrusher 4.0, and the original
> >> Tera MTA didn't use a single-chip processor.
> >>
> >> The rationale behind using so many threads on a processor was so that
> >> the processor could be doing something useful while waiting, and
> >> waiting, and waiting for requested data to arrive from main memory.
> >> So, if one had enough threads, _memory latency_ would not be an
> >> issue.
> > <
> > This same approach is used in GPUs, where one switches threads every
> > instruction! I don't remember a lot about Tera, but at Denelcore the processor
> > had a number of threads and a number of calculation units and memory.
> > A thread was not eligible to run a second instruction until all aspects of
> > the first instruction had been performed. Each word (64-bits) of memory
> > could be {Read, Written, Read-if-Full, Written-if-empty} So a memory ref
> > could be sent out and literally not return for thousands of clocks. The
> > registers had the same kind of stuff but we didn't use that--just the memory.
> >>
> >> Apparently, this processor had eight data registers, and eight address
> >> modification registers. A bit like a 68000, or even my original
> >> Concertina architecture. But I haven't been able to find any information
> >> about its instruction set.
> >>
> >> Of course, having 128 threads on a single-core chip, while it does,
> >> without diminishing throughput, slow each individual thread down to
> >> the pace of memory, might seem to create bandwidth issues. However,
> >> that didn't stop Intel from giving the world Xeon Phi.
> >>
> >> In any case, was this a valid approach, or is there a good reason why
> >> it is no longer viable?
> > <
> > It migrated into GPUs, where a single instruction may cause 32 (or 64) calculations
> > or memory references to 64 different cache lines, in 64 different pages,.....
> > The average memory reference takes something like 400 cycles, so the GPU
> > better have other things to do while it waits. So, switching to a new set of
> > threads every cycle dramatically improves throughput even if dragging out the
> > latency on a per thread basis.
> Hmm...
>
> Reminds me of an observation I made while fiddling around with stuff for
> my own ISA:
<
> If the CPU core ran at half the effective clock-speed, but only had to
> spend half as many cycles on cache misses, the relative impact on
> performance would be "surprisingly minor" for things like Doom (the
> clock speed drops but the IPC increases enough to compensate).
<
I have noticed several occurrences of similar merit::
<
As the length of the pipeline is decreased the pressure on the predictors
{branch, Jump, Call/Return} is reduced because the recovery multiplier is
smaller. Conversely, a 15 stage pipeline needs a predictor that has no more
than 1/3rd of the mispredicts as a 5 stage pipeline to achieve similar miss
prediction recovery cycles.
<
As the execution width becomes wider, one needs to service more AGENs
per cycle:: 1-wide needs 1-AGEN, 4-wide needs 2 AGENs, 6-wide needs
3 AGENs--all these AGENs end up colliding at the cache ports and it you
can't service this many on a per cycle basis, you might want to reconsider
the width you are trying to achieve.
<
As the AGENs per cycle goes up, the TLB must end up servicing all of them
on a per cycle basis, too.
<
As width goes up, fetching wide does not improve as much as expected
because one ends up taking a branch every 8-9 instructions. Somewhere
around 6-instruction ISSUEs per cycle is where one needs to issue inst-
ructions from a <predicted> taken basic block along with the instructions
that take you to the predicted basic block.
>
>
> It seems like SMT could be done in theory, just it would effectively
> double the size of the register file (each hardware thread effectively
> banking all the registers).
<
The register file size is multiplied by the degree of SMT-ness.

Re: The Tera MTA

<0afc2a0a-b3e6-4525-9fa3-a33bc287a6d5n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18874&group=comp.arch#18874

copy link Newsgroups: comp.arch

X-Received: by 2002:a0c:e54e:: with SMTP id n14mr17031007qvm.41.1626552108636;
Sat, 17 Jul 2021 13:01:48 -0700 (PDT)
X-Received: by 2002:a05:6830:1658:: with SMTP id h24mr2794920otr.182.1626552108383;
Sat, 17 Jul 2021 13:01:48 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!usenet.pasdenom.info!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 17 Jul 2021 13:01:48 -0700 (PDT)
In-Reply-To: <scuq5c$h7h$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fa3c:a000:d407:3113:3c50:f5d2;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fa3c:a000:d407:3113:3c50:f5d2
References: <d8e86c4a-44a4-4db2-b92b-ddd9c966b9fdn@googlegroups.com> <scuq5c$h7h$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <0afc2a0a-b3e6-4525-9fa3-a33bc287a6d5n@googlegroups.com>
Subject: Re: The Tera MTA
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Sat, 17 Jul 2021 20:01:48 +0000
Content-Type: text/plain; charset="UTF-8"

by: Quadibloc - Sat, 17 Jul 2021 20:01 UTC

On Saturday, July 17, 2021 at 8:40:47 AM UTC-6, Tom Gardner wrote:
> On 16/07/21 16:00, Quadibloc wrote:

> > In any case, was this a valid approach, or is there a good reason why
> > it is no longer viable?

> I don't know whether a one word answer is correct or
> sufficient: "Oracle".

Insufficient: not being aware of Oracle's current doings, I
do not know if they are evidence of the validity of this
approach, or of its failure.

John Savard

Re: The Tera MTA

<37a5ba05-c242-444a-a1bd-7ed146e8559cn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18875&group=comp.arch#18875

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:1893:: with SMTP id v19mr15550965qtc.222.1626554587018;
Sat, 17 Jul 2021 13:43:07 -0700 (PDT)
X-Received: by 2002:a9d:4c9a:: with SMTP id m26mr12574401otf.110.1626554586819;
Sat, 17 Jul 2021 13:43:06 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 17 Jul 2021 13:43:06 -0700 (PDT)
In-Reply-To: <d8e86c4a-44a4-4db2-b92b-ddd9c966b9fdn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fa3c:a000:d407:3113:3c50:f5d2;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fa3c:a000:d407:3113:3c50:f5d2
References: <d8e86c4a-44a4-4db2-b92b-ddd9c966b9fdn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <37a5ba05-c242-444a-a1bd-7ed146e8559cn@googlegroups.com>
Subject: Re: The Tera MTA
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Sat, 17 Jul 2021 20:43:07 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 1440

by: Quadibloc - Sat, 17 Jul 2021 20:43 UTC

My memory doesn't seem to be working as well as it should.

The chip in question was the Threadstorm 4.0, not the
Threadcrusher 4.0.

Also, I see it went into a 1204-pin socket, Socket F... whereas
the FPGA that went into an AMD socket went into Socket 940
which had 940 pins.

John Savard

Re: The Tera MTA

<51548316-5543-41d9-8ec6-822b861ea6afn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18876&group=comp.arch#18876

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:41d2:: with SMTP id o18mr15634227qtm.10.1626555257019;
Sat, 17 Jul 2021 13:54:17 -0700 (PDT)
X-Received: by 2002:a9d:7f91:: with SMTP id t17mr13265824otp.22.1626555256787;
Sat, 17 Jul 2021 13:54:16 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!usenet.pasdenom.info!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 17 Jul 2021 13:54:16 -0700 (PDT)
In-Reply-To: <0afc2a0a-b3e6-4525-9fa3-a33bc287a6d5n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fa3c:a000:d407:3113:3c50:f5d2;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fa3c:a000:d407:3113:3c50:f5d2
References: <d8e86c4a-44a4-4db2-b92b-ddd9c966b9fdn@googlegroups.com>
<scuq5c$h7h$1@dont-email.me> <0afc2a0a-b3e6-4525-9fa3-a33bc287a6d5n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <51548316-5543-41d9-8ec6-822b861ea6afn@googlegroups.com>
Subject: Re: The Tera MTA
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Sat, 17 Jul 2021 20:54:17 +0000
Content-Type: text/plain; charset="UTF-8"

by: Quadibloc - Sat, 17 Jul 2021 20:54 UTC

On Saturday, July 17, 2021 at 2:01:49 PM UTC-6, Quadibloc wrote:
> On Saturday, July 17, 2021 at 8:40:47 AM UTC-6, Tom Gardner wrote:

> > I don't know whether a one word answer is correct or
> > sufficient: "Oracle".

> Insufficient: not being aware of Oracle's current doings, I
> do not know if they are evidence of the validity of this
> approach, or of its failure.

A quick web search shows that Oracle is still selling
SPARC-based servers, based on the M8 processor,
which has 32 cores and 256 threads, so 8 threads per
core, so apparently the approach hasn't been given up
yet.

John Savard

Re: The Tera MTA

<aeab88d8-c9c6-49aa-9738-c778ac4174den@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18877&group=comp.arch#18877

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:4741:: with SMTP id k1mr15615089qtp.374.1626556009752;
Sat, 17 Jul 2021 14:06:49 -0700 (PDT)
X-Received: by 2002:aca:c7cb:: with SMTP id x194mr12924063oif.119.1626556009557;
Sat, 17 Jul 2021 14:06:49 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 17 Jul 2021 14:06:49 -0700 (PDT)
In-Reply-To: <51548316-5543-41d9-8ec6-822b861ea6afn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fa3c:a000:d407:3113:3c50:f5d2;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fa3c:a000:d407:3113:3c50:f5d2
References: <d8e86c4a-44a4-4db2-b92b-ddd9c966b9fdn@googlegroups.com>
<scuq5c$h7h$1@dont-email.me> <0afc2a0a-b3e6-4525-9fa3-a33bc287a6d5n@googlegroups.com>
<51548316-5543-41d9-8ec6-822b861ea6afn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <aeab88d8-c9c6-49aa-9738-c778ac4174den@googlegroups.com>
Subject: Re: The Tera MTA
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Sat, 17 Jul 2021 21:06:49 +0000
Content-Type: text/plain; charset="UTF-8"

by: Quadibloc - Sat, 17 Jul 2021 21:06 UTC

On Saturday, July 17, 2021 at 2:54:18 PM UTC-6, Quadibloc wrote:

> A quick web search shows that Oracle is still selling
> SPARC-based servers, based on the M8 processor,
> which has 32 cores and 256 threads, so 8 threads per
> core, so apparently the approach hasn't been given up
> yet.

Looking into this a bit more, I now see that what I remembered
reading before was coverage of the introduction of the
M7 processor - it was exciting at the time because it had
10 billion transistors, 32 cores, and 8 threads per core.

The M8 also has 32 cores and 8 threads per core. Not only that,
but I see it's still on the same 20nm process!

However, where I learned _that_

https://www.nextplatform.com/2017/09/18/m8-last-hurrah-oracle-sparc/

notes that there were architectural improvements, and from the table
of the specs, I see they added a lot of cache. Which I think is a good idea,
given all the cores and threads they've put on those chips!

John Savard

Re: The Tera MTA

<02416980-bf6b-4094-8cea-34bb69937472n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18878&group=comp.arch#18878

copy link Newsgroups: comp.arch

X-Received: by 2002:ad4:5c49:: with SMTP id a9mr8555401qva.27.1626557640936;
Sat, 17 Jul 2021 14:34:00 -0700 (PDT)
X-Received: by 2002:a9d:6f84:: with SMTP id h4mr13851730otq.240.1626557640692;
Sat, 17 Jul 2021 14:34:00 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 17 Jul 2021 14:34:00 -0700 (PDT)
In-Reply-To: <aeab88d8-c9c6-49aa-9738-c778ac4174den@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:95ef:5480:867b:c854;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:95ef:5480:867b:c854
References: <d8e86c4a-44a4-4db2-b92b-ddd9c966b9fdn@googlegroups.com>
<scuq5c$h7h$1@dont-email.me> <0afc2a0a-b3e6-4525-9fa3-a33bc287a6d5n@googlegroups.com>
<51548316-5543-41d9-8ec6-822b861ea6afn@googlegroups.com> <aeab88d8-c9c6-49aa-9738-c778ac4174den@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <02416980-bf6b-4094-8cea-34bb69937472n@googlegroups.com>
Subject: Re: The Tera MTA
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 17 Jul 2021 21:34:00 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Sat, 17 Jul 2021 21:34 UTC

On Saturday, July 17, 2021 at 4:06:50 PM UTC-5, Quadibloc wrote:
> On Saturday, July 17, 2021 at 2:54:18 PM UTC-6, Quadibloc wrote:
>
> > A quick web search shows that Oracle is still selling
> > SPARC-based servers, based on the M8 processor,
> > which has 32 cores and 256 threads, so 8 threads per
> > core, so apparently the approach hasn't been given up
> > yet.
> Looking into this a bit more, I now see that what I remembered
> reading before was coverage of the introduction of the
> M7 processor - it was exciting at the time because it had
> 10 billion transistors, 32 cores, and 8 threads per core.
>
> The M8 also has 32 cores and 8 threads per core. Not only that,
> but I see it's still on the same 20nm process!
>
> However, where I learned _that_
>
> https://www.nextplatform.com/2017/09/18/m8-last-hurrah-oracle-sparc/
>
> notes that there were architectural improvements, and from the table
> of the specs, I see they added a lot of cache. Which I think is a good idea,
> given all the cores and threads they've put on those chips!
<
Notice that they doubled the ICache and kept the DCache the same.
Servers need larger I relative to D while application processors go the
other direction.
<
Secondly, 20nm still is cheaper per transistor than 14nm, 10nm, 7nm or 5nm.
>
> John Savard

Re: The Tera MTA

<scvp1v$dgh$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18879&group=comp.arch#18879

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: The Tera MTA
Date: Sat, 17 Jul 2021 18:27:58 -0500
Organization: A noiseless patient Spider
Lines: 202
Message-ID: <scvp1v$dgh$1@dont-email.me>
References: <d8e86c4a-44a4-4db2-b92b-ddd9c966b9fdn@googlegroups.com>
<127e7a94-d498-4467-8834-c6b639515c3bn@googlegroups.com>
<scv1hp$gh9$1@dont-email.me>
<1b43802a-3438-4aa8-bdc4-0b86f976eda9n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 17 Jul 2021 23:27:59 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="79d1ca3378643581b91c4335e953a407";
logging-data="13841"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1909YfSc7vudqlEwvtkcsXb"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.12.0
Cancel-Lock: sha1:euFCsxwvds7sa+neVKIgtHMUNeA=
In-Reply-To: <1b43802a-3438-4aa8-bdc4-0b86f976eda9n@googlegroups.com>
Content-Language: en-US

by: BGB - Sat, 17 Jul 2021 23:27 UTC

On 7/17/2021 2:39 PM, MitchAlsup wrote:
> On Saturday, July 17, 2021 at 11:46:52 AM UTC-5, BGB wrote:
>> On 7/16/2021 10:40 AM, MitchAlsup wrote:
>>> On Friday, July 16, 2021 at 10:00:17 AM UTC-5, Quadibloc wrote:
>>>> The Cray XMT (not to be confused with the X-MP) recently came to my
>>>> attention again when I happened to stumble upon an eBay sale for a
>>>> Threadcrusher 4.0 chip.
>>>>
>>>> I remembered that at one point, AMD allowed other companies to make
>>>> chips that would fit into the same sockets as an AMD processor; in
>>>> connection with that, I remember hearing of an FPGA accelerator chip that
>>>> did this.
>>>>
>>>> But I didn't know there was also the Threadcrusher 4.0 from Cray, which could
>>>> fit in an Opteron socket also.
>>>>
>>>> I knew that Sun and/or Oracle made versions of the SPARC chip that took
>>>> simultaneous multi-threading (SMT) beyond what Intel did with
>>>> Hyper-Threading - instead of two simultaneous threads, their chips could
>>>> offer up to _eight_.
>>>>
>>>> But the Threadcrusher chip from Cray (and it fit into an AMD socket... and
>>>> currently AMD has a product it calls the Threadripper...) took SMT to a
>>>> rather higher level, with 128 simultaneous threads.
>>> <
>>> Lookup Burton Smith
>>> <
>>> Tera was another Denelore HEP-like processor architecture.
>>>>
>>>> It seems, from the history, that the Tera MTA couldn't have been a
>>>> complete failure.
>>>>
>>>> The Tera MTA later became known as the Cray MTA. When I heard that
>>>> I thought, oh, Cray bought Tera. But no, this happened after Tera bought
>>>> Cray - from Silicon Graphics. And then decided that the name "Cray" had
>>>> a bit more of a _cachet_ than the name "Tera", and adopted, therefore,
>>>> that name for the combined company.
>>>>
>>>> The same architecture was used in later versions of the machine; the
>>>> chip for sale on eBay was, after all, a Threadcrusher 4.0, and the original
>>>> Tera MTA didn't use a single-chip processor.
>>>>
>>>> The rationale behind using so many threads on a processor was so that
>>>> the processor could be doing something useful while waiting, and
>>>> waiting, and waiting for requested data to arrive from main memory.
>>>> So, if one had enough threads, _memory latency_ would not be an
>>>> issue.
>>> <
>>> This same approach is used in GPUs, where one switches threads every
>>> instruction! I don't remember a lot about Tera, but at Denelcore the processor
>>> had a number of threads and a number of calculation units and memory.
>>> A thread was not eligible to run a second instruction until all aspects of
>>> the first instruction had been performed. Each word (64-bits) of memory
>>> could be {Read, Written, Read-if-Full, Written-if-empty} So a memory ref
>>> could be sent out and literally not return for thousands of clocks. The
>>> registers had the same kind of stuff but we didn't use that--just the memory.
>>>>
>>>> Apparently, this processor had eight data registers, and eight address
>>>> modification registers. A bit like a 68000, or even my original
>>>> Concertina architecture. But I haven't been able to find any information
>>>> about its instruction set.
>>>>
>>>> Of course, having 128 threads on a single-core chip, while it does,
>>>> without diminishing throughput, slow each individual thread down to
>>>> the pace of memory, might seem to create bandwidth issues. However,
>>>> that didn't stop Intel from giving the world Xeon Phi.
>>>>
>>>> In any case, was this a valid approach, or is there a good reason why
>>>> it is no longer viable?
>>> <
>>> It migrated into GPUs, where a single instruction may cause 32 (or 64) calculations
>>> or memory references to 64 different cache lines, in 64 different pages,.....
>>> The average memory reference takes something like 400 cycles, so the GPU
>>> better have other things to do while it waits. So, switching to a new set of
>>> threads every cycle dramatically improves throughput even if dragging out the
>>> latency on a per thread basis.
>> Hmm...
>>
>> Reminds me of an observation I made while fiddling around with stuff for
>> my own ISA:
> <
>> If the CPU core ran at half the effective clock-speed, but only had to
>> spend half as many cycles on cache misses, the relative impact on
>> performance would be "surprisingly minor" for things like Doom (the
>> clock speed drops but the IPC increases enough to compensate).

As noted, this seems to be because Doom is mostly bound by memory
accesses, spending the majority of its clock cycles waiting on L2 misses.

Poking at it some more:

If memory stays at the same speed, Doom is still fairly playable at
around 12MHz or 16MHz, though at these speeds transitions over into
being mostly instruction-bound.

Its behavior and properties seem to change slightly, and the "low/high
detail" option becomes "actually relevant" (has a more obvious effect on
framerate).

A similar property seems to hold for Quake, which only seems to suffer a
modest slowdown (relative to 50MHz) when running at 12MHz (still low
single digit framterates).

GLQuake performance tanks pretty bad at 12MHz though (averages 0 fps,
this scenario somewhat favors software-rendered Quake).

> <
> I have noticed several occurrences of similar merit::
> <
> As the length of the pipeline is decreased the pressure on the predictors
> {branch, Jump, Call/Return} is reduced because the recovery multiplier is
> smaller. Conversely, a 15 stage pipeline needs a predictor that has no more
> than 1/3rd of the mispredicts as a 5 stage pipeline to achieve similar miss
> prediction recovery cycles.

OK. Branch misses don't appear to be too major of a problem with an
8-stage pipeline. The predictors I have mostly seem to work OK.

> <
> As the execution width becomes wider, one needs to service more AGENs
> per cycle:: 1-wide needs 1-AGEN, 4-wide needs 2 AGENs, 6-wide needs
> 3 AGENs--all these AGENs end up colliding at the cache ports and it you
> can't service this many on a per cycle basis, you might want to reconsider
> the width you are trying to achieve.

OK, in my 3-wide core, there is only 1 memory port.

So, I guess from the pattern above, the idea is that, as the number of
lanes increases, one needs roughly n/2 memory ports to be worthwhile (as
opposed to, say, doing a 6-wide core with 1 memory port, and the other 5
lanes only able to do ALU ops).

> <
> As the AGENs per cycle goes up, the TLB must end up servicing all of them
> on a per cycle basis, too.
> <
> As width goes up, fetching wide does not improve as much as expected
> because one ends up taking a branch every 8-9 instructions. Somewhere
> around 6-instruction ISSUEs per cycle is where one needs to issue inst-
> ructions from a <predicted> taken basic block along with the instructions
> that take you to the predicted basic block.

This can be partly sidestepped by putting the TLB on the L1<->L2
interface, then the TLB only gets invoked on L1 misses. Tradeoff is that
L1 cache lines now need to keep track of both virtual and physical
addresses and double-mapping may introduce awkward cache-consistency issues.

Putting the TLB inline with L1 access would likely mean needing more
cycles to access the L1 cache.

Then again, another simpler workaround to the double-map cache
consistency issue would be using non-hashed direct mapping, which
basically eliminates the issue in cases where the size of the L1 cache
is less than the page size (at the cost of a higher L1 miss rate).

>>
>>
>> It seems like SMT could be done in theory, just it would effectively
>> double the size of the register file (each hardware thread effectively
>> banking all the registers).
> <
> The register file size is multiplied by the degree of SMT-ness.
>

Yeah.

This seems like a possible issue with the idea.

Another possible issue is that a core running such an SMT configuration
could suffer from a higher cache miss rate than a core running a single
thread.

It could have a lower resource cost than dual-core, but to be similarly
effective would likely mean roughly doubling the sizes of the L1 caches.

Click here to read the complete article

Re: The Tera MTA

<26747ef3-5f6e-4df9-a0f7-16bf879119e3n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18880&group=comp.arch#18880

copy link Newsgroups: comp.arch

X-Received: by 2002:ad4:56e4:: with SMTP id cr4mr17957479qvb.54.1626570092787;
Sat, 17 Jul 2021 18:01:32 -0700 (PDT)
X-Received: by 2002:aca:dac5:: with SMTP id r188mr13439954oig.78.1626570092483;
Sat, 17 Jul 2021 18:01:32 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 17 Jul 2021 18:01:32 -0700 (PDT)
In-Reply-To: <02416980-bf6b-4094-8cea-34bb69937472n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fa3c:a000:cc51:6590:e4c:98a2;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fa3c:a000:cc51:6590:e4c:98a2
References: <d8e86c4a-44a4-4db2-b92b-ddd9c966b9fdn@googlegroups.com>
<scuq5c$h7h$1@dont-email.me> <0afc2a0a-b3e6-4525-9fa3-a33bc287a6d5n@googlegroups.com>
<51548316-5543-41d9-8ec6-822b861ea6afn@googlegroups.com> <aeab88d8-c9c6-49aa-9738-c778ac4174den@googlegroups.com>
<02416980-bf6b-4094-8cea-34bb69937472n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <26747ef3-5f6e-4df9-a0f7-16bf879119e3n@googlegroups.com>
Subject: Re: The Tera MTA
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Sun, 18 Jul 2021 01:01:32 +0000
Content-Type: text/plain; charset="UTF-8"

by: Quadibloc - Sun, 18 Jul 2021 01:01 UTC

On Saturday, July 17, 2021 at 3:34:02 PM UTC-6, MitchAlsup wrote:

> Notice that they doubled the ICache and kept the DCache the same.
> Servers need larger I relative to D while application processors go the
> other direction.

I had misread the table; I thought they made major increases in the cache
at all levels.

> Secondly, 20nm still is cheaper per transistor than 14nm, 10nm, 7nm or 5nm.

True. But at least the M7 chip, its predecessor, was monolithic. Yields on
a 10 billion transistor die have to be considered as well, and power consumption.

Of course, as there were no die shots of the M8, and Oracle was too
embarassed to even talk about it at Hot Chips, it could be they split the
die up for a chiplet design this time, as the Register speculated.

John Savard

Re: The Tera MTA

<98aaf00d-0858-4a5c-bad1-1022cb3f5f9dn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18881&group=comp.arch#18881

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:11c3:: with SMTP id n3mr16361033qtk.211.1626576340823; Sat, 17 Jul 2021 19:45:40 -0700 (PDT)
X-Received: by 2002:aca:4946:: with SMTP id w67mr10950521oia.155.1626576340546; Sat, 17 Jul 2021 19:45:40 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!news.uzoreto.com!tr3.eu1.usenetexpress.com!feeder.usenetexpress.com!tr3.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 17 Jul 2021 19:45:40 -0700 (PDT)
In-Reply-To: <scvp1v$dgh$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:95ef:5480:867b:c854; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:95ef:5480:867b:c854
References: <d8e86c4a-44a4-4db2-b92b-ddd9c966b9fdn@googlegroups.com> <127e7a94-d498-4467-8834-c6b639515c3bn@googlegroups.com> <scv1hp$gh9$1@dont-email.me> <1b43802a-3438-4aa8-bdc4-0b86f976eda9n@googlegroups.com> <scvp1v$dgh$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <98aaf00d-0858-4a5c-bad1-1022cb3f5f9dn@googlegroups.com>
Subject: Re: The Tera MTA
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 18 Jul 2021 02:45:40 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 208

by: MitchAlsup - Sun, 18 Jul 2021 02:45 UTC

On Saturday, July 17, 2021 at 6:28:02 PM UTC-5, BGB wrote:
> On 7/17/2021 2:39 PM, MitchAlsup wrote:
> > On Saturday, July 17, 2021 at 11:46:52 AM UTC-5, BGB wrote:
> >> On 7/16/2021 10:40 AM, MitchAlsup wrote:
> >>> On Friday, July 16, 2021 at 10:00:17 AM UTC-5, Quadibloc wrote:
> >>>> The Cray XMT (not to be confused with the X-MP) recently came to my
> >>>> attention again when I happened to stumble upon an eBay sale for a
> >>>> Threadcrusher 4.0 chip.
> >>>>
> >>>> I remembered that at one point, AMD allowed other companies to make
> >>>> chips that would fit into the same sockets as an AMD processor; in
> >>>> connection with that, I remember hearing of an FPGA accelerator chip that
> >>>> did this.
> >>>>
> >>>> But I didn't know there was also the Threadcrusher 4.0 from Cray, which could
> >>>> fit in an Opteron socket also.
> >>>>
> >>>> I knew that Sun and/or Oracle made versions of the SPARC chip that took
> >>>> simultaneous multi-threading (SMT) beyond what Intel did with
> >>>> Hyper-Threading - instead of two simultaneous threads, their chips could
> >>>> offer up to _eight_.
> >>>>
> >>>> But the Threadcrusher chip from Cray (and it fit into an AMD socket... and
> >>>> currently AMD has a product it calls the Threadripper...) took SMT to a
> >>>> rather higher level, with 128 simultaneous threads.
> >>> <
> >>> Lookup Burton Smith
> >>> <
> >>> Tera was another Denelore HEP-like processor architecture.
> >>>>
> >>>> It seems, from the history, that the Tera MTA couldn't have been a
> >>>> complete failure.
> >>>>
> >>>> The Tera MTA later became known as the Cray MTA. When I heard that
> >>>> I thought, oh, Cray bought Tera. But no, this happened after Tera bought
> >>>> Cray - from Silicon Graphics. And then decided that the name "Cray" had
> >>>> a bit more of a _cachet_ than the name "Tera", and adopted, therefore,
> >>>> that name for the combined company.
> >>>>
> >>>> The same architecture was used in later versions of the machine; the
> >>>> chip for sale on eBay was, after all, a Threadcrusher 4.0, and the original
> >>>> Tera MTA didn't use a single-chip processor.
> >>>>
> >>>> The rationale behind using so many threads on a processor was so that
> >>>> the processor could be doing something useful while waiting, and
> >>>> waiting, and waiting for requested data to arrive from main memory.
> >>>> So, if one had enough threads, _memory latency_ would not be an
> >>>> issue.
> >>> <
> >>> This same approach is used in GPUs, where one switches threads every
> >>> instruction! I don't remember a lot about Tera, but at Denelcore the processor
> >>> had a number of threads and a number of calculation units and memory.
> >>> A thread was not eligible to run a second instruction until all aspects of
> >>> the first instruction had been performed. Each word (64-bits) of memory
> >>> could be {Read, Written, Read-if-Full, Written-if-empty} So a memory ref
> >>> could be sent out and literally not return for thousands of clocks. The
> >>> registers had the same kind of stuff but we didn't use that--just the memory.
> >>>>
> >>>> Apparently, this processor had eight data registers, and eight address
> >>>> modification registers. A bit like a 68000, or even my original
> >>>> Concertina architecture. But I haven't been able to find any information
> >>>> about its instruction set.
> >>>>
> >>>> Of course, having 128 threads on a single-core chip, while it does,
> >>>> without diminishing throughput, slow each individual thread down to
> >>>> the pace of memory, might seem to create bandwidth issues. However,
> >>>> that didn't stop Intel from giving the world Xeon Phi.
> >>>>
> >>>> In any case, was this a valid approach, or is there a good reason why
> >>>> it is no longer viable?
> >>> <
> >>> It migrated into GPUs, where a single instruction may cause 32 (or 64) calculations
> >>> or memory references to 64 different cache lines, in 64 different pages,.....
> >>> The average memory reference takes something like 400 cycles, so the GPU
> >>> better have other things to do while it waits. So, switching to a new set of
> >>> threads every cycle dramatically improves throughput even if dragging out the
> >>> latency on a per thread basis.
> >> Hmm...
> >>
> >> Reminds me of an observation I made while fiddling around with stuff for
> >> my own ISA:
> > <
> >> If the CPU core ran at half the effective clock-speed, but only had to
> >> spend half as many cycles on cache misses, the relative impact on
> >> performance would be "surprisingly minor" for things like Doom (the
> >> clock speed drops but the IPC increases enough to compensate).
> As noted, this seems to be because Doom is mostly bound by memory
> accesses, spending the majority of its clock cycles waiting on L2 misses.
>
>
> Poking at it some more:
>
> If memory stays at the same speed, Doom is still fairly playable at
> around 12MHz or 16MHz, though at these speeds transitions over into
> being mostly instruction-bound.
>
> Its behavior and properties seem to change slightly, and the "low/high
> detail" option becomes "actually relevant" (has a more obvious effect on
> framerate).
>
> A similar property seems to hold for Quake, which only seems to suffer a
> modest slowdown (relative to 50MHz) when running at 12MHz (still low
> single digit framterates).
>
> GLQuake performance tanks pretty bad at 12MHz though (averages 0 fps,
> this scenario somewhat favors software-rendered Quake).
> > <
> > I have noticed several occurrences of similar merit::
> > <
> > As the length of the pipeline is decreased the pressure on the predictors
> > {branch, Jump, Call/Return} is reduced because the recovery multiplier is
> > smaller. Conversely, a 15 stage pipeline needs a predictor that has no more
> > than 1/3rd of the mispredicts as a 5 stage pipeline to achieve similar miss
> > prediction recovery cycles.
> OK. Branch misses don't appear to be too major of a problem with an
> 8-stage pipeline. The predictors I have mostly seem to work OK.
> > <
> > As the execution width becomes wider, one needs to service more AGENs
> > per cycle:: 1-wide needs 1-AGEN, 4-wide needs 2 AGENs, 6-wide needs
> > 3 AGENs--all these AGENs end up colliding at the cache ports and it you
> > can't service this many on a per cycle basis, you might want to reconsider
> > the width you are trying to achieve.
> OK, in my 3-wide core, there is only 1 memory port.
>
> So, I guess from the pattern above, the idea is that, as the number of
> lanes increases, one needs roughly n/2 memory ports to be worthwhile (as
> opposed to, say, doing a 6-wide core with 1 memory port, and the other 5
> lanes only able to do ALU ops).
<
MATRIX300 is DGEMM at its core and this has 3 memrefs and 2 Fops
(unrolled 4 times) so the ADD-CMP-BC is amortized over 4 loops or
5.75 I/C (if your execution window can absorb an L1 DCache miss).
{Modern equivalent uses FMAC instead of FMUL and FADD.} We
specifically targeted this application (it was 1991) for our 6-wide
machine.
> > <
> > As the AGENs per cycle goes up, the TLB must end up servicing all of them
> > on a per cycle basis, too.
> > <
> > As width goes up, fetching wide does not improve as much as expected
> > because one ends up taking a branch every 8-9 instructions. Somewhere
> > around 6-instruction ISSUEs per cycle is where one needs to issue inst-
> > ructions from a <predicted> taken basic block along with the instructions
> > that take you to the predicted basic block.
<
> This can be partly sidestepped by putting the TLB on the L1<->L2
> interface, then the TLB only gets invoked on L1 misses. Tradeoff is that
> L1 cache lines now need to keep track of both virtual and physical
> addresses and double-mapping may introduce awkward cache-consistency issues.
>
> Putting the TLB inline with L1 access would likely mean needing more
> cycles to access the L1 cache.
<
You access TLB at the same time you access tag, and do the compare while
aligning the data from the cache read.
>
>
> Then again, another simpler workaround to the double-map cache
> consistency issue would be using non-hashed direct mapping, which
> basically eliminates the issue in cases where the size of the L1 cache
> is less than the page size (at the cost of a higher L1 miss rate).
<
We did 4-ports direct mapped in the 6-wide machine above.
> >>
> >>
> >> It seems like SMT could be done in theory, just it would effectively
> >> double the size of the register file (each hardware thread effectively
> >> banking all the registers).
> > <
> > The register file size is multiplied by the degree of SMT-ness.
> >
> Yeah.
>
> This seems like a possible issue with the idea.
<
Ya think ?!?
>
> Another possible issue is that a core running such an SMT configuration
> could suffer from a higher cache miss rate than a core running a single
> thread.
<
From sharing the caches! I notice the SPARC mentioned above has
individual caches for each thread.
>
> It could have a lower resource cost than dual-core, but to be similarly
> effective would likely mean roughly doubling the sizes of the L1 caches.
>
>
> A naive approach (single pipeline) would have the big drawback (besides
> the lower effective clock speed), that both threads would stall whenever
> either thread had an L1 miss. Avoiding this seems like it would require
> two separate pipelines, and a bit of additional multiplex capability.
<
You do not build multi-threaded cache stalling pipelines.
You simply don't
>
> In this case, it could likely make sense to only do the half-speed
> operation when both threads are running instructions, and if one thread
> is stalled on an L1 miss, then whichever is the non-stalled thread
> switches to full-speed operation.
>
> Another question would be whether the CPU tries to behave as if it has
> two independent cores, or if it is left up to the OS scheduler:
> Using SMT effectively involving the OS scheduler setting up both threads
> at the same time, or (at its discretion) only scheduling a single thread
> (which runs at a higher effective clock speed).
>
>
> ...

Click here to read the complete article

Re: The Tera MTA

<017f9e4b-ab5b-471b-a4cc-11c074782713n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18883&group=comp.arch#18883

copy link Newsgroups: comp.arch

X-Received: by 2002:ad4:56e4:: with SMTP id cr4mr18881399qvb.54.1626591828545;
Sun, 18 Jul 2021 00:03:48 -0700 (PDT)
X-Received: by 2002:a4a:98ce:: with SMTP id b14mr13532272ooj.69.1626591828285;
Sun, 18 Jul 2021 00:03:48 -0700 (PDT)
Path: i2pn2.org!i2pn.org!news.niel.me!usenet.pasdenom.info!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 18 Jul 2021 00:03:48 -0700 (PDT)
In-Reply-To: <98aaf00d-0858-4a5c-bad1-1022cb3f5f9dn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fa3c:a000:cc51:6590:e4c:98a2;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fa3c:a000:cc51:6590:e4c:98a2
References: <d8e86c4a-44a4-4db2-b92b-ddd9c966b9fdn@googlegroups.com>
<127e7a94-d498-4467-8834-c6b639515c3bn@googlegroups.com> <scv1hp$gh9$1@dont-email.me>
<1b43802a-3438-4aa8-bdc4-0b86f976eda9n@googlegroups.com> <scvp1v$dgh$1@dont-email.me>
<98aaf00d-0858-4a5c-bad1-1022cb3f5f9dn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <017f9e4b-ab5b-471b-a4cc-11c074782713n@googlegroups.com>
Subject: Re: The Tera MTA
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Sun, 18 Jul 2021 07:03:48 +0000
Content-Type: text/plain; charset="UTF-8"

by: Quadibloc - Sun, 18 Jul 2021 07:03 UTC

On Saturday, July 17, 2021 at 8:45:41 PM UTC-6, MitchAlsup wrote:
> I notice the SPARC mentioned above has
> individual caches for each thread.

Surely that's because of Spectre and friends?

John Savard

Re: The Tera MTA

<sd0kkg$jcf$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18884&group=comp.arch#18884

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: The Tera MTA
Date: Sun, 18 Jul 2021 09:18:40 +0200
Organization: A noiseless patient Spider
Lines: 54
Message-ID: <sd0kkg$jcf$1@dont-email.me>
References: <d8e86c4a-44a4-4db2-b92b-ddd9c966b9fdn@googlegroups.com>
<127e7a94-d498-4467-8834-c6b639515c3bn@googlegroups.com>
<scv1hp$gh9$1@dont-email.me>
<1b43802a-3438-4aa8-bdc4-0b86f976eda9n@googlegroups.com>
<scvp1v$dgh$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 18 Jul 2021 07:18:40 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="d69c091ab0322b5f4a1fce3723a48993";
logging-data="19855"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX184lOljWZiuGrtd8yEbZlGgGzQIchK8w6Y="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:hIaN46aeOVWKZsPvCZAqBqyKL8c=
In-Reply-To: <scvp1v$dgh$1@dont-email.me>
Content-Language: en-US

by: Marcus - Sun, 18 Jul 2021 07:18 UTC

On 2021-07-18, BGB wrote:
> [snip] > Another possible issue is that a core running such an SMT configuration
> could suffer from a higher cache miss rate than a core running a single
> thread.
>

I think that SMT is when you run several threads at once to make better
use of the available execution units. The Cray MTA architecture is
something else, as it does a context switch on each clock cycle.

If you look at the Cary Threadstorm CPU & variants, with 128 threads per
core, it is specifically designed for applications where there is little
or no locality in the data access pattern. Hence it is more or less
expected that each memory access will make a full round trip to DDR and
back. And the barrell processor design is optimized for hiding the
memory access latency (hence all the threads).

IIUC the Threadstorm even scrambles the memory addresses to achieve a
sort of load balancing of the different memory channels.

If I were to make a ray tracing processor, it would most likely be a
barrel processor. In ray tracing - especially when doing global
illumination & stochastic sampling (Monte-carlo) - you typically have
lots of unrelated memory references (acceleration tree walking, triangle
mesh and texture lookups etc) and hard-to-predict branches, AND you have
an almost infinitely parallel problem.

> It could have a lower resource cost than dual-core, but to be similarly
> effective would likely mean roughly doubling the sizes of the L1 caches.
>
>
> A naive approach (single pipeline) would have the big drawback (besides
> the lower effective clock speed), that both threads would stall whenever
> either thread had an L1 miss. Avoiding this seems like it would require
> two separate pipelines, and a bit of additional multiplex capability.
>
> In this case, it could likely make sense to only do the half-speed
> operation when both threads are running instructions, and if one thread
> is stalled on an L1 miss, then whichever is the non-stalled thread
> switches to full-speed operation.
>
> Another question would be whether the CPU tries to behave as if it has
> two independent cores, or if it is left up to the OS scheduler:
> Using SMT effectively involving the OS scheduler setting up both threads
> at the same time, or (at its discretion) only scheduling a single thread
> (which runs at a higher effective clock speed).
>

The Threadstorm seems to do thread management in user space at the
instruction level. There also seems to be a HW scheduler that picks
threads that are ready to execute (i.e. not waiting for a memory access
to complete).

/Marcus

Re: The Tera MTA

<05e6b305-5cae-4419-b33b-52b5c080621fn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18885&group=comp.arch#18885

copy link Newsgroups: comp.arch

X-Received: by 2002:ad4:5ccc:: with SMTP id iu12mr19220428qvb.21.1626620016245;
Sun, 18 Jul 2021 07:53:36 -0700 (PDT)
X-Received: by 2002:a05:6808:1455:: with SMTP id x21mr19582047oiv.51.1626620016033;
Sun, 18 Jul 2021 07:53:36 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 18 Jul 2021 07:53:35 -0700 (PDT)
In-Reply-To: <sd0kkg$jcf$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fa3c:a000:e9e9:9d8c:6208:62b2;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fa3c:a000:e9e9:9d8c:6208:62b2
References: <d8e86c4a-44a4-4db2-b92b-ddd9c966b9fdn@googlegroups.com>
<127e7a94-d498-4467-8834-c6b639515c3bn@googlegroups.com> <scv1hp$gh9$1@dont-email.me>
<1b43802a-3438-4aa8-bdc4-0b86f976eda9n@googlegroups.com> <scvp1v$dgh$1@dont-email.me>
<sd0kkg$jcf$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <05e6b305-5cae-4419-b33b-52b5c080621fn@googlegroups.com>
Subject: Re: The Tera MTA
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Sun, 18 Jul 2021 14:53:36 +0000
Content-Type: text/plain; charset="UTF-8"

by: Quadibloc - Sun, 18 Jul 2021 14:53 UTC

On Sunday, July 18, 2021 at 1:18:43 AM UTC-6, Marcus wrote:

> I think that SMT is when you run several threads at once to make better
> use of the available execution units. The Cray MTA architecture is
> something else, as it does a context switch on each clock cycle.

Wouldn't saving all the registers to memory on each clock cycle
defeat the purpose of hiding memory latency?

A context switch is what happens when an interrupt or a subroutine
call happens within a single thread. When a barrel processor moves
to a new thread, it just switches to a different set of registers. Which
is exactly what SMT does. The difference between SMT and a barrel
processor is simply that SMT switches from one thread to another
randomly based on which execution units are available - although, to
ensure all threads get equal access to the CPU, I suspect there's still
a bias built in towards a barrel-like order. And SMT does switch
threads every single cycle.

John Savard

Re: The Tera MTA

<b2aefaf1-628c-4385-9149-afb72ce4aed2n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18886&group=comp.arch#18886

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:5f0d:: with SMTP id x13mr18436560qta.69.1626620212540;
Sun, 18 Jul 2021 07:56:52 -0700 (PDT)
X-Received: by 2002:a05:6808:d54:: with SMTP id w20mr19230969oik.175.1626620212327;
Sun, 18 Jul 2021 07:56:52 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 18 Jul 2021 07:56:52 -0700 (PDT)
In-Reply-To: <05e6b305-5cae-4419-b33b-52b5c080621fn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fa3c:a000:e9e9:9d8c:6208:62b2;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fa3c:a000:e9e9:9d8c:6208:62b2
References: <d8e86c4a-44a4-4db2-b92b-ddd9c966b9fdn@googlegroups.com>
<127e7a94-d498-4467-8834-c6b639515c3bn@googlegroups.com> <scv1hp$gh9$1@dont-email.me>
<1b43802a-3438-4aa8-bdc4-0b86f976eda9n@googlegroups.com> <scvp1v$dgh$1@dont-email.me>
<sd0kkg$jcf$1@dont-email.me> <05e6b305-5cae-4419-b33b-52b5c080621fn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b2aefaf1-628c-4385-9149-afb72ce4aed2n@googlegroups.com>
Subject: Re: The Tera MTA
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Sun, 18 Jul 2021 14:56:52 +0000
Content-Type: text/plain; charset="UTF-8"

by: Quadibloc - Sun, 18 Jul 2021 14:56 UTC

On Sunday, July 18, 2021 at 8:53:37 AM UTC-6, Quadibloc wrote:
> And SMT does switch
> threads every single cycle.

Except when it doesn't. After all, SMT with two threads might well
face the situation where the other thread has nothing to do, and
as well, it's often combined with out-of-order, which means that
having two different instructions from the same thread execute
in successive cycles is not impossible or even that unlikely.

But while it doesn't always switch threads, SMT is capable of
switching threads every cycle; it isn't as if it has to spend five
cycles in each thread before it can switch.

John Savard

Re: The Tera MTA

<sd1hps$2es$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18887&group=comp.arch#18887

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: The Tera MTA
Date: Sun, 18 Jul 2021 08:36:26 -0700
Organization: A noiseless patient Spider
Lines: 57
Message-ID: <sd1hps$2es$1@dont-email.me>
References: <d8e86c4a-44a4-4db2-b92b-ddd9c966b9fdn@googlegroups.com>
<127e7a94-d498-4467-8834-c6b639515c3bn@googlegroups.com>
<scv1hp$gh9$1@dont-email.me>
<1b43802a-3438-4aa8-bdc4-0b86f976eda9n@googlegroups.com>
<scvp1v$dgh$1@dont-email.me> <sd0kkg$jcf$1@dont-email.me>
<05e6b305-5cae-4419-b33b-52b5c080621fn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 18 Jul 2021 15:36:28 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="5cd92180df2ef113a2d597c640ecb986";
logging-data="2524"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/lCDkuTfUWcdnuu6267wZs8XRvEdR8TPU="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.12.0
Cancel-Lock: sha1:YbpWdaKZEvu3QwlYNK5HtHL4fkY=
In-Reply-To: <05e6b305-5cae-4419-b33b-52b5c080621fn@googlegroups.com>
Content-Language: en-US

by: Stephen Fuld - Sun, 18 Jul 2021 15:36 UTC

On 7/18/2021 7:53 AM, Quadibloc wrote:
> On Sunday, July 18, 2021 at 1:18:43 AM UTC-6, Marcus wrote:
>
>> I think that SMT is when you run several threads at once to make better
>> use of the available execution units. The Cray MTA architecture is
>> something else, as it does a context switch on each clock cycle.
>
> Wouldn't saving all the registers to memory on each clock cycle
> defeat the purpose of hiding memory latency?

Perhaps you are misunderstanding how the Tera MTA worked. Registers are
not saved to memory on every cycle. The CPU had 128 register sets, and
it switched which one to use every cycle. It could afford such a large
register file because it is still only accessing one set per instruction
and one instruction per cycle. It can afford the die space because it
had no d-cache (to avoid cache coherence scaling problems.

> A context switch is what happens when an interrupt or a subroutine
> call happens within a single thread.

Typically, yes. Perhaps the term "context switch" isn't a good
description, but the idea is that it is doing work for a different
thread each cycle.

> When a barrel processor moves
> to a new thread, it just switches to a different set of registers.

Right.

> Which
> is exactly what SMT does.

Perhaps. See below.

> The difference between SMT and a barrel
> processor is simply that SMT switches from one thread to another
> randomly based on which execution units are available - although, to
> ensure all threads get equal access to the CPU, I suspect there's still
> a bias built in towards a barrel-like order. And SMT does switch
> threads every single cycle.

Did you leave out the word "not" after "does" above? There is no
requirement for an SMT to switch on every cycle. Most don't.

And in any event, your description is not necessarily accurate. And
SMT, if it is on a superscaler CPU may execute instructions from
different threads in the same clock cycle. This allows, for example, an
FP intensive thread to run simultaneously with an integer only thread
with minimal interference (not counting cache contention)

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: The Tera MTA

<06a1b3d0-4830-46c0-b555-ae0c41ee55f5n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18889&group=comp.arch#18889

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:4004:: with SMTP id h4mr20579701qko.370.1626626971421;
Sun, 18 Jul 2021 09:49:31 -0700 (PDT)
X-Received: by 2002:aca:5cd7:: with SMTP id q206mr14919962oib.99.1626626971194;
Sun, 18 Jul 2021 09:49:31 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 18 Jul 2021 09:49:30 -0700 (PDT)
In-Reply-To: <05e6b305-5cae-4419-b33b-52b5c080621fn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:646f:e929:9775:676;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:646f:e929:9775:676
References: <d8e86c4a-44a4-4db2-b92b-ddd9c966b9fdn@googlegroups.com>
<127e7a94-d498-4467-8834-c6b639515c3bn@googlegroups.com> <scv1hp$gh9$1@dont-email.me>
<1b43802a-3438-4aa8-bdc4-0b86f976eda9n@googlegroups.com> <scvp1v$dgh$1@dont-email.me>
<sd0kkg$jcf$1@dont-email.me> <05e6b305-5cae-4419-b33b-52b5c080621fn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <06a1b3d0-4830-46c0-b555-ae0c41ee55f5n@googlegroups.com>
Subject: Re: The Tera MTA
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 18 Jul 2021 16:49:31 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Sun, 18 Jul 2021 16:49 UTC

On Sunday, July 18, 2021 at 9:53:37 AM UTC-5, Quadibloc wrote:
> On Sunday, July 18, 2021 at 1:18:43 AM UTC-6, Marcus wrote:
>
> > I think that SMT is when you run several threads at once to make better
> > use of the available execution units. The Cray MTA architecture is
> > something else, as it does a context switch on each clock cycle.
<
> Wouldn't saving all the registers to memory on each clock cycle
> defeat the purpose of hiding memory latency?
<
It has room for all threads in the register file, so no saving is necessary.
>
> A context switch is what happens when an interrupt or a subroutine
> call happens within a single thread. When a barrel processor moves
> to a new thread, it just switches to a different set of registers. Which
> is exactly what SMT does. The difference between SMT and a barrel
> processor is simply that SMT switches from one thread to another
> randomly based on which execution units are available - although, to
> ensure all threads get equal access to the CPU, I suspect there's still
> a bias built in towards a barrel-like order. And SMT does switch
> threads every single cycle.
<
As long as there is no saving of registers to "context switch" you are
correct, the barrel is an ordered list, while SMT is anyone who can run
can get picked to run. HEP was the later rather than the former.
>
> John Savard

Re: The Tera MTA

<2021Jul18.184756@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18891&group=comp.arch#18891

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: The Tera MTA
Date: Sun, 18 Jul 2021 16:47:56 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 36
Message-ID: <2021Jul18.184756@mips.complang.tuwien.ac.at>
References: <d8e86c4a-44a4-4db2-b92b-ddd9c966b9fdn@googlegroups.com> <scuq5c$h7h$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="d131848ce95877394fb656ffd8f210e9";
logging-data="2245"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+cp5RhHU4XGz2zVSfscOkp"
Cancel-Lock: sha1:QqTvIiYkm/tbu9Shl5B2dBqfJlk=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Sun, 18 Jul 2021 16:47 UTC

Tom Gardner <spamjunk@blueyonder.co.uk> writes:
>> In any case, was this a valid approach, or is there a good reason why
>> it is no longer viable?
>
>I don't know whether a one word answer is correct or
>sufficient: "Oracle".

It seems to me that Oracle was quite patient with SPARC, which had
been losing ground to Intel and AMD for a long time. The SPARC T4,
T5, M7, M8 may technically have been competetive (I have not used one
of them, but at least the clock rates appear competetive), but it
seems that in the marketplace they were too late to save SPARC, so
Oracle finally took the decision that they (and earlier Sun) also
could have taken earlier.

We have an UltraSparc T1, which is so slow for single threads (even
without slowdown from competing threads) that the fact that it has 8
cores with 4 threads does not make up for it. E.g., on Gforth 0.7.0 I
measured:

sieve bubble matrix fib
2.114 2.665 1.494 1.912 UltraSparc T1 1GHz; gcc-4.0.2
0.176 0.244 0.100 0.308 Athlon 64 X2 4400+; gcc-4.0.4
12 10.9 14.9 6.2 factor

So the Ultrasparc T1 is 6.2-14.9 times slower, but has only 4 times
more cores than the Athlon 64 X2. Does the multi-threading make up
for that? I doubt it. No SPEC CPU2000 or CPU2006 rate results were
submitted with the UltraSparc T1, so apparently the performance was
nothing to write home about. CPU2006 results were submitted for the
UltraSparc T2, however.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: The Tera MTA

<2021Jul18.194522@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18892&group=comp.arch#18892

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: The Tera MTA
Date: Sun, 18 Jul 2021 17:45:22 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 14
Message-ID: <2021Jul18.194522@mips.complang.tuwien.ac.at>
References: <d8e86c4a-44a4-4db2-b92b-ddd9c966b9fdn@googlegroups.com> <scuq5c$h7h$1@dont-email.me> <0afc2a0a-b3e6-4525-9fa3-a33bc287a6d5n@googlegroups.com> <51548316-5543-41d9-8ec6-822b861ea6afn@googlegroups.com>
Injection-Info: reader02.eternal-september.org; posting-host="d131848ce95877394fb656ffd8f210e9";
logging-data="2245"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/QV4AU2qKDbxApg2IUGDVL"
Cancel-Lock: sha1:JYRkx3oBEhUfxmBo9chW8CI6Dws=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Sun, 18 Jul 2021 17:45 UTC

Quadibloc <jsavard@ecn.ab.ca> writes:
>A quick web search shows that Oracle is still selling
>SPARC-based servers, based on the M8 processor,
>which has 32 cores and 256 threads, so 8 threads per
>core, so apparently the approach hasn't been given up
>yet.

They are selling the M8, but have canceled further development as soon
as the M8 was complete.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: The Tera MTA

<2021Jul18.194647@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18893&group=comp.arch#18893

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: The Tera MTA
Date: Sun, 18 Jul 2021 17:46:47 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 18
Message-ID: <2021Jul18.194647@mips.complang.tuwien.ac.at>
References: <d8e86c4a-44a4-4db2-b92b-ddd9c966b9fdn@googlegroups.com> <127e7a94-d498-4467-8834-c6b639515c3bn@googlegroups.com> <scv1hp$gh9$1@dont-email.me> <1b43802a-3438-4aa8-bdc4-0b86f976eda9n@googlegroups.com> <scvp1v$dgh$1@dont-email.me> <98aaf00d-0858-4a5c-bad1-1022cb3f5f9dn@googlegroups.com> <017f9e4b-ab5b-471b-a4cc-11c074782713n@googlegroups.com>
Injection-Info: reader02.eternal-september.org; posting-host="d131848ce95877394fb656ffd8f210e9";
logging-data="2245"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX183Kgud+08p3fYDzeG1ngmg"
Cancel-Lock: sha1:rKvHpqBtyxuf1UjR3p+V3F1MccU=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Sun, 18 Jul 2021 17:46 UTC

Quadibloc <jsavard@ecn.ab.ca> writes:
>On Saturday, July 17, 2021 at 8:45:41 PM UTC-6, MitchAlsup wrote:
>> I notice the SPARC mentioned above has
>> individual caches for each thread.
>
>Surely that's because of Spectre and friends?

Spectre was only publically revealed in 2018, the last Oracle SPARC
was released in 2017. Other cache side-channel attacks: Possible, but
unlikely.

My guess is that the reason is that accessing only one slice of a
largish cache takes less power than accessing the whole cache.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: The Tera MTA

<sd1rss$30l$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18894&group=comp.arch#18894

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: The Tera MTA
Date: Sun, 18 Jul 2021 13:28:41 -0500
Organization: A noiseless patient Spider
Lines: 180
Message-ID: <sd1rss$30l$1@dont-email.me>
References: <d8e86c4a-44a4-4db2-b92b-ddd9c966b9fdn@googlegroups.com>
<127e7a94-d498-4467-8834-c6b639515c3bn@googlegroups.com>
<scv1hp$gh9$1@dont-email.me>
<1b43802a-3438-4aa8-bdc4-0b86f976eda9n@googlegroups.com>
<scvp1v$dgh$1@dont-email.me> <sd0kkg$jcf$1@dont-email.me>
<05e6b305-5cae-4419-b33b-52b5c080621fn@googlegroups.com>
<sd1hps$2es$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 18 Jul 2021 18:28:44 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="79d1ca3378643581b91c4335e953a407";
logging-data="3093"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+bG5hbipi9gMVAhEV8JsFH"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.12.0
Cancel-Lock: sha1:IBE93KsbvhaKLa+DQn2VvJnRI+w=
In-Reply-To: <sd1hps$2es$1@dont-email.me>
Content-Language: en-US

by: BGB - Sun, 18 Jul 2021 18:28 UTC

On 7/18/2021 10:36 AM, Stephen Fuld wrote:
> On 7/18/2021 7:53 AM, Quadibloc wrote:
>> On Sunday, July 18, 2021 at 1:18:43 AM UTC-6, Marcus wrote:
>>
>>> I think that SMT is when you run several threads at once to make better
>>> use of the available execution units. The Cray MTA architecture is
>>> something else, as it does a context switch on each clock cycle.
>>
>> Wouldn't saving all the registers to memory on each clock cycle
>> defeat the purpose of hiding memory latency?
>
> Perhaps you are misunderstanding how the Tera MTA worked. Registers are
> not saved to memory on every cycle. The CPU had 128 register sets, and
> it switched which one to use every cycle. It could afford such a large
> register file because it is still only accessing one set per instruction
> and one instruction per cycle. It can afford the die space because it
> had no d-cache (to avoid cache coherence scaling problems.
>

For a larger number of threads, possibly the register file would become
more like a specialized L1 cache.

Otherwise, for the class of FPGA I am targeting, a register file large
enough to handle this many threads would be implausible (roughly
comparable in size to the L2 cache).

But, say, if I did dual-thread, or possibly had the dual-thread mode
split the current 64 GPRs into 2 sets of 32 GPRs, it would be more viable.

>
>> A context switch is what happens when an interrupt or a subroutine
>> call happens within a single thread.
>
> Typically, yes. Perhaps the term "context switch" isn't a good
> description, but the idea is that it is doing work for a different
> thread each cycle.
>

It is likely there would be the SMT, running multiple threads in
hardware. Then a more traditional scheduler which saves off the current
thread(s), and loads up new thread(s).

It could use some checks to see whether the threads could be safely run
in parallel, say for example, if both are required to be running in 32
GPR mode, both exist within the same virtual address space, ...

It is also likely that interrupts would immediately switch from SMT mode
to Scalar mode, so from the POV of the scheduler:
SP-1 is at R15, SP-2 at R47;
There would be PC-1/2, GBR-1/2, TBR-1/2, ...

Registers like TTB, MMCR, ... could be treated as shared state.

>
>> When a barrel processor moves
>> to a new thread, it just switches to a different set of registers.
>
> Right.
>
>> Which
>> is exactly what SMT does.
>
> Perhaps. See below.
>
>
>> The difference between SMT and a barrel
>> processor is simply that SMT switches from one thread to another
>> randomly based on which execution units are available - although, to
>> ensure all threads get equal access to the CPU, I suspect there's still
>> a bias built in towards a barrel-like order. And SMT does switch
>> threads every single cycle.
>
> Did you leave out the word "not" after "does" above? There is no
> requirement for an SMT to switch on every cycle. Most don't.
>
> And in any event, your description is not necessarily accurate. And
> SMT, if it is on a superscaler CPU may execute instructions from
> different threads in the same clock cycle. This allows, for example, an
> FP intensive thread to run simultaneously with an integer only thread
> with minimal interference (not counting cache contention)
>

This is also possible.

Handling both threads in the same cycle would, at least in my case,
require adding multiple ports to the L1 caches.

For the I$, it would likely make more sense in this case to have a
bigger I$ that effectively splits in half in SMT mode (behaving like two
separate L1 caches), but in Scalar mode, behaves like a single 2-way
set-associative cache.

For the D$, this would likely mean implementing support for a full
dual-port mode, and/or using split merge. The full dual-port mode would
probably effectively just be two L1's glued together, probably with a
"bridge" which could allow for cache-consistency checking, and the
caches could be able to function as a set-associative cache when only
one port is in use.

Though, another option for dual-port is to split the L1 into two small
L0 caches, which use a shared L1 array. I had considered such a design
to allow using larger L1 at higher clock speeds, but didn't do so
because it would be worse in current use than the current cache.

If both of these were done, then two threads could potentially both
execute in parallel. This would make more sense with a dual-pipeline
design though.

Likely, each pipeline would have its own ALU-1, though possibly
Thread-2's Lane-2 ALU borrows the Lane-3 ALU.
Thread-1: ALU1A, ALU-2
Thread-2: ALU2A, ALU-3

If either thread needs to use all 3 ALU's (rare), it may trigger an
interlock with the other pipeline.

The other option being that each thread has dedicated ALUs for Lanes 1
and 2, but the Lane-3 ALU is shared.

Possibly the FPU would also be shared with a similar scheme.

It is possible that threads could also execute two parallel memory
requests (using both ports), though in SMT mode, this would likely
require an interlock if the other thread also tries to access memory.

Another option is to do it like the GPRs, and using dual-port memory
access disallows running the thread in SMT mode. I would likely need to
add some way to flag this via the PE/COFF / PEL headers.

It is possible though, that in single-threaded mode, the pipeline could
be widened (to 5 or 6 lanes) by ganging both lanes. The 5-lane case
assumes that given 3-wide bundles are infrequent, the SMT pipeline
operates as 2x 2-wide pipelines which share Lane 3.

So, say:
SMT-Capable Thread (WEX3W):
R0..R31, 3 lanes, 1 memory port, ...
Single-Issue Thread (WEX5W):
R0..R63, 5 lanes, 2 memory ports, ...

So, the 5W configuration is:
Lane1A, Lane1B, Lane2A, Lane2B, Lane3

Or, a 6W configuration as:
Lane1A, Lane1B, Lane2A, Lane2B, Lane3A, Lane3B

Though, all this does seem a fair bit complicated and kinda expensive.

Going this route would likely mean a single core eats pretty much an
entire XC7A100T. All this is drastic enough to where I would probably
also fork the core for this.

A much simpler option would be instead to do SMT by having an internal
state machine, say:
00: Single-Thread Mode (Baseline)
01: Single-Thread Mode (ISR, Return to SMT on RTE)
10: SMT, Thread-1
11: SMT, Thread-2

This state machine just sort of rapidly switches from one thread to
another within a single pipeline (and the two threads are separated
mostly by using special case logic in the instruction decoder and similar).

However, this simpler approach is not likely to offer much of any real
performance advantage over single-threaded operation.

Workloads like Doom might do OK, but my testing (modeling behavior via
emulation) implies that the simple approach would interact very poorly
with things like my GL rasterizer.

....

Re: The Tera MTA

<memo.20210718212312.10680F@jgd.cix.co.uk>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18896&group=comp.arch#18896

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: jgd...@cix.co.uk (John Dallman)
Newsgroups: comp.arch
Subject: Re: The Tera MTA
Date: Sun, 18 Jul 2021 21:23 +0100 (BST)
Organization: A noiseless patient Spider
Lines: 33
Message-ID: <memo.20210718212312.10680F@jgd.cix.co.uk>
References: <2021Jul18.184756@mips.complang.tuwien.ac.at>
Reply-To: jgd@cix.co.uk
Injection-Info: reader02.eternal-september.org; posting-host="df4caffcb0baaaba70013c0f58a74a08";
logging-data="20460"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19hXYtfC4/4MMmDCQbaRweIJKgwfoIToBA="
Cancel-Lock: sha1:xM63kUgrEnC9P1qCLHPkrKPDaYc=

by: John Dallman - Sun, 18 Jul 2021 20:23 UTC

In article <2021Jul18.184756@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> It seems to me that Oracle was quite patient with SPARC, which had
> been losing ground to Intel and AMD for a long time. The SPARC T4,
> T5, M7, M8 may technically have been competetive (I have not used
> one of them, but at least the clock rates appear competetive), but it
> seems that in the marketplace they were too late to save SPARC, so
> Oracle finally took the decision that they (and earlier Sun) also
> could have taken earlier.

Sun and Oracle were unwilling or unable to invest in development on the
scale needed to compete with Intel and AMD. They didn't even stay
approximately one generation behind, as POWER has managed.

It seems that Oracle's plan in buying Sun was vertical integration:
selling Oracle database and hardware specifically tuned to run it quickly.
However, it turned out that the tuning had largely already been done, and
there were no low-hanging fruit available.

> We have an UltraSparc T1, which is so slow for single threads (even
> without slowdown from competing threads) that the fact that it has 8
> cores with 4 threads does not make up for it.

The T1 seems to have been designed as a specialised web-serving processor,
when that job required a lot less horsepower than it does today (it dates
from 2005). It wasn't much for anything else.

I have access to a machine with SPARC M7 cores at 4.27GHz. An Apple Mac
Mini M1 at 3.2GHz has about 2.5 times its single-thread performance with
the mathematical modeller I work on.

John

Re: The Tera MTA

<54348333-d4b7-482e-9804-fda9777808a7n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18897&group=comp.arch#18897

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:34c:: with SMTP id r12mr19189731qtw.196.1626641905075;
Sun, 18 Jul 2021 13:58:25 -0700 (PDT)
X-Received: by 2002:a4a:d781:: with SMTP id c1mr15354824oou.23.1626641904849;
Sun, 18 Jul 2021 13:58:24 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!usenet.pasdenom.info!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 18 Jul 2021 13:58:24 -0700 (PDT)
In-Reply-To: <06a1b3d0-4830-46c0-b555-ae0c41ee55f5n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fa3c:a000:6437:2f3c:fd50:a217;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fa3c:a000:6437:2f3c:fd50:a217
References: <d8e86c4a-44a4-4db2-b92b-ddd9c966b9fdn@googlegroups.com>
<127e7a94-d498-4467-8834-c6b639515c3bn@googlegroups.com> <scv1hp$gh9$1@dont-email.me>
<1b43802a-3438-4aa8-bdc4-0b86f976eda9n@googlegroups.com> <scvp1v$dgh$1@dont-email.me>
<sd0kkg$jcf$1@dont-email.me> <05e6b305-5cae-4419-b33b-52b5c080621fn@googlegroups.com>
<06a1b3d0-4830-46c0-b555-ae0c41ee55f5n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <54348333-d4b7-482e-9804-fda9777808a7n@googlegroups.com>
Subject: Re: The Tera MTA
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Sun, 18 Jul 2021 20:58:25 +0000
Content-Type: text/plain; charset="UTF-8"

by: Quadibloc - Sun, 18 Jul 2021 20:58 UTC

On Sunday, July 18, 2021 at 10:49:32 AM UTC-6, MitchAlsup wrote:
> On Sunday, July 18, 2021 at 9:53:37 AM UTC-5, Quadibloc wrote:

> > Wouldn't saving all the registers to memory on each clock cycle
> > defeat the purpose of hiding memory latency?

> It has room for all threads in the register file, so no saving is necessary.

Yes, which means that whatever it's doing every cycle isn't a
"context switch" in the sense with which I am familiar.

John Savard

Pages:12 3

server_pubkey.txt

rocksolid light 0.9.8
clearnet tor