novaBBS - comp.arch - The Computer of the Future

I noticed that in another thread I might have seemed to have contradicted myself.

So I will clarify.

In the near term, in two or three years, I think that it's entirely possible that we
will have dies that combine four "GBOoO" performance cores with sixteen in-
order efficiency cores, and chips that have four of those dies in a package, to
give good performance on both number-crunching and database workloads.

In the _longer_ term, though, when Moore's Law finally runs out of steam,
and chips are even bigger... instead of putting more than 64 or 128 cores
on a chip, if that becomes possible with the most advanced silicon
process attainable... I think the number of cores will top out, and instead
of having 4,096 in-order cores on a chip, we'll see perhaps 64 out-of-order
cores (of modest size) instead for the efficiency core contingent.

That's because eventually they'll bump into constraints with memory
bandwidth, but there's still some headroom left.

I do think that eventually a Cray-like vector architecture is something that
should be considered as we look for ways to make chips more powerful.
After all, we've gone from MMX all the way to AVX-512 on the one hand,
and on the other hand, efforts have been made to make GPU computing
more versatile and flexible.

Today, some Intel chips slow down their clock rates when doing AVX-512
operations. This reduces, but does not eliminate, the performance
increase in going from AVX-256 to AVX-512.

What I'm thinking a chip of the future, aimed at the ultimate in high
performance might do is this:

It would have vector instructions similar to those of the Cray-I or its
successors.

These instructions would use floating point ALUs that run more slowly
than the regular main floating-point ALU on the chip, which are organized
into something that _somewhat_ resembles a GPU (but not exactly, so as
to be versatile and flexible enough to handle everything a vector
supercomputer can do).

To avoid the problem current Intel chips have of E-cores that can't handle
AVX-512 instructions, I think it might be sensible to take one leaf out of
Bulldozer.

Let's have a core complex that looks like this:

One performance core.
One long (Cray-like) vector unit.
Four to eight efficiency cores.

So if you _have_ vector instructions in your cores, when you switch from
the performance cores to the efficiency cores, you just leave the vector
unit turned on (instead of trying to copy the contents of its big vector
registers).

John Savard

On Tuesday, February 15, 2022 at 4:48:03 AM UTC-6, Quadibloc wrote:
> I noticed that in another thread I might have seemed to have contradicted myself.
>
> So I will clarify.
>
> In the near term, in two or three years, I think that it's entirely possible that we
> will have dies that combine four "GBOoO" performance cores with sixteen in-
> order efficiency cores, and chips that have four of those dies in a package, to
> give good performance on both number-crunching and database workloads.
>
> In the _longer_ term, though, when Moore's Law finally runs out of steam,
> and chips are even bigger... instead of putting more than 64 or 128 cores
> on a chip, if that becomes possible with the most advanced silicon
> process attainable... I think the number of cores will top out, and instead
> of having 4,096 in-order cores on a chip, we'll see perhaps 64 out-of-order
> cores (of modest size) instead for the efficiency core contingent.
>
> That's because eventually they'll bump into constraints with memory
> bandwidth, but there's still some headroom left.
>
> I do think that eventually a Cray-like vector architecture is something that
> should be considered as we look for ways to make chips more powerful.
<
CRAYs were designed to consume memory bandwidth, something you said
will top out in the paragraph above. To consume as much memory bandwidth
as someone can afford to build. Consume this Bandwidth while tolerating
the ever growing latency measured in cycles..
<
> After all, we've gone from MMX all the way to AVX-512 on the one hand,
These are not CRAY like.
> and on the other hand, efforts have been made to make GPU computing
> more versatile and flexible.
These are neither CRAY like, nor MMX-AVX like.
And grossly morph generation to generation.
>
> Today, some Intel chips slow down their clock rates when doing AVX-512
> operations. This reduces, but does not eliminate, the performance
> increase in going from AVX-256 to AVX-512.
>
> What I'm thinking a chip of the future, aimed at the ultimate in high
> performance might do is this:
>
> It would have vector instructions similar to those of the Cray-I or its
> successors.
<
VVM provides everything CRAYs do, and allows the HW to organize itself
AVX fashion all from a scalar instruction set, and without dragging ever
larger register files around.
>
> These instructions would use floating point ALUs that run more slowly
> than the regular main floating-point ALU on the chip, which are organized
> into something that _somewhat_ resembles a GPU (but not exactly, so as
> to be versatile and flexible enough to handle everything a vector
> supercomputer can do).
<
You simply FAIL to understand the model of the GPU. They are not:: really
wide SIMD, they are really wide SIMT almost as if thousands of "threads"
on a barrel scheduler. On cycle[k] they perform 32 instructions for
threads[m..m+31], on cycle[k+1] they perform 32 instructions from
threads[x..x+31]. Threads[k] and thread[x] can be from entirely different
"draw calls" with running different code under different MMU tables,...
All this "different" is there to tolerate the latency of memory. Your typical
LD may take 100 trips around the barrel. So you need other threads to
operate while waiting.
<
GPUs satisfy high IPC only on embarrassingly parallel calculation patterns;
patterns which do not contain branches.
>
> To avoid the problem current Intel chips have of E-cores that can't handle
> AVX-512 instructions, I think it might be sensible to take one leaf out of
> Bulldozer.
>
> Let's have a core complex that looks like this:
>
> One performance core.
> One long (Cray-like) vector unit.
> Four to eight efficiency cores.
>
> So if you _have_ vector instructions in your cores, when you switch from
> the performance cores to the efficiency cores, you just leave the vector
> unit turned on (instead of trying to copy the contents of its big vector
> registers).
>
> John Savard

MitchAlsup <MitchAlsup@aol.com> wrote:
> On Tuesday, February 15, 2022 at 4:48:03 AM UTC-6, Quadibloc wrote:
>> I noticed that in another thread I might have seemed to have contradicted myself.
>>
>> So I will clarify.
>>
>> In the near term, in two or three years, I think that it's entirely possible that we
>> will have dies that combine four "GBOoO" performance cores with sixteen in-
>> order efficiency cores, and chips that have four of those dies in a package, to
>> give good performance on both number-crunching and database workloads.
>>
>> In the _longer_ term, though, when Moore's Law finally runs out of steam,
>> and chips are even bigger... instead of putting more than 64 or 128 cores
>> on a chip, if that becomes possible with the most advanced silicon
>> process attainable... I think the number of cores will top out, and instead
>> of having 4,096 in-order cores on a chip, we'll see perhaps 64 out-of-order
>> cores (of modest size) instead for the efficiency core contingent.
>>
>> That's because eventually they'll bump into constraints with memory
>> bandwidth, but there's still some headroom left.
>>
>> I do think that eventually a Cray-like vector architecture is something that
>> should be considered as we look for ways to make chips more powerful.
> <
> CRAYs were designed to consume memory bandwidth, something you said
> will top out in the paragraph above. To consume as much memory bandwidth
> as someone can afford to build. Consume this Bandwidth while tolerating
> the ever growing latency measured in cycles..
> <
>> After all, we've gone from MMX all the way to AVX-512 on the one hand,
> These are not CRAY like.
>> and on the other hand, efforts have been made to make GPU computing
>> more versatile and flexible.
> These are neither CRAY like, nor MMX-AVX like.
> And grossly morph generation to generation.
>>
>> Today, some Intel chips slow down their clock rates when doing AVX-512
>> operations. This reduces, but does not eliminate, the performance
>> increase in going from AVX-256 to AVX-512.
>>
>> What I'm thinking a chip of the future, aimed at the ultimate in high
>> performance might do is this:
>>
>> It would have vector instructions similar to those of the Cray-I or its
>> successors.
> <
> VVM provides everything CRAYs do, and allows the HW to organize itself
> AVX fashion all from a scalar instruction set, and without dragging ever
> larger register files around.
>>
>> These instructions would use floating point ALUs that run more slowly
>> than the regular main floating-point ALU on the chip, which are organized
>> into something that _somewhat_ resembles a GPU (but not exactly, so as
>> to be versatile and flexible enough to handle everything a vector
>> supercomputer can do).
> <
> You simply FAIL to understand the model of the GPU. They are not:: really
> wide SIMD, they are really wide SIMT almost as if thousands of "threads"
> on a barrel scheduler. On cycle[k] they perform 32 instructions for
> threads[m..m+31], on cycle[k+1] they perform 32 instructions from
> threads[x..x+31]. Threads[k] and thread[x] can be from entirely different
> "draw calls" with running different code under different MMU tables,...
> All this "different" is there to tolerate the latency of memory. Your typical
> LD may take 100 trips around the barrel. So you need other threads to
> operate while waiting.
> <
> GPUs satisfy high IPC only on embarrassingly parallel calculation patterns;
> patterns which do not contain branches.

Branches used to matter, but now the herd of chickens is so big that memory
bandwidth is the limit.

>> To avoid the problem current Intel chips have of E-cores that can't handle
>> AVX-512 instructions, I think it might be sensible to take one leaf out of
>> Bulldozer.
>>
>> Let's have a core complex that looks like this:
>>
>> One performance core.
>> One long (Cray-like) vector unit.
>> Four to eight efficiency cores.
>>
>> So if you _have_ vector instructions in your cores, when you switch from
>> the performance cores to the efficiency cores, you just leave the vector
>> unit turned on (instead of trying to copy the contents of its big vector
>> registers).
>>
>> John Savard
>

On Tuesday, February 15, 2022 at 8:42:40 AM UTC-7, MitchAlsup wrote:

> CRAYs were designed to consume memory bandwidth, something you said
> will top out in the paragraph above. To consume as much memory bandwidth
> as someone can afford to build. Consume this Bandwidth while tolerating
> the ever growing latency measured in cycles..

This puzzles me. The CRAY-I architecture worked like modern RISC architectures:
one did loads and stores from memory, and then arithmetic in the register file.
The idea was to do as much arithmetic within the register file as possible, in order
to get as much work as possible done within the constraint of the memory bandwidth.

> VVM provides everything CRAYs do, and allows the HW to organize itself
> AVX fashion all from a scalar instruction set, and without dragging ever
> larger register files around.

It's true that without an explicit register file, one is allowed to have implementations
of different sizes, which use a cache instead to conserve memory bandwidth. And
the SX-6, for example, didn't even have a cache.

I am thinking that a CRAY-style machine still needs something like VVM as well,
because the vector registers might have 64 or 256 elements, while

> You simply FAIL to understand the model of the GPU. They are not:: really
> wide SIMD, they are really wide SIMT almost as if thousands of "threads"
> on a barrel scheduler. On cycle[k] they perform 32 instructions for
> threads[m..m+31], on cycle[k+1] they perform 32 instructions from
> threads[x..x+31]. Threads[k] and thread[x] can be from entirely different
> "draw calls" with running different code under different MMU tables,...
> All this "different" is there to tolerate the latency of memory. Your typical
> LD may take 100 trips around the barrel. So you need other threads to
> operate while waiting.

Yes, that is an important point. Memory latency is a serious characteristic
of modern architectures, so an architecture that can take full advantage of
memory bandwidth despite memory latency is useful.

John Savard

On Tuesday, February 15, 2022 at 2:07:41 PM UTC-6, Quadibloc wrote:
> On Tuesday, February 15, 2022 at 8:42:40 AM UTC-7, MitchAlsup wrote:
>
> > CRAYs were designed to consume memory bandwidth, something you said
> > will top out in the paragraph above. To consume as much memory bandwidth
> > as someone can afford to build. Consume this Bandwidth while tolerating
> > the ever growing latency measured in cycles..
<
> This puzzles me. The CRAY-I architecture worked like modern RISC architectures:
> one did loads and stores from memory, and then arithmetic in the register file.
<
Memory was uncached and 14 cycles away (10 in Cray 1-s)
Vectors were used so the latency of unCached access was hidden under the
64 memory requests (took 78 cycles latency)
<
Can your proposed system read 4 different cache lines in 78 cycles ?
<
> The idea was to do as much arithmetic within the register file as possible, in order
> to get as much work as possible done within the constraint of the memory bandwidth.
<
The idea was to perform 64 or 128 arithmetic operations done while waiting the 78
cycles when yo can start arithmetic again.
<
> > VVM provides everything CRAYs do, and allows the HW to organize itself
> > AVX fashion all from a scalar instruction set, and without dragging ever
> > larger register files around.
<
> It's true that without an explicit register file, one is allowed to have implementations
> of different sizes, which use a cache instead to conserve memory bandwidth. And
> the SX-6, for example, didn't even have a cache.
<
None of the Crays had a cache, and the S-register file of Cray-2 was a failure
(compilers could hardly program it)
<
Later Crays could have 2 LDs and a ST pending on memory at the same time
(192 individual references), and let us postulate that a modern CRAY would run
at 5 GHz. That is 120 GB/s per core. I will let you multiply by the number of cores
you want.
<
Vector machines used banked memory systems. The cores would spew out 3
references per cycle continuously. How many banks is your purported computer
architecture going to have. Unless it is way above 64-banks, you have no chance.
Most computers today ship with 1 or 2 DRAM DIMMs. This is where the volume
this is where you should be designing.
<
Vector machines came into existence for a narrow range of applications that require
high bandwidth memory and high FP calculation rates where the data sets had no
chance of fitting into cache.
<
Vector machines fell out of favor when memory latency got so large that the vector
size no longer covered memory latency. NEC hung in for a while by changing the
length of the vectors from 64->128 then to 256. At this point the vector register
file access time became problematic in pipelining.
<
Cray-like vectors had run their course.
>
> I am thinking that a CRAY-style machine still needs something like VVM as well,
> because the vector registers might have 64 or 256 elements, while
<
<
<
> > You simply FAIL to understand the model of the GPU. They are not:: really
> > wide SIMD, they are really wide SIMT almost as if thousands of "threads"
> > on a barrel scheduler. On cycle[k] they perform 32 instructions for
> > threads[m..m+31], on cycle[k+1] they perform 32 instructions from
> > threads[x..x+31]. Threads[k] and thread[x] can be from entirely different
> > "draw calls" with running different code under different MMU tables,...
> > All this "different" is there to tolerate the latency of memory. Your typical
> > LD may take 100 trips around the barrel. So you need other threads to
> > operate while waiting.
<
> Yes, that is an important point. Memory latency is a serious characteristic
> of modern architectures, so an architecture that can take full advantage of
> memory bandwidth despite memory latency is useful.
>
> John Savard

Quadibloc <jsavard@ecn.ab.ca> schrieb:
> I noticed that in another thread I might have seemed to have contradicted myself.
>
> So I will clarify.
>
> In the near term, in two or three years, I think that it's entirely possible that we
> will have dies that combine four "GBOoO" performance cores with sixteen in-
> order efficiency cores, and chips that have four of those dies in a package, to
> give good performance on both number-crunching and database workloads.

Who actually needs number crunching?

I certainly do, even in the the company I work in (wich is in
the chemical industry, so rather technical) the number of people
actually running code which depends on floating point execution
speed is rather small, probably in the low single digit percent
range of all employees.

That does not mean that floating point is not important :-) but that
most users would not notice if they had a CPU with, let's say, a
reasonably efficient software emulation of floating point numbers.

Such a CPU would look horrible in SPECfp, and the savings from removing
floating point from a general purpose CPU are probably not that great so
it is not done, and I think that as an intensive user of floating point,
I have to be grateful for that.

Hmm, come to think of it, that is the fist positive thing about
SPEC that occurred to me in quite a few years...

Re: The Computer of the Future

<8b79ec05-ff05-4907-9eac-1f25153100b4n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23535&group=comp.arch#23535

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:258e:: with SMTP id fq14mr903166qvb.69.1644961923961;
Tue, 15 Feb 2022 13:52:03 -0800 (PST)
X-Received: by 2002:a05:6808:bd3:b0:2d2:f7ae:1e8 with SMTP id
o19-20020a0568080bd300b002d2f7ae01e8mr2644842oik.179.1644961923762; Tue, 15
Feb 2022 13:52:03 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 15 Feb 2022 13:52:03 -0800 (PST)
In-Reply-To: <6ced8a79-deb8-421f-8f82-864c9fcfe56en@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=162.229.185.59; posting-account=Gm3E_woAAACkDRJFCvfChVjhgA24PTsb
NNTP-Posting-Host: 162.229.185.59
References: <b94f72eb-d747-47a4-85cd-d4c351cfcc5fn@googlegroups.com>
<858e7a72-fcc0-4bab-b087-28b9995c7094n@googlegroups.com> <7ab532a6-9110-46dc-8e41-a28babbf2bc1n@googlegroups.com>
<6ced8a79-deb8-421f-8f82-864c9fcfe56en@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <8b79ec05-ff05-4907-9eac-1f25153100b4n@googlegroups.com>
Subject: Re: The Computer of the Future
From: yogaman...@yahoo.com (Scott Smader)
Injection-Date: Tue, 15 Feb 2022 21:52:03 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 81

by: Scott Smader - Tue, 15 Feb 2022 21:52 UTC

On Tuesday, February 15, 2022 at 12:48:59 PM UTC-8, MitchAlsup wrote:
> On Tuesday, February 15, 2022 at 2:07:41 PM UTC-6, Quadibloc wrote:
> > On Tuesday, February 15, 2022 at 8:42:40 AM UTC-7, MitchAlsup wrote:
> >
> > > CRAYs were designed to consume memory bandwidth, something you said
> > > will top out in the paragraph above. To consume as much memory bandwidth
> > > as someone can afford to build. Consume this Bandwidth while tolerating
> > > the ever growing latency measured in cycles..
> <
> > This puzzles me. The CRAY-I architecture worked like modern RISC architectures:
> > one did loads and stores from memory, and then arithmetic in the register file.
> <
> Memory was uncached and 14 cycles away (10 in Cray 1-s)
> Vectors were used so the latency of unCached access was hidden under the
> 64 memory requests (took 78 cycles latency)
> <
> Can your proposed system read 4 different cache lines in 78 cycles ?
> <
> > The idea was to do as much arithmetic within the register file as possible, in order
> > to get as much work as possible done within the constraint of the memory bandwidth.
> <
> The idea was to perform 64 or 128 arithmetic operations done while waiting the 78
> cycles when yo can start arithmetic again.
> <
> > > VVM provides everything CRAYs do, and allows the HW to organize itself
> > > AVX fashion all from a scalar instruction set, and without dragging ever
> > > larger register files around.
> <
> > It's true that without an explicit register file, one is allowed to have implementations
> > of different sizes, which use a cache instead to conserve memory bandwidth. And
> > the SX-6, for example, didn't even have a cache.
> <
> None of the Crays had a cache, and the S-register file of Cray-2 was a failure
> (compilers could hardly program it)
> <
> Later Crays could have 2 LDs and a ST pending on memory at the same time
> (192 individual references), and let us postulate that a modern CRAY would run
> at 5 GHz. That is 120 GB/s per core. I will let you multiply by the number of cores
> you want.
> <
> Vector machines used banked memory systems. The cores would spew out 3
> references per cycle continuously. How many banks is your purported computer
> architecture going to have. Unless it is way above 64-banks, you have no chance.
> Most computers today ship with 1 or 2 DRAM DIMMs. This is where the volume
> this is where you should be designing.
> <
> Vector machines came into existence for a narrow range of applications that require
> high bandwidth memory and high FP calculation rates where the data sets had no
> chance of fitting into cache.
> <
> Vector machines fell out of favor when memory latency got so large that the vector
> size no longer covered memory latency. NEC hung in for a while by changing the
> length of the vectors from 64->128 then to 256. At this point the vector register
> file access time became problematic in pipelining.
> <
> Cray-like vectors had run their course.

Wow. That reads like a great teaser for an upcoming "The History of Computer Architecture" by Mitch Alsup, and I'm eager to read the whole story.

I hope it's something you might consider. In your spare time.

> >
> > I am thinking that a CRAY-style machine still needs something like VVM as well,
> > because the vector registers might have 64 or 256 elements, while
> <
> <
> <
> > > You simply FAIL to understand the model of the GPU. They are not:: really
> > > wide SIMD, they are really wide SIMT almost as if thousands of "threads"
> > > on a barrel scheduler. On cycle[k] they perform 32 instructions for
> > > threads[m..m+31], on cycle[k+1] they perform 32 instructions from
> > > threads[x..x+31]. Threads[k] and thread[x] can be from entirely different
> > > "draw calls" with running different code under different MMU tables,...
> > > All this "different" is there to tolerate the latency of memory. Your typical
> > > LD may take 100 trips around the barrel. So you need other threads to
> > > operate while waiting.
> <
> > Yes, that is an important point. Memory latency is a serious characteristic
> > of modern architectures, so an architecture that can take full advantage of
> > memory bandwidth despite memory latency is useful.
> >
> > John Savard

On Tuesday, February 15, 2022 at 8:42:40 AM UTC-7, MitchAlsup wrote:
> On Tuesday, February 15, 2022 at 4:48:03 AM UTC-6, Quadibloc wrote:

> > After all, we've gone from MMX all the way to AVX-512 on the one hand,

> These are not CRAY like.

That's true. One of the problems you've noted with the current
vector architectures is that they keep changing the instruction
set in order to make the vectors bigger.

A Cray-like architecture makes the vectors _much_ bigger than even
in AVX-512, so in my naivete, I would have thought that this would
allow the constant changes to stop for a while.

As it became possible to put more and more transistors on a chip,
at first, the best way to make use of those extra transistors was
obvious. Give the computer the ability to do 16-bit arithmetic,
not just 8-bit arithmetic. Add floating-point in hardware.

Adding Cray-like vector instructions seemed to me like the final
natural step in this evolution. Existing vector instructions kept
getting wider, so this would take that to its limit.

This doesn't mean I expect to solve all supercomputing problems
that way. I don't claim to have a magic wand with which to solve the
memory latency issue.

However, today's microprocessors *do* have L3 caches that are as
big as the memory of the original Cray I. So, while that wouldn't
help in solving the problems that the supercomputers of today are
working on, it _would_ make more arithmetic operations available,
say, to video game writers.

I'm envisaging a chip that tries to increase memory bandwidth, but
only within bounds suitable to a "consumer" product. Just as heatsinks
have grown much bigger than people would have expected back in the
days of the 386 processor, I'm thinking we could go with having 512
data lines going out of a processor. With four signal levels to further
double memory bandwidth.

This is all predicated on the assumption that, given that lithography
has reached its ultimate limits, and so Moore's Law is over, people
are desperate for more performance. They don't know how to do
parallel programming well, and so they're desperate for things that
make parallelism more palatable - like out-of-order execution and
vectors.

GPU hardware is apparently the best way to get the most FLOPs on
a chip. It may be a bad fit for a Cray-like ISA, but the native GPU
design is a bad fit for programmers. And no two GPUs are alike.

Exactly how a modified GPU design aimed at simulating a Cray
or multiple Crays in parallel working on different problems might
look is not clear to me, but I presume that if one can put a bunch
of ALUs on a chip, and one can organize that to look like a GPU
or like a Xeon Phi (but with RISC instead of x86), it could also be
organized to look like something in between adapted to a
Cray-like instruction set.

Since Crays used 64-element vector registers for code in loops
that handled vectors with more than 64 elements, that these loops
might well be... augmented... by means of something looking a
bit like your VVM is also not beyond the bounds of imagination.
(But if you're using something like VVM, why have vector instructions?
Reducing decoding overhead!)

Of course, though, my designs will have scalar floating-point
instructions, short vector instructions (sort of like AVX-256),
and long vector instructions (like a Cray)... because they're
intended to illustrate what an architecture burdened with a
rather large amount of legacy stuff carried over. But because it
was designed on a clean sheet of paper, it only gets one
kind of short vectors to support, rather than several like an x86.

And there would be a somewhat VVM-like set of
vector of vector wrapper instructions that could be wrapped
around *any* of them.

Which combination does it make sense to use? Why, that's
outlined in the Application Notes for the particular device
implementing the ISA that you're using. So the same ISA
serves supercomputers, servers, desktop PCs, and smartphones,
software is tailored to where in this food chain it's being used,
but it shares as much as it can...

John Savard

Re: The Computer of the Future

<fb409a7e-e1a2-4eaf-8fbb-d697ac3f0febn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23574&group=comp.arch#23574

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6000:1e04:b0:1e4:9b64:8cab with SMTP id bj4-20020a0560001e0400b001e49b648cabmr3280697wrb.608.1645035908088;
Wed, 16 Feb 2022 10:25:08 -0800 (PST)
X-Received: by 2002:a05:6808:1b26:b0:2d4:5f3b:d4d3 with SMTP id
bx38-20020a0568081b2600b002d45f3bd4d3mr1076873oib.133.1645035907435; Wed, 16
Feb 2022 10:25:07 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 16 Feb 2022 10:25:07 -0800 (PST)
In-Reply-To: <2e266e3e-b633-4c2a-bd33-962cb675bb77n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3db1:25d2:322a:440e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3db1:25d2:322a:440e
References: <b94f72eb-d747-47a4-85cd-d4c351cfcc5fn@googlegroups.com>
<858e7a72-fcc0-4bab-b087-28b9995c7094n@googlegroups.com> <2e266e3e-b633-4c2a-bd33-962cb675bb77n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <fb409a7e-e1a2-4eaf-8fbb-d697ac3f0febn@googlegroups.com>
Subject: Re: The Computer of the Future
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 16 Feb 2022 18:25:08 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 158

by: MitchAlsup - Wed, 16 Feb 2022 18:25 UTC

On Wednesday, February 16, 2022 at 4:02:07 AM UTC-6, Quadibloc wrote:
> On Tuesday, February 15, 2022 at 8:42:40 AM UTC-7, MitchAlsup wrote:
> > On Tuesday, February 15, 2022 at 4:48:03 AM UTC-6, Quadibloc wrote:
>
> > > After all, we've gone from MMX all the way to AVX-512 on the one hand,
>
> > These are not CRAY like.
> That's true. One of the problems you've noted with the current
> vector architectures is that they keep changing the instruction
> set in order to make the vectors bigger.
>
> A Cray-like architecture makes the vectors _much_ bigger than even
> in AVX-512, so in my naivete, I would have thought that this would
> allow the constant changes to stop for a while.
<
Point of Order::
CRAY vectors are processed in a pipeline, 1,2,4 units of work per unit time..
AVX vectors are processed en massé <however wide> per unit time.
These are VASTLY different things.
<
>
> As it became possible to put more and more transistors on a chip,
> at first, the best way to make use of those extra transistors was
> obvious. Give the computer the ability to do 16-bit arithmetic,
> not just 8-bit arithmetic. Add floating-point in hardware.
>
> Adding Cray-like vector instructions seemed to me like the final
> natural step in this evolution. Existing vector instructions kept
> getting wider, so this would take that to its limit.
>
> This doesn't mean I expect to solve all supercomputing problems
> that way. I don't claim to have a magic wand with which to solve the
> memory latency issue.
>
> However, today's microprocessors *do* have L3 caches that are as
> big as the memory of the original Cray I.
<
But with considerably LOWER concurrency.
A CRAY might have 64 memory banks (NEC up to 245 banks)
......Each bank might take 5-10 cycles to perform 1 request
......but there can be up to 64 requests being performed.
<
At best modern L3 can be doing 3:
Receiving write data,
Routing Data around the SRAM matrix,
Sending out read data.
<
There is nothing fundamental about the difference, but L3 caches are
not build to have the concurrency of CRAYs banked memory.
<
< So, while that wouldn't
> help in solving the problems that the supercomputers of today are
> working on, it _would_ make more arithmetic operations available,
> say, to video game writers.
<
In principle, yes; in practice, not so much.
>
> I'm envisaging a chip that tries to increase memory bandwidth, but
> only within bounds suitable to a "consumer" product. Just as heatsinks
> have grown much bigger than people would have expected back in the
> days of the 386 processor, I'm thinking we could go with having 512
> data lines going out of a processor. With four signal levels to further
> double memory bandwidth.
<
PCIe 6.0 uses 16 GHz clock to send 4 bits per wire per cycle using
double data rate and PAM4 modulation; and achieves 64GTs per wire
each direction. So 4 pins: true-comp out, true comp in: provide 8GB/s
out and 8GB/s in.
<
Now, remember from yesterday out 120 GB/s per core. You will need
10 of these 4 pin wires to support inbound bandwidth and 5 to support
the outbound bandwidth.
<
But hey, if you want to provide 512 pins, I sure you can find some use
for this kind of bandwidth. {but try dealing with the heat.}
>
> This is all predicated on the assumption that, given that lithography
> has reached its ultimate limits, and so Moore's Law is over, people
> are desperate for more performance. They don't know how to do
> parallel programming well, and so they're desperate for things that
> make parallelism more palatable - like out-of-order execution and
> vectors.
<
More pins wiggling faster has always provided more bandwidth.
Being able to absorb the latency has always been the problem.
{That and paying for it: $$$ and heat}
>
> GPU hardware is apparently the best way to get the most FLOPs on
> a chip. It may be a bad fit for a Cray-like ISA, but the native GPU
> design is a bad fit for programmers. And no two GPUs are alike.
<
GPUs are evolving like PCUs were evolving from 1948 to 1980.
GPUs are being modified each generation in order to address
bad performance characteristics of last generation GPUs.
Tolerance for the divergence found in ray tracing application
is the modern addition which required a pretty fundamental
change in how WARPs are organized and reorganized over
time. Gen[-2] instructions set provided no concept of WARP
reorganization, we are just coming to grips with what needs
fixed in Gen[-1] while kicking Gen[0] out the door, designing
Gen[+1].
>
> Exactly how a modified GPU design aimed at simulating a Cray
> or multiple Crays in parallel working on different problems might
> look is not clear to me, but I presume that if one can put a bunch
> of ALUs on a chip, and one can organize that to look like a GPU
> or like a Xeon Phi (but with RISC instead of x86), it could also be
> organized to look like something in between adapted to a
> Cray-like instruction set.
>
> Since Crays used 64-element vector registers for code in loops
> that handled vectors with more than 64 elements, that these loops
> might well be... augmented... by means of something looking a
> bit like your VVM is also not beyond the bounds of imagination.
> (But if you're using something like VVM, why have vector instructions?
> Reducing decoding overhead!)
<
Exactly! Let each generation of HW give the maximum performance
if can while the application code remains constant.
<
Secondly: If you want wide vector performance, you need to be organized
around ¼, ½ 1 cache line per clock out of the cache and back into the cache.
The width appropriate for one generation is not necessarily appropriate for
the next--so don't expose width through ISA.
<
Machines that can afford 4 FMACs per core will have enough area that
performing multiple iterations of a loop per cycle are an easily recognized
pattern. I happened to make this discovery considerably simpler with my
LOOP instruction.
>
> Of course, though, my designs will have scalar floating-point
> instructions, short vector instructions (sort of like AVX-256),
> and long vector instructions (like a Cray)... because they're
> intended to illustrate what an architecture burdened with a
> rather large amount of legacy stuff carried over. But because it
> was designed on a clean sheet of paper, it only gets one
> kind of short vectors to support, rather than several like an x86.
>
> And there would be a somewhat VVM-like set of
> vector of vector wrapper instructions that could be wrapped
> around *any* of them.
<
Question: If you have VVM and VVM performs as well as CRAY
vectors running Matrix300, why have the CRAY vector state
or bloat your ISA with CRAY vector instructions?
>
> Which combination does it make sense to use? Why, that's
> outlined in the Application Notes for the particular device
> implementing the ISA that you're using. So the same ISA
> serves supercomputers, servers, desktop PCs, and smartphones,
> software is tailored to where in this food chain it's being used,
> but it shares as much as it can...
>
> John Savard

Re: The Computer of the Future

<1a8a324d-34b8-4c1e-876e-1a0cde795e3fn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23598&group=comp.arch#23598

copy link Newsgroups: comp.arch

X-Received: by 2002:a7b:c4d2:0:b0:37b:b47c:4a5f with SMTP id g18-20020a7bc4d2000000b0037bb47c4a5fmr3709272wmk.102.1645053851913;
Wed, 16 Feb 2022 15:24:11 -0800 (PST)
X-Received: by 2002:aca:a9c5:0:b0:2d4:373d:98c8 with SMTP id
s188-20020acaa9c5000000b002d4373d98c8mr57483oie.272.1645053851211; Wed, 16
Feb 2022 15:24:11 -0800 (PST)
Path: i2pn2.org!i2pn.org!news.swapon.de!news.mixmin.net!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 16 Feb 2022 15:24:10 -0800 (PST)
In-Reply-To: <fb409a7e-e1a2-4eaf-8fbb-d697ac3f0febn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=50.99.86.110; posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 50.99.86.110
References: <b94f72eb-d747-47a4-85cd-d4c351cfcc5fn@googlegroups.com>
<858e7a72-fcc0-4bab-b087-28b9995c7094n@googlegroups.com> <2e266e3e-b633-4c2a-bd33-962cb675bb77n@googlegroups.com>
<fb409a7e-e1a2-4eaf-8fbb-d697ac3f0febn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1a8a324d-34b8-4c1e-876e-1a0cde795e3fn@googlegroups.com>
Subject: Re: The Computer of the Future
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Wed, 16 Feb 2022 23:24:11 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: Quadibloc - Wed, 16 Feb 2022 23:24 UTC

On Wednesday, February 16, 2022 at 11:25:10 AM UTC-7, MitchAlsup wrote:
> On Wednesday, February 16, 2022 at 4:02:07 AM UTC-6, Quadibloc wrote:

> > A Cray-like architecture makes the vectors _much_ bigger than even
> > in AVX-512, so in my naivete, I would have thought that this would
> > allow the constant changes to stop for a while.
> <
> Point of Order::
> CRAY vectors are processed in a pipeline, 1,2,4 units of work per unit time.
> AVX vectors are processed en massé <however wide> per unit time.
> These are VASTLY different things.

Yes. However, a faster Cray-like machine can be implemented
with as many, or more, floating ALUs than an AVX-style vector unit.

So you could have, say, AVX-512 with eight 64-bit floats across, and
then switch to Cray in the next generation with sixteen ALUs, and then
stick with Cray with thirty-two ALUs in the generation after that.

> > However, today's microprocessors *do* have L3 caches that are as
> > big as the memory of the original Cray I.

> But with considerably LOWER concurrency.
> A CRAY might have 64 memory banks (NEC up to 245 banks)
> .....Each bank might take 5-10 cycles to perform 1 request
> .....but there can be up to 64 requests being performed.

But if the cache is *on the same die*, having more wires connecting it
to the CPU isn't much of a problem?

> At best modern L3 can be doing 3:
> Receiving write data,
> Routing Data around the SRAM matrix,
> Sending out read data.
> <
> There is nothing fundamental about the difference, but L3 caches are
> not build to have the concurrency of CRAYs banked memory.

So we seem to be in agreement on this point.

> PCIe 6.0 uses 16 GHz clock to send 4 bits per wire per cycle using
> double data rate and PAM4 modulation; and achieves 64GTs per wire
> each direction. So 4 pins: true-comp out, true comp in: provide 8GB/s
> out and 8GB/s in.

Of course, PCIe 6.0 is a complicated protocol, while interfaces
like DDR 5 to DRAM are kept simple by comparison.

> But hey, if you want to provide 512 pins, I sure you can find some use
> for this kind of bandwidth. {but try dealing with the heat.}

Presumably chips implemented in the final evolution of 1nm or whatever
will run slightly cooler.

I had thought that the number of CPUs in the package was what governed
the heat, and using more pins for data would not be too bad. If that's not
true, then, yes, this would be *one* fatal objection to my concepts.

> More pins wiggling faster has always provided more bandwidth.
> Being able to absorb the latency has always been the problem.
> {That and paying for it: $$$ and heat}

Of course, the way to absorb latency is to do something else while
you're waiting. So now you need bandwidth for the first thing you
were doing, and now more bandwidth for the something else
you're doing so as to make the latency less relevant.

This sounds like a destructive paradox. But since latency is
fundamentally unfixable (until you make the transistors and
wires faster) while you can have more bandwidth if you pay for
it, the idea of having the amount of bandwidth you needed in
the first place, times eight or so, almost makes sense.

> > Exactly how a modified GPU design aimed at simulating a Cray
> > or multiple Crays in parallel working on different problems might
> > look is not clear to me, but I presume that if one can put a bunch
> > of ALUs on a chip, and one can organize that to look like a GPU
> > or like a Xeon Phi (but with RISC instead of x86), it could also be
> > organized to look like something in between adapted to a
> > Cray-like instruction set.

.....and a Cray-like instruction set could be like a later generation of
the Cray, with more and longer vector registers, and in other ways
it could move to being more GPU-like if that was needed to fix some
flaws.

> > Since Crays used 64-element vector registers for code in loops
> > that handled vectors with more than 64 elements, that these loops
> > might well be... augmented... by means of something looking a
> > bit like your VVM is also not beyond the bounds of imagination.
> > (But if you're using something like VVM, why have vector instructions?
> > Reducing decoding overhead!)

> Exactly! Let each generation of HW give the maximum performance
> if can while the application code remains constant.

I'm glad you approve of something...

> Secondly: If you want wide vector performance, you need to be organized
> around ¼, ½ 1 cache line per clock out of the cache and back into the cache.
> The width appropriate for one generation is not necessarily appropriate for
> the next--so don't expose width through ISA.

Of course, IBM's take on a Cray-like architecture avoided that pitfall, by
excluding the vector width from the ISA spec, making it model-dependent,
so that's definitely possible.

> > Of course, though, my designs will have scalar floating-point
> > instructions, short vector instructions (sort of like AVX-256),
> > and long vector instructions (like a Cray)... because they're
> > intended to illustrate what an architecture burdened with a
> > rather large amount of legacy stuff carried over. But because it
> > was designed on a clean sheet of paper, it only gets one
> > kind of short vectors to support, rather than several like an x86.

> > And there would be a somewhat VVM-like set of
> > vector of vector wrapper instructions that could be wrapped
> > around *any* of them.

> Question: If you have VVM and VVM performs as well as CRAY
> vectors running Matrix300, why have the CRAY vector state
> or bloat your ISA with CRAY vector instructions?

Now, that's a very good question.

Possible answer 1:
This is only included in the ISA because the spec is meant to
illustrate possibilities, and would be omitted in any real-world
CPU.

Possible answer 2:
The idea is that VVM wrapped around scalar floating-point
instructions works well for vectors that are "this" long;

VVM wrapped around AVX-style vector instructions works for
vectors that are 4x longer, in proportion to the number of floats
in a single AVX vector...

VVM wrapped around Cray-style vector instructions is intended
for vectors that are 64x longer than VVM wrapped around scalar
instructions.

Assume VVM around scalar handles vectors somewhat longer
than Cray without VVM. Then what we've got is a range of options,
each one adapted to how long your vectors happen to be. (And to
things like granularity, because using VVM around Cray for vectors
2x the Cray size presumably wouldn't be "bad" in terms of efficiency.)

Possible answer 3:
And I might fail to implement VVM well enough to avoid some
associated overhead.

John Savard

Re: The Computer of the Future

<005ee5af-519a-4d45-93bd-87f4ab580c61n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23599&group=comp.arch#23599

copy link Newsgroups: comp.arch

X-Received: by 2002:a7b:c8cc:0:b0:37b:eaff:a9b8 with SMTP id f12-20020a7bc8cc000000b0037beaffa9b8mr235517wml.102.1645054323690; Wed, 16 Feb 2022 15:32:03 -0800 (PST)
X-Received: by 2002:a05:6870:b50e:b0:ce:c0c9:687 with SMTP id v14-20020a056870b50e00b000cec0c90687mr1401871oap.217.1645054323058; Wed, 16 Feb 2022 15:32:03 -0800 (PST)
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!feeder.usenetexpress.com!tr3.eu1.usenetexpress.com!newsfeed.xs4all.nl!newsfeed8.news.xs4all.nl!feeder1.cambriumusenet.nl!feed.tweak.nl!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 16 Feb 2022 15:32:02 -0800 (PST)
In-Reply-To: <1a8a324d-34b8-4c1e-876e-1a0cde795e3fn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:b4aa:d20e:c626:3906; posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:b4aa:d20e:c626:3906
References: <b94f72eb-d747-47a4-85cd-d4c351cfcc5fn@googlegroups.com> <858e7a72-fcc0-4bab-b087-28b9995c7094n@googlegroups.com> <2e266e3e-b633-4c2a-bd33-962cb675bb77n@googlegroups.com> <fb409a7e-e1a2-4eaf-8fbb-d697ac3f0febn@googlegroups.com> <1a8a324d-34b8-4c1e-876e-1a0cde795e3fn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <005ee5af-519a-4d45-93bd-87f4ab580c61n@googlegroups.com>
Subject: Re: The Computer of the Future
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Wed, 16 Feb 2022 23:32:03 +0000
Content-Type: text/plain; charset="UTF-8"

by: Quadibloc - Wed, 16 Feb 2022 23:32 UTC

On Wednesday, February 16, 2022 at 4:24:14 PM UTC-7, Quadibloc wrote:
> On Wednesday, February 16, 2022 at 11:25:10 AM UTC-7, MitchAlsup wrote:

> > Question: If you have VVM and VVM performs as well as CRAY
> > vectors running Matrix300, why have the CRAY vector state
> > or bloat your ISA with CRAY vector instructions?

> Now, that's a very good question.
>
> Possible answer 1:
> This is only included in the ISA because the spec is meant to
> illustrate possibilities, and would be omitted in any real-world
> CPU.
>
> Possible answer 2:
> The idea is that VVM wrapped around scalar floating-point
> instructions works well for vectors that are "this" long;
>
> VVM wrapped around AVX-style vector instructions works for
> vectors that are 4x longer, in proportion to the number of floats
> in a single AVX vector...
>
> VVM wrapped around Cray-style vector instructions is intended
> for vectors that are 64x longer than VVM wrapped around scalar
> instructions.
>
> Assume VVM around scalar handles vectors somewhat longer
> than Cray without VVM. Then what we've got is a range of options,
> each one adapted to how long your vectors happen to be. (And to
> things like granularity, because using VVM around Cray for vectors
> 2x the Cray size presumably wouldn't be "bad" in terms of efficiency.)
>
> Possible answer 3:
> And I might fail to implement VVM well enough to avoid some
> associated overhead.

Actually, there's _another_ point I would need to raise here.

In your design, VVM is the primary way to create vector operations.

So if additional ALU resources are brought into play for vector
computation, using VVM would invoke them.

On the other hand, I have a prejudice against instruction fusion,
as I feel it complicates decoding.

And so I'm thinking in terms of a "dumb" implementation of VVM.
It reduces loop overhead, it permits data forwarding and stuff like
that...

but it _doesn't_ tell the scalar floating-point instructions to start
using the ALUs that belong to the AVX instructions or the Cray
instructions.

So if you've got a 65,536-element vector, you can *choose* to
process it using VVM around scalar, VVM around AVX, or
VVM around Cray.

What this will *change* is how many instances of your program
can be floating around on your CPU at a given time running
concurrently - instead of some sitting idle on disk while a handful
are actually running.

The VVM around Cray program would be the one that could finish
faster in wall clock time if it happened to be given the run of the
entire CPU without having to share.

John Savard

Re: The Computer of the Future

<8d7e830e-866e-4dcf-86bf-292c8b2850a9n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23600&group=comp.arch#23600

copy link Newsgroups: comp.arch

X-Received: by 2002:a5d:47ce:0:b0:1e8:88b7:446a with SMTP id o14-20020a5d47ce000000b001e888b7446amr228224wrc.459.1645054794269;
Wed, 16 Feb 2022 15:39:54 -0800 (PST)
X-Received: by 2002:a05:6808:180b:b0:2ce:6ee7:2cd5 with SMTP id
bh11-20020a056808180b00b002ce6ee72cd5mr1650376oib.259.1645054793775; Wed, 16
Feb 2022 15:39:53 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.88.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 16 Feb 2022 15:39:53 -0800 (PST)
In-Reply-To: <005ee5af-519a-4d45-93bd-87f4ab580c61n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:b4aa:d20e:c626:3906;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:b4aa:d20e:c626:3906
References: <b94f72eb-d747-47a4-85cd-d4c351cfcc5fn@googlegroups.com>
<858e7a72-fcc0-4bab-b087-28b9995c7094n@googlegroups.com> <2e266e3e-b633-4c2a-bd33-962cb675bb77n@googlegroups.com>
<fb409a7e-e1a2-4eaf-8fbb-d697ac3f0febn@googlegroups.com> <1a8a324d-34b8-4c1e-876e-1a0cde795e3fn@googlegroups.com>
<005ee5af-519a-4d45-93bd-87f4ab580c61n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <8d7e830e-866e-4dcf-86bf-292c8b2850a9n@googlegroups.com>
Subject: Re: The Computer of the Future
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Wed, 16 Feb 2022 23:39:54 +0000
Content-Type: text/plain; charset="UTF-8"

by: Quadibloc - Wed, 16 Feb 2022 23:39 UTC

On Wednesday, February 16, 2022 at 4:32:06 PM UTC-7, Quadibloc wrote:

> Actually, there's _another_ point I would need to raise here.
>
> In your design, VVM is the primary way to create vector operations.
>
> So if additional ALU resources are brought into play for vector
> computation, using VVM would invoke them.
>
> On the other hand, I have a prejudice against instruction fusion,
> as I feel it complicates decoding.
>
> And so I'm thinking in terms of a "dumb" implementation of VVM.
> It reduces loop overhead, it permits data forwarding and stuff like
> that...
>
> but it _doesn't_ tell the scalar floating-point instructions to start
> using the ALUs that belong to the AVX instructions or the Cray
> instructions.
>
> So if you've got a 65,536-element vector, you can *choose* to
> process it using VVM around scalar, VVM around AVX, or
> VVM around Cray.
>
> What this will *change* is how many instances of your program
> can be floating around on your CPU at a given time running
> concurrently - instead of some sitting idle on disk while a handful
> are actually running.
>
> The VVM around Cray program would be the one that could finish
> faster in wall clock time if it happened to be given the run of the
> entire CPU without having to share.

But surely, instead of being a _versatile_ design, that's a badly
flawed design!

Don't you just want the program code to say "I want to process
a 65,536-element vector", with the CPU then doing so in the
best way possible given whatever the load on the system happens
to be at a given time?

Now, *that's* a big advantage of a _proper_ implementation of
VVM.

(Nothing, of course, _prevents_ my bloated ISA, in its
top-end implementation, from having one. In which case, the
AVX-style and Cray-style vector instructions would be entirely
superfluous for _that_ implementation. But presumably VVM
is not so hard to implement that it would make any sense not
to do it that way in the first place.)

John Savard

Re: The Computer of the Future

<26b1a380-2609-4428-99b9-a12c4ae0e003n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23601&group=comp.arch#23601

copy link Newsgroups: comp.arch

X-Received: by 2002:adf:f001:0:b0:1e4:b7b1:87c1 with SMTP id j1-20020adff001000000b001e4b7b187c1mr279105wro.238.1645055414832;
Wed, 16 Feb 2022 15:50:14 -0800 (PST)
X-Received: by 2002:a05:6870:13d5:b0:d2:8671:4ee2 with SMTP id
21-20020a05687013d500b000d286714ee2mr1476819oat.82.1645055414159; Wed, 16 Feb
2022 15:50:14 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 16 Feb 2022 15:50:13 -0800 (PST)
In-Reply-To: <8d7e830e-866e-4dcf-86bf-292c8b2850a9n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:b4aa:d20e:c626:3906;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:b4aa:d20e:c626:3906
References: <b94f72eb-d747-47a4-85cd-d4c351cfcc5fn@googlegroups.com>
<858e7a72-fcc0-4bab-b087-28b9995c7094n@googlegroups.com> <2e266e3e-b633-4c2a-bd33-962cb675bb77n@googlegroups.com>
<fb409a7e-e1a2-4eaf-8fbb-d697ac3f0febn@googlegroups.com> <1a8a324d-34b8-4c1e-876e-1a0cde795e3fn@googlegroups.com>
<005ee5af-519a-4d45-93bd-87f4ab580c61n@googlegroups.com> <8d7e830e-866e-4dcf-86bf-292c8b2850a9n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <26b1a380-2609-4428-99b9-a12c4ae0e003n@googlegroups.com>
Subject: Re: The Computer of the Future
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Wed, 16 Feb 2022 23:50:14 +0000
Content-Type: text/plain; charset="UTF-8"

by: Quadibloc - Wed, 16 Feb 2022 23:50 UTC

On Wednesday, February 16, 2022 at 4:39:57 PM UTC-7, Quadibloc wrote:
> On Wednesday, February 16, 2022 at 4:32:06 PM UTC-7, Quadibloc wrote:
>
> > Actually, there's _another_ point I would need to raise here.
> >
> > In your design, VVM is the primary way to create vector operations.
> >
> > So if additional ALU resources are brought into play for vector
> > computation, using VVM would invoke them.
> >
> > On the other hand, I have a prejudice against instruction fusion,
> > as I feel it complicates decoding.
> >
> > And so I'm thinking in terms of a "dumb" implementation of VVM.
> > It reduces loop overhead, it permits data forwarding and stuff like
> > that...
> >
> > but it _doesn't_ tell the scalar floating-point instructions to start
> > using the ALUs that belong to the AVX instructions or the Cray
> > instructions.
> >
> > So if you've got a 65,536-element vector, you can *choose* to
> > process it using VVM around scalar, VVM around AVX, or
> > VVM around Cray.
> >
> > What this will *change* is how many instances of your program
> > can be floating around on your CPU at a given time running
> > concurrently - instead of some sitting idle on disk while a handful
> > are actually running.
> >
> > The VVM around Cray program would be the one that could finish
> > faster in wall clock time if it happened to be given the run of the
> > entire CPU without having to share.
> But surely, instead of being a _versatile_ design, that's a badly
> flawed design!
>
> Don't you just want the program code to say "I want to process
> a 65,536-element vector", with the CPU then doing so in the
> best way possible given whatever the load on the system happens
> to be at a given time?
>
> Now, *that's* a big advantage of a _proper_ implementation of
> VVM.
>
> (Nothing, of course, _prevents_ my bloated ISA, in its
> top-end implementation, from having one. In which case, the
> AVX-style and Cray-style vector instructions would be entirely
> superfluous for _that_ implementation. But presumably VVM
> is not so hard to implement that it would make any sense not
> to do it that way in the first place.)

In my overly complicated ISA, though, even with a good VVM
implementation, where VVM-around-scalar float instructions
would be able to use the Cray-style ALUs, it would *not* be able to
touch or disturb the state of the Cray-style vector registers.

So those vector instructions would miss out on one possible
resource for buffer storage, possibly having a slight impact
on their performance, which means that these different types
of really long vector instructions would still have _some_
difference.

John Savard

Re: The Computer of the Future

<69d5859e-fdd2-4c59-904c-f6e3cefdad5dn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23602&group=comp.arch#23602

copy link Newsgroups: comp.arch

X-Received: by 2002:a1c:f705:0:b0:37d:f2e5:d8ec with SMTP id v5-20020a1cf705000000b0037df2e5d8ecmr3706686wmh.21.1645055992139;
Wed, 16 Feb 2022 15:59:52 -0800 (PST)
X-Received: by 2002:a05:6871:4082:b0:d2:c693:b8be with SMTP id
kz2-20020a056871408200b000d2c693b8bemr132325oab.282.1645055991473; Wed, 16
Feb 2022 15:59:51 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 16 Feb 2022 15:59:51 -0800 (PST)
In-Reply-To: <26b1a380-2609-4428-99b9-a12c4ae0e003n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:b4aa:d20e:c626:3906;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:b4aa:d20e:c626:3906
References: <b94f72eb-d747-47a4-85cd-d4c351cfcc5fn@googlegroups.com>
<858e7a72-fcc0-4bab-b087-28b9995c7094n@googlegroups.com> <2e266e3e-b633-4c2a-bd33-962cb675bb77n@googlegroups.com>
<fb409a7e-e1a2-4eaf-8fbb-d697ac3f0febn@googlegroups.com> <1a8a324d-34b8-4c1e-876e-1a0cde795e3fn@googlegroups.com>
<005ee5af-519a-4d45-93bd-87f4ab580c61n@googlegroups.com> <8d7e830e-866e-4dcf-86bf-292c8b2850a9n@googlegroups.com>
<26b1a380-2609-4428-99b9-a12c4ae0e003n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <69d5859e-fdd2-4c59-904c-f6e3cefdad5dn@googlegroups.com>
Subject: Re: The Computer of the Future
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Wed, 16 Feb 2022 23:59:52 +0000
Content-Type: text/plain; charset="UTF-8"

by: Quadibloc - Wed, 16 Feb 2022 23:59 UTC

On Wednesday, February 16, 2022 at 4:50:17 PM UTC-7, Quadibloc wrote:

> In my overly complicated ISA, though, even with a good VVM
> implementation, where VVM-around-scalar float instructions
> would be able to use the Cray-style ALUs, it would *not* be able to
> touch or disturb the state of the Cray-style vector registers.

Oh, and by the way: to prevent waste in extreme low-end
implementations, where there is only a scalar floating-point ALU,
and the AVX-like and Cray-like vector registers are faked, being
located in DRAM...

it should be noted in the ISA description that when AVX-like
vector instructions or Cray-like vector instructions are placed
within a VVM wrapper, *the associated register contents are
not guaranteed to be anything in particular on exit*.

They may get trashed, or the CPU may not bother to touch them,
depending on whether or not using them helps.

John Savard

Since the topic is "computers of the future"
And Intel has opened up their Tofino product
It could happen that ethernet switches become "vector processors" of Ethernet frames?
E.g. the data is shipped to which ever switch can process it at pipeline rates?
Uses existing infrastructure and augments it with specialized high data rate processing.

Re: The Computer of the Future

<5b38cb28-2409-453d-9ea0-c00779ecbb8fn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23605&group=comp.arch#23605

copy link Newsgroups: comp.arch

X-Received: by 2002:adf:fa92:0:b0:1e7:e760:49dd with SMTP id h18-20020adffa92000000b001e7e76049ddmr386250wrr.99.1645059284494;
Wed, 16 Feb 2022 16:54:44 -0800 (PST)
X-Received: by 2002:a05:6870:4302:b0:d2:894f:4262 with SMTP id
w2-20020a056870430200b000d2894f4262mr205401oah.194.1645059283855; Wed, 16 Feb
2022 16:54:43 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.88.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 16 Feb 2022 16:54:43 -0800 (PST)
In-Reply-To: <1a8a324d-34b8-4c1e-876e-1a0cde795e3fn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3db1:25d2:322a:440e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3db1:25d2:322a:440e
References: <b94f72eb-d747-47a4-85cd-d4c351cfcc5fn@googlegroups.com>
<858e7a72-fcc0-4bab-b087-28b9995c7094n@googlegroups.com> <2e266e3e-b633-4c2a-bd33-962cb675bb77n@googlegroups.com>
<fb409a7e-e1a2-4eaf-8fbb-d697ac3f0febn@googlegroups.com> <1a8a324d-34b8-4c1e-876e-1a0cde795e3fn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5b38cb28-2409-453d-9ea0-c00779ecbb8fn@googlegroups.com>
Subject: Re: The Computer of the Future
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 17 Feb 2022 00:54:44 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Thu, 17 Feb 2022 00:54 UTC

On Wednesday, February 16, 2022 at 5:24:14 PM UTC-6, Quadibloc wrote:
> On Wednesday, February 16, 2022 at 11:25:10 AM UTC-7, MitchAlsup wrote:
> > On Wednesday, February 16, 2022 at 4:02:07 AM UTC-6, Quadibloc wrote:
>
> > > A Cray-like architecture makes the vectors _much_ bigger than even
> > > in AVX-512, so in my naivete, I would have thought that this would
> > > allow the constant changes to stop for a while.
> > <
> > Point of Order::
> > CRAY vectors are processed in a pipeline, 1,2,4 units of work per unit time.
> > AVX vectors are processed en massé <however wide> per unit time.
> > These are VASTLY different things.
<
> Yes. However, a faster Cray-like machine can be implemented
> with as many, or more, floating ALUs than an AVX-style vector unit.
>
> So you could have, say, AVX-512 with eight 64-bit floats across, and
> then switch to Cray in the next generation with sixteen ALUs, and then
> stick with Cray with thirty-two ALUs in the generation after that.
<
Yes but no mater how wide the CRAY implementation was, the register
file seen by instructions had the same shape.
<
Every time AVX grows in width SW gets a different model to program to.
<
That is the difference CRAY exposed a common model, AVX exposed
an ever changing model.
<
> > > However, today's microprocessors *do* have L3 caches that are as
> > > big as the memory of the original Cray I.
>
> > But with considerably LOWER concurrency.
> > A CRAY might have 64 memory banks (NEC up to 245 banks)
> > .....Each bank might take 5-10 cycles to perform 1 request
> > .....but there can be up to 64 requests being performed.
> But if the cache is *on the same die*, having more wires connecting it
> to the CPU isn't much of a problem?
> > At best modern L3 can be doing 3:
> > Receiving write data,
> > Routing Data around the SRAM matrix,
> > Sending out read data.
> > <
> > There is nothing fundamental about the difference, but L3 caches are
> > not build to have the concurrency of CRAYs banked memory.
<
> So we seem to be in agreement on this point.
<
> > PCIe 6.0 uses 16 GHz clock to send 4 bits per wire per cycle using
> > double data rate and PAM4 modulation; and achieves 64GTs per wire
> > each direction. So 4 pins: true-comp out, true comp in: provide 8GB/s
> > out and 8GB/s in.
<
> Of course, PCIe 6.0 is a complicated protocol, while interfaces
> like DDR 5 to DRAM are kept simple by comparison.
<
> > But hey, if you want to provide 512 pins, I sure you can find some use
> > for this kind of bandwidth. {but try dealing with the heat.}
<
> Presumably chips implemented in the final evolution of 1nm or whatever
> will run slightly cooler.
<
Logic may, pins will not.
>
> I had thought that the number of CPUs in the package was what governed
> the heat, and using more pins for data would not be too bad. If that's not
> true, then, yes, this would be *one* fatal objection to my concepts.
<
For the power cost of sending 1 byte from DRAM to pins of processor
you can read ~{64 to 128} bytes from nearby on die SRAM.
<
> > More pins wiggling faster has always provided more bandwidth.
> > Being able to absorb the latency has always been the problem.
> > {That and paying for it: $$$ and heat}
<
> Of course, the way to absorb latency is to do something else while
> you're waiting. So now you need bandwidth for the first thing you
> were doing, and now more bandwidth for the something else
> you're doing so as to make the latency less relevant.
>
> This sounds like a destructive paradox. But since latency is
> fundamentally unfixable (until you make the transistors and
> wires faster) while you can have more bandwidth if you pay for
> it, the idea of having the amount of bandwidth you needed in
> the first place, times eight or so, almost makes sense.
<
> > > Exactly how a modified GPU design aimed at simulating a Cray
> > > or multiple Crays in parallel working on different problems might
> > > look is not clear to me, but I presume that if one can put a bunch
> > > of ALUs on a chip, and one can organize that to look like a GPU
> > > or like a Xeon Phi (but with RISC instead of x86), it could also be
> > > organized to look like something in between adapted to a
> > > Cray-like instruction set.
<
> ....and a Cray-like instruction set could be like a later generation of
> the Cray, with more and longer vector registers, and in other ways
> it could move to being more GPU-like if that was needed to fix some
> flaws.
<
Since you mentioned later CRAYs--they do Gather/Scatter LDs/STs.
64 address to 64 different memory locations in 64 clocks per port.
3 of these ports.
<
> > > Since Crays used 64-element vector registers for code in loops
> > > that handled vectors with more than 64 elements, that these loops
> > > might well be... augmented... by means of something looking a
> > > bit like your VVM is also not beyond the bounds of imagination.
> > > (But if you're using something like VVM, why have vector instructions?
> > > Reducing decoding overhead!)
>
> > Exactly! Let each generation of HW give the maximum performance
> > if can while the application code remains constant.
> I'm glad you approve of something...
> > Secondly: If you want wide vector performance, you need to be organized
> > around ¼, ½ 1 cache line per clock out of the cache and back into the cache.
> > The width appropriate for one generation is not necessarily appropriate for
> > the next--so don't expose width through ISA.
> Of course, IBM's take on a Cray-like architecture avoided that pitfall, by
> excluding the vector width from the ISA spec, making it model-dependent,
> so that's definitely possible.
> > > Of course, though, my designs will have scalar floating-point
> > > instructions, short vector instructions (sort of like AVX-256),
> > > and long vector instructions (like a Cray)... because they're
> > > intended to illustrate what an architecture burdened with a
> > > rather large amount of legacy stuff carried over. But because it
> > > was designed on a clean sheet of paper, it only gets one
> > > kind of short vectors to support, rather than several like an x86.
>
> > > And there would be a somewhat VVM-like set of
> > > vector of vector wrapper instructions that could be wrapped
> > > around *any* of them.
>
> > Question: If you have VVM and VVM performs as well as CRAY
> > vectors running Matrix300, why have the CRAY vector state
> > or bloat your ISA with CRAY vector instructions?
<
> Now, that's a very good question.
>
> Possible answer 1:
> This is only included in the ISA because the spec is meant to
> illustrate possibilities, and would be omitted in any real-world
> CPU.
>
> Possible answer 2:
> The idea is that VVM wrapped around scalar floating-point
> instructions works well for vectors that are "this" long;
<
The width of the SIMD engine is hidden from the expression of
use found in the code. Byte loops would tend to run 64-iterations
per cycle, HalfWord loops 32 iterations per cycle, Word Loops
16-iterations per cycle, DoubleWord loops 8-iterations per cycle.
<
At this width, you are already at the limit of Cache Access, likely
DRAM, wo I don't foresee a need for more than 1 cache line of
calculation: you can't feed it from memory.
>
> VVM wrapped around AVX-style vector instructions works for
> vectors that are 4x longer, in proportion to the number of floats
> in a single AVX vector...
<
VVM supplants any need for AVX contortions. VVM supplies the
entire memory reference and calculation set of AVX with just 2
instructions.
>
> VVM wrapped around Cray-style vector instructions is intended
> for vectors that are 64x longer than VVM wrapped around scalar
> instructions.
<
VVM supplants any need for CRAY-like vectors without the register file
and without a zillion vector instructions.
>
> Assume VVM around scalar handles vectors somewhat longer
> than Cray without VVM. Then what we've got is a range of options,
> each one adapted to how long your vectors happen to be. (And to
> things like granularity, because using VVM around Cray for vectors
> 2x the Cray size presumably wouldn't be "bad" in terms of efficiency.)
<
VVM does not expose width or depth to software--this is the point you
are missing. Thus, SW has a common model over time. Users can go
about optimizing the hard parts of their applications rather than
byte-copy. Byte copy runs just as fast as AVX copy on My 66000.
<
Why add all the gorp when you can wave your hands and make all
the cruft vanish.
>
> Possible answer 3:
> And I might fail to implement VVM well enough to avoid some
> associated overhead.
<
Not my problem.
>
> John Savard

Click here to read the complete article

Re: The Computer of the Future

<66d1cc8e-c8b9-4ecb-be59-fee1ab1da715n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23606&group=comp.arch#23606

copy link Newsgroups: comp.arch

X-Received: by 2002:a1c:2782:0:b0:37c:2d45:a52f with SMTP id n124-20020a1c2782000000b0037c2d45a52fmr3945267wmn.17.1645059786817;
Wed, 16 Feb 2022 17:03:06 -0800 (PST)
X-Received: by 2002:a05:6808:13cb:b0:2d3:7f26:1c52 with SMTP id
d11-20020a05680813cb00b002d37f261c52mr159234oiw.309.1645059786131; Wed, 16
Feb 2022 17:03:06 -0800 (PST)
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!feeder1.cambriumusenet.nl!feed.tweak.nl!209.85.128.88.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 16 Feb 2022 17:03:05 -0800 (PST)
In-Reply-To: <005ee5af-519a-4d45-93bd-87f4ab580c61n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3db1:25d2:322a:440e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3db1:25d2:322a:440e
References: <b94f72eb-d747-47a4-85cd-d4c351cfcc5fn@googlegroups.com>
<858e7a72-fcc0-4bab-b087-28b9995c7094n@googlegroups.com> <2e266e3e-b633-4c2a-bd33-962cb675bb77n@googlegroups.com>
<fb409a7e-e1a2-4eaf-8fbb-d697ac3f0febn@googlegroups.com> <1a8a324d-34b8-4c1e-876e-1a0cde795e3fn@googlegroups.com>
<005ee5af-519a-4d45-93bd-87f4ab580c61n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <66d1cc8e-c8b9-4ecb-be59-fee1ab1da715n@googlegroups.com>
Subject: Re: The Computer of the Future
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 17 Feb 2022 01:03:06 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Thu, 17 Feb 2022 01:03 UTC

On Wednesday, February 16, 2022 at 5:32:06 PM UTC-6, Quadibloc wrote:
> On Wednesday, February 16, 2022 at 4:24:14 PM UTC-7, Quadibloc wrote:
> > On Wednesday, February 16, 2022 at 11:25:10 AM UTC-7, MitchAlsup wrote:
>
> > > Question: If you have VVM and VVM performs as well as CRAY
> > > vectors running Matrix300, why have the CRAY vector state
> > > or bloat your ISA with CRAY vector instructions?
>
> > Now, that's a very good question.
> >
> > Possible answer 1:
> > This is only included in the ISA because the spec is meant to
> > illustrate possibilities, and would be omitted in any real-world
> > CPU.
> >
> > Possible answer 2:
> > The idea is that VVM wrapped around scalar floating-point
> > instructions works well for vectors that are "this" long;
> >
> > VVM wrapped around AVX-style vector instructions works for
> > vectors that are 4x longer, in proportion to the number of floats
> > in a single AVX vector...
> >
> > VVM wrapped around Cray-style vector instructions is intended
> > for vectors that are 64x longer than VVM wrapped around scalar
> > instructions.
> >
> > Assume VVM around scalar handles vectors somewhat longer
> > than Cray without VVM. Then what we've got is a range of options,
> > each one adapted to how long your vectors happen to be. (And to
> > things like granularity, because using VVM around Cray for vectors
> > 2x the Cray size presumably wouldn't be "bad" in terms of efficiency.)
> >
> > Possible answer 3:
> > And I might fail to implement VVM well enough to avoid some
> > associated overhead.
<
> Actually, there's _another_ point I would need to raise here.
>
> In your design, VVM is the primary way to create vector operations.
>
> So if additional ALU resources are brought into play for vector
> computation, using VVM would invoke them.
>
> On the other hand, I have a prejudice against instruction fusion,
> as I feel it complicates decoding.
<
VVM loops are not fused. So, you have nothing to worry about.
A VVM loop is a contract of what the loop must perform, and
what the register state needs to be after the loop completes.
>
> And so I'm thinking in terms of a "dumb" implementation of VVM.
> It reduces loop overhead, it permits data forwarding and stuff like
> that...
<
>
> but it _doesn't_ tell the scalar floating-point instructions to start
> using the ALUs that belong to the AVX instructions or the Cray
> instructions.
<
Because there ARE NO AVX or CRAY instructions, the programmer never got
misinformed as to what way to do this or that. HW gets to make those
choices on an implementation by implementation. The old SW just
keeps getting faster and remains near optimal across many generations.
>
> So if you've got a 65,536-element vector, you can *choose* to
> process it using VVM around scalar, VVM around AVX, or
> VVM around Cray.
<
Just code it up, let the compiler VVM the inner loops, and smile all
the way to the bank.
>
> What this will *change* is how many instances of your program
> can be floating around on your CPU at a given time running
> concurrently - instead of some sitting idle on disk while a handful
> are actually running.
<
You are not sharing these VVM computational resources across core
boundaries. Bull dozer pretty much demonstrated you don't want to do
that.
>
> The VVM around Cray program would be the one that could finish
> faster in wall clock time if it happened to be given the run of the
> entire CPU without having to share.
<
Doing CRAY-1 stuff, VVM should be just as efficient as CRAY.
In fact it can perform Gather/Scatter as a side effect that is
vectorizes LOOPs not instructions.
<
For the addition of 2 instructions, you get a shot CRAY performance,
get a shot at AVX performance, and did not waste 1000 opcodes to
get there.
>
> John Savard

Re: The Computer of the Future

<e86a5f19-7855-4754-b8ef-b4d2466a683an@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23608&group=comp.arch#23608

copy link Newsgroups: comp.arch

X-Received: by 2002:adf:bc14:0:b0:1e2:b035:9c46 with SMTP id s20-20020adfbc14000000b001e2b0359c46mr889134wrg.386.1645074553763;
Wed, 16 Feb 2022 21:09:13 -0800 (PST)
X-Received: by 2002:a05:6808:bd5:b0:2d4:447d:6d27 with SMTP id
o21-20020a0568080bd500b002d4447d6d27mr2104946oik.179.1645074553190; Wed, 16
Feb 2022 21:09:13 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.88.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 16 Feb 2022 21:09:13 -0800 (PST)
In-Reply-To: <66d1cc8e-c8b9-4ecb-be59-fee1ab1da715n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:b4aa:d20e:c626:3906;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:b4aa:d20e:c626:3906
References: <b94f72eb-d747-47a4-85cd-d4c351cfcc5fn@googlegroups.com>
<858e7a72-fcc0-4bab-b087-28b9995c7094n@googlegroups.com> <2e266e3e-b633-4c2a-bd33-962cb675bb77n@googlegroups.com>
<fb409a7e-e1a2-4eaf-8fbb-d697ac3f0febn@googlegroups.com> <1a8a324d-34b8-4c1e-876e-1a0cde795e3fn@googlegroups.com>
<005ee5af-519a-4d45-93bd-87f4ab580c61n@googlegroups.com> <66d1cc8e-c8b9-4ecb-be59-fee1ab1da715n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e86a5f19-7855-4754-b8ef-b4d2466a683an@googlegroups.com>
Subject: Re: The Computer of the Future
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Thu, 17 Feb 2022 05:09:13 +0000
Content-Type: text/plain; charset="UTF-8"

by: Quadibloc - Thu, 17 Feb 2022 05:09 UTC

On Wednesday, February 16, 2022 at 6:03:09 PM UTC-7, MitchAlsup wrote:

> For the addition of 2 instructions, you get a shot CRAY performance,
> get a shot at AVX performance, and did not waste 1000 opcodes to
> get there.

I think I see where I've gone wrong now.

I've assumed that VVM might be difficult to implement, so I needed
an instruction set that would serve smaller-scale implementations.

And, even worse, I've assumed that because VVM lacks the big
vector registers of a Cray, it's constrained to do operations in memory,
which, of course, exacts a huge penalty. But that's silly: a loop with
Cray instructions in it will take input from memory and put output to
memory, while putting intermediate results in vector registers... and
a VVM loop would do exactly the same thing, except using ordinary
registers.

I mean, either operand forwarding works, or you've got problems
that have to be solved before you can be doing anything with vectors.

John Savard

MitchAlsup wrote:
> On Wednesday, February 16, 2022 at 5:24:14 PM UTC-6, Quadibloc wrote:
>> On Wednesday, February 16, 2022 at 11:25:10 AM UTC-7, MitchAlsup wrote:
>>> On Wednesday, February 16, 2022 at 4:02:07 AM UTC-6, Quadibloc wrote:
>>
>>>> A Cray-like architecture makes the vectors _much_ bigger than even
>>>> in AVX-512, so in my naivete, I would have thought that this would
>>>> allow the constant changes to stop for a while.
>>> <
>>> Point of Order::
>>> CRAY vectors are processed in a pipeline, 1,2,4 units of work per unit time.
>>> AVX vectors are processed en massÃ© <however wide> per unit time.
>>> These are VASTLY different things.
> <
>> Yes. However, a faster Cray-like machine can be implemented
>> with as many, or more, floating ALUs than an AVX-style vector unit.
>>
>> So you could have, say, AVX-512 with eight 64-bit floats across, and
>> then switch to Cray in the next generation with sixteen ALUs, and then
>> stick with Cray with thirty-two ALUs in the generation after that.
> <
> Yes but no mater how wide the CRAY implementation was, the register
> file seen by instructions had the same shape.
> <
> Every time AVX grows in width SW gets a different model to program to.
> <
> That is the difference CRAY exposed a common model, AVX exposed
> an ever changing model.

This is the main advantage of a two-stage model, as used by C#/DotNet/JVM:

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

(Hit Send to quickly. :-()
With a two-stage model like Mill/DotNet/JVM you can have vector
operations in your C# source code that gets turned into the best/widest
available instruction set during JIT/AOT first run.

This supposedly works today, haven't tried it yet myself.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

On 2/15/2022 3:09 PM, Thomas Koenig wrote:
> Quadibloc <jsavard@ecn.ab.ca> schrieb:
>> I noticed that in another thread I might have seemed to have contradicted myself.
>>
>> So I will clarify.
>>
>> In the near term, in two or three years, I think that it's entirely possible that we
>> will have dies that combine four "GBOoO" performance cores with sixteen in-
>> order efficiency cores, and chips that have four of those dies in a package, to
>> give good performance on both number-crunching and database workloads.
>
> Who actually needs number crunching?
>
> I certainly do, even in the the company I work in (wich is in
> the chemical industry, so rather technical) the number of people
> actually running code which depends on floating point execution
> speed is rather small, probably in the low single digit percent
> range of all employees.
>

Probably depends. IME, integer operations tend to dominate, but in some
workloads, floating point tends to make its presence known.

> That does not mean that floating point is not important :-) but that
> most users would not notice if they had a CPU with, let's say, a
> reasonably efficient software emulation of floating point numbers.
>

Probably depends on "reasonably efficient".
Say, ~ 20 cycles, probably only a minority of programs will notice.
Say, ~ 500 cycles, probably nearly everything will notice.

One big factor is having integer operations which are larger than the
floating-point values being worked with.

As noted, having an FPU which does ADD/SUB/MUL and a few conversion ops
and similar is for the most part "sufficient" for most practical uses.

Granted, having all the rest could be "better", but is more expensive.

> Such a CPU would look horrible in SPECfp, and the savings from removing
> floating point from a general purpose CPU are probably not that great so
> it is not done, and I think that as an intensive user of floating point,
> I have to be grateful for that.
>

Quick look, at least in my case, the FPU costs less than the L1 D$.

For a 1-wide core, it may be tempting to omit the FPU and MMU for cost
reasons. For a bigger core, omitting them may not be worthwhile if
anything actually uses them.

Or, at least within the limits of an FPU which is cost-cut enough to
where the LSB being correctly rounded is a bit hit or miss.

> Hmm, come to think of it, that is the fist positive thing about
> SPEC that occurred to me in quite a few years...

....

On 2/16/2022 10:25 AM, MitchAlsup wrote:
> On Wednesday, February 16, 2022 at 4:02:07 AM UTC-6, Quadibloc wrote:
>> On Tuesday, February 15, 2022 at 8:42:40 AM UTC-7, MitchAlsup wrote:
>>> On Tuesday, February 15, 2022 at 4:48:03 AM UTC-6, Quadibloc wrote:
>>
>>>> After all, we've gone from MMX all the way to AVX-512 on the one hand,
>>
>>> These are not CRAY like.
>> That's true. One of the problems you've noted with the current
>> vector architectures is that they keep changing the instruction
>> set in order to make the vectors bigger.
>>
>> A Cray-like architecture makes the vectors _much_ bigger than even
>> in AVX-512, so in my naivete, I would have thought that this would
>> allow the constant changes to stop for a while.
> <
> Point of Order::
> CRAY vectors are processed in a pipeline, 1,2,4 units of work per unit time.
> AVX vectors are processed en massé <however wide> per unit time.
> These are VASTLY different things.
> <
>>
>> As it became possible to put more and more transistors on a chip,
>> at first, the best way to make use of those extra transistors was
>> obvious. Give the computer the ability to do 16-bit arithmetic,
>> not just 8-bit arithmetic. Add floating-point in hardware.
>>
>> Adding Cray-like vector instructions seemed to me like the final
>> natural step in this evolution. Existing vector instructions kept
>> getting wider, so this would take that to its limit.
>>
>> This doesn't mean I expect to solve all supercomputing problems
>> that way. I don't claim to have a magic wand with which to solve the
>> memory latency issue.
>>
>> However, today's microprocessors *do* have L3 caches that are as
>> big as the memory of the original Cray I.
> <
> But with considerably LOWER concurrency.
> A CRAY might have 64 memory banks (NEC up to 245 banks)
> .....Each bank might take 5-10 cycles to perform 1 request
> .....but there can be up to 64 requests being performed.
> <
> At best modern L3 can be doing 3:
> Receiving write data,
> Routing Data around the SRAM matrix,
> Sending out read data.
> <
> There is nothing fundamental about the difference, but L3 caches are
> not build to have the concurrency of CRAYs banked memory.
> <
> < So, while that wouldn't
>> help in solving the problems that the supercomputers of today are
>> working on, it _would_ make more arithmetic operations available,
>> say, to video game writers.
> <
> In principle, yes; in practice, not so much.
>>
>> I'm envisaging a chip that tries to increase memory bandwidth, but
>> only within bounds suitable to a "consumer" product. Just as heatsinks
>> have grown much bigger than people would have expected back in the
>> days of the 386 processor, I'm thinking we could go with having 512
>> data lines going out of a processor. With four signal levels to further
>> double memory bandwidth.
> <
> PCIe 6.0 uses 16 GHz clock to send 4 bits per wire per cycle using
> double data rate and PAM4 modulation; and achieves 64GTs per wire
> each direction. So 4 pins: true-comp out, true comp in: provide 8GB/s
> out and 8GB/s in.
> <
> Now, remember from yesterday out 120 GB/s per core. You will need
> 10 of these 4 pin wires to support inbound bandwidth and 5 to support
> the outbound bandwidth.
> <
> But hey, if you want to provide 512 pins, I sure you can find some use
> for this kind of bandwidth. {but try dealing with the heat.}

If you don't really need the bandwidth, but have the pincount in the
socket, can't you get less heat by say driving the pins eight at a time
at an eighth the clock? (please forgive my HW ignorance)

Re: The Computer of the Future

<suljuo$it1$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23620&group=comp.arch#23620

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: The Computer of the Future
Date: Thu, 17 Feb 2022 05:52:22 -0800
Organization: A noiseless patient Spider
Lines: 107
Message-ID: <suljuo$it1$1@dont-email.me>
References: <b94f72eb-d747-47a4-85cd-d4c351cfcc5fn@googlegroups.com>
<858e7a72-fcc0-4bab-b087-28b9995c7094n@googlegroups.com>
<2e266e3e-b633-4c2a-bd33-962cb675bb77n@googlegroups.com>
<fb409a7e-e1a2-4eaf-8fbb-d697ac3f0febn@googlegroups.com>
<1a8a324d-34b8-4c1e-876e-1a0cde795e3fn@googlegroups.com>
<005ee5af-519a-4d45-93bd-87f4ab580c61n@googlegroups.com>
<66d1cc8e-c8b9-4ecb-be59-fee1ab1da715n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 17 Feb 2022 13:52:24 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4cdbdab641c35097c9e27d444e25d5dd";
logging-data="19361"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX196xGqew/dMbnpg+8Gp5Du4"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:MU+7zYwPXDnfMW5zm2/kuHL/VAg=
In-Reply-To: <66d1cc8e-c8b9-4ecb-be59-fee1ab1da715n@googlegroups.com>
Content-Language: en-US

by: Ivan Godard - Thu, 17 Feb 2022 13:52 UTC

On 2/16/2022 5:03 PM, MitchAlsup wrote:
> On Wednesday, February 16, 2022 at 5:32:06 PM UTC-6, Quadibloc wrote:
>> On Wednesday, February 16, 2022 at 4:24:14 PM UTC-7, Quadibloc wrote:
>>> On Wednesday, February 16, 2022 at 11:25:10 AM UTC-7, MitchAlsup wrote:
>>
>>>> Question: If you have VVM and VVM performs as well as CRAY
>>>> vectors running Matrix300, why have the CRAY vector state
>>>> or bloat your ISA with CRAY vector instructions?
>>
>>> Now, that's a very good question.
>>>
>>> Possible answer 1:
>>> This is only included in the ISA because the spec is meant to
>>> illustrate possibilities, and would be omitted in any real-world
>>> CPU.
>>>
>>> Possible answer 2:
>>> The idea is that VVM wrapped around scalar floating-point
>>> instructions works well for vectors that are "this" long;
>>>
>>> VVM wrapped around AVX-style vector instructions works for
>>> vectors that are 4x longer, in proportion to the number of floats
>>> in a single AVX vector...
>>>
>>> VVM wrapped around Cray-style vector instructions is intended
>>> for vectors that are 64x longer than VVM wrapped around scalar
>>> instructions.
>>>
>>> Assume VVM around scalar handles vectors somewhat longer
>>> than Cray without VVM. Then what we've got is a range of options,
>>> each one adapted to how long your vectors happen to be. (And to
>>> things like granularity, because using VVM around Cray for vectors
>>> 2x the Cray size presumably wouldn't be "bad" in terms of efficiency.)
>>>
>>> Possible answer 3:
>>> And I might fail to implement VVM well enough to avoid some
>>> associated overhead.
> <
>> Actually, there's _another_ point I would need to raise here.
>>
>> In your design, VVM is the primary way to create vector operations.
>>
>> So if additional ALU resources are brought into play for vector
>> computation, using VVM would invoke them.
>>
>> On the other hand, I have a prejudice against instruction fusion,
>> as I feel it complicates decoding.
> <
> VVM loops are not fused. So, you have nothing to worry about.
> A VVM loop is a contract of what the loop must perform, and
> what the register state needs to be after the loop completes.
>>
>> And so I'm thinking in terms of a "dumb" implementation of VVM.
>> It reduces loop overhead, it permits data forwarding and stuff like
>> that...
> <
>>
>> but it _doesn't_ tell the scalar floating-point instructions to start
>> using the ALUs that belong to the AVX instructions or the Cray
>> instructions.
> <
> Because there ARE NO AVX or CRAY instructions, the programmer never got
> misinformed as to what way to do this or that. HW gets to make those
> choices on an implementation by implementation. The old SW just
> keeps getting faster and remains near optimal across many generations.
>>
>> So if you've got a 65,536-element vector, you can *choose* to
>> process it using VVM around scalar, VVM around AVX, or
>> VVM around Cray.
> <
> Just code it up, let the compiler VVM the inner loops, and smile all
> the way to the bank.
>>
>> What this will *change* is how many instances of your program
>> can be floating around on your CPU at a given time running
>> concurrently - instead of some sitting idle on disk while a handful
>> are actually running.
> <
> You are not sharing these VVM computational resources across core
> boundaries. Bull dozer pretty much demonstrated you don't want to do
> that.
>>
>> The VVM around Cray program would be the one that could finish
>> faster in wall clock time if it happened to be given the run of the
>> entire CPU without having to share.
> <
> Doing CRAY-1 stuff, VVM should be just as efficient as CRAY.
> In fact it can perform Gather/Scatter as a side effect that is
> vectorizes LOOPs not instructions.
> <
> For the addition of 2 instructions, you get a shot CRAY performance,
> get a shot at AVX performance, and did not waste 1000 opcodes to
> get there.

Yes - when the hardware can recognize that a loop is VVMable and doesn't
do it scalar. It's clear that a lot of simple (micro-benchmark) loops
can be recognized with acceptably complex recognizer logic. My concern
about VVM is that the efforts for auto-vectorization (which I do realize
is not the same as VVM) have shown that simple-enough loops are not as
common in the wild as they are in benchmarks.

I think the problem is really a linguistic one: our programming
languages have caused us to think about our programs in control flow
terms, when they are more naturally (and more efficiently) thought of
in dataflow terms.

Hence, streamers.

On 2/17/2022 1:24 AM, BGB wrote:
> On 2/15/2022 3:09 PM, Thomas Koenig wrote:
>> Quadibloc <jsavard@ecn.ab.ca> schrieb:
>>> I noticed that in another thread I might have seemed to have
>>> contradicted myself.
>>>
>>> So I will clarify.
>>>
>>> In the near term, in two or three years, I think that it's entirely
>>> possible that we
>>> will have dies that combine four "GBOoO" performance cores with
>>> sixteen in-
>>> order efficiency cores, and chips that have four of those dies in a
>>> package, to
>>> give good performance on both number-crunching and database workloads.
>>
>> Who actually needs number crunching?
>>
>> I certainly do, even in the the company I work in (wich is in
>> the chemical industry, so rather technical) the number of people
>> actually running code which depends on floating point execution
>> speed is rather small, probably in the low single digit percent
>> range of all employees.
>>
>
> Probably depends. IME, integer operations tend to dominate, but in some
> workloads, floating point tends to make its presence known.
>
>
>> That does not mean that floating point is not important :-) but that
>> most users would not notice if they had a CPU with, let's say, a
>> reasonably efficient software emulation of floating point numbers.
>>
>
> Probably depends on "reasonably efficient".
> Say, ~ 20 cycles, probably only a minority of programs will notice.
> Say, ~ 500 cycles, probably nearly everything will notice.
>
> One big factor is having integer operations which are larger than the
> floating-point values being worked with.
>
>
> As noted, having an FPU which does ADD/SUB/MUL and a few conversion ops
> and similar is for the most part "sufficient" for most practical uses.
>
> Granted, having all the rest could be "better", but is more expensive.

That's all Mill has.

In space, no one can hear you fart.

devel / comp.arch / The Computer of the Future

Subject	Author
The Computer of the Future	Quadibloc
Re: The Computer of the Future	MitchAlsup
Re: The Computer of the Future	Brett
Re: The Computer of the Future	Quadibloc
Re: The Computer of the Future	MitchAlsup
Re: The Computer of the Future	Scott Smader
Re: The Computer of the Future	Quadibloc
Re: The Computer of the Future	MitchAlsup
Re: The Computer of the Future	Quadibloc
Re: The Computer of the Future	Quadibloc
Re: The Computer of the Future	Quadibloc
Re: The Computer of the Future	Quadibloc
Re: The Computer of the Future	Quadibloc
Re: The Computer of the Future	MitchAlsup
Re: The Computer of the Future	Quadibloc
Re: The Computer of the Future	Stefan Monnier
Re: The Computer of the Future	MitchAlsup
Re: The Computer of the Future	Ivan Godard
Re: The Computer of the Future	Quadibloc
Re: The Computer of the Future	Tim Rentsch
Re: The Computer of the Future	Terje Mathisen
Re: The Computer of the Future	Tim Rentsch
Re: The Computer of the Future	Terje Mathisen
Re: The Computer of the Future	Tim Rentsch
Re: The Computer of the Future	Terje Mathisen
Re: The Computer of the Future	Tim Rentsch
Re: The Computer of the Future	Terje Mathisen
Re: The Computer of the Future	David Brown
Re: The Computer of the Future	Michael S
Re: The Computer of the Future	Tim Rentsch
Re: The Computer of the Future	Ivan Godard
Re: The Computer of the Future	David Brown
Re: The Computer of the Future	BGB
Re: The Computer of the Future	Terje Mathisen
Re: The Computer of the Future	David Brown
Re: The Computer of the Future	Niklas Holsti
Re: The Computer of the Future	David Brown
Re: The Computer of the Future	Terje Mathisen
Re: The Computer of the Future	Thomas Koenig
Re: The Computer of the Future	David Brown
Re: The Computer of the Future	Thomas Koenig
Re: The Computer of the Future	David Brown
Re: The Computer of the Future	Niklas Holsti
Re: The Computer of the Future	David Brown
Re: The Computer of the Future	Michael S
Re: The Computer of the Future	David Brown
Re: The Computer of the Future	MitchAlsup
Re: The Computer of the Future	David Brown
Re: The Computer of the Future	Thomas Koenig
Re: The Computer of the Future	Quadibloc
Re: The Computer of the Future	Quadibloc
Re: The Computer of the Future	Quadibloc
Re: The Computer of the Future	BGB
Re: The Computer of the Future	MitchAlsup
Re: The Computer of the Future	BGB
Re: The Computer of the Future	Quadibloc
Re: The Computer of the Future	Ivan Godard
Re: The Computer of the Future	Tim Rentsch
Re: The Computer of the Future	Terje Mathisen
Re: The Computer of the Future	BGB
Re: The Computer of the Future	Quadibloc
Re: The Computer of the Future	Bill Findlay
Re: The Computer of the Future	Terje Mathisen
Re: The Computer of the Future	Michael S
Re: The Computer of the Future	Michael S
Re: The Computer of the Future	David Brown
Re: The Computer of the Future	Michael S
Re: The Computer of the Future	Tom Gardner
Re: The Computer of the Future	Thomas Koenig
Re: The Computer of the Future	David Brown
Re: The Computer of the Future	Tom Gardner
Re: The Computer of the Future	David Brown
Re: The Computer of the Future	Niklas Holsti
Re: The Computer of the Future	David Brown
Re: The Computer of the Future	Andy Valencia
Re: The Computer of the Future	Tim Rentsch
Re: The Computer of the Future	Bill Findlay
Re: The Computer of the Future	Quadibloc
Re: The Computer of the Future	Quadibloc
Re: The Computer of the Future	Quadibloc
Re: The Computer of the Future	Quadibloc
Re: The Computer of the Future	John Levine
Re: The Computer of the Future	Anton Ertl
Re: The Computer of the Future	Quadibloc
Re: The Computer of the Future	Thomas Koenig
Re: The Computer of the Future	JimBrakefield
Re: The Computer of the Future	Quadibloc
Re: The Computer of the Future	BGB
Re: The Computer of the Future	MitchAlsup
Re: The Computer of the Future	BGB
Re: The Computer of the Future	MitchAlsup
Re: The Computer of the Future	Terje Mathisen
Re: The Computer of the Future	Stephen Fuld
FPGAs (was: The Computer of the Future)	Anton Ertl
Re: FPGAs (was: The Computer of the Future)	BGB
Re: FPGAs (was: The Computer of the Future)	JimBrakefield
Re: FPGAs (was: The Computer of the Future)	Michael S
Re: FPGAs (was: The Computer of the Future)	JimBrakefield
Re: FPGAs (was: The Computer of the Future)	Michael S
Re: FPGAs (was: The Computer of the Future)	BGB
Re: FPGAs (was: The Computer of the Future)	MitchAlsup
Re: FPGAs	Terje Mathisen
Re: FPGAs (was: The Computer of the Future)	Quadibloc
Re: FPGAs (was: The Computer of the Future)	Michael S
Re: FPGAs (was: The Computer of the Future)	MitchAlsup
Re: The Computer of the Future	Terje Mathisen
Re: The Computer of the Future	Thomas Koenig
Re: The Computer of the Future	Brian G. Lucas
Re: The Computer of the Future	Ivan Godard
Re: The Computer of the Future	MitchAlsup
Re: The Computer of the Future	Tom Gardner
Re: The Computer of the Future	MitchAlsup
Re: The Computer of the Future	Ivan Godard
Re: The Computer of the Future	Thomas Koenig
Re: The Computer of the Future	JimBrakefield