Message-ID:

"BTW, does Jesus know you flame?" -- Diane Holt, dianeh@binky.UUCP, to Ed Carp

computers / comp.arch / Re: The Computer of the Future

Re: The Computer of the Future

<5b38cb28-2409-453d-9ea0-c00779ecbb8fn@googlegroups.com>

https://www.novabbs.com/computers/article-flat.php?id=23605&group=comp.arch#23605

X-Received: by 2002:adf:fa92:0:b0:1e7:e760:49dd with SMTP id h18-20020adffa92000000b001e7e76049ddmr386250wrr.99.1645059284494;
Wed, 16 Feb 2022 16:54:44 -0800 (PST)
X-Received: by 2002:a05:6870:4302:b0:d2:894f:4262 with SMTP id
w2-20020a056870430200b000d2894f4262mr205401oah.194.1645059283855; Wed, 16 Feb
2022 16:54:43 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.88.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 16 Feb 2022 16:54:43 -0800 (PST)
In-Reply-To: <1a8a324d-34b8-4c1e-876e-1a0cde795e3fn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3db1:25d2:322a:440e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3db1:25d2:322a:440e
References: <b94f72eb-d747-47a4-85cd-d4c351cfcc5fn@googlegroups.com>
<858e7a72-fcc0-4bab-b087-28b9995c7094n@googlegroups.com> <2e266e3e-b633-4c2a-bd33-962cb675bb77n@googlegroups.com>
<fb409a7e-e1a2-4eaf-8fbb-d697ac3f0febn@googlegroups.com> <1a8a324d-34b8-4c1e-876e-1a0cde795e3fn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5b38cb28-2409-453d-9ea0-c00779ecbb8fn@googlegroups.com>
Subject: Re: The Computer of the Future
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 17 Feb 2022 00:54:44 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Thu, 17 Feb 2022 00:54 UTC

On Wednesday, February 16, 2022 at 5:24:14 PM UTC-6, Quadibloc wrote:
> On Wednesday, February 16, 2022 at 11:25:10 AM UTC-7, MitchAlsup wrote:
> > On Wednesday, February 16, 2022 at 4:02:07 AM UTC-6, Quadibloc wrote:
>
> > > A Cray-like architecture makes the vectors _much_ bigger than even
> > > in AVX-512, so in my naivete, I would have thought that this would
> > > allow the constant changes to stop for a while.
> > <
> > Point of Order::
> > CRAY vectors are processed in a pipeline, 1,2,4 units of work per unit time.
> > AVX vectors are processed en massé <however wide> per unit time.
> > These are VASTLY different things.
<
> Yes. However, a faster Cray-like machine can be implemented
> with as many, or more, floating ALUs than an AVX-style vector unit.
>
> So you could have, say, AVX-512 with eight 64-bit floats across, and
> then switch to Cray in the next generation with sixteen ALUs, and then
> stick with Cray with thirty-two ALUs in the generation after that.
<
Yes but no mater how wide the CRAY implementation was, the register
file seen by instructions had the same shape.
<
Every time AVX grows in width SW gets a different model to program to.
<
That is the difference CRAY exposed a common model, AVX exposed
an ever changing model.
<
> > > However, today's microprocessors *do* have L3 caches that are as
> > > big as the memory of the original Cray I.
>
> > But with considerably LOWER concurrency.
> > A CRAY might have 64 memory banks (NEC up to 245 banks)
> > .....Each bank might take 5-10 cycles to perform 1 request
> > .....but there can be up to 64 requests being performed.
> But if the cache is *on the same die*, having more wires connecting it
> to the CPU isn't much of a problem?
> > At best modern L3 can be doing 3:
> > Receiving write data,
> > Routing Data around the SRAM matrix,
> > Sending out read data.
> > <
> > There is nothing fundamental about the difference, but L3 caches are
> > not build to have the concurrency of CRAYs banked memory.
<
> So we seem to be in agreement on this point.
<
> > PCIe 6.0 uses 16 GHz clock to send 4 bits per wire per cycle using
> > double data rate and PAM4 modulation; and achieves 64GTs per wire
> > each direction. So 4 pins: true-comp out, true comp in: provide 8GB/s
> > out and 8GB/s in.
<
> Of course, PCIe 6.0 is a complicated protocol, while interfaces
> like DDR 5 to DRAM are kept simple by comparison.
<
> > But hey, if you want to provide 512 pins, I sure you can find some use
> > for this kind of bandwidth. {but try dealing with the heat.}
<
> Presumably chips implemented in the final evolution of 1nm or whatever
> will run slightly cooler.
<
Logic may, pins will not.
>
> I had thought that the number of CPUs in the package was what governed
> the heat, and using more pins for data would not be too bad. If that's not
> true, then, yes, this would be *one* fatal objection to my concepts.
<
For the power cost of sending 1 byte from DRAM to pins of processor
you can read ~{64 to 128} bytes from nearby on die SRAM.
<
> > More pins wiggling faster has always provided more bandwidth.
> > Being able to absorb the latency has always been the problem.
> > {That and paying for it: $$$ and heat}
<
> Of course, the way to absorb latency is to do something else while
> you're waiting. So now you need bandwidth for the first thing you
> were doing, and now more bandwidth for the something else
> you're doing so as to make the latency less relevant.
>
> This sounds like a destructive paradox. But since latency is
> fundamentally unfixable (until you make the transistors and
> wires faster) while you can have more bandwidth if you pay for
> it, the idea of having the amount of bandwidth you needed in
> the first place, times eight or so, almost makes sense.
<
> > > Exactly how a modified GPU design aimed at simulating a Cray
> > > or multiple Crays in parallel working on different problems might
> > > look is not clear to me, but I presume that if one can put a bunch
> > > of ALUs on a chip, and one can organize that to look like a GPU
> > > or like a Xeon Phi (but with RISC instead of x86), it could also be
> > > organized to look like something in between adapted to a
> > > Cray-like instruction set.
<
> ....and a Cray-like instruction set could be like a later generation of
> the Cray, with more and longer vector registers, and in other ways
> it could move to being more GPU-like if that was needed to fix some
> flaws.
<
Since you mentioned later CRAYs--they do Gather/Scatter LDs/STs.
64 address to 64 different memory locations in 64 clocks per port.
3 of these ports.
<
> > > Since Crays used 64-element vector registers for code in loops
> > > that handled vectors with more than 64 elements, that these loops
> > > might well be... augmented... by means of something looking a
> > > bit like your VVM is also not beyond the bounds of imagination.
> > > (But if you're using something like VVM, why have vector instructions?
> > > Reducing decoding overhead!)
>
> > Exactly! Let each generation of HW give the maximum performance
> > if can while the application code remains constant.
> I'm glad you approve of something...
> > Secondly: If you want wide vector performance, you need to be organized
> > around ¼, ½ 1 cache line per clock out of the cache and back into the cache.
> > The width appropriate for one generation is not necessarily appropriate for
> > the next--so don't expose width through ISA.
> Of course, IBM's take on a Cray-like architecture avoided that pitfall, by
> excluding the vector width from the ISA spec, making it model-dependent,
> so that's definitely possible.
> > > Of course, though, my designs will have scalar floating-point
> > > instructions, short vector instructions (sort of like AVX-256),
> > > and long vector instructions (like a Cray)... because they're
> > > intended to illustrate what an architecture burdened with a
> > > rather large amount of legacy stuff carried over. But because it
> > > was designed on a clean sheet of paper, it only gets one
> > > kind of short vectors to support, rather than several like an x86.
>
> > > And there would be a somewhat VVM-like set of
> > > vector of vector wrapper instructions that could be wrapped
> > > around *any* of them.
>
> > Question: If you have VVM and VVM performs as well as CRAY
> > vectors running Matrix300, why have the CRAY vector state
> > or bloat your ISA with CRAY vector instructions?
<
> Now, that's a very good question.
>
> Possible answer 1:
> This is only included in the ISA because the spec is meant to
> illustrate possibilities, and would be omitted in any real-world
> CPU.
>
> Possible answer 2:
> The idea is that VVM wrapped around scalar floating-point
> instructions works well for vectors that are "this" long;
<
The width of the SIMD engine is hidden from the expression of
use found in the code. Byte loops would tend to run 64-iterations
per cycle, HalfWord loops 32 iterations per cycle, Word Loops
16-iterations per cycle, DoubleWord loops 8-iterations per cycle.
<
At this width, you are already at the limit of Cache Access, likely
DRAM, wo I don't foresee a need for more than 1 cache line of
calculation: you can't feed it from memory.
>
> VVM wrapped around AVX-style vector instructions works for
> vectors that are 4x longer, in proportion to the number of floats
> in a single AVX vector...
<
VVM supplants any need for AVX contortions. VVM supplies the
entire memory reference and calculation set of AVX with just 2
instructions.
>
> VVM wrapped around Cray-style vector instructions is intended
> for vectors that are 64x longer than VVM wrapped around scalar
> instructions.
<
VVM supplants any need for CRAY-like vectors without the register file
and without a zillion vector instructions.
>
> Assume VVM around scalar handles vectors somewhat longer
> than Cray without VVM. Then what we've got is a range of options,
> each one adapted to how long your vectors happen to be. (And to
> things like granularity, because using VVM around Cray for vectors
> 2x the Cray size presumably wouldn't be "bad" in terms of efficiency.)
<
VVM does not expose width or depth to software--this is the point you
are missing. Thus, SW has a common model over time. Users can go
about optimizing the hard parts of their applications rather than
byte-copy. Byte copy runs just as fast as AVX copy on My 66000.
<
Why add all the gorp when you can wave your hands and make all
the cruft vanish.
>
> Possible answer 3:
> And I might fail to implement VVM well enough to avoid some
> associated overhead.
<
Not my problem.
>
> John Savard

Subject	Replies	Author
The Computer of the Future By: Quadibloc on Tue, 15 Feb 2022	137	Quadibloc