Welcome to novaBBS (click a section below)

mail files register newsreader groups login

Message-ID:

6 May, 2024: The networking issue during the past two days has been identified and fixed.

Short Vectors Versus Long Vectors

Subject	Author
Short Vectors Versus Long Vectors	Lawrence D'Oliveiro
Re: Short Vectors Versus Long Vectors	MitchAlsup1
Re: Short Vectors Versus Long Vectors	Lawrence D'Oliveiro
Re: Short Vectors Versus Long Vectors	Anton Ertl
Re: Short Vectors Versus Long Vectors	Lawrence D'Oliveiro
Re: Short Vectors Versus Long Vectors	Anton Ertl
Re: Short Vectors Versus Long Vectors	MitchAlsup1
Re: Short Vectors Versus Long Vectors	MitchAlsup1
Re: Short Vectors Versus Long Vectors	BGB
Re: Short Vectors Versus Long Vectors	MitchAlsup1
Re: Short Vectors Versus Long Vectors	BGB
Re: Short Vectors Versus Long Vectors	Lawrence D'Oliveiro
Re: Short Vectors Versus Long Vectors	MitchAlsup1
Re: Short Vectors Versus Long Vectors	BGB
Re: Short Vectors Versus Long Vectors	Lawrence D'Oliveiro
Re: Short Vectors Versus Long Vectors	Terje Mathisen
Re: Short Vectors Versus Long Vectors	MitchAlsup1
Re: Short Vectors Versus Long Vectors	MitchAlsup1
Re: Short Vectors Versus Long Vectors	MitchAlsup1
Re: Short Vectors Versus Long Vectors	Lawrence D'Oliveiro
Re: Short Vectors Versus Long Vectors	MitchAlsup1
Re: Short Vectors Versus Long Vectors	Lawrence D'Oliveiro
Re: Short Vectors Versus Long Vectors	MitchAlsup1
Re: Short Vectors Versus Long Vectors	John Savard
Re: Short Vectors Versus Long Vectors	MitchAlsup1
Re: Short Vectors Versus Long Vectors	John Savard
Re: Short Vectors Versus Long Vectors	Lawrence D'Oliveiro
Re: Short Vectors Versus Long Vectors	BGB
Re: Short Vectors Versus Long Vectors	Anton Ertl
Re: Short Vectors Versus Long Vectors	Lawrence D'Oliveiro
Re: Short Vectors Versus Long Vectors	Michael S
Re: Short Vectors Versus Long Vectors	Lawrence D'Oliveiro
Re: Short Vectors Versus Long Vectors	John Levine
Re: Short Vectors Versus Long Vectors	MitchAlsup1
Re: Short Vectors Versus Long Vectors	Michael S
Re: Short Vectors Versus Long Vectors	MitchAlsup1
Re: Short Vectors Versus Long Vectors	Michael S
Re: Short Vectors Versus Long Vectors	BGB
Re: Short Vectors Versus Long Vectors	Thomas Koenig
Re: Short Vectors Versus Long Vectors	John Levine
Re: Short Vectors Versus Long Vectors	Lawrence D'Oliveiro
Re: Short Vectors Versus Long Vectors	John Levine
Re: Short Vectors Versus Long Vectors	Tim Rentsch
Re: Short Vectors Versus Long Vectors	Lawrence D'Oliveiro
Re: Short Vectors Versus Long Vectors	MitchAlsup1
Re: Short Vectors Versus Long Vectors	Lawrence D'Oliveiro
Re: Short Vectors Versus Long Vectors	MitchAlsup1
Re: Short Vectors Versus Long Vectors	David Schultz
Re: Short Vectors Versus Long Vectors	aph
Re: Short Vectors Versus Long Vectors	MitchAlsup1
Re: Short Vectors Versus Long Vectors	Lawrence D'Oliveiro
Re: Short Vectors Versus Long Vectors	Anton Ertl
Re: Short Vectors Versus Long Vectors	Thomas Koenig
Re: Short Vectors Versus Long Vectors	Anton Ertl
Re: Short Vectors Versus Long Vectors	Lawrence D'Oliveiro
Re: Short Vectors Versus Long Vectors	Anton Ertl
Re: Short Vectors Versus Long Vectors	Lawrence D'Oliveiro
Re: Short Vectors Versus Long Vectors	John Savard
Re: Short Vectors Versus Long Vectors	Lawrence D'Oliveiro
Re: Short Vectors Versus Long Vectors	Michael S
Re: Short Vectors Versus Long Vectors	Lawrence D'Oliveiro
Re: Short Vectors Versus Long Vectors	Michael S
Re: Short Vectors Versus Long Vectors	John Levine
Re: Short Vectors Versus Long Vectors	Thomas Koenig
Re: Short Vectors Versus Long Vectors	George Neuner
Re: Short Vectors Versus Long Vectors	Terje Mathisen
Re: Short Vectors Versus Long Vectors	BGB
Re: Short Vectors Versus Long Vectors	George Neuner
Re: lotsa power, Short Vectors Versus Long Vectors	John Levine
Re: Short Vectors Versus Long Vectors	John Savard
Re: Short Vectors Versus Long Vectors	Thomas Koenig
Re: lots of juice, Short Vectors Versus Long Vectors	John Levine
Re: lots of juice, Short Vectors Versus Long Vectors	Thomas Koenig
Re: lots of juice, Short Vectors Versus Long Vectors	Lawrence D'Oliveiro
Re: Short Vectors Versus Long Vectors	Tim Rentsch
Re: Short Vectors Versus Long Vectors	Thomas Koenig
Re: old power, Short Vectors Versus Long Vectors	John Levine
Re: old power, Short Vectors Versus Long Vectors	Lawrence D'Oliveiro
Re: old power, Short Vectors Versus Long Vectors	John Levine
Re: old power, Short Vectors Versus Long Vectors	MitchAlsup1
Re: old power, Short Vectors Versus Long Vectors	BGB
Re: old power, Short Vectors Versus Long Vectors	MitchAlsup1
Re: old power, Short Vectors Versus Long Vectors	BGB
Re: old power, Short Vectors Versus Long Vectors	Lawrence D'Oliveiro
Re: old power, Short Vectors Versus Long Vectors	BGB
Re: old power, Short Vectors Versus Long Vectors	MitchAlsup1
Re: old power, Short Vectors Versus Long Vectors	Thomas Koenig
Re: old power, Short Vectors Versus Long Vectors	BGB
Re: old power, Short Vectors Versus Long Vectors	MitchAlsup1
Re: old power, Short Vectors Versus Long Vectors	BGB
Re: old power, Short Vectors Versus Long Vectors	BGB
Re: old power, Short Vectors Versus Long Vectors	MitchAlsup1
Re: not even sort of old power, Short Vectors Versus Long Vectors	John Levine
Re: not even sort of old power, Short Vectors Versus Long Vectors	MitchAlsup1
Re: not even sort of old power, Short Vectors Versus Long Vectors	Thomas Koenig
Re: Short Vectors Versus Long Vectors	Tim Rentsch
Re: Short Vectors Versus Long Vectors	Thomas Koenig
Re: Short Vectors Versus Long Vectors	MitchAlsup1
Re: Short Vectors Versus Long Vectors	Scott Lurndal
Re: Short Vectors Versus Long Vectors	MitchAlsup1
Re: Short Vectors Versus Long Vectors	Scott Lurndal

Pages:12 3 4 5

Short Vectors Versus Long Vectors

<v06vdb$17r2v$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=37943&group=comp.arch#37943

copy link Newsgroups: comp.arch

by: Lawrence D'Oliv - Tue, 23 Apr 2024 00:29 UTC

Adding the typical kind of vector-processing instructions to an
instruction set inevitably leads to a combinatorial explosion in the
number of opcodes. This kind of thing makes a mockery of the “R” in
“RISC”.

Interesting to see that the RISC-V folks are staying off this path;
instead, they are reviving an old idea from Seymour Cray’s original
machines that bear his name: a vector pipeline. Instead of being limited
to processing 4 or 8 operands at a time, the Cray machines could operate
(sequentially, but rapidly) on variable-length vectors of up to 64
elements with a single setup sequence. RISC-V seems to make the limit on
vector length an implementation choice, with a value of 32 being mentioned
in the spec.

The way it avoids having separate instructions for each combination of
operand types is to have operand-type registers as part of the vector
unit. This way, only a small number of instructions is required to set up
all the combinations of operand/result types. You then give it a kick in
the guts and off it goes.

Detailed spec here:
<https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc>.

Re: Short Vectors Versus Long Vectors

<5451dcac941e1f569397a5cc7818f68f@www.novabbs.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=37946&group=comp.arch#37946

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchal...@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: Short Vectors Versus Long Vectors
Date: Tue, 23 Apr 2024 02:14:32 +0000
Organization: Rocksolid Light
Message-ID: <5451dcac941e1f569397a5cc7818f68f@www.novabbs.org>
References: <v06vdb$17r2v$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="2064526"; mail-complaints-to="usenet@i2pn2.org";
posting-account="65wTazMNTleAJDh/pRqmKE7ADni/0wesT78+pyiDW8A";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
X-Rslight-Site: $2y$10$ICEjUVF4s1/6zQwPUGtMR.PRJumFH3UHcQsPTyFVfNq.fF5zkMVZi

by: MitchAlsup1 - Tue, 23 Apr 2024 02:14 UTC

Lawrence D'Oliveiro wrote:

> Adding the typical kind of vector-processing instructions to an
> instruction set inevitably leads to a combinatorial explosion in the
> number of opcodes. This kind of thing makes a mockery of the “R” in
> “RISC”.

It does indeed make a mockery of the R in RISC.

> Interesting to see that the RISC-V folks are staying off this path;
> instead, they are reviving an old idea from Seymour Cray’s original
> machines that bear his name: a vector pipeline. Instead of being limited
> to processing 4 or 8 operands at a time, the Cray machines could operate
> (sequentially, but rapidly) on variable-length vectors of up to 64
> elements with a single setup sequence. RISC-V seems to make the limit on
> vector length an implementation choice, with a value of 32 being mentioned
> in the spec.

CRAY machines stayed "in style" as long as memory latency remained smaller
than the length of a vector (64 cycles) and fell out of favor when the cores
got fast enough that memory could no longer keep up.

I whish them well, but I expect it will not work out as they desire.....

> The way it avoids having separate instructions for each combination of
> operand types is to have operand-type registers as part of the vector
> unit. This way, only a small number of instructions is required to set up
> all the combinations of operand/result types. You then give it a kick in
> the guts and off it goes.

> Detailed spec here:
> <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc>.

On the other hand, My 66000 has support for both SIMD and CRAY-like vectors
and the ISA contains only 6-bits of state supporting vectorization and
exactly 2 instructions--one that gives HW a register it can use in the
"loop" and the LOOP instruction that performs the ADD-CMP-BC functionality.
{{Not 2 for every kind of vectorized instruction, 2 total instructions}}

There is nor 4KB of register file (context switch overhead),
there is no need for Gather/Scatter, stride memory references,
there is no masking register,
the OS can use vectorization for small fast loops without overhead,
the compiler does not have to solve memory address aliasing,
cache activities are modified to suit vector workloads,
exotic HW can execute across multiple lanes (as desired),
simple HW can "do it all" in a 1-wide pipeline,
the debugger presents scalar code to coder,
and exceptions remain precise (for those that care),
and the exception handler(s) sees only scalar code.

Re: Short Vectors Versus Long Vectors

<v078td$1df76$4@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=37950&group=comp.arch#37950

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ldo...@nz.invalid (Lawrence D'Oliveiro)
Newsgroups: comp.arch
Subject: Re: Short Vectors Versus Long Vectors
Date: Tue, 23 Apr 2024 03:11:41 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 8
Message-ID: <v078td$1df76$4@dont-email.me>
References: <v06vdb$17r2v$1@dont-email.me>
<5451dcac941e1f569397a5cc7818f68f@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 23 Apr 2024 05:11:41 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="e762e53ec9e8808879f7618c3ddfda81";
logging-data="1490150"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/EDXXE0YVp6o7Nj+SOePR8"
User-Agent: Pan/0.155 (Kherson; fc5a80b8)
Cancel-Lock: sha1:3YbcJdPzsqKjssdjUd2TFyFdrl0=

by: Lawrence D'Oliv - Tue, 23 Apr 2024 03:11 UTC

On Tue, 23 Apr 2024 02:14:32 +0000, MitchAlsup1 wrote:

> CRAY machines stayed "in style" as long as memory latency remained
> smaller than the length of a vector (64 cycles) and fell out of favor
> when the cores got fast enough that memory could no longer keep up.

So why would conventional short vectors work better, then? Surely the
latency discrepancy would be even worse for them.

Re: Short Vectors Versus Long Vectors

<2024Apr23.082238@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=37952&group=comp.arch#37952

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Short Vectors Versus Long Vectors
Date: Tue, 23 Apr 2024 06:22:38 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 81
Message-ID: <2024Apr23.082238@mips.complang.tuwien.ac.at>
References: <v06vdb$17r2v$1@dont-email.me> <5451dcac941e1f569397a5cc7818f68f@www.novabbs.org> <v078td$1df76$4@dont-email.me>
Injection-Date: Tue, 23 Apr 2024 09:19:26 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="88aa0d4350c8e2e1582100f920599d78";
logging-data="1584486"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+GOY16JUVSmaqD0Ftadeni"
Cancel-Lock: sha1:QTZJAIERhJotKzxZ7qeeghoIyNs=
X-newsreader: xrn 10.11

by: Anton Ertl - Tue, 23 Apr 2024 06:22 UTC

Lawrence D'Oliveiro <ldo@nz.invalid> writes:
>On Tue, 23 Apr 2024 02:14:32 +0000, MitchAlsup1 wrote:
>
>> CRAY machines stayed "in style" as long as memory latency remained
>> smaller than the length of a vector (64 cycles) and fell out of favor
>> when the cores got fast enough that memory could no longer keep up.

Mitch Alsup repeatedly makes this claim without giving any
justification. Your question may shed some light on that.

>So why would conventional short vectors work better, then? Surely the
>latency discrepancy would be even worse for them.

Thinking about it, they probably don't work better. They just don't
work worse, so why spend area on 4096-bit vector registers like the
Cray-1 did when 128-512-bit vector registers work just as well? Plus,
they have 200 or so of these registers, so 4096-bit registers would be
really expensive. How many vector registers does the Cray-1 (and its
successors) have?

On modern machines OoO machinery bridges the latency gap between the
L2 cache, maybe even the L3 cache and the core for data-parallel code.
For the latency gap to main memory there are the hardware prefetchers,
and they use the L1 or L2 cache as intermediate buffer, while the
Cray-1 and followons use vector registers.

So what's the benefit of using vector/SIMD instructions at all rather
than doing it with scalar code? A SIMD instruction that replaces n
scalar instructions consumes fewer resources for instruction fetching,
decoding, register renaming, administering the instruction in the OoO
engine, and in retiring the instruction.

So why not use SIMD instructions with longer vector registers? The
progression from 128-bit SSE through AVX-256 to AVX-512 by Intel
suggests that this is happening, but with every doubling the cost in
area doubles but the returns are diminishing thanks to Amdahl's law.
So at some point you stop. Intel introduced AVX-512 for Larrabee (a
special-purpose machine), and now is backpedaling with desktop, laptop
and small-server CPUs (even though only the Golden/Raptor Cove cores
are enabled on the small-server CPUs) only supporting AVX, and with
AVX10 only guaranteeing 256-bit vector registers, so maybe 512-bit
vector registers are already too costly for the benefit they give in
general-purpose computing.

Back to old-style vector processors. There have been machines that
supported longer vector registers and AFAIK also memory-to-memory
machines. The question is why have they not been the answer of the
vector-processor community to the problem of covering the latency? Or
maybe they have? AFAIK NEC SX has been available in some form even in
recent years, maybe still.

Anyway, after thinking about this, the reason behind Mitch Alsup's
statement is that in a

doall(load process store)

computation (like what SIMD is good at), the loads precede the
corresponding processing by the load latency (i.e., memory latency on
the Cray machines). If your OoO capabilities are limited (and I think
they are on the Cray machines), you cannot start the second iteration
of the doall loop before the processing step of the first iteration
has finished with the register. You can do a bit of software
pipelining and software register renaming by transforming this into

load1 doall(load2 process1 store1 load1 process2 store2)

but at some point you run out of vector registers.

One thing that comes to mind is tracking individual parts of the
vector registers, which allows to starting the next iteration as soon
as the first part of the vector register no longer has any readers.
However, it's probably not that far off in complexity to tracking
shorter vector registers in an OoO engine. And if you support
exceptions (the Crays probably don't), this becomes messy, while with
short vector registers it's easier to implement the (ISA)
architecture.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Short Vectors Versus Long Vectors

<v07rkp$1h6fk$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=37953&group=comp.arch#37953

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ldo...@nz.invalid (Lawrence D'Oliveiro)
Newsgroups: comp.arch
Subject: Re: Short Vectors Versus Long Vectors
Date: Tue, 23 Apr 2024 08:31:21 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 17
Message-ID: <v07rkp$1h6fk$1@dont-email.me>
References: <v06vdb$17r2v$1@dont-email.me>
<5451dcac941e1f569397a5cc7818f68f@www.novabbs.org>
<v078td$1df76$4@dont-email.me> <2024Apr23.082238@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 23 Apr 2024 10:31:22 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="e762e53ec9e8808879f7618c3ddfda81";
logging-data="1612276"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX188gN0iI8p9dYYvG8FSzAaM"
User-Agent: Pan/0.155 (Kherson; fc5a80b8)
Cancel-Lock: sha1:w1GzB5o/dCvk22sT+tv5Jb74QRc=

by: Lawrence D'Oliv - Tue, 23 Apr 2024 08:31 UTC

On Tue, 23 Apr 2024 06:22:38 GMT, Anton Ertl wrote:

> So what's the benefit of using vector/SIMD instructions at all rather
> than doing it with scalar code?

On the original Cray machines, I read somewhere the benefit of using the
vector versions over the scalar ones was a net positive for a vector
length as low as 2.

> If your OoO capabilities are limited (and I think
> they are on the Cray machines), you cannot start the second iteration of
> the doall loop before the processing step of the first iteration has
> finished with the register.

How would out-of-order execution help, anyway, given all the operations on
the vector elements are supposed to be identical? Unless it’s just greater
parallelism.

Re: Short Vectors Versus Long Vectors

<2024Apr23.144007@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=37954&group=comp.arch#37954

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Short Vectors Versus Long Vectors
Date: Tue, 23 Apr 2024 12:40:07 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 84
Message-ID: <2024Apr23.144007@mips.complang.tuwien.ac.at>
References: <v06vdb$17r2v$1@dont-email.me> <5451dcac941e1f569397a5cc7818f68f@www.novabbs.org> <v078td$1df76$4@dont-email.me> <2024Apr23.082238@mips.complang.tuwien.ac.at> <v07rkp$1h6fk$1@dont-email.me>
Injection-Date: Tue, 23 Apr 2024 15:23:49 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="88aa0d4350c8e2e1582100f920599d78";
logging-data="1736610"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18bLv50c0sa9cjHH0LbhjZP"
Cancel-Lock: sha1:nrT+eOhSQcUHhIKjU70O/L8u24k=
X-newsreader: xrn 10.11

by: Anton Ertl - Tue, 23 Apr 2024 12:40 UTC

Lawrence D'Oliveiro <ldo@nz.invalid> writes:
>On Tue, 23 Apr 2024 06:22:38 GMT, Anton Ertl wrote:
>> If your OoO capabilities are limited (and I think
>> they are on the Cray machines), you cannot start the second iteration of
>> the doall loop before the processing step of the first iteration has
>> finished with the register.
>
>How would out-of-order execution help, anyway, given all the operations on
>the vector elements are supposed to be identical?

OoO does in hardware what software pipelining does in software: If I
have a loop

for (i=0; i<n; i++)
b[i] = a[i]+c[i];

the straightforward way to code this is:

L0:
load tmp1 = a[i]
load tmp2 = b[i]
add tmp3 = tmp1,tmp2
store c[i] = tmp3
add i = i+1
branch L0 if i<n

On an in-order CPU you then do things like loop unrolling, modulo
scheduling and modulo variable renaming to get a steady state like:

L0:
store c[i], tmp1
add tmp3 = tmp3,tmp4
load tmp9 = a[i+4]
load tmp10 = b[i+4]
store c[i+1], tmp3
add tmp5 = tmp5,tmp6
load tmp1 = a[i+5]
load tmp2 = b[i+5]
store c[i+2], tmp5
add tmp7 = tmp7,tmp8
load tmp3 = a[i+6]
load tmp4 = b[i+6]
store c[i+3], tmp7
add tmp9 = tmp9,tmp10
load tmp1 = a[i+7]
load tmp2 = b[i+7]
store c[i+4], tmp9
add tmp1 = tmp1,tmp2
load tmp3 = a[i+8]
load tmp4 = b[i+8]
add i=i+5
branch L0 if i<n-4

And that's just to cover a load latency of 4 cycles, assuming that the
machine can perform 2 loads and one store per cycle. And you have to
generate the ramp-up and ramp-down code, and for more complicated
loops it becomes more complicated.

By contrast, on an OoO machine the straightforward code just works
efficiently, and the hardware does the reordering and register
renaming (and the Golden Cove with its 0-cycle constant additions
eliminates even a part of the reason for loop unrolling). It creates
the ramp-up automatically, and, if the loop exit is predicted
correctly, even the ramp-down, and it overlaps the ramp-up (and
possibly the ramp-down) with adjacent code.

Back to the Crays: While the SIMD/vector semantics means that a
straightforward loop will process 64 elements rather than one before
the first load of the second iteration has to wait for the add of
the first iteration to finish, you still have to do some software
pipelining to get an overlap between that add and that load; the
longer the latency, the more software pipelining and (for register
renaming) the more registers you need.

In OoO the corresponding condition is when the OoO engine has consumed
all instances of one resource and has to wait for instructions to
finish to free these resources; ideally the hardware prefetcher avoids
that scenario, but in memory-bandwidth-limited situations it will
occur.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Short Vectors Versus Long Vectors

<1ba7eb4d351901d065977ac9b28171d0@www.novabbs.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=37955&group=comp.arch#37955

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchal...@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: Short Vectors Versus Long Vectors
Date: Tue, 23 Apr 2024 17:34:12 +0000
Organization: Rocksolid Light
Message-ID: <1ba7eb4d351901d065977ac9b28171d0@www.novabbs.org>
References: <v06vdb$17r2v$1@dont-email.me> <5451dcac941e1f569397a5cc7818f68f@www.novabbs.org> <v078td$1df76$4@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="2133271"; mail-complaints-to="usenet@i2pn2.org";
posting-account="65wTazMNTleAJDh/pRqmKE7ADni/0wesT78+pyiDW8A";
User-Agent: Rocksolid Light
X-Rslight-Site: $2y$10$5WMXocIWi5d2T9j/83E4aOfihhHoSaVT9DwqaVIoN69zhbWvRKuhu
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8

by: MitchAlsup1 - Tue, 23 Apr 2024 17:34 UTC

Lawrence D'Oliveiro wrote:

> On Tue, 23 Apr 2024 02:14:32 +0000, MitchAlsup1 wrote:

>> CRAY machines stayed "in style" as long as memory latency remained
>> smaller than the length of a vector (64 cycles) and fell out of favor
>> when the cores got fast enough that memory could no longer keep up.

> So why would conventional short vectors work better, then? Surely the
> latency discrepancy would be even worse for them.

Yes, the later NEC long vector machines grew their VRF up to 256 entries
per register.

As to why RISC-V went shorter I can only imagine they think vector codes
can be compiled properly for a quicker memory hierarchy (i.e., hit in
L1 or L2 caches.).

Re: Short Vectors Versus Long Vectors

<c3a6a8910c1d5dc883c7fb40565088b6@www.novabbs.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=37956&group=comp.arch#37956

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchal...@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: Short Vectors Versus Long Vectors
Date: Tue, 23 Apr 2024 17:49:25 +0000
Organization: Rocksolid Light
Message-ID: <c3a6a8910c1d5dc883c7fb40565088b6@www.novabbs.org>
References: <v06vdb$17r2v$1@dont-email.me> <5451dcac941e1f569397a5cc7818f68f@www.novabbs.org> <v078td$1df76$4@dont-email.me> <2024Apr23.082238@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="2134372"; mail-complaints-to="usenet@i2pn2.org";
posting-account="65wTazMNTleAJDh/pRqmKE7ADni/0wesT78+pyiDW8A";
User-Agent: Rocksolid Light
X-Rslight-Site: $2y$10$ELa5UrrZtF/IiBcbmT9irOIeYt5AI4Y9r4Fhu8eeyfVMn2yY8bMy.
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8

by: MitchAlsup1 - Tue, 23 Apr 2024 17:49 UTC

Anton Ertl wrote:

> Lawrence D'Oliveiro <ldo@nz.invalid> writes:
>>On Tue, 23 Apr 2024 02:14:32 +0000, MitchAlsup1 wrote:
>>
>>> CRAY machines stayed "in style" as long as memory latency remained
>>> smaller than the length of a vector (64 cycles) and fell out of favor
>>> when the cores got fast enough that memory could no longer keep up.

> Mitch Alsup repeatedly makes this claim without giving any
> justification. Your question may shed some light on that.

Consider a CRAY-like vector machine with 128-cycle main memory
and 64-entry VRF registers. If it only takes 64 cycles to send
out all the addresses, but takes 128 cycles to return, there is
no "chain slot"--chain slot only works when the memory latency
is shorter than vector length.

And without chain slot, vectors are not higher performing (by
much) compared to scalar operation. Vectors were a way of
appearing to perform one beat of work per cycle per active
function unit.

>>So why would conventional short vectors work better, then? Surely the
>>latency discrepancy would be even worse for them.

Context switch latency...

> Thinking about it, they probably don't work better. They just don't
> work worse, so why spend area on 4096-bit vector registers like the
> Cray-1 did when 128-512-bit vector registers work just as well?

But do they work as well ??

> Plus,
> they have 200 or so of these registers, so 4096-bit registers would be
> really expensive. How many vector registers does the Cray-1 (and its
> successors) have?

> On modern machines OoO machinery bridges the latency gap between the
> L2 cache, maybe even the L3 cache and the core for data-parallel code.

Mc 88120 would run MATRIX 300 at just under 6 I/C with massive cache
misses (~33%).

> For the latency gap to main memory there are the hardware prefetchers,
> and they use the L1 or L2 cache as intermediate buffer, while the
> Cray-1 and followons use vector registers.

Opening yourself up to Spectré-like attacks.

> So what's the benefit of using vector/SIMD instructions at all rather
> than doing it with scalar code? A SIMD instruction that replaces n
> scalar instructions consumes fewer resources for instruction fetching,
> decoding, register renaming, administering the instruction in the OoO
> engine, and in retiring the instruction.

I can argue that SIMD is "just a waste of ISA encoding space".

Not to mention that the 512 version can only run a few SIMD instructions
at that width before thermally throttling itself.

> So at some point you stop. Intel introduced AVX-512 for Larrabee (a
> special-purpose machine), and now is backpedaling with desktop, laptop
> and small-server CPUs (even though only the Golden/Raptor Cove cores
> are enabled on the small-server CPUs) only supporting AVX, and with
> AVX10 only guaranteeing 256-bit vector registers, so maybe 512-bit
> vector registers are already too costly for the benefit they give in
> general-purpose computing.

> Back to old-style vector processors. There have been machines that
> supported longer vector registers and AFAIK also memory-to-memory
> machines. The question is why have they not been the answer of the
> vector-processor community to the problem of covering the latency? Or
> maybe they have? AFAIK NEC SX has been available in some form even in
> recent years, maybe still.

> Anyway, after thinking about this, the reason behind Mitch Alsup's
> statement is that in a

> doall(load process store)

> computation (like what SIMD is good at), the loads precede the
> corresponding processing by the load latency (i.e., memory latency on
> the Cray machines). If your OoO capabilities are limited (and I think
> they are on the Cray machines), you cannot start the second iteration
> of the doall loop before the processing step of the first iteration
> has finished with the register.

Unless the compiler can solve the memory aliasing problem.

> You can do a bit of software
> pipelining and software register renaming by transforming this into

> load1 doall(load2 process1 store1 load1 process2 store2)

> but at some point you run out of vector registers.

> One thing that comes to mind is tracking individual parts of the
> vector registers, which allows to starting the next iteration as soon
> as the first part of the vector register no longer has any readers.

A vector scoreboard anyone ??

> However, it's probably not that far off in complexity to tracking
> shorter vector registers in an OoO engine. And if you support
> exceptions (the Crays probably don't), this becomes messy, while with
> short vector registers it's easier to implement the (ISA)
> architecture.

All of which is solved with VVM. Consider::

for( int64_t i = 0; i < max; i++ )
a[i] = a[max-i];

This can be vectorized under VVM, the parts far from i = ½×max run
at vector speeds, those near i = ½×max run at scalar speeds, from
the same instruction sequence !! .....

> - anton

Re: Short Vectors Versus Long Vectors

<263a9117de58a68c536dfc8bae6368e2@www.novabbs.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=37957&group=comp.arch#37957

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchal...@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: Short Vectors Versus Long Vectors
Date: Tue, 23 Apr 2024 17:51:44 +0000
Organization: Rocksolid Light
Message-ID: <263a9117de58a68c536dfc8bae6368e2@www.novabbs.org>
References: <v06vdb$17r2v$1@dont-email.me> <5451dcac941e1f569397a5cc7818f68f@www.novabbs.org> <v078td$1df76$4@dont-email.me> <2024Apr23.082238@mips.complang.tuwien.ac.at> <v07rkp$1h6fk$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="2134699"; mail-complaints-to="usenet@i2pn2.org";
posting-account="65wTazMNTleAJDh/pRqmKE7ADni/0wesT78+pyiDW8A";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Site: $2y$10$RlSWcMsjnfIR5YcPVIZq2eow/iSPGFVbFnuMSCQZGHWtJX6O4hwcO

by: MitchAlsup1 - Tue, 23 Apr 2024 17:51 UTC

Lawrence D'Oliveiro wrote:

> On Tue, 23 Apr 2024 06:22:38 GMT, Anton Ertl wrote:

>> So what's the benefit of using vector/SIMD instructions at all rather
>> than doing it with scalar code?

> On the original Cray machines, I read somewhere the benefit of using the
> vector versions over the scalar ones was a net positive for a vector
> length as low as 2.

Somewhere in the neighborhood of 4-5 length vectors. There was a 3 cycle
decode delay as pipeline scheduling slots were reserved for the vector
writebacks.

>> If your OoO capabilities are limited (and I think
>> they are on the Cray machines), you cannot start the second iteration of
>> the doall loop before the processing step of the first iteration has
>> finished with the register.

> How would out-of-order execution help, anyway, given all the operations on
> the vector elements are supposed to be identical? Unless it’s just greater
> parallelism.

Out of order makes it easier to "run into" undiscovered dynamic dependency
free operations.

Re: Short Vectors Versus Long Vectors

<v098so$1rp16$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=37959&group=comp.arch#37959

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Short Vectors Versus Long Vectors
Date: Tue, 23 Apr 2024 16:23:33 -0500
Organization: A noiseless patient Spider
Lines: 206
Message-ID: <v098so$1rp16$1@dont-email.me>
References: <v06vdb$17r2v$1@dont-email.me>
<5451dcac941e1f569397a5cc7818f68f@www.novabbs.org>
<v078td$1df76$4@dont-email.me> <2024Apr23.082238@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 23 Apr 2024 23:23:37 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="e4c9793258b9d913f28848bbc0503c1c";
logging-data="1958950"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19PiXobpn+u9AATQmZSRLeg9tNgYH04Zsw="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:LRX87wD7CKaf0eSTJQHnTsMVZ70=
In-Reply-To: <2024Apr23.082238@mips.complang.tuwien.ac.at>
Content-Language: en-US

by: BGB - Tue, 23 Apr 2024 21:23 UTC

On 4/23/2024 1:22 AM, Anton Ertl wrote:
> Lawrence D'Oliveiro <ldo@nz.invalid> writes:
>> On Tue, 23 Apr 2024 02:14:32 +0000, MitchAlsup1 wrote:
>>
>>> CRAY machines stayed "in style" as long as memory latency remained
>>> smaller than the length of a vector (64 cycles) and fell out of favor
>>> when the cores got fast enough that memory could no longer keep up.
>
> Mitch Alsup repeatedly makes this claim without giving any
> justification. Your question may shed some light on that.
>
>> So why would conventional short vectors work better, then? Surely the
>> latency discrepancy would be even worse for them.
>
> Thinking about it, they probably don't work better. They just don't
> work worse, so why spend area on 4096-bit vector registers like the
> Cray-1 did when 128-512-bit vector registers work just as well? Plus,
> they have 200 or so of these registers, so 4096-bit registers would be
> really expensive. How many vector registers does the Cray-1 (and its
> successors) have?
>

Yeah.

Or if you can already saturate the RAM bandwidth with 128-bit vectors,
why go wider?...

Or, one may find that even the difference between 64 and 128 bit vectors
goes away once one's working set exceeds 1/3 to 1/2 the size of the L1
cache.

Meanwhile, it remains an issue that, wider vectors are more expensive.

Though, unclear even if 64-bit is a clear win over 32-bit in terms of
performance. Arguably, many uses of 64-bit could have been served with a
primarily 32-bit machine that allows paired registers for things like
memory addressing and similar.

Though, OTOH, 64/128 allows unifying GPRs, FPU, and SIMD, into a single
register space.

Also, 32/64/128 bit splitting/pairing isn't really workable as it would
end up needing 8 or 12 register read ports (so, would be more expensive
than the "use 64-bit registers and effectively waste half the register
for 32-bit operations" option).

Well, unless the number of register ports remain constant (with 64-bit
ports), and the 32-bit registers are effectively faked (by splitting the
registers in half, and merging halves on write-back). But, there is
little obvious advantage to this over the "just waste half the register"
option (and it would be more expensive than just wasting half the register).

> On modern machines OoO machinery bridges the latency gap between the
> L2 cache, maybe even the L3 cache and the core for data-parallel code.
> For the latency gap to main memory there are the hardware prefetchers,
> and they use the L1 or L2 cache as intermediate buffer, while the
> Cray-1 and followons use vector registers.
>

On my current PC, while it hides latency, one is hard-pressed to exceed
roughly 4GB/sec of memory bandwidth (per core), though the overall
system memory bandwidth seems to be higher.

Say: 8C/16T (memcpy)
Each core has a local peak of ~ 4GB/s;
System seems to be ~ 12-16 GB/s
Seemingly ~ 6-8 GB/s per group of 4 cores.

Peak memcpy bandwidth (L1 local) being in the area of 24 GB/s.

Latency is hidden fairly well, granted, but doesn't make as big of a
difference if the task is bandwidth limited.

In my case, with my custom CPU core:
The elements are packaged in a way to make them easier to work with, for
either parallel or pipeline execution.

For the low-precision unit, it can work on all 4 at the same time, if 4
are available. This unit does Binary16 or (optionally) Binary32.

In the main FPU, the SIMD packaging allows the FPU to pipeline the
operations despite the FPU having too high of a latency to be pipelined
normally.

The advantage of SIMD would be reduced if the pipeline were long enough
to handle Binary64 values directly, but 6 EX stages would be asking a
bit much (further increasing either pipeline length or width having a
significant impact on the cost of the register-forwarding logic).

Arguably, one could have a separate (and longer) pipeline for FPU, but
this would add complexity with a shared register space.

> So why not use SIMD instructions with longer vector registers? The
> progression from 128-bit SSE through AVX-256 to AVX-512 by Intel
> suggests that this is happening, but with every doubling the cost in
> area doubles but the returns are diminishing thanks to Amdahl's law.
> So at some point you stop. Intel introduced AVX-512 for Larrabee (a
> special-purpose machine), and now is backpedaling with desktop, laptop
> and small-server CPUs (even though only the Golden/Raptor Cove cores
> are enabled on the small-server CPUs) only supporting AVX, and with
> AVX10 only guaranteeing 256-bit vector registers, so maybe 512-bit
> vector registers are already too costly for the benefit they give in
> general-purpose computing.
>

Yeah.

As I see it, in general, 128-bit SIMD seems to be the local optimum.
Both 256 and 512 end up having more drawbacks than merits as I see it.

Going outside of Load/Store adds has a lot of hair for comparably little
benefit.

Like, technically, I could go Load-Op / Op-Store for a subset of
operations, as I ended up with the logic to support it. But, it doesn't
seem like it would bring enough benefit to really be worth it (would not
improve code-density as they require 64-bit encodings in my case, and in
most cases seem unlikely to bring a performance advantage either; and
given some limitations of the WEXifier, using them might actually make
performance worse by interfering with shuffle-and-bundle).

The main merit they would have is if the CPU were register-pressure
limited, but in my case, with 64 GPRs, this isn't really the case either.

> Anyway, after thinking about this, the reason behind Mitch Alsup's
> statement is that in a
>
> doall(load process store)
>
> computation (like what SIMD is good at), the loads precede the
> corresponding processing by the load latency (i.e., memory latency on
> the Cray machines). If your OoO capabilities are limited (and I think
> they are on the Cray machines), you cannot start the second iteration
> of the doall loop before the processing step of the first iteration
> has finished with the register. You can do a bit of software
> pipelining and software register renaming by transforming this into
>
> load1 doall(load2 process1 store1 load1 process2 store2)
>
> but at some point you run out of vector registers.
>
> One thing that comes to mind is tracking individual parts of the
> vector registers, which allows to starting the next iteration as soon
> as the first part of the vector register no longer has any readers.
> However, it's probably not that far off in complexity to tracking
> shorter vector registers in an OoO engine. And if you support
> exceptions (the Crays probably don't), this becomes messy, while with
> short vector registers it's easier to implement the (ISA)
> architecture.
>

As can be noted, SIMD is easy to implement.

Main obvious drawback is the potential for combinatorial explosions of
instructions. One needs to keep a fairly careful watch over this.

Like, if one is faced with an NxN or NxM grid of possibilities, naive
strategy is to be like "I will define an instruction for every
possibility in the grid.", but this is bad. More reasonable to devise a
minimal set of instructions that will allow the operation to be done
within in a reasonable number of instructions.

But, then again, I can also note that I axed things like packed-byte
operations and saturating arithmetic, which are pretty much de-facto in
packed-integer SIMD.

Likewise, a lot of the gaps are filled in with specialized converter and
helper ops. Even here, some conversion chains will require multiple
instructions.

Well, and if there is no practical difference between a scalar and SIMD
version of an instruction, may well just use the SIMD version for scalar.

....

> - anton

Re: Short Vectors Versus Long Vectors

<v09auq$1s2pn$6@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=37961&group=comp.arch#37961

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ldo...@nz.invalid (Lawrence D'Oliveiro)
Newsgroups: comp.arch
Subject: Re: Short Vectors Versus Long Vectors
Date: Tue, 23 Apr 2024 21:58:50 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 5
Message-ID: <v09auq$1s2pn$6@dont-email.me>
References: <v06vdb$17r2v$1@dont-email.me>
<5451dcac941e1f569397a5cc7818f68f@www.novabbs.org>
<v078td$1df76$4@dont-email.me>
<1ba7eb4d351901d065977ac9b28171d0@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 23 Apr 2024 23:58:51 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="e762e53ec9e8808879f7618c3ddfda81";
logging-data="1968951"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18VFVytWjJpjZ5/GbbZLk/s"
User-Agent: Pan/0.155 (Kherson; fc5a80b8)
Cancel-Lock: sha1:Be0MWWfTACEYOx391oZE0NrjOeY=

by: Lawrence D'Oliv - Tue, 23 Apr 2024 21:58 UTC

On Tue, 23 Apr 2024 17:34:12 +0000, MitchAlsup1 wrote:

> As to why RISC-V went shorter ...

They didn’t fix a length.

Re: Short Vectors Versus Long Vectors

<30dfc55476acb70cb78edd423bf2502b@www.novabbs.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=37962&group=comp.arch#37962

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchal...@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: Short Vectors Versus Long Vectors
Date: Tue, 23 Apr 2024 22:40:25 +0000
Organization: Rocksolid Light
Message-ID: <30dfc55476acb70cb78edd423bf2502b@www.novabbs.org>
References: <v06vdb$17r2v$1@dont-email.me> <5451dcac941e1f569397a5cc7818f68f@www.novabbs.org> <v078td$1df76$4@dont-email.me> <1ba7eb4d351901d065977ac9b28171d0@www.novabbs.org> <v09auq$1s2pn$6@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="2156611"; mail-complaints-to="usenet@i2pn2.org";
posting-account="65wTazMNTleAJDh/pRqmKE7ADni/0wesT78+pyiDW8A";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Site: $2y$10$Lx4kMMSjUciA/vPbq7v54.0Lb1TlWI6u.CLdYgy6.5/iXT3O0Rfc2
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8

by: MitchAlsup1 - Tue, 23 Apr 2024 22:40 UTC

Lawrence D'Oliveiro wrote:

> On Tue, 23 Apr 2024 17:34:12 +0000, MitchAlsup1 wrote:

>> As to why RISC-V went shorter ...

> They didn’t fix a length.

Nor do they want to have to save a page of VRF at context switch.

Re: Short Vectors Versus Long Vectors

<d46723273a62c22283387893da40e7e4@www.novabbs.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=37963&group=comp.arch#37963

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchal...@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: Short Vectors Versus Long Vectors
Date: Tue, 23 Apr 2024 22:39:44 +0000
Organization: Rocksolid Light
Message-ID: <d46723273a62c22283387893da40e7e4@www.novabbs.org>
References: <v06vdb$17r2v$1@dont-email.me> <5451dcac941e1f569397a5cc7818f68f@www.novabbs.org> <v078td$1df76$4@dont-email.me> <2024Apr23.082238@mips.complang.tuwien.ac.at> <v098so$1rp16$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="2156611"; mail-complaints-to="usenet@i2pn2.org";
posting-account="65wTazMNTleAJDh/pRqmKE7ADni/0wesT78+pyiDW8A";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
X-Rslight-Site: $2y$10$LlBTxv7Vi.4oWjEdLXElg.Z/pyeZ9o.4XEpiJa9HFQ6U9Rd4arMLq

by: MitchAlsup1 - Tue, 23 Apr 2024 22:39 UTC

BGB wrote:

> On 4/23/2024 1:22 AM, Anton Ertl wrote:
>> Lawrence D'Oliveiro <ldo@nz.invalid> writes:
>>> On Tue, 23 Apr 2024 02:14:32 +0000, MitchAlsup1 wrote:
>>>big snip>

> As can be noted, SIMD is easy to implement.

ADD/SUB is, MUL and DIV and SHIFTs and CMPs are not; especially when
MUL does 2n = n × n and DIV does 2n / n -> n (quotient) + n (remainder)

> Main obvious drawback is the potential for combinatorial explosions of
> instructions. One needs to keep a fairly careful watch over this.

> Like, if one is faced with an NxN or NxM grid of possibilities, naive
> strategy is to be like "I will define an instruction for every
> possibility in the grid.", but this is bad. More reasonable to devise a
> minimal set of instructions that will allow the operation to be done
> within in a reasonable number of instructions.

> But, then again, I can also note that I axed things like packed-byte
> operations and saturating arithmetic, which are pretty much de-facto in
> packed-integer SIMD.

MANY SIMD algorithms need saturating arithmetic because they cannot do
b + b -> h and avoid the overflow. And they cannot do B + b -> h because
that would consume vast amounts of encoding space.

> Likewise, a lot of the gaps are filled in with specialized converter and
> helper ops. Even here, some conversion chains will require multiple
> instructions.

> Well, and if there is no practical difference between a scalar and SIMD
> version of an instruction, may well just use the SIMD version for scalar.

> ....

>> - anton

Re: Short Vectors Versus Long Vectors

<v09gmk$1te49$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=37965&group=comp.arch#37965

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Short Vectors Versus Long Vectors
Date: Tue, 23 Apr 2024 18:36:49 -0500
Organization: A noiseless patient Spider
Lines: 99
Message-ID: <v09gmk$1te49$1@dont-email.me>
References: <v06vdb$17r2v$1@dont-email.me>
<5451dcac941e1f569397a5cc7818f68f@www.novabbs.org>
<v078td$1df76$4@dont-email.me> <2024Apr23.082238@mips.complang.tuwien.ac.at>
<v098so$1rp16$1@dont-email.me>
<d46723273a62c22283387893da40e7e4@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 24 Apr 2024 01:36:53 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="9b128de1d320d65ee4a49c78ee8f6778";
logging-data="2013321"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19zMhC1KKeS+RUm1LHgVfzDIO+QWy+Xx3k="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:W0jlgXWGw56TtcjJwe26bdTbG1k=
Content-Language: en-US
In-Reply-To: <d46723273a62c22283387893da40e7e4@www.novabbs.org>

by: BGB - Tue, 23 Apr 2024 23:36 UTC

On 4/23/2024 5:39 PM, MitchAlsup1 wrote:
> BGB wrote:
>
>> On 4/23/2024 1:22 AM, Anton Ertl wrote:
>>> Lawrence D'Oliveiro <ldo@nz.invalid> writes:
>>>> On Tue, 23 Apr 2024 02:14:32 +0000, MitchAlsup1 wrote:
>>>> big snip>
>
>
>> As can be noted, SIMD is easy to implement.
>
> ADD/SUB is, MUL and DIV and SHIFTs and CMPs are not; especially when
> MUL does 2n = n × n and DIV does 2n / n -> n (quotient) + n (remainder)
>

MUL:
Have a few instructions, giving the Low, High-Signed, and High-Unsigned
results.

DIV:
Didn't bother with this.
Typically faked using multiply-by-reciprocal and taking the high result.

Something like MOD would need to be faked, but SIMD modulo doesn't
really tend to be a thing IME.

Division by a non-constant scalar/vector value will need a runtime call.

SHIFT:
Mostly faked using ALU shifts and masking.

CMPxx:
There are dedicated instructions for this.

>> Main obvious drawback is the potential for combinatorial explosions of
>> instructions. One needs to keep a fairly careful watch over this.
>
>> Like, if one is faced with an NxN or NxM grid of possibilities, naive
>> strategy is to be like "I will define an instruction for every
>> possibility in the grid.", but this is bad. More reasonable to devise
>> a minimal set of instructions that will allow the operation to be done
>> within in a reasonable number of instructions.
>
>> But, then again, I can also note that I axed things like packed-byte
>> operations and saturating arithmetic, which are pretty much de-facto
>> in packed-integer SIMD.
>
> MANY SIMD algorithms need saturating arithmetic because they cannot do
> b + b -> h and avoid the overflow. And they cannot do B + b -> h because
> that would consume vast amounts of encoding space.
>

There are ways to fake it.

Though, granted, most end up involving extra instructions and 1 bit of
dynamic range.

Though, the main case where one can't spare any dynamic range is
typically packed byte, which I had skipped (in favor of faking
packed-byte scenarios using packed word).

But, could add, say:
PSHAR.W Rm, Rn //Packed Shift right 1 bit, arithmetic
PSHLR.W Rm, Rn //Packed Shift right 1 bit, logical
PSHAL.W Rm, Rn //Packed Shift left 1 bit, arithmetic saturate
PSHLL.W Rm, Rn //Packed Shift left 1 bit, logical saturate

In a naive case, one could fake a virtual PADDSS.W instruction as, say:
PSHAR.W R4, R16
PSHAR.W R5, R17
PADD.W R16, R17, R18
PSHAL.W R18, R2

These could more efficiently address both saturation, and 1-bit shift
(the most common case).

Shift left 1-bit (without saturation) can generally be handled with:
PADD.W R4, R4, R2
Or similar.

>> Likewise, a lot of the gaps are filled in with specialized converter
>> and helper ops. Even here, some conversion chains will require
>> multiple instructions.
>
>> Well, and if there is no practical difference between a scalar and
>> SIMD version of an instruction, may well just use the SIMD version for
>> scalar.
>
>> ....
>
>
>>> - anton

Re: Short Vectors Versus Long Vectors

<v09jfp$1tvga$2@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=37966&group=comp.arch#37966

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ldo...@nz.invalid (Lawrence D'Oliveiro)
Newsgroups: comp.arch
Subject: Re: Short Vectors Versus Long Vectors
Date: Wed, 24 Apr 2024 00:24:25 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 7
Message-ID: <v09jfp$1tvga$2@dont-email.me>
References: <v06vdb$17r2v$1@dont-email.me>
<5451dcac941e1f569397a5cc7818f68f@www.novabbs.org>
<v078td$1df76$4@dont-email.me> <2024Apr23.082238@mips.complang.tuwien.ac.at>
<v098so$1rp16$1@dont-email.me>
<d46723273a62c22283387893da40e7e4@www.novabbs.org>
<v09gmk$1te49$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 24 Apr 2024 02:24:26 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="13b1d2d556e95ece6587b97b13572e96";
logging-data="2031114"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/AF95wFf90zxhbVTg8TIBN"
User-Agent: Pan/0.155 (Kherson; fc5a80b8)
Cancel-Lock: sha1:jqNju2e4+KPHPd+8pECBT7He8aE=

by: Lawrence D'Oliv - Wed, 24 Apr 2024 00:24 UTC

On Tue, 23 Apr 2024 18:36:49 -0500, BGB wrote:

> DIV:
> Didn't bother with this.
> Typically faked using multiply-by-reciprocal and taking the high result.

Another Cray-ism! ;)

Re: Short Vectors Versus Long Vectors

<v09jim$1tvga$3@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=37967&group=comp.arch#37967

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ldo...@nz.invalid (Lawrence D'Oliveiro)
Newsgroups: comp.arch
Subject: Re: Short Vectors Versus Long Vectors
Date: Wed, 24 Apr 2024 00:25:59 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 15
Message-ID: <v09jim$1tvga$3@dont-email.me>
References: <v06vdb$17r2v$1@dont-email.me>
<5451dcac941e1f569397a5cc7818f68f@www.novabbs.org>
<v078td$1df76$4@dont-email.me>
<1ba7eb4d351901d065977ac9b28171d0@www.novabbs.org>
<v09auq$1s2pn$6@dont-email.me>
<30dfc55476acb70cb78edd423bf2502b@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 24 Apr 2024 02:25:59 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="13b1d2d556e95ece6587b97b13572e96";
logging-data="2031114"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+ojqm1sMazmKyM4+8Qt7aU"
User-Agent: Pan/0.155 (Kherson; fc5a80b8)
Cancel-Lock: sha1:mrqWkwMhipVK7HB3u7LVY3xMHE0=

by: Lawrence D'Oliv - Wed, 24 Apr 2024 00:25 UTC

On Tue, 23 Apr 2024 22:40:25 +0000, MitchAlsup1 wrote:

> Lawrence D'Oliveiro wrote:
>
>> On Tue, 23 Apr 2024 17:34:12 +0000, MitchAlsup1 wrote:
>
>>> As to why RISC-V went shorter ...
>
>> They didn’t fix a length.
>
> Nor do they want to have to save a page of VRF at context switch.

But then, you don’t need a whole array of registers, do you: you just need
operand (one for each operand) and destination address registers, plus a
counter.

Re: Short Vectors Versus Long Vectors

<4041e83a5f551d5b01df22a26b15155d@www.novabbs.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=37968&group=comp.arch#37968

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchal...@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: Short Vectors Versus Long Vectors
Date: Wed, 24 Apr 2024 00:34:30 +0000
Organization: Rocksolid Light
Message-ID: <4041e83a5f551d5b01df22a26b15155d@www.novabbs.org>
References: <v06vdb$17r2v$1@dont-email.me> <5451dcac941e1f569397a5cc7818f68f@www.novabbs.org> <v078td$1df76$4@dont-email.me> <2024Apr23.082238@mips.complang.tuwien.ac.at> <v098so$1rp16$1@dont-email.me> <d46723273a62c22283387893da40e7e4@www.novabbs.org> <v09gmk$1te49$1@dont-email.me> <v09jfp$1tvga$2@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="2164101"; mail-complaints-to="usenet@i2pn2.org";
posting-account="65wTazMNTleAJDh/pRqmKE7ADni/0wesT78+pyiDW8A";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
X-Rslight-Site: $2y$10$63CJsjKv/MHNm.NMl6yBBe/A.eSKv8bE.UAjZGEl6PObREWyLl6ia

by: MitchAlsup1 - Wed, 24 Apr 2024 00:34 UTC

Lawrence D'Oliveiro wrote:

> On Tue, 23 Apr 2024 18:36:49 -0500, BGB wrote:

>> DIV:
>> Didn't bother with this.
>> Typically faked using multiply-by-reciprocal and taking the high result.

> Another Cray-ism! ;)

Not IEEE 754 legal.

Re: Short Vectors Versus Long Vectors

<d07c9bad1880ce9188406797b6bcf4bf@www.novabbs.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=37969&group=comp.arch#37969

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchal...@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: Short Vectors Versus Long Vectors
Date: Wed, 24 Apr 2024 00:34:03 +0000
Organization: Rocksolid Light
Message-ID: <d07c9bad1880ce9188406797b6bcf4bf@www.novabbs.org>
References: <v06vdb$17r2v$1@dont-email.me> <5451dcac941e1f569397a5cc7818f68f@www.novabbs.org> <v078td$1df76$4@dont-email.me> <1ba7eb4d351901d065977ac9b28171d0@www.novabbs.org> <v09auq$1s2pn$6@dont-email.me> <30dfc55476acb70cb78edd423bf2502b@www.novabbs.org> <v09jim$1tvga$3@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="2164101"; mail-complaints-to="usenet@i2pn2.org";
posting-account="65wTazMNTleAJDh/pRqmKE7ADni/0wesT78+pyiDW8A";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
X-Rslight-Site: $2y$10$1NHsnGYlNmpeBCzqj0CrOO/E9B8XUsqdAB/QpyMDsZgVKbSbjKAh2

by: MitchAlsup1 - Wed, 24 Apr 2024 00:34 UTC

Lawrence D'Oliveiro wrote:

> On Tue, 23 Apr 2024 22:40:25 +0000, MitchAlsup1 wrote:

>> Lawrence D'Oliveiro wrote:
>>
>>> On Tue, 23 Apr 2024 17:34:12 +0000, MitchAlsup1 wrote:
>>
>>>> As to why RISC-V went shorter ...
>>
>>> They didn’t fix a length.
>>
>> Nor do they want to have to save a page of VRF at context switch.

> But then, you don’t need a whole array of registers, do you: you just need
> operand (one for each operand) and destination address registers, plus a
> counter.

If by 'you' you mean My 66000's VVM::
a) yes I avoid any SW visible register file
b) and I use the miss buffers as the VRF register file pool
c) they vanish on an interrupt or exception
d) the counter is the loop valiable.

Re: Short Vectors Versus Long Vectors

<5b6b6ca26932be92ecc511d08173617a@www.novabbs.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=37970&group=comp.arch#37970

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchal...@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: Short Vectors Versus Long Vectors
Date: Wed, 24 Apr 2024 00:37:11 +0000
Organization: Rocksolid Light
Message-ID: <5b6b6ca26932be92ecc511d08173617a@www.novabbs.org>
References: <v06vdb$17r2v$1@dont-email.me> <5451dcac941e1f569397a5cc7818f68f@www.novabbs.org> <v078td$1df76$4@dont-email.me> <2024Apr23.082238@mips.complang.tuwien.ac.at> <v098so$1rp16$1@dont-email.me> <d46723273a62c22283387893da40e7e4@www.novabbs.org> <v09gmk$1te49$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="2164541"; mail-complaints-to="usenet@i2pn2.org";
posting-account="65wTazMNTleAJDh/pRqmKE7ADni/0wesT78+pyiDW8A";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
X-Rslight-Site: $2y$10$iNU1V1jxOd2NChvNvUF7cuq.i4cvnRuqplmeltxYDMDQ2JbZi5ytG

by: MitchAlsup1 - Wed, 24 Apr 2024 00:37 UTC

BGB wrote:

> On 4/23/2024 5:39 PM, MitchAlsup1 wrote:
>> BGB wrote:
>>
>>
>> MANY SIMD algorithms need saturating arithmetic because they cannot do
>> b + b -> h and avoid the overflow. And they cannot do B + b -> h because
>> that would consume vast amounts of encoding space.
>>

> There are ways to fake it.

> Though, granted, most end up involving extra instructions and 1 bit of
> dynamic range.

1-bit for ADD and SUB, but MUL and shifts require more than 1-bit.

Re: Short Vectors Versus Long Vectors

<hqmg2j1vbkf6suddfnsh3h3uhtkqqio4uk@4ax.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=37971&group=comp.arch#37971

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: quadib...@servername.invalid (John Savard)
Newsgroups: comp.arch
Subject: Re: Short Vectors Versus Long Vectors
Date: Tue, 23 Apr 2024 19:25:22 -0600
Organization: A noiseless patient Spider
Lines: 60
Message-ID: <hqmg2j1vbkf6suddfnsh3h3uhtkqqio4uk@4ax.com>
References: <v06vdb$17r2v$1@dont-email.me> <5451dcac941e1f569397a5cc7818f68f@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 24 Apr 2024 03:25:24 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="0739b5a5b267f63942d6ef28bfb9babb";
logging-data="2054328"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+E4uIVKYb9dWguwpk/M2iiEtuon1vfXXM="
Cancel-Lock: sha1:v6od1lEnI0fUKieeBpClNrKrT0Q=
X-Newsreader: Forte Free Agent 3.3/32.846

by: John Savard - Wed, 24 Apr 2024 01:25 UTC

On Tue, 23 Apr 2024 02:14:32 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

>CRAY machines stayed "in style" as long as memory latency remained smaller
>than the length of a vector (64 cycles) and fell out of favor when the cores
>got fast enough that memory could no longer keep up.
>
>I whish them well, but I expect it will not work out as they desire.....

I know that you've said this about Cray-style vectors.

I had thought the cause was much simpler. As soon as chiips like the
486 DX and then the Pentium II became available, a Cray-style machine
would have had to be implemented from smaller-scale integrated
circuits, so it would have been wildly uneconomic for the performance
it provided; it made much more sense to use off-the-shelf
microprocessors. Despite their shortcomings theoretically in
architectural terms compared to a Cray-style machine, they offered
vastly more FLOPS for the dollar.

After all, the reason the Cray I succeeded where the STAR-100 failed
was that it had those big vector registers - so it did calculations on
a register-to-register basis, rather than on a memory-to-memory basis.

That doesn't make it immune to considerations of memory bandwidth, but
that does mean that it was designed correctly for the circumstance
where memory bandwidth is an issue. So if you have the kind of
calculation to perform that is suited to a vector machine, wouldn't it
still be better to use a vector machine than a whole bunch of scalar
cores with no provision for vectors?

And if memory bandwidth issues make Cray-style vector machines
impractical, then wouldn't it be even worse for GPUs?

There are ways to increase memory bandwidth. Use HBM. Use static RAM.
Use graphics DRAM. The vector CPU of the last gasp of the Cray-style
architecture, the NEC SX-Aurora TSUBASA, is even packaged like a GPU.

Also, the original Cray I did useful work with a memory no larger than
many L3 caches these days. So a vector machine today wouldn't be as
fast as it would be if it could have, say, a 1024-bit wide data bus to
a terabyte of DRAM. That doesn't necessarily mean that such a CPU,
even when throttled by memory bandwidth, isn't an improvement over an
ordinary CPU.

Of course, though, the question is, is it an improvement enough? If
most problems anyone would want to use a vector CPU for today do
involve a large amount of memory, used in a random fashion, so as to
fit poorly in cache, then it might well be that memory bandwidth would
mean that even with a vector architecture well suited to doing a lot
of work, the net result would be only a slight improvement over what
an ordinary CPU could do with the same memory bandwidth.

I would think that a chip is still useful if it can only provide an
improvement for some problems, and that there are ways to increase
memory bandwidth from what ordinary CPUs offer, making it seem likely
that Cray-style vectors are worth doing as a way to improve what a CPU
can do.

John Savard

Re: Short Vectors Versus Long Vectors

<v09oh8$1uppi$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=37972&group=comp.arch#37972

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Short Vectors Versus Long Vectors
Date: Tue, 23 Apr 2024 20:50:31 -0500
Organization: A noiseless patient Spider
Lines: 44
Message-ID: <v09oh8$1uppi$1@dont-email.me>
References: <v06vdb$17r2v$1@dont-email.me>
<5451dcac941e1f569397a5cc7818f68f@www.novabbs.org>
<v078td$1df76$4@dont-email.me> <2024Apr23.082238@mips.complang.tuwien.ac.at>
<v098so$1rp16$1@dont-email.me>
<d46723273a62c22283387893da40e7e4@www.novabbs.org>
<v09gmk$1te49$1@dont-email.me> <v09jfp$1tvga$2@dont-email.me>
<4041e83a5f551d5b01df22a26b15155d@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 24 Apr 2024 03:50:32 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="9b128de1d320d65ee4a49c78ee8f6778";
logging-data="2058034"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Q6VAGR21YFeI9WLfacC7k2llRadF95D0="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:jrxBPuZ2UwX4FPZ+fTqfg5JlWWE=
Content-Language: en-US
In-Reply-To: <4041e83a5f551d5b01df22a26b15155d@www.novabbs.org>

by: BGB - Wed, 24 Apr 2024 01:50 UTC

On 4/23/2024 7:34 PM, MitchAlsup1 wrote:
> Lawrence D'Oliveiro wrote:
>
>> On Tue, 23 Apr 2024 18:36:49 -0500, BGB wrote:
>
>>> DIV:
>>> Didn't bother with this.
>>> Typically faked using multiply-by-reciprocal and taking the high result.
>
>> Another Cray-ism! ;)
>
> Not IEEE 754 legal.

The multiply and take high result:
This is for Packed Integer SIMD.

For Packed-Float SIMD:
Multiply by reciprocal.

There is an instruction to calculate an approximate reciprocal (say, for
dividing two FP-SIMD vectors), at which a person can use Newton-Raphson
to either get a more accurate version, or use it directly (possibly
using N-R to fix up the result of the division).

For some use-cases with Binary16, directly using the approximate version
may be "good enough". Algorithm for Binary16 being, roughly: 0x7800-Val.

If N-R is used, there is often some wonk needed with the first stage,
since as the values often aren't close enough for N-R to reliably
converge if used directly (for whatever reason this may happen; it seems
one needs to be fairly close to the true reciprocal for N-R to be reliable).

But, for some algorithms, if one only needs "throwing darts at a
dartboard" levels of accuracy, the N-R can be skipped.

Note that the default case (if one just divides two FP-SIMD vectors in
C) will be handled using a runtime call (which will use N-R to find the
vector reciprocal and then multiply by it); so faster cases would
require either an inline ASM blob or an intrinsic.

Re: Short Vectors Versus Long Vectors

<5ad43f26367ef2d5e8b3c298511ddf45@www.novabbs.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=37973&group=comp.arch#37973

copy link Newsgroups: comp.arch

Path: i2pn2.org!.POSTED!not-for-mail
From: mitchal...@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: Short Vectors Versus Long Vectors
Date: Wed, 24 Apr 2024 02:00:10 +0000
Organization: Rocksolid Light
Message-ID: <5ad43f26367ef2d5e8b3c298511ddf45@www.novabbs.org>
References: <v06vdb$17r2v$1@dont-email.me> <5451dcac941e1f569397a5cc7818f68f@www.novabbs.org> <hqmg2j1vbkf6suddfnsh3h3uhtkqqio4uk@4ax.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="2170044"; mail-complaints-to="usenet@i2pn2.org";
posting-account="65wTazMNTleAJDh/pRqmKE7ADni/0wesT78+pyiDW8A";
User-Agent: Rocksolid Light
X-Rslight-Site: $2y$10$IIUXQ5KyeQ94sLr0vJBcXepjHuYA1/BPagjjRiEASrsOW8dBmVFh.
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
X-Spam-Checker-Version: SpamAssassin 4.0.0

by: MitchAlsup1 - Wed, 24 Apr 2024 02:00 UTC

John Savard wrote:

> On Tue, 23 Apr 2024 02:14:32 +0000, mitchalsup@aol.com (MitchAlsup1)
> wrote:

>>CRAY machines stayed "in style" as long as memory latency remained smaller
>>than the length of a vector (64 cycles) and fell out of favor when the cores
>>got fast enough that memory could no longer keep up.
>>
>>I whish them well, but I expect it will not work out as they desire.....

> I know that you've said this about Cray-style vectors.

> I had thought the cause was much simpler. As soon as chiips like the
> 486 DX and then the Pentium II became available, a Cray-style machine
> would have had to be implemented from smaller-scale integrated
> circuits, so it would have been wildly uneconomic for the performance
> it provided; it made much more sense to use off-the-shelf
> microprocessors. Despite their shortcomings theoretically in
> architectural terms compared to a Cray-style machine, they offered
> vastly more FLOPS for the dollar.

CRAY-XMP was done in MECL 10K gate arrays, offering 10K gates per chip.

> After all, the reason the Cray I succeeded where the STAR-100 failed
> was that it had those big vector registers - so it did calculations on
> a register-to-register basis, rather than on a memory-to-memory basis.

The CRAY-a had much shorter setup sequences than the STAR.
Amdahl's law strikes again.

> That doesn't make it immune to considerations of memory bandwidth, but
> that does mean that it was designed correctly for the circumstance
> where memory bandwidth is an issue. So if you have the kind of
> calculation to perform that is suited to a vector machine, wouldn't it
> still be better to use a vector machine than a whole bunch of scalar
> cores with no provision for vectors?

Let us face facts:: en the large; vector machines are DMA devices
that happen to mangle the data on thee way through.

> And if memory bandwidth issues make Cray-style vector machines
> impractical, then wouldn't it be even worse for GPUs?

a) It is not pure BW but BW at a latency less than K. CRAY-1 was
about 16-cycles (DRAM) CRAY-1S was about 10 cycles (SRAM), XMP
was about 22 cycles, and YMP was about 32 cycles. CRAY-1 and -1S
had 1 port to memory, XMP and YMP had 2Rd and 1W to memory.

b) GPUs use threading to absorb the latency to memory (roughly 400
cycles), along with HW rasterizer, interpolator, texture access,
and an HW OS that can clean up a thread and launch a new thread in
about 8 cycles. That is: GPUs absorb latency by waiting in a way
that does not prevent others from making forward progress.

> There are ways to increase memory bandwidth. Use HBM. Use static RAM.
> Use graphics DRAM. The vector CPU of the last gasp of the Cray-style
> architecture, the NEC SX-Aurora TSUBASA, is even packaged like a GPU.

Even HBM has a latency of standard DRAM (with smaller command cycle
overheads) so, a 5-GHz core using 20ns DRAM with infinite BW between
core and DRAM will still have the core see 100 cycles of latency.
Bandwidth alone does not solve latency bound problems, latency alone
does not solve BW bound problems.

> Also, the original Cray I did useful work with a memory no larger than
> many L3 caches these days. So a vector machine today wouldn't be as
> fast as it would be if it could have, say, a 1024-bit wide data bus to
> a terabyte of DRAM. That doesn't necessarily mean that such a CPU,
> even when throttled by memory bandwidth, isn't an improvement over an
> ordinary CPU.

The 128K DW memory was used for number crunching, but CRAY-1 had an
I/O system that could consume as much BW as a core, so one could
write out the last chunk and read in the next chunk while the
current chunk was processing. And it was this I/O system that made
a CRAY-1 faster than its equivalent NEC machine (excepting on certain
benchmarks).

> Of course, though, the question is, is it an improvement enough? If
> most problems anyone would want to use a vector CPU for today do
> involve a large amount of memory, used in a random fashion, so as to
> fit poorly in cache, then it might well be that memory bandwidth would
> mean that even with a vector architecture well suited to doing a lot
> of work, the net result would be only a slight improvement over what
> an ordinary CPU could do with the same memory bandwidth.

In essence, if you can teach the compiler to block the numeric algorithm
to fit through (through not in) the cache(s) you can use vector style
CPU architecture.

> I would think that a chip is still useful if it can only provide an
> improvement for some problems, and that there are ways to increase
> memory bandwidth from what ordinary CPUs offer, making it seem likely
> that Cray-style vectors are worth doing as a way to improve what a CPU
> can do.

Everyone has to have hope on something.

> John Savard

Re: Short Vectors Versus Long Vectors

<v09ra9$236pc$2@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=37974&group=comp.arch#37974

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ldo...@nz.invalid (Lawrence D'Oliveiro)
Newsgroups: comp.arch
Subject: Re: Short Vectors Versus Long Vectors
Date: Wed, 24 Apr 2024 02:38:02 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 10
Message-ID: <v09ra9$236pc$2@dont-email.me>
References: <v06vdb$17r2v$1@dont-email.me>
<5451dcac941e1f569397a5cc7818f68f@www.novabbs.org>
<v078td$1df76$4@dont-email.me> <2024Apr23.082238@mips.complang.tuwien.ac.at>
<v098so$1rp16$1@dont-email.me>
<d46723273a62c22283387893da40e7e4@www.novabbs.org>
<v09gmk$1te49$1@dont-email.me> <v09jfp$1tvga$2@dont-email.me>
<4041e83a5f551d5b01df22a26b15155d@www.novabbs.org>
<v09oh8$1uppi$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 24 Apr 2024 04:38:02 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="13b1d2d556e95ece6587b97b13572e96";
logging-data="2202412"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18L7aMVWhH6fRz61uiKqv/R"
User-Agent: Pan/0.155 (Kherson; fc5a80b8)
Cancel-Lock: sha1:Qw+KsxBGvbeGbSxs+cwlpCOO1Y0=

by: Lawrence D'Oliv - Wed, 24 Apr 2024 02:38 UTC

On Tue, 23 Apr 2024 20:50:31 -0500, BGB wrote:

> There is an instruction to calculate an approximate reciprocal (say,
for
> dividing two FP-SIMD vectors), at which a person can use Newton-Raphson
> to either get a more accurate version, or use it directly (possibly
> using N-R to fix up the result of the division).

Cray had that: an approximate-reciprocal instruction, use it twice to get
the full-accuracy result.

Re: Short Vectors Versus Long Vectors

<v09rsv$236so$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=37975&group=comp.arch#37975

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ldo...@nz.invalid (Lawrence D'Oliveiro)
Newsgroups: comp.arch
Subject: Re: Short Vectors Versus Long Vectors
Date: Wed, 24 Apr 2024 02:47:59 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 20
Message-ID: <v09rsv$236so$1@dont-email.me>
References: <v06vdb$17r2v$1@dont-email.me>
<5451dcac941e1f569397a5cc7818f68f@www.novabbs.org>
<hqmg2j1vbkf6suddfnsh3h3uhtkqqio4uk@4ax.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 24 Apr 2024 04:47:59 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="13b1d2d556e95ece6587b97b13572e96";
logging-data="2202520"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19HhVjU/K406s+MwURuflWi"
User-Agent: Pan/0.155 (Kherson; fc5a80b8)
Cancel-Lock: sha1:8PTohrcAaw9010+pnyE/MnMGmyM=

by: Lawrence D'Oliv - Wed, 24 Apr 2024 02:47 UTC

On Tue, 23 Apr 2024 19:25:22 -0600, John Savard wrote:

> After all, the reason the Cray I succeeded where the STAR-100 failed was
> that it had those big vector registers ...

Looking at an old Cray-1 manual, it mentions, among other things, sixty
four 64-bit intermediate scalar “T” registers, and eight 64-element vector
“V” registers of 64 bits per element. That’s a lot of registers.

RISC-V has nothing like this, as far as I can tell. Right at the top of
the spec I linked earlier, it says:

The vector extension adds 32 architectural vector registers,
v0-v31 to the base scalar RISC-V ISA.

Each vector register has a fixed VLEN bits of state.

So, no “big vector registers” that I can see? It says that VLEN must be a
power of two no bigger than 2**16, which does sound like a lot, but then
the example they give only has VLEN = 128.

Re: Short Vectors Versus Long Vectors

<v0a6ea$25g9k$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=37976&group=comp.arch#37976

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Short Vectors Versus Long Vectors
Date: Wed, 24 Apr 2024 05:47:54 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 27
Message-ID: <v0a6ea$25g9k$1@dont-email.me>
References: <v06vdb$17r2v$1@dont-email.me>
<5451dcac941e1f569397a5cc7818f68f@www.novabbs.org>
<hqmg2j1vbkf6suddfnsh3h3uhtkqqio4uk@4ax.com>
Injection-Date: Wed, 24 Apr 2024 07:47:54 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="91a34b43642a05499f88b93fb0e8ec6c";
logging-data="2277684"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+cjMLZAex1nYxbzbx18By9McNVgbkN6ik="
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:0Q0/7F0yHUvpq3A0k67qRZZTaUI=

by: Thomas Koenig - Wed, 24 Apr 2024 05:47 UTC

John Savard <quadibloc@servername.invalid> schrieb:
> On Tue, 23 Apr 2024 02:14:32 +0000, mitchalsup@aol.com (MitchAlsup1)
> wrote:
>
>>CRAY machines stayed "in style" as long as memory latency remained smaller
>>than the length of a vector (64 cycles) and fell out of favor when the cores
>>got fast enough that memory could no longer keep up.
>>
>>I whish them well, but I expect it will not work out as they desire.....
>
> I know that you've said this about Cray-style vectors.
>
> I had thought the cause was much simpler. As soon as chiips like the
> 486 DX and then the Pentium II became available,

The 486 came out in 1989.

>a Cray-style machine
> would have had to be implemented from smaller-scale integrated
> circuits, so it would have been wildly uneconomic for the performance
> it provided;

The Cray C90 came out in 1991. That was still considered ecomomic
by the people who bought it :-)

The (low-level) competition for scientific computing at the time
was workstations.

Pages:12 3 4 5

server_pubkey.txt

rocksolid light 0.9.81
clearnet tor