Message-ID:

Play Rogue, visit exotic locations, meet strange creatures and kill them.

devel / comp.arch / Re: Vector ISA Categorisation

On 2021-07-11, Quadibloc wrote:
> On Sunday, July 11, 2021 at 2:46:28 AM UTC-6, Marcus wrote:
>
>> That's why I made the MRISC32 ISA vector register size agnostic. It's
>> not very hard, but you need to keep it in mind.
>
> True. Basically, one designs vector memory-reference instructions that
> are indexed, and you have the last of those in a loop increment the index
> by the vector length, whatever it is. And a vector counter is also needed,
> which gets decremented by the vector length on each iteration, and
> when it goes less than the vector length, causes all the instructions to work
> on partial vectors.

On MRISC32 the vector loop control typically looks like this, given
that R1 = input array, R2 = output array and R3 = array length:

CPUID VL, Z, Z ; Get max VL
loop:
MIN VL, VL, R3 ; Set VL for this iteration
SUB R3, R3, VL ; Decrement # of array elements left
LDW V1, R1, #4
... ; Perform vector operations...
STW V1, R2, #4
LDEA R1, R1, VL*4 ; Increment input ptr
LDEA R2, R2, VL*4 ; Increment output ptr
BNZ R3, loop ; Continue if there are any elements left

Admittedly the loop logic is quite explicit, but I believe that it
usually drowns in the noise since the vector instructions will consume
most of the time (while leaving the front-end free so that the scalar
loop control instructions can run concurrently with the vector
operations).

(BTW, "CPUID" is planned to be replaced by a more generic "get
system/status register" instruction).

> What concerns me about this model, though, is that it does seem to limit
> what you can do with vectors; it seems to impose one particular model of
> vector calculations.
> > While Mitch Alsup's VVM doesn't restrict what you can do with vectors, it
> seems to keep the vectors in memory.

I think that Mitch Alsup's virtual vectors can solve more problems for
sure.

>
> Having the length of vectors known to the programmer, and having vector
> registers, seems to me to allow the most flexibility and the highest performance,
> even if it creates a serious issue of future-proofing.

Since the MRISC32 ISA mandates a minimum vector register size (currently
16 elements, or 512 bits), there are many situations where the vector
length can be defined by the programmer. This is particularly useful for
algorithms that require folding (horizontal operations), where you
generally need to know the length of a vector in order to know how many
folding operations to do ("unrolled" by the programmer for maximum
performance).

> My current inclination is
> simply to include *all three approaches* in the description of the hardware, until
> I learn more about these issues, and can be confident that one approach is the
> best. Performance is a particular emphasis that I have here, since this is really
> what vectors are _for_.
>
> John Savard
>

On Monday, July 12, 2021 at 5:57:13 AM UTC+1, Marcus wrote:
> On 2021-07-11, Quadibloc wrote:
> > On Sunday, July 11, 2021 at 2:46:28 AM UTC-6, Marcus wrote:
> >
> >> That's why I made the MRISC32 ISA vector register size agnostic. It's
> >> not very hard, but you need to keep it in mind.
> >
> > True. Basically, one designs vector memory-reference instructions that
> > are indexed, and you have the last of those in a loop increment the index
> > by the vector length, whatever it is.

you may both be fascinated to know that in SVP64 that loop increment
number is put through a hardware "REMAPper" (actually, up to 4 such
REMAPpers, one for each src and dest register), such that Horizontal
Reduction, in-place Butterfly and Matrix Multiply schedules may be
performed. in the case of MMult it's *five instructions*, two of which
are needed to zero the result registers. if that's not needed then it's
only 3 instructions for a full arbitrary-sized *in-place* matrix multiply.

> On MRISC32 the vector loop control typically looks like this, given
> that R1 = input array, R2 = output array and R3 = array length:
>
> CPUID VL, Z, Z ; Get max VL
> loop:
> MIN VL, VL, R3 ; Set VL for this iteration
> ...
> Admittedly the loop logic is quite explicit, but I believe that it
> usually drowns in the noise since the vector instructions will consume
> most of the time (while leaving the front-end free so that the scalar
> loop control instructions can run concurrently with the vector
> operations).

this i termed "Horizontal-first" Vectoring. i.e. for each instruction
the order of the operations travels *horizontally* along *all* elements
first, before moving on to the next instruction.

> I think that Mitch Alsup's virtual vectors can solve more problems for
> sure.

Mitch's VVM is what i would term "Vertical-first" Vectoring, with
*implicit* ability to horizontally batch elements together. i.e. it
implicitly assesses available resources then batches as many
elements in a given instruction as possible (which could only
be one) before moving on (Vertically) to the next instruction.

this actually imposes some limitations / assumptions (i only
just understood VVM enough last week, after 2 years, to be able
to say this), namely that a mixed read-write interaction between
scalar and vector registers does not seem to be possible.

MRISC32, SVP64 and RVV because they are all "Horizontal",
the instructions could (hypothetically in MRISC32's case?) use
a scalar as a source or destination (VEXTRACT, VSPLAT, even
arithmetic).

whereas for VVM it has been a *fundamental design principle*
that the entire loop will not write to a scalar register, because to
do so would prevent and prohibit Horizontal element-group-detection
> >
> > Having the length of vectors known to the programmer, and having vector
> > registers, seems to me to allow the most flexibility and the highest performance,
> > even if it creates a serious issue of future-proofing.

there are cases for both.

> Since the MRISC32 ISA mandates a minimum vector register size (currently
> 16 elements, or 512 bits), there are many situations where the vector
> length can be defined by the programmer.

you may be fascinated to know that both Broadcom VideoCore-IV and
NEC SX-Aurora have something called "Virtual" Vector Lanes. the *actual*
underlying hardware is limited to 4 in VC-IV but to the ISA it looks like 16,
and the SX-Aurora actual hardware is 16 but to the ISA the Maximum Vector
Length (MVL) appears to all intents and purposes to be 64.

in other words there is an internal hardware-level for-loop going on.

> This is particularly useful for
> algorithms that require folding (horizontal operations), where you
> generally need to know the length of a vector in order to know how many
> folding operations to do ("unrolled" by the programmer for maximum
> performance).

ok so there's two different approaches to Vector regfiles:

1) separate Vector numbers (r0-r31 usually)
2) "MMX-like" which effectively drops Vectorisation on top of the *scalar* regfile
[MMX used x87 FP regs as integer SIMD]

- RVV allows the *option* to drop Vector registers on top of the scalar
FP regfile, for embedded designs, i.e. allows hardware designers
to select *either* (1) *or* (2).
- SVP64 goes route (2)
- VVM goes route (1)
- MRISC32 appears to be going route (1)
- SX-Aurora and other Cray-style Vector ISAs went with (1)

in route (1) you *do not* - should not - need the programmer to
know the Vector Length. if that is a base assumption it is a
design "Red Flag", as Quadiblock points out, the future-proof
breaking is quite serious and should not be underestimated.

when going route (2) *then* you have some quite fascinating
properties, because you have to have an additional argument
to the "Set Vector Length" instruction (or other "harware config"
instruction) which defines exactly *how much* of the underlying
Scalar regfile is to be allocated to
registers-that-happen-to-be-numbered-as-Vectors.

*here* you can do tricks such as Horizontal operations through
careful scheduling of the size of individual Vector registers.
you can first define the Vector Registers to be one size
and location, perform the first parallel suite of Horizontal
Reduction. then *redefine* the Vector Registers to be another
that is directly suited to the second level of Horizontal Reduction
and so on, a la mapreduce.

in SVP64 we decided that Horizontal mapreduce is important
enough to actually define a REMAP schedule for it. some alternatives:

1) add actual explicit Vector Reduction instructions (NEC SX-Aurora)
this route is limited, but has advantages in that the accumulators
can be greater accuracy than standard registers can hold.

2) use predicate masks and indexed copying (quite wasteful,
quite a lot of Vector registers needed)

what we've done in SVP64 is to define a fixed (abstracted) schedule,
where the operations *will* be carried out in a set Tree-Reduce
order. reg[n] = OPERATION(reg[m], reg[p]) where n m and p each
vary according to the length of the Vector (no need for a fixed
power-of-two).

this ensures that even for non-commutative operations (divide,
subtract) the results are at least DEFINED, and therefore may
have actual practical uses.

so there's a number of features which can be used to categorise
Vector ISAs:

* Horizontal-first vs Vertical-first element scheduling
* Explicit Vector register numbering vs overloading of Scalar regfiles
* Explicit Vector Length vs "Architecturally-independent" Vector Length

these have really fundamental architectural implications, hilariously
though they can all use the exact same internal micro-architectural
layout: it's just the issue phase that is radically different.

On 7/12/2021 6:09 AM, luke.l...@gmail.com wrote:
> On Monday, July 12, 2021 at 5:57:13 AM UTC+1, Marcus wrote:
>> On 2021-07-11, Quadibloc wrote:

snip

>> I think that Mitch Alsup's virtual vectors can solve more problems for
>> sure.
>
> Mitch's VVM is what i would term "Vertical-first" Vectoring, with
> *implicit* ability to horizontally batch elements together. i.e. it
> implicitly assesses available resources then batches as many
> elements in a given instruction as possible (which could only
> be one) before moving on (Vertically) to the next instruction.
>
> this actually imposes some limitations / assumptions (i only
> just understood VVM enough last week, after 2 years, to be able
> to say this), namely that a mixed read-write interaction between
> scalar and vector registers does not seem to be possible.

I don't understand what you want here. Reading a scalar value from a
register within a VVM loop is explicitly provided for (appropriate bit
set in VEC instruction). The register is read the first time through
the loop, and the value reused.

For write interaction, what do you want to have happen? Do you want the
value to be written once at the end of a loop, with whatever is the most
recent value of some other register? That is certainly possible. Can
you give an example of where a write to a "scalar" register within a
loop is useful?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Vector ISA Categorisation

<f9331b8a-6f89-4abd-9052-a49585b7bf13n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18637&group=comp.arch#18637

copy link Newsgroups: comp.arch

X-Received: by 2002:ad4:57d1:: with SMTP id y17mr14350843qvx.44.1626105252014;
Mon, 12 Jul 2021 08:54:12 -0700 (PDT)
X-Received: by 2002:a9d:5f19:: with SMTP id f25mr29595490oti.206.1626105251752;
Mon, 12 Jul 2021 08:54:11 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.snarked.org!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 12 Jul 2021 08:54:11 -0700 (PDT)
In-Reply-To: <schl4p$7om$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.176.255; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.176.255
References: <sb6s70$dip$1@newsreader4.netcologne.de> <sb6vfb$1ov$1@dont-email.me>
<sb70q1$fsg$2@newsreader4.netcologne.de> <sb912k$c4c$1@dont-email.me>
<sb99gi$1r5$1@newsreader4.netcologne.de> <sbh665$sht$1@dont-email.me>
<sbubiu$unp$1@dont-email.me> <sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me>
<sc186o$gns$1@dont-email.me> <sc5cg5$a3p$1@dont-email.me> <sc5fh8$p7q$1@dont-email.me>
<sc8pjr$8ib$1@dont-email.me> <sc8uoc$2tc$1@dont-email.me> <sc9iib$3ei$1@dont-email.me>
<scac7e$ph4$1@dont-email.me> <63597d55-f5bd-42fc-bae3-38155d072128n@googlegroups.com>
<scan92$k7m$1@dont-email.me> <dc5c8894-e51d-46a8-b682-9784fb8ac205n@googlegroups.com>
<9020308c-08e6-4f4f-b29f-e4320c19b1c2n@googlegroups.com> <2336ffa3-df90-461a-a1cb-51147dfc504dn@googlegroups.com>
<9a596e40-0c21-4b4c-83b1-56c745dd199cn@googlegroups.com> <sceb52$b4t$1@dont-email.me>
<0e87d075-e620-4173-accc-e16e0adbba35n@googlegroups.com> <scgi37$u7v$1@dont-email.me>
<57a0784c-b114-460e-af96-9930e94441f3n@googlegroups.com> <schl4p$7om$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f9331b8a-6f89-4abd-9052-a49585b7bf13n@googlegroups.com>
Subject: Re: Vector ISA Categorisation
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Mon, 12 Jul 2021 15:54:12 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 45

by: luke.l...@gmail.com - Mon, 12 Jul 2021 15:54 UTC

On Monday, July 12, 2021 at 3:55:24 PM UTC+1, Stephen Fuld wrote:
> On 7/12/2021 6:09 AM, luke.l...@gmail.com wrote:

> > this actually imposes some limitations / assumptions (i only
> > just understood VVM enough last week, after 2 years, to be able
> > to say this), namely that a mixed read-write interaction between
> > scalar and vector registers does not seem to be possible.
> I don't understand what you want here. Reading a scalar value from a
> register within a VVM loop is explicitly provided for (appropriate bit
> set in VEC instruction). The register is read the first time through
> the loop, and the value reused.

scalar read is a given (easy)

> For write interaction, what do you want to have happen? Do you want the
> value to be written once at the end of a loop,

no. i expect full priority behaviour with scalar read and writes as first class citizens exactly as if the loop had no vector behaviour at all (which is the fallback position and the default implementation for low-resource implementations)

> with whatever is the most
> recent value of some other register?

no.

if it is read, i expect the value sequentially read. if it is written, i expect the next read occurrence to receive the last written calue as in the sequential order in which the instructions are executed.

no exceptions, no caveats, no restrictions.

if there is an interrupt (context switch) in the middle of the loop i expect the scalar register's contents to be *fully* up-to-date exactly as expected in any loop code.

> That is certainly possible. Can
> you give an example of where a write to a "scalar" register within a
> loop is useful?

FFMPEG MP3 DCT apply_window function. starts at line 124
https://github.com/FFmpeg/FFmpeg/blob/master/libavcodec/mpegaudiodsp_template.c

it performs multiple multiply-and-accumulate operations within the same loop.

it would be extremely annoying and inconvenient if not impossible to have to perform post loop analysis or have all intermediate sums written out to memory.

it is however natural to use multiple scalar registers as accumulators *within the vector loop*.

> For write interaction, what do you want to have happen? Do you want the
> value to be written once at the end of a loop, with whatever is the most
> recent value of some other register? That is certainly possible. Can you
> give an example of where a write to a "scalar" register within a loop
> is useful?

He's referring to things like "reduce" operations, e.g. compute the
product or the sum of all the elements of a vector.

Stefan

On 2021-07-12, luke.l...@gmail.com wrote:
> On Monday, July 12, 2021 at 5:57:13 AM UTC+1, Marcus wrote:
>> On 2021-07-11, Quadibloc wrote:
>>> On Sunday, July 11, 2021 at 2:46:28 AM UTC-6, Marcus wrote:
>>>
>>>> That's why I made the MRISC32 ISA vector register size agnostic. It's
>>>> not very hard, but you need to keep it in mind.
>>>
>>> True. Basically, one designs vector memory-reference instructions that
>>> are indexed, and you have the last of those in a loop increment the index
>>> by the vector length, whatever it is.
>
> you may both be fascinated to know that in SVP64 that loop increment
> number is put through a hardware "REMAPper" (actually, up to 4 such
> REMAPpers, one for each src and dest register), such that Horizontal
> Reduction, in-place Butterfly and Matrix Multiply schedules may be
> performed. in the case of MMult it's *five instructions*, two of which
> are needed to zero the result registers. if that's not needed then it's
> only 3 instructions for a full arbitrary-sized *in-place* matrix multiply.
>

That's impressive, and I'll have to look into it. So are you saying that
the REMAP will create controlled strides into the register file
(indexing vector elements)? It sounds like it could be similar to the
MRISC32 AGU (address generation unit) that can generate vector strides
against memory.

>> On MRISC32 the vector loop control typically looks like this, given
>> that R1 = input array, R2 = output array and R3 = array length:
>>
>> CPUID VL, Z, Z ; Get max VL
>> loop:
>> MIN VL, VL, R3 ; Set VL for this iteration
>> ...
>> Admittedly the loop logic is quite explicit, but I believe that it
>> usually drowns in the noise since the vector instructions will consume
>> most of the time (while leaving the front-end free so that the scalar
>> loop control instructions can run concurrently with the vector
>> operations).
>
> this i termed "Horizontal-first" Vectoring. i.e. for each instruction
> the order of the operations travels *horizontally* along *all* elements
> first, before moving on to the next instruction.
>
>> I think that Mitch Alsup's virtual vectors can solve more problems for
>> sure.
>
> Mitch's VVM is what i would term "Vertical-first" Vectoring, with
> *implicit* ability to horizontally batch elements together. i.e. it
> implicitly assesses available resources then batches as many
> elements in a given instruction as possible (which could only
> be one) before moving on (Vertically) to the next instruction.
>
> this actually imposes some limitations / assumptions (i only
> just understood VVM enough last week, after 2 years, to be able
> to say this), namely that a mixed read-write interaction between
> scalar and vector registers does not seem to be possible.
>
> MRISC32, SVP64 and RVV because they are all "Horizontal",
> the instructions could (hypothetically in MRISC32's case?) use
> a scalar as a source or destination (VEXTRACT, VSPLAT, even
> arithmetic).

Yes the MRISC32 can feed ("splat") a scalar onto a vector operation. The
scalar can either be a scalar register (R0..R31) or a 15-bit immediate
value (usually using my I15HL encoding - so either a sign-extended or a
sticky-left-shifted 14-bit value). For instance:

AND V3, V2, #0x7f800000
ADD V3, V3, R9

This is quite useful, and I'd say that in a Vector (not SIMD!)
architecture it's almost (*almost*) a must-have, since the cost of
a separate VSPLAT is quite high in comparison. I /think/ that the
hardware cost in a wide EU implementation for doing the scalar->vector
splat as part of scalar register fetch or forwarding is manageable since
it should just be a matter of line duplication + a MUX (vector or scalar
input), or similar.

I have planned a VEXTRACT instruction (it's fairly trivial, but I just
have not implemented it yet).

>
> whereas for VVM it has been a *fundamental design principle*
> that the entire loop will not write to a scalar register, because to
> do so would prevent and prohibit Horizontal element-group-detection
>
>>>
>>> Having the length of vectors known to the programmer, and having vector
>>> registers, seems to me to allow the most flexibility and the highest performance,
>>> even if it creates a serious issue of future-proofing.
>
> there are cases for both.
>
>> Since the MRISC32 ISA mandates a minimum vector register size (currently
>> 16 elements, or 512 bits), there are many situations where the vector
>> length can be defined by the programmer.
>
> you may be fascinated to know that both Broadcom VideoCore-IV and
> NEC SX-Aurora have something called "Virtual" Vector Lanes. the *actual*
> underlying hardware is limited to 4 in VC-IV but to the ISA it looks like 16,
> and the SX-Aurora actual hardware is 16 but to the ISA the Maximum Vector
> Length (MVL) appears to all intents and purposes to be 64.
>
> in other words there is an internal hardware-level for-loop going on.

Not sure, but this sounds like my plan for MRISC32, i.e. that to the
software the register size ("max vector length") will typically be
longer than the EU width. So e.g. with a 16-element register size you
could have a 1-wide or 4-wide ALU (or something else) - the software
model is the same. Similarly the "units of transaction" and register
ports could be 1-wide, 2-wide, 4-wide, ..., etc, but to the software
the vector length would typically be longer.

>> This is particularly useful for
>> algorithms that require folding (horizontal operations), where you
>> generally need to know the length of a vector in order to know how many
>> folding operations to do ("unrolled" by the programmer for maximum
>> performance).
>
> ok so there's two different approaches to Vector regfiles:
>
> 1) separate Vector numbers (r0-r31 usually)
> 2) "MMX-like" which effectively drops Vectorisation on top of the *scalar* regfile
> [MMX used x87 FP regs as integer SIMD]
>
> - RVV allows the *option* to drop Vector registers on top of the scalar
> FP regfile, for embedded designs, i.e. allows hardware designers
> to select *either* (1) *or* (2).
> - SVP64 goes route (2)
> - VVM goes route (1)
> - MRISC32 appears to be going route (1)
> - SX-Aurora and other Cray-style Vector ISAs went with (1)
>

Yes MRISC32 is mostly 1), ... but, well, it's is kind of a mix of
1) and 2).

Both FP and INT can live in either scalar registers (R0-R31) or vector
registers (V0-V31). So there are effectively two user visible register
files.

OTOH I included (optional) support for packed SIMD on the vector
element level, so that a single 32-bit vector element can be split into
2x16 or 4x8, "true SIMD" style (ugly, but simple and fairly cost
effective, and occasionally useful). And the same goes for the scalar
registers: each scalar register can be treated as 1x32, 2x16 or 4x8. In
other words, the packed SIMD stuff is not a function of the register
file or even the vector control - it's a function of each individual EU.

Note: In a future MRISC64 ISA, where all scalar registers and vector
elements are 64 bits wide, the packed SIMD functionality will be very
similar to MMX.

I'm not sure that this is a good idea, but so far it's been quite nice
to occasionally be able to do 4x8 integer arithmetic or 2x16 FP
arithmetic on scalar registers, for instance, and the vector logic
should be much simpler than RVV for instance.

> in route (1) you *do not* - should not - need the programmer to
> know the Vector Length. if that is a base assumption it is a
> design "Red Flag", as Quadiblock points out, the future-proof
> breaking is quite serious and should not be underestimated.

Believe me, I've worked with many SIMD ISA:s, and the lack of future-
proofness in those ISA:s is a real pain (particularly x86 - ARM was
kinder and stuck with 128-bit NEON for a long time).

The only situations that I've come across where you may want to define
the vector length up front as a programmer are:

1) When you're going to do horizontal operations with the help of
folding and it's likely that the data array size is < 8 X the machine
vector size (or so), in which case the folding steps should be unrolled
up-front for optimal performance. If you're working on arrays with 1000+
elements - just go ahead and code up the folding steps in regular
dynamic loop at the end.

2) When you have natural data packet sizes (say, interleaved
N-dimensional components such as [X,Y,Z,W,R,G,B,A] or a 4x4 matrix).

3) When you want to do control flow based on vector content, e.g.
checking the end-of-iteration criterion for a Mandelbrot iteration, in
which case you may not want to spend too much time iterating on a wide
vector when all elements but one satisfy your criterion.

Click here to read the complete article

On Friday, July 9, 2021 at 7:48:53 PM UTC-4, Ivan Godard wrote:
[snip]
> As for whether customers will be paranoid about gen/conAsm why would
> they be? They seem to have no problem with (for example) Java VM code as
> a distribution format, and a JIT producing binary across differing
> binary ISAs. Mill is no different.

First, you seem to be assuming that people are rational and willing
to take the time and effort to examine a proposal reasonably. I
suspect most decisions are not primarily data-driven.

(Even a rational person must be aware that FUD — justified or not —
will impact the adoption of a policy/technology, so irrational
perceptions must be considered when there is any kind of network
effect.)

Second, Java bytecode is not adopted as a universal software
distribution format (SDF). System software is typically not
distributed in a "portable" format. Selective download (or fat
binaries for physical media distribution) seem the common
method of providing (rather limited) support of multiple
machine binary formats.

Third, I do not think Java bytecode has strong system-level
support for caching compilation work. I do not know of any
theoretical obstacle to a JVM persistently caching such, but
I was under the impression that such was not common practice
even for business servers with expected reuse of software. (I
*think* Android's Davlik has some support for install-time
compilation.) For servers, this would impact performance after
restart but not typical performance (as the programs tend to
remain active and the JVM especially so).

Fourth, the Mill incorporates several unconventional features
(static scheduling, single address space, virtually addressed
caches, etc.) in addition to using a higher-level SDF. While I
perceive these as coordinated as a reasoned design, others
might feel such is a sign of design by committee.

(I disagree with the choice of static scheduling and feel that
more attention to core-internal (L1-inward) and inter-core
communication is desirable. However, I very much like the idea
of a higher-level SDF; I view such as a reasonable work-caching
choice compared to only having a choice between developer-
friendly format (source code) or architecture-specific, micro-
architecture-optimized format as well as providing some ability
to more transparently work around hardware bugs. [For a
conventional system, if a particular rarely used group of actions
resulted in a timing violation, a typical fix might be to
underclock the core. With a higher-level SDF, the bug might be
averted by software construction.])

> However, even those who want to do their own ROM are extremely unlikely
> to need or want to do anything in conAsm. It is near impossible to
> manually write anything significant in conAsm - this is not a machine
> wherein you drop into writing assembler for the fun of it. About the
> only people who ever deal with conAsm are the people who maintain the
> specializer.

This might also be FUD-inducing. People can be irrationally
(and rationally) uncomfortable with relying on compiler correctness.
If checking correctness becomes more difficult (beyond just
unfamiliarity of the assembly language), some people will balk.

Even on a conventional serial, static-register based architecture,
optimization can increase the difficulty for a human of mapping
compiler-generated assembly back to source code.

I suspect tools could help and formal proofs might comfort some
more rational people. However, even a correct formal proof of correct
translation of llvm front-end output (language definition) to genAsm
and genAsm to each conAsm would not assure unexpected
behavior whether llvm correctly handles the language definition but
programmers assume a different interpretation and the difference
does not manifest in conventional architectures or llvm mishandles
the language but the bug does not manifest in conventional
architectures. (I receive the impression that memory consistency
has historically been associated with such miscommunication.)

On 7/12/2021 9:31 AM, Stefan Monnier wrote:
>> For write interaction, what do you want to have happen? Do you want the
>> value to be written once at the end of a loop, with whatever is the most
>> recent value of some other register? That is certainly possible. Can you
>> give an example of where a write to a "scalar" register within a loop
>> is useful?
>
> He's referring to things like "reduce" operations, e.g. compute the
> product or the sum of all the elements of a vector.

Thank you. I was having a hard time trying to figure out to what Luke
was referring, as reduce operations were the subject of a discussion
here a few months ago. I thought he wanted some arcane functionality
that I didn't understand.

Reduce type operations are absolutely allowed in VVM, and work just as
you would expect, so I still don't understand Luke's objection.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Vector ISA Categorisation

<8cdfdab4-79b1-453b-b473-8cb5b743e909n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18644&group=comp.arch#18644

copy link Newsgroups: comp.arch

X-Received: by 2002:a0c:fde3:: with SMTP id m3mr127439qvu.55.1626111168698; Mon, 12 Jul 2021 10:32:48 -0700 (PDT)
X-Received: by 2002:a4a:95f2:: with SMTP id p47mr266229ooi.40.1626111168437; Mon, 12 Jul 2021 10:32:48 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!tr1.eu1.usenetexpress.com!feeder.usenetexpress.com!tr3.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 12 Jul 2021 10:32:48 -0700 (PDT)
In-Reply-To: <57a0784c-b114-460e-af96-9930e94441f3n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:2914:c98d:387d:bfef; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:2914:c98d:387d:bfef
References: <sb6s70$dip$1@newsreader4.netcologne.de> <sb6vfb$1ov$1@dont-email.me> <sb70q1$fsg$2@newsreader4.netcologne.de> <sb912k$c4c$1@dont-email.me> <sb99gi$1r5$1@newsreader4.netcologne.de> <sbh665$sht$1@dont-email.me> <sbubiu$unp$1@dont-email.me> <sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me> <sc186o$gns$1@dont-email.me> <sc5cg5$a3p$1@dont-email.me> <sc5fh8$p7q$1@dont-email.me> <sc8pjr$8ib$1@dont-email.me> <sc8uoc$2tc$1@dont-email.me> <sc9iib$3ei$1@dont-email.me> <scac7e$ph4$1@dont-email.me> <63597d55-f5bd-42fc-bae3-38155d072128n@googlegroups.com> <scan92$k7m$1@dont-email.me> <dc5c8894-e51d-46a8-b682-9784fb8ac205n@googlegroups.com> <9020308c-08e6-4f4f-b29f-e4320c19b1c2n@googlegroups.com> <2336ffa3-df90-461a-a1cb-51147dfc504dn@googlegroups.com> <9a596e40-0c21-4b4c-83b1-56c745dd199cn@googlegroups.com> <sceb52$b4t$1@dont-email.me> <0e87d075-e620-4173-accc-e16e0adbba35n@googlegroups.com> <scgi37$u7v$1@dont-email.me> <57a0784c-b114-460e-af96-9930e94441f3n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <8cdfdab4-79b1-453b-b473-8cb5b743e909n@googlegroups.com>
Subject: Re: Vector ISA Categorisation
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 12 Jul 2021 17:32:48 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 143

by: MitchAlsup - Mon, 12 Jul 2021 17:32 UTC

On Monday, July 12, 2021 at 8:09:09 AM UTC-5, luke.l...@gmail.com wrote:
> On Monday, July 12, 2021 at 5:57:13 AM UTC+1, Marcus wrote:
> > On 2021-07-11, Quadibloc wrote:
> > > On Sunday, July 11, 2021 at 2:46:28 AM UTC-6, Marcus wrote:
> > >
> > >> That's why I made the MRISC32 ISA vector register size agnostic. It's
> > >> not very hard, but you need to keep it in mind.
> > >
> > > True. Basically, one designs vector memory-reference instructions that
> > > are indexed, and you have the last of those in a loop increment the index
> > > by the vector length, whatever it is.
> you may both be fascinated to know that in SVP64 that loop increment
> number is put through a hardware "REMAPper" (actually, up to 4 such
> REMAPpers, one for each src and dest register), such that Horizontal
> Reduction, in-place Butterfly and Matrix Multiply schedules may be
> performed. in the case of MMult it's *five instructions*, two of which
> are needed to zero the result registers. if that's not needed then it's
> only 3 instructions for a full arbitrary-sized *in-place* matrix multiply.
> > On MRISC32 the vector loop control typically looks like this, given
> > that R1 = input array, R2 = output array and R3 = array length:
> >
> > CPUID VL, Z, Z ; Get max VL
> > loop:
> > MIN VL, VL, R3 ; Set VL for this iteration
> > ...
> > Admittedly the loop logic is quite explicit, but I believe that it
> > usually drowns in the noise since the vector instructions will consume
> > most of the time (while leaving the front-end free so that the scalar
> > loop control instructions can run concurrently with the vector
> > operations).
> this i termed "Horizontal-first" Vectoring. i.e. for each instruction
> the order of the operations travels *horizontally* along *all* elements
> first, before moving on to the next instruction.
> > I think that Mitch Alsup's virtual vectors can solve more problems for
> > sure.
> Mitch's VVM is what i would term "Vertical-first" Vectoring, with
> *implicit* ability to horizontally batch elements together. i.e. it
> implicitly assesses available resources then batches as many
> elements in a given instruction as possible (which could only
> be one) before moving on (Vertically) to the next instruction.
>
> this actually imposes some limitations / assumptions (i only
> just understood VVM enough last week, after 2 years, to be able
> to say this), namely that a mixed read-write interaction between
> scalar and vector registers does not seem to be possible.
>
> MRISC32, SVP64 and RVV because they are all "Horizontal",
> the instructions could (hypothetically in MRISC32's case?) use
> a scalar as a source or destination (VEXTRACT, VSPLAT, even
> arithmetic).
>
> whereas for VVM it has been a *fundamental design principle*
> that the entire loop will not write to a scalar register, because to
> do so would prevent and prohibit Horizontal element-group-detection
<
Does not have to write to a scalar register not does not write......
> > >
> > > Having the length of vectors known to the programmer, and having vector
> > > registers, seems to me to allow the most flexibility and the highest performance,
> > > even if it creates a serious issue of future-proofing.
> there are cases for both.
> > Since the MRISC32 ISA mandates a minimum vector register size (currently
> > 16 elements, or 512 bits), there are many situations where the vector
> > length can be defined by the programmer.
> you may be fascinated to know that both Broadcom VideoCore-IV and
> NEC SX-Aurora have something called "Virtual" Vector Lanes. the *actual*
> underlying hardware is limited to 4 in VC-IV but to the ISA it looks like 16,
> and the SX-Aurora actual hardware is 16 but to the ISA the Maximum Vector
> Length (MVL) appears to all intents and purposes to be 64.
>
> in other words there is an internal hardware-level for-loop going on.
> > This is particularly useful for
> > algorithms that require folding (horizontal operations), where you
> > generally need to know the length of a vector in order to know how many
> > folding operations to do ("unrolled" by the programmer for maximum
> > performance).
> ok so there's two different approaches to Vector regfiles:
>
> 1) separate Vector numbers (r0-r31 usually)
CRAY had only 8 vector registers.
> 2) "MMX-like" which effectively drops Vectorisation on top of the *scalar* regfile
> [MMX used x87 FP regs as integer SIMD]
3) Virtual Vector Registers.
>
> - RVV allows the *option* to drop Vector registers on top of the scalar
> FP regfile, for embedded designs, i.e. allows hardware designers
> to select *either* (1) *or* (2).
> - SVP64 goes route (2)
> - VVM goes route (1)
> - MRISC32 appears to be going route (1)
> - SX-Aurora and other Cray-style Vector ISAs went with (1)
>
> in route (1) you *do not* - should not - need the programmer to
> know the Vector Length. if that is a base assumption it is a
> design "Red Flag", as Quadiblock points out, the future-proof
> breaking is quite serious and should not be underestimated.
>
> when going route (2) *then* you have some quite fascinating
> properties, because you have to have an additional argument
> to the "Set Vector Length" instruction (or other "harware config"
> instruction) which defines exactly *how much* of the underlying
> Scalar regfile is to be allocated to
> registers-that-happen-to-be-numbered-as-Vectors.
>
> *here* you can do tricks such as Horizontal operations through
> careful scheduling of the size of individual Vector registers.
> you can first define the Vector Registers to be one size
> and location, perform the first parallel suite of Horizontal
> Reduction. then *redefine* the Vector Registers to be another
> that is directly suited to the second level of Horizontal Reduction
> and so on, a la mapreduce.
>
> in SVP64 we decided that Horizontal mapreduce is important
> enough to actually define a REMAP schedule for it. some alternatives:
>
> 1) add actual explicit Vector Reduction instructions (NEC SX-Aurora)
> this route is limited, but has advantages in that the accumulators
> can be greater accuracy than standard registers can hold.
>
> 2) use predicate masks and indexed copying (quite wasteful,
> quite a lot of Vector registers needed)
>
> what we've done in SVP64 is to define a fixed (abstracted) schedule,
> where the operations *will* be carried out in a set Tree-Reduce
> order. reg[n] = OPERATION(reg[m], reg[p]) where n m and p each
> vary according to the length of the Vector (no need for a fixed
> power-of-two).
>
> this ensures that even for non-commutative operations (divide,
> subtract) the results are at least DEFINED, and therefore may
> have actual practical uses.
>
> so there's a number of features which can be used to categorise
> Vector ISAs:
>
> * Horizontal-first vs Vertical-first element scheduling
> * Explicit Vector register numbering vs overloading of Scalar regfiles
> * Explicit Vector Length vs "Architecturally-independent" Vector Length
>
> these have really fundamental architectural implications, hilariously
> though they can all use the exact same internal micro-architectural
> layout: it's just the issue phase that is radically different.
>
> l.

Re: Vector ISA Categorisation

<626c1606-026d-402e-940b-8d96ce60d9b2n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18645&group=comp.arch#18645

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:1728:: with SMTP id az40mr7082345qkb.366.1626111707961;
Mon, 12 Jul 2021 10:41:47 -0700 (PDT)
X-Received: by 2002:a05:6830:17d2:: with SMTP id p18mr154050ota.82.1626111707660;
Mon, 12 Jul 2021 10:41:47 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 12 Jul 2021 10:41:47 -0700 (PDT)
In-Reply-To: <8cdfdab4-79b1-453b-b473-8cb5b743e909n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.177.0; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.177.0
References: <sb6s70$dip$1@newsreader4.netcologne.de> <sb6vfb$1ov$1@dont-email.me>
<sb70q1$fsg$2@newsreader4.netcologne.de> <sb912k$c4c$1@dont-email.me>
<sb99gi$1r5$1@newsreader4.netcologne.de> <sbh665$sht$1@dont-email.me>
<sbubiu$unp$1@dont-email.me> <sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me>
<sc186o$gns$1@dont-email.me> <sc5cg5$a3p$1@dont-email.me> <sc5fh8$p7q$1@dont-email.me>
<sc8pjr$8ib$1@dont-email.me> <sc8uoc$2tc$1@dont-email.me> <sc9iib$3ei$1@dont-email.me>
<scac7e$ph4$1@dont-email.me> <63597d55-f5bd-42fc-bae3-38155d072128n@googlegroups.com>
<scan92$k7m$1@dont-email.me> <dc5c8894-e51d-46a8-b682-9784fb8ac205n@googlegroups.com>
<9020308c-08e6-4f4f-b29f-e4320c19b1c2n@googlegroups.com> <2336ffa3-df90-461a-a1cb-51147dfc504dn@googlegroups.com>
<9a596e40-0c21-4b4c-83b1-56c745dd199cn@googlegroups.com> <sceb52$b4t$1@dont-email.me>
<0e87d075-e620-4173-accc-e16e0adbba35n@googlegroups.com> <scgi37$u7v$1@dont-email.me>
<57a0784c-b114-460e-af96-9930e94441f3n@googlegroups.com> <8cdfdab4-79b1-453b-b473-8cb5b743e909n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <626c1606-026d-402e-940b-8d96ce60d9b2n@googlegroups.com>
Subject: Re: Vector ISA Categorisation
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Mon, 12 Jul 2021 17:41:47 +0000
Content-Type: text/plain; charset="UTF-8"

by: luke.l...@gmail.com - Mon, 12 Jul 2021 17:41 UTC

On Monday, July 12, 2021 at 6:32:49 PM UTC+1, MitchAlsup wrote:

> Does not have to write to a scalar register not does not write......

ok then that's radically different, and yes, would allow scalars
to be used as accumulators, i.e. treat them as first-class citizens.

i did think it odd, i mean from a hazard perspective all that need be done
is that the current vector element creates a write hazard on the scalar register.

Paul A. Clayton <paaronclayton@gmail.com> schrieb:

>> However, even those who want to do their own ROM are extremely unlikely
>> to need or want to do anything in conAsm. It is near impossible to
>> manually write anything significant in conAsm - this is not a machine
>> wherein you drop into writing assembler for the fun of it. About the
>> only people who ever deal with conAsm are the people who maintain the
>> specializer.
>
> This might also be FUD-inducing. People can be irrationally
> (and rationally) uncomfortable with relying on compiler correctness.

As somebody who more or less regularly fixes wrong-code compiler
bugs, I concur. Your compiler will, of course, have bugs, both
bugs inherited from LLVM and bugs in the specializer.

> If checking correctness becomes more difficult (beyond just
> unfamiliarity of the assembly language), some people will balk.

Yep.

> Even on a conventional serial, static-register based architecture,
> optimization can increase the difficulty for a human of mapping
> compiler-generated assembly back to source code.

This is especially true of higher-level languages, where a
single line of code might be tranlated into dozens of assembler
instructions including calls to malloc and free.

What I tend to do in the gcc framework is to compile, without
debugging, using -fverbose-asm, then assemble the code with
debugging and single-step through it, so that the main debug
information does not interfere.

> I suspect tools could help and formal proofs might comfort some
> more rational people.

A reasonably performant compiler is far, far too complex for
formal proofs for any large amount of code.

> However, even a correct formal proof of correct
> translation of llvm front-end output (language definition) to genAsm
> and genAsm to each conAsm would not assure unexpected
> behavior whether llvm correctly handles the language definition but
> programmers assume a different interpretation and the difference
> does not manifest in conventional architectures or llvm mishandles
> the language but the bug does not manifest in conventional
> architectures.

People making wrong assumption about undefined behavior in C is
certainly the rule rather than the exception.

> (I receive the impression that memory consistency
> has historically been associated with such miscommunication.)

Pointer aliasing... it will be interesting to see what bugs
The Mill shakes out of LLVM.

Contrast this with IBM's experience with the AS/400 (now iSeries or
something else). For those who aren't familiar with this issue, AS/400
was IBM's successor to their System 38. Programs written for the S/38
were compiled into an intermediate form. This was then translated into
machine language, but the executable file still contained the
intermediate form. The AS/400 had a different ISA than the S/38. So
when a program compiled for a S/38 was first run on AS/400 ISA, it was
transparently retranslated into the native AS/400 ISA.

According to the AS/400 book, some customers complained that their brand
new, supposedly faster AS/400, ran their programs slower than the older
S/38, customer support told then to run it again, and of course it ran
faster so they were happy. They didn't even realize that the first time
it was run, they were incurring the cost of the retranslation!

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Vector ISA Categorisation

<999476a5-100e-49dc-9a06-4550a7c928f0n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18648&group=comp.arch#18648

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:1105:: with SMTP id e5mr207737qty.268.1626114629307;
Mon, 12 Jul 2021 11:30:29 -0700 (PDT)
X-Received: by 2002:a05:6808:1313:: with SMTP id y19mr95527oiv.37.1626114629037;
Mon, 12 Jul 2021 11:30:29 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 12 Jul 2021 11:30:28 -0700 (PDT)
In-Reply-To: <schqua$k02$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.177.0; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.177.0
References: <sb6s70$dip$1@newsreader4.netcologne.de> <sb6vfb$1ov$1@dont-email.me>
<sb70q1$fsg$2@newsreader4.netcologne.de> <sb912k$c4c$1@dont-email.me>
<sb99gi$1r5$1@newsreader4.netcologne.de> <sbh665$sht$1@dont-email.me>
<sbubiu$unp$1@dont-email.me> <sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me>
<sc186o$gns$1@dont-email.me> <sc5cg5$a3p$1@dont-email.me> <sc5fh8$p7q$1@dont-email.me>
<sc8pjr$8ib$1@dont-email.me> <sc8uoc$2tc$1@dont-email.me> <sc9iib$3ei$1@dont-email.me>
<scac7e$ph4$1@dont-email.me> <63597d55-f5bd-42fc-bae3-38155d072128n@googlegroups.com>
<scan92$k7m$1@dont-email.me> <dc5c8894-e51d-46a8-b682-9784fb8ac205n@googlegroups.com>
<9020308c-08e6-4f4f-b29f-e4320c19b1c2n@googlegroups.com> <2336ffa3-df90-461a-a1cb-51147dfc504dn@googlegroups.com>
<9a596e40-0c21-4b4c-83b1-56c745dd199cn@googlegroups.com> <sceb52$b4t$1@dont-email.me>
<0e87d075-e620-4173-accc-e16e0adbba35n@googlegroups.com> <scgi37$u7v$1@dont-email.me>
<57a0784c-b114-460e-af96-9930e94441f3n@googlegroups.com> <schqua$k02$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <999476a5-100e-49dc-9a06-4550a7c928f0n@googlegroups.com>
Subject: Re: Vector ISA Categorisation
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Mon, 12 Jul 2021 18:30:29 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: luke.l...@gmail.com - Mon, 12 Jul 2021 18:30 UTC

On Monday, July 12, 2021 at 5:34:20 PM UTC+1, Marcus wrote:
> On 2021-07-12, luke.l...@gmail.com wrote:

> > you may both be fascinated to know that in SVP64 that loop increment
> > number is put through a hardware "REMAPper" (actually, up to 4 such
> > REMAPpers, one for each src and dest register), such that Horizontal
> > Reduction, in-place Butterfly and Matrix Multiply schedules may be
> > performed. in the case of MMult it's *five instructions*, two of which
> > are needed to zero the result registers. if that's not needed then it's
> > only 3 instructions for a full arbitrary-sized *in-place* matrix multiply.
> >
> That's impressive, and I'll have to look into it. So are you saying that
> the REMAP will create controlled strides into the register file
> (indexing vector elements)?

yes. a "normal" vector ISA would:

* compute a series of offset indices with actual code
* perform either an INDEXED MV (reg[reg[x]] = reg[y]) or an INDEXED LD
* compute the stuff sequentially
* perform yet more INDEXED MVs or LDs.

expensive.

> It sounds like it could be similar to the
> MRISC32 AGU (address generation unit) that can generate vector strides
> against memory.

an array of offsets? this is the "normal" way, and it costs both in terms of computing the offsets and in using memory for transfers.

in the case of FFT butterfly that's 3 loops with 3 indices/offsets, none of which are entirely linear.

in the case of Matrix Multiply, again it's 3 different indices.

with FFT and MM being used for such a ridiculous number of computer science algorithms including Reed Solomon (NTT, the Number Theory version of DFT) i figured, "what the heck, go for it"

> > MRISC32, SVP64 and RVV because they are all "Horizontal",
> > the instructions could (hypothetically in MRISC32's case?) use
> > a scalar as a source or destination (VEXTRACT, VSPLAT, even
> > arithmetic).
> Yes the MRISC32 can feed ("splat") a scalar onto a vector operation. The
> scalar can either be a scalar register (R0..R31) or a 15-bit immediate
> value (usually using my I15HL encoding - so either a sign-extended or a
> sticky-left-shifted 14-bit value).

nice. in SVP64 it's sv.addi RT.v, 0, #nnnn which is the pseudo op for load immediate, or if sv.addi RT.v, r9, #0 that would copy the sum (R9+0) into the whole vector.

btw i strongly suggest adding an "iota" instruction which puts the immedistes 0 1 2 3 .... into the vector. apparently this is very useful.

> Not sure, but this sounds like my plan for MRISC32, i.e. that to the
> software the register size ("max vector length") will typically be
> longer than the EU width. So e.g. with a 16-element register size you
> could have a 1-wide or 4-wide ALU (or something else) - the software
> model is the same.

yes that's the one. and if the MVL is huge then to be honest i can't see that it wouldn't be future proof.

> Similarly the "units of transaction" and register
> ports could be 1-wide, 2-wide, 4-wide, ..., etc, but to the software
> the vector length would typically be longer.

yehyeh, exactly.

> Yes MRISC32 is mostly 1), ... but, well, it's is kind of a mix of
> 1) and 2).
>
> Both FP and INT can live in either scalar registers (R0-R31) or vector
> registers (V0-V31). So there are effectively two user visible register
> files.

urrr as long as the ISA can select either arbitrarily it shoukd be good :)

> OTOH I included (optional) support for packed SIMD on the vector
> element level, so that a single 32-bit vector element can be split into
> 2x16 or 4x8, "true SIMD" style (ugly, but simple and fairly cost
> effective, and occasionally useful).

don't do it. really. ok, if you do, then do it as "sub-vectors", which map onto Vulkan vec2, vec3 and vec4, respectively, but fir god's sake don't do "actual packed SIMD"

set VL=1 and SUBVL=3 yes.

actual explicit packed SIMD instructions, just, please, consider SIMD ISAs to be the programming equivalent of the worst kind of hell you could possibly inflict.

> And the same goes for the scalar
> registers: each scalar register can be treated as 1x32, 2x16 or 4x8. In
> other words, the packed SIMD stuff is not a function of the register
> file or even the vector control - it's a function of each individual EU.

yes. if it's hidden behind a Vector ISA Frontend, SIMD *ALUs* are incredibly valuable. caveat: *predicated* SIMD ALUs are useful.

> Note: In a future MRISC64 ISA, where all scalar registers and vector
> elements are 64 bits wide, the packed SIMD functionality will be very
> similar to MMX.

unless it's part of the Vector ISA (SUBVL=1/2/3/4) you will regret it, i promise.

>
> I'm not sure that this is a good idea, but so far it's been quite nice
> to occasionally be able to do 4x8 integer arithmetic or 2x16 FP
> arithmetic on scalar registers, for instance, and the vector logic
> should be much simpler than RVV for instance.

RVV has Sub-Vectors. a single bit of predicate mask applies to each subvector.

what RVV is missing which makes it Absolute Hell for 3D (Vulkan Shaders) is swizzle. swizzle is i think something like 30% of all 3D Shader LD/MV operations, it's enormous.

(swizzle: a vec4 can be treated as SIMD BUT has lane-crossing, XYZW can be routed YYZX or ZYXW or any combination.)

swizzle encoding however is... expensive. most GPU ISAs are 64 bit minimum.

> The only situations that I've come across where you may want to define
> the vector length up front as a programmer are:
>
> 1) When you're going to do horizontal operations with the help of
> folding and it's likely that the data array size is < 8 X the machine
> vector size (or so), in which case the folding steps should be unrolled
> up-front for optimal performance. If you're working on arrays with 1000+
> elements - just go ahead and code up the folding steps in regular
> dynamic loop at the end.

this is why NEC's SX-Aurora just went with it and added Horizontal add, mul, or, and, xor etc.

> 2) When you have natural data packet sizes (say, interleaved
> N-dimensional components such as [X,Y,Z,W,R,G,B,A] or a 4x4 matrix).

and RADIX-2 butterfly, or fixed width DCT, yes. hence why we went with explicit VL as well in SVP64.

> 3) When you want to do control flow based on vector content, e.g.
> checking the end-of-iteration criterion for a Mandelbrot iteration, in
> which case you may not want to spend too much time iterating on a wide
> vector when all elements but one satisfy your criterion.

ah for that we added "Data-dependent Fail-First".

using Condition Codes testing in each element, if the test fails the *Vector Length is automatically truncated*. subsequent operations take place *at that altered length*.

Vertical-first Vector ISAs of course do not have to play this trick, they can just exit the loop at the first failed element condition.

look up SVE "Fail-first" and RVV "Speculative Load". i adapted that and applied it to data.

> In most other situations you're better off relying on the automatic
> vector length mechanisms provided by the ISA - and your code should
> automatically adapt to longer vector registers and be able to scale
> perfectly.

yes. except for strncpy or saturated audio processing this is fine. then the assumptions all go to hell. efforts to fix that inside the loop just slow the loop down. reduce the vector size likewise.

> >
> > when going route (2) *then* you have some quite fascinating
> > properties, because you have to have an additional argument
> > to the "Set Vector Length" instruction (or other "harware config"
> > instruction) which defines exactly *how much* of the underlying
> > Scalar regfile is to be allocated to
> > registers-that-happen-to-be-numbered-as-Vectors.
> >
> > *here* you can do tricks such as Horizontal operations through
> > careful scheduling of the size of individual Vector registers.
> > you can first define the Vector Registers to be one size
> > and location, perform the first parallel suite of Horizontal
> > Reduction. then *redefine* the Vector Registers to be another
> > that is directly suited to the second level of Horizontal Reduction
> > and so on, a la mapreduce.
> Doesn't that set limitations to how wide your register ports can be
> (i.e. how many Vector Elements you can read/write at once from/to your
> register file)? In 1) it's fairly trivial to e.g. let each vector
> register port be multiple elements wide (e.g. 4 elements per port).

we will be doing stratified regfiles. 0-31 are "normal", 32-127 are 4 way interleaved.

then we will add a massive cyclic shift register which, if you make the mistake of issuing an op "ADD r32 r33 r34" Instead.of "ADD r32 r36 r40" (note same modulo 4 there) the operands drop into the cyclic buffer and get shunted to an ALU that can cope. increases latency, we don't mind that.

Click here to read the complete article

Re: Vector ISA Categorisation

<4640781b-edc0-4a14-8192-d6b0732400ffn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18649&group=comp.arch#18649

copy link Newsgroups: comp.arch

X-Received: by 2002:a0c:bf4b:: with SMTP id b11mr666556qvj.11.1626115250855;
Mon, 12 Jul 2021 11:40:50 -0700 (PDT)
X-Received: by 2002:a4a:52cc:: with SMTP id d195mr309516oob.66.1626115250579;
Mon, 12 Jul 2021 11:40:50 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 12 Jul 2021 11:40:50 -0700 (PDT)
In-Reply-To: <626c1606-026d-402e-940b-8d96ce60d9b2n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:2914:c98d:387d:bfef;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:2914:c98d:387d:bfef
References: <sb6s70$dip$1@newsreader4.netcologne.de> <sb6vfb$1ov$1@dont-email.me>
<sb70q1$fsg$2@newsreader4.netcologne.de> <sb912k$c4c$1@dont-email.me>
<sb99gi$1r5$1@newsreader4.netcologne.de> <sbh665$sht$1@dont-email.me>
<sbubiu$unp$1@dont-email.me> <sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me>
<sc186o$gns$1@dont-email.me> <sc5cg5$a3p$1@dont-email.me> <sc5fh8$p7q$1@dont-email.me>
<sc8pjr$8ib$1@dont-email.me> <sc8uoc$2tc$1@dont-email.me> <sc9iib$3ei$1@dont-email.me>
<scac7e$ph4$1@dont-email.me> <63597d55-f5bd-42fc-bae3-38155d072128n@googlegroups.com>
<scan92$k7m$1@dont-email.me> <dc5c8894-e51d-46a8-b682-9784fb8ac205n@googlegroups.com>
<9020308c-08e6-4f4f-b29f-e4320c19b1c2n@googlegroups.com> <2336ffa3-df90-461a-a1cb-51147dfc504dn@googlegroups.com>
<9a596e40-0c21-4b4c-83b1-56c745dd199cn@googlegroups.com> <sceb52$b4t$1@dont-email.me>
<0e87d075-e620-4173-accc-e16e0adbba35n@googlegroups.com> <scgi37$u7v$1@dont-email.me>
<57a0784c-b114-460e-af96-9930e94441f3n@googlegroups.com> <8cdfdab4-79b1-453b-b473-8cb5b743e909n@googlegroups.com>
<626c1606-026d-402e-940b-8d96ce60d9b2n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4640781b-edc0-4a14-8192-d6b0732400ffn@googlegroups.com>
Subject: Re: Vector ISA Categorisation
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 12 Jul 2021 18:40:50 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Mon, 12 Jul 2021 18:40 UTC

On Monday, July 12, 2021 at 12:41:49 PM UTC-5, luke.l...@gmail.com wrote:
> On Monday, July 12, 2021 at 6:32:49 PM UTC+1, MitchAlsup wrote:
>
> > Does not have to write to a scalar register not does not write......
<
> ok then that's radically different, and yes, would allow scalars
> to be used as accumulators, i.e. treat them as first-class citizens.
>
> i did think it odd, i mean from a hazard perspective all that need be done
> is that the current vector element creates a write hazard on the scalar register.
<
VVM essentially created a duplicate scalar register for each iteration of a loop.
<
But, this illusion has to collapse when an exception is raised, so that the
scalar registers all appear to have the appropriate scalar values to the
debugger or exception handler. This is the only case where the scalar
registers HAVE to be written. And it is essential for the exception handlers
not to need a model for exception handling of vectors different from that
of scalars.
<
So, when an exception is raised, in an iteration, all of the registers appropriate
to that iteration are written, the exception control transfer is performed, the
exception is handled, and control returns to a scalar execution model. The
rest of the loop is run in scalar mode and when the LOOP instruction is
encountered, data from the Thread header is used to find the VEC instruction
and reinstall the vectorized loop into the execution window.
<
VVM supports precise exceptions as if in a scalar manner.
<
>
> l.

Re: Vector ISA Categorisation

<a0e2018b-9bae-4549-855b-50ff92cacbc2n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18650&group=comp.arch#18650

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:5f46:: with SMTP id y6mr361606qta.10.1626116493000;
Mon, 12 Jul 2021 12:01:33 -0700 (PDT)
X-Received: by 2002:aca:5cd7:: with SMTP id q206mr171687oib.99.1626116492761;
Mon, 12 Jul 2021 12:01:32 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 12 Jul 2021 12:01:32 -0700 (PDT)
In-Reply-To: <999476a5-100e-49dc-9a06-4550a7c928f0n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:2914:c98d:387d:bfef;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:2914:c98d:387d:bfef
References: <sb6s70$dip$1@newsreader4.netcologne.de> <sb6vfb$1ov$1@dont-email.me>
<sb70q1$fsg$2@newsreader4.netcologne.de> <sb912k$c4c$1@dont-email.me>
<sb99gi$1r5$1@newsreader4.netcologne.de> <sbh665$sht$1@dont-email.me>
<sbubiu$unp$1@dont-email.me> <sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me>
<sc186o$gns$1@dont-email.me> <sc5cg5$a3p$1@dont-email.me> <sc5fh8$p7q$1@dont-email.me>
<sc8pjr$8ib$1@dont-email.me> <sc8uoc$2tc$1@dont-email.me> <sc9iib$3ei$1@dont-email.me>
<scac7e$ph4$1@dont-email.me> <63597d55-f5bd-42fc-bae3-38155d072128n@googlegroups.com>
<scan92$k7m$1@dont-email.me> <dc5c8894-e51d-46a8-b682-9784fb8ac205n@googlegroups.com>
<9020308c-08e6-4f4f-b29f-e4320c19b1c2n@googlegroups.com> <2336ffa3-df90-461a-a1cb-51147dfc504dn@googlegroups.com>
<9a596e40-0c21-4b4c-83b1-56c745dd199cn@googlegroups.com> <sceb52$b4t$1@dont-email.me>
<0e87d075-e620-4173-accc-e16e0adbba35n@googlegroups.com> <scgi37$u7v$1@dont-email.me>
<57a0784c-b114-460e-af96-9930e94441f3n@googlegroups.com> <schqua$k02$1@dont-email.me>
<999476a5-100e-49dc-9a06-4550a7c928f0n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <a0e2018b-9bae-4549-855b-50ff92cacbc2n@googlegroups.com>
Subject: Re: Vector ISA Categorisation
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 12 Jul 2021 19:01:32 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Mon, 12 Jul 2021 19:01 UTC

On Monday, July 12, 2021 at 1:30:30 PM UTC-5, luke.l...@gmail.com wrote:
> On Monday, July 12, 2021 at 5:34:20 PM UTC+1, Marcus wrote:
> > On 2021-07-12, luke.l...@gmail.com wrote:
>
> > > you may both be fascinated to know that in SVP64 that loop increment
> > > number is put through a hardware "REMAPper" (actually, up to 4 such
> > > REMAPpers, one for each src and dest register), such that Horizontal
> > > Reduction, in-place Butterfly and Matrix Multiply schedules may be
> > > performed. in the case of MMult it's *five instructions*, two of which
> > > are needed to zero the result registers. if that's not needed then it's
> > > only 3 instructions for a full arbitrary-sized *in-place* matrix multiply.
> > >
> > That's impressive, and I'll have to look into it. So are you saying that
> > the REMAP will create controlled strides into the register file
> > (indexing vector elements)?
> yes. a "normal" vector ISA would:
>
> * compute a series of offset indices with actual code
> * perform either an INDEXED MV (reg[reg[x]] = reg[y]) or an INDEXED LD
> * compute the stuff sequentially
> * perform yet more INDEXED MVs or LDs.
>
> expensive.
> > It sounds like it could be similar to the
> > MRISC32 AGU (address generation unit) that can generate vector strides
> > against memory.
> an array of offsets? this is the "normal" way, and it costs both in terms of computing the offsets and in using memory for transfers.
>
> in the case of FFT butterfly that's 3 loops with 3 indices/offsets, none of which are entirely linear.
>
> in the case of Matrix Multiply, again it's 3 different indices.
>
> with FFT and MM being used for such a ridiculous number of computer science algorithms including Reed Solomon (NTT, the Number Theory version of DFT) i figured, "what the heck, go for it"
<
Agreed, but I would phrase the situation differently. MM and FFT have "difficult" addressing
of memory, and if/when your architecture handles these AGEN problems with succinctness
and efficiency, not many other serious memory addressing problems will be encountered.
<
> > > MRISC32, SVP64 and RVV because they are all "Horizontal",
> > > the instructions could (hypothetically in MRISC32's case?) use
> > > a scalar as a source or destination (VEXTRACT, VSPLAT, even
> > > arithmetic).
> > Yes the MRISC32 can feed ("splat") a scalar onto a vector operation. The
> > scalar can either be a scalar register (R0..R31) or a 15-bit immediate
> > value (usually using my I15HL encoding - so either a sign-extended or a
> > sticky-left-shifted 14-bit value).
<
The VEC instruction identifies the registers with carry iteration to iteration dependencies
{Mostly only the index register}. A register used as a destination in a vectorized loop
is a vector register unless it was identified as a loop carrier, otherwise the operand
registers become scalar. So the registers in a loop are categorized into 3 {Scalar,
Vector, Carrier}. Scalars are read once, vectors are forward only, Carriers create
iteration to iteration dependencies. This scheme provides the data flow analysis
to run the loop SIMD-style while obeying scalar ordering rules and run at vector
rates.
<
In My 66000 ISA, VVM watches the registers as the loop is installed into the execution
window. Scalar registers are read from the RF and installed in the reservation stations
ONCE, hereafter they remain, not waiting for a dependency from iteration to iteration.
<
> nice. in SVP64 it's sv.addi RT.v, 0, #nnnn which is the pseudo op for load immediate, or if sv.addi RT.v, r9, #0 that would copy the sum (R9+0) into the whole vector.
>
> btw i strongly suggest adding an "iota" instruction which puts the immedistes 0 1 2 3 .... into the vector. apparently this is very useful.
<
More of a symptom of the problem rather than a solution............
<
> > Not sure, but this sounds like my plan for MRISC32, i.e. that to the
> > software the register size ("max vector length") will typically be
> > longer than the EU width. So e.g. with a 16-element register size you
> > could have a 1-wide or 4-wide ALU (or something else) - the software
> > model is the same.
> yes that's the one. and if the MVL is huge then to be honest i can't see that it wouldn't be future proof.
> > Similarly the "units of transaction" and register
> > ports could be 1-wide, 2-wide, 4-wide, ..., etc, but to the software
> > the vector length would typically be longer.
> yehyeh, exactly.
> > Yes MRISC32 is mostly 1), ... but, well, it's is kind of a mix of
> > 1) and 2).
> >
> > Both FP and INT can live in either scalar registers (R0-R31) or vector
> > registers (V0-V31). So there are effectively two user visible register
> > files.
> urrr as long as the ISA can select either arbitrarily it shoukd be good :)
> > OTOH I included (optional) support for packed SIMD on the vector
> > element level, so that a single 32-bit vector element can be split into
> > 2x16 or 4x8, "true SIMD" style (ugly, but simple and fairly cost
> > effective, and occasionally useful).
> don't do it. really. ok, if you do, then do it as "sub-vectors", which map onto Vulkan vec2, vec3 and vec4, respectively, but fir god's sake don't do "actual packed SIMD"
>
> set VL=1 and SUBVL=3 yes.
>
> actual explicit packed SIMD instructions, just, please, consider SIMD ISAs to be the programming equivalent of the worst kind of hell you could possibly inflict.
<
A convert !!!!!!!!!!!!!!!!!!!!
<
> > And the same goes for the scalar
> > registers: each scalar register can be treated as 1x32, 2x16 or 4x8. In
> > other words, the packed SIMD stuff is not a function of the register
> > file or even the vector control - it's a function of each individual EU..
> yes. if it's hidden behind a Vector ISA Frontend, SIMD *ALUs* are incredibly valuable. caveat: *predicated* SIMD ALUs are useful.
> > Note: In a future MRISC64 ISA, where all scalar registers and vector
> > elements are 64 bits wide, the packed SIMD functionality will be very
> > similar to MMX.
> unless it's part of the Vector ISA (SUBVL=1/2/3/4) you will regret it, i promise.
> >
> > I'm not sure that this is a good idea, but so far it's been quite nice
> > to occasionally be able to do 4x8 integer arithmetic or 2x16 FP
> > arithmetic on scalar registers, for instance, and the vector logic
> > should be much simpler than RVV for instance.
<
> RVV has Sub-Vectors. a single bit of predicate mask applies to each subvector.
<
My 66000 is allowed to use its predicate instructions inside loops which
acts like a vector of predicates (above) but expressed exactly the same in
scalar code as in vector code. No ISA additions needed to do this..............
<
>
> what RVV is missing which makes it Absolute Hell for 3D (Vulkan Shaders) is swizzle. swizzle is i think something like 30% of all 3D Shader LD/MV operations, it's enormous.
>
> (swizzle: a vec4 can be treated as SIMD BUT has lane-crossing, XYZW can be routed YYZX or ZYXW or any combination.)
<
Or even XXYY or XXXZ
>
> swizzle encoding however is... expensive. most GPU ISAs are 64 bit minimum.
<
And swizzle is a solution to the problem of packing all the data into a single register !
<
So, if you don't want to swizzle, don't put more than one unit of data in a register.
<
> > The only situations that I've come across where you may want to define
> > the vector length up front as a programmer are:
> >
> > 1) When you're going to do horizontal operations with the help of
> > folding and it's likely that the data array size is < 8 X the machine
> > vector size (or so), in which case the folding steps should be unrolled
> > up-front for optimal performance. If you're working on arrays with 1000+
> > elements - just go ahead and code up the folding steps in regular
> > dynamic loop at the end.
> this is why NEC's SX-Aurora just went with it and added Horizontal add, mul, or, and, xor etc.
> > 2) When you have natural data packet sizes (say, interleaved
> > N-dimensional components such as [X,Y,Z,W,R,G,B,A] or a 4x4 matrix).
> and RADIX-2 butterfly, or fixed width DCT, yes. hence why we went with explicit VL as well in SVP64.
> > 3) When you want to do control flow based on vector content, e.g.
> > checking the end-of-iteration criterion for a Mandelbrot iteration, in
> > which case you may not want to spend too much time iterating on a wide
> > vector when all elements but one satisfy your criterion.
> ah for that we added "Data-dependent Fail-First".
>
> using Condition Codes testing in each element, if the test fails the *Vector Length is automatically truncated*. subsequent operations take place *at that altered length*.
<
Any transfer of control outside of a VVM loop terminates the rest of the loop.
>
> Vertical-first Vector ISAs of course do not have to play this trick, they can just exit the loop at the first failed element condition.
>
> look up SVE "Fail-first" and RVV "Speculative Load". i adapted that and applied it to data.
> > In most other situations you're better off relying on the automatic
> > vector length mechanisms provided by the ISA - and your code should
> > automatically adapt to longer vector registers and be able to scale
> > perfectly.
> yes. except for strncpy or saturated audio processing this is fine. then the assumptions all go to hell. efforts to fix that inside the loop just slow the loop down. reduce the vector size likewise.
> > >
> > > when going route (2) *then* you have some quite fascinating
> > > properties, because you have to have an additional argument
> > > to the "Set Vector Length" instruction (or other "harware config"
> > > instruction) which defines exactly *how much* of the underlying
> > > Scalar regfile is to be allocated to
> > > registers-that-happen-to-be-numbered-as-Vectors.
> > >
> > > *here* you can do tricks such as Horizontal operations through
> > > careful scheduling of the size of individual Vector registers.
> > > you can first define the Vector Registers to be one size
> > > and location, perform the first parallel suite of Horizontal
> > > Reduction. then *redefine* the Vector Registers to be another
> > > that is directly suited to the second level of Horizontal Reduction
> > > and so on, a la mapreduce.
> > Doesn't that set limitations to how wide your register ports can be
> > (i.e. how many Vector Elements you can read/write at once from/to your
> > register file)? In 1) it's fairly trivial to e.g. let each vector
> > register port be multiple elements wide (e.g. 4 elements per port).
> we will be doing stratified regfiles. 0-31 are "normal", 32-127 are 4 way interleaved.
>
> then we will add a massive cyclic shift register which, if you make the mistake of issuing an op "ADD r32 r33 r34" Instead.of "ADD r32 r36 r40" (note same modulo 4 there) the operands drop into the cyclic buffer and get shunted to an ALU that can cope. increases latency, we don't mind that.
> > I have toyed with, but rejected, ideas about treating the scalar
> > register file as a vector (e.g. transfer a range of one vector register
> > to a subset of the scalar registers, and vice versa).
> you have to have reaally thought about it.
> > > what we've done in SVP64 is to define a fixed (abstracted) schedule,
> > > where the operations *will* be carried out in a set Tree-Reduce
> > > order. reg[n] = OPERATION(reg[m], reg[p]) where n m and p each
> > > vary according to the length of the Vector (no need for a fixed
> > > power-of-two).
> > >
> > Ok - need to look into that. Given how important vector permutation
> > operations are to some algorithms, this sounds like it could be quite
> > a powerful feature.
> yes. we require the destination to be a Vector, which MUST contain the intermediate partial results. not "may", MUST. this allows precise interrupts to occur in the middle.
>
> l.

Click here to read the complete article

On Monday, July 12, 2021 at 2:03:59 PM UTC-4, Stephen Fuld wrote:
[snip]
> Contrast this with IBM's experience with the AS/400 (now iSeries or
> something else). For those who aren't familiar with this issue, AS/400
> was IBM's successor to their System 38. Programs written for the S/38
> were compiled into an intermediate form. This was then translated into
> machine language, but the executable file still contained the
> intermediate form. The AS/400 had a different ISA than the S/38. So
> when a program compiled for a S/38 was first run on AS/400 ISA, it was
> transparently retranslated into the native AS/400 ISA.

This is significantly different in circumstances; "nobody got fired for
buying IBM" cf. "Mill Computing? What is that?" and established base
customer already using the system cf. to new customers switching
from a conventional system.

> According to the AS/400 book, some customers complained that their brand
> new, supposedly faster AS/400, ran their programs slower than the older
> S/38, customer support told then to run it again, and of course it ran
> faster so they were happy. They didn't even realize that the first time
> it was run, they were incurring the cost of the retranslation!

Interface lock-in also tends to be more profitable. Although iSeries is
probably effectively a proprietary system (i.e., no one could produce a
system running iSeries software even if such should be legal), IBM still
seems to prefer POWER and zArch (e.g., not encouraging a move to
iSeries).

Intel/AMD (x86) and Microsoft (Windows/DOS) may be extreme
examples of the profitability of interface lock-in, but I do not think
the principle is inaccurate. I do not know how architecturally
neutral Mill Computing's genAsm is (such that success could
quickly draw competitors using such for software compatibility
without violating Mill architecture patents), but I suspect it is too
generic to provide strong lock-in. (Huge profitability attracts
investors [as potential] and consumers [fame/name recognition
and "too big to fail"]. Playing fair much less being honest and
generous seems not to be well-rewarded in the business world.)

Maybe I am getting crotchety and cynical. Maybe well-considered
systematic design will overcome the worse-is-better tendency
and long-term total-cost (even including externalities) will be used
for making decisions on a significant scale.

Re: Vector ISA Categorisation

<80be8c80-109d-4b85-9822-68fa19ee17ffn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18652&group=comp.arch#18652

copy link Newsgroups: comp.arch

X-Received: by 2002:ad4:4bca:: with SMTP id l10mr605715qvw.50.1626118217549;
Mon, 12 Jul 2021 12:30:17 -0700 (PDT)
X-Received: by 2002:a9d:4c9a:: with SMTP id m26mr466183otf.110.1626118217285;
Mon, 12 Jul 2021 12:30:17 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!feeder.erje.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 12 Jul 2021 12:30:17 -0700 (PDT)
In-Reply-To: <a0e2018b-9bae-4549-855b-50ff92cacbc2n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.176.255; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.176.255
References: <sb6s70$dip$1@newsreader4.netcologne.de> <sb6vfb$1ov$1@dont-email.me>
<sb70q1$fsg$2@newsreader4.netcologne.de> <sb912k$c4c$1@dont-email.me>
<sb99gi$1r5$1@newsreader4.netcologne.de> <sbh665$sht$1@dont-email.me>
<sbubiu$unp$1@dont-email.me> <sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me>
<sc186o$gns$1@dont-email.me> <sc5cg5$a3p$1@dont-email.me> <sc5fh8$p7q$1@dont-email.me>
<sc8pjr$8ib$1@dont-email.me> <sc8uoc$2tc$1@dont-email.me> <sc9iib$3ei$1@dont-email.me>
<scac7e$ph4$1@dont-email.me> <63597d55-f5bd-42fc-bae3-38155d072128n@googlegroups.com>
<scan92$k7m$1@dont-email.me> <dc5c8894-e51d-46a8-b682-9784fb8ac205n@googlegroups.com>
<9020308c-08e6-4f4f-b29f-e4320c19b1c2n@googlegroups.com> <2336ffa3-df90-461a-a1cb-51147dfc504dn@googlegroups.com>
<9a596e40-0c21-4b4c-83b1-56c745dd199cn@googlegroups.com> <sceb52$b4t$1@dont-email.me>
<0e87d075-e620-4173-accc-e16e0adbba35n@googlegroups.com> <scgi37$u7v$1@dont-email.me>
<57a0784c-b114-460e-af96-9930e94441f3n@googlegroups.com> <schqua$k02$1@dont-email.me>
<999476a5-100e-49dc-9a06-4550a7c928f0n@googlegroups.com> <a0e2018b-9bae-4549-855b-50ff92cacbc2n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <80be8c80-109d-4b85-9822-68fa19ee17ffn@googlegroups.com>
Subject: Re: Vector ISA Categorisation
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Mon, 12 Jul 2021 19:30:17 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: luke.l...@gmail.com - Mon, 12 Jul 2021 19:30 UTC

On Monday, July 12, 2021 at 8:01:34 PM UTC+1, MitchAlsup wrote:

> Agreed, but I would phrase the situation differently. MM and FFT have "difficult" addressing
> of memory, and if/when your architecture handles these AGEN problems with succinctness
> and efficiency, not many other serious memory addressing problems will be encountered.

i also added an explicit "bitreverse" unit stride LD mode. again this was one of those "oink, what on earth are standard ISAs doing" moments. it made a dog's dinner mess of the ISA but i feel it's worth it to be able to do DFT and DCT in around 10 instructions.

> The VEC instruction identifies the registers with carry iteration to iteration dependencies
> {Mostly only the index register}. A register used as a destination in a vectorized loop
> is a vector register unless it was identified as a loop carrier, otherwise the operand
> registers become scalar.

my immediate reaction is that it may be much simpler, by using standard reg hazard tracking, treating all Vector elements as multi issue scalars. this *should* allow the situation to be reassessed after a write scalar is encountered, carrying on parallel execution.

where it gets complex is identifying the number of elements that are permitted to be run simultaneously. nominally (the fallback) is one: all are scalar in effect.

however i have an idea: an architectural "hint" instruction which the programmer tells the hardware *how many* elements may be run Horizontally without violating hazards. hardware may run *up to* that limit but not exceed it..

if the limit is set "unlimited" then fascinatingly it says "all elements may be processed vertically and that by definition is the standard Cray-style Horizontal-first execution.

then it is not necessary for the hardware to assess the situation: the compiler has *told* you "up to N elrments can be computed in parallel".

> <
> In My 66000 ISA, VVM watches the registers as the loop is installed into the execution
> window. Scalar registers are read from the RF and installed in the reservation stations
> ONCE, hereafter they remain, not waiting for a dependency from iteration to iteration.
> <
> > nice. in SVP64 it's sv.addi RT.v, 0, #nnnn which is the pseudo op for load immediate, or if sv.addi RT.v, r9, #0 that would copy the sum (R9+0) into the whole vector.
> >
> > btw i strongly suggest adding an "iota" instruction which puts the immedistes 0 1 2 3 .... into the vector. apparently this is very useful.
> <
> More of a symptom of the problem rather than a solution............

it may have been added for doing bit-reversed LOADs. iota folloqed by bitrev followed by INDEXED load. and i have seen it used to synthesise VSCATTER from VGATHER (in RVV).

> > actual explicit packed SIMD instructions, just, please, consider SIMD ISAs to be the programming equivalent of the worst kind of hell you could possibly inflict.
> <
> A convert !!!!!!!!!!!!!!!!!!!!

i knew the SIMD harmful article, but it wasn't until i compiled 2 line IAXPY with AVX512 and it generated 340 instructions i realised the true horror.

each widening only nakes things worse not better by increasing latency because the elements have to all complete before the next (entire) batch can proceed. in the case of SIMD LD/ST that translates to being far, far worse performance even than the most basic Vector engine.

> My 66000 is allowed to use its predicate instructions inside loops which
> acts like a vector of predicates (above) but expressed exactly the same in
> scalar code as in vector code. No ISA additions needed to do this..............

i do like this.

On 7/12/2021 9:37 AM, Paul A. Clayton wrote:
> On Friday, July 9, 2021 at 7:48:53 PM UTC-4, Ivan Godard wrote:
> [snip]
>> As for whether customers will be paranoid about gen/conAsm why would
>> they be? They seem to have no problem with (for example) Java VM code as
>> a distribution format, and a JIT producing binary across differing
>> binary ISAs. Mill is no different.
>
> First, you seem to be assuming that people are rational and willing
> to take the time and effort to examine a proposal reasonably. I
> suspect most decisions are not primarily data-driven.
>
> (Even a rational person must be aware that FUD — justified or not —
> will impact the adoption of a policy/technology, so irrational
> perceptions must be considered when there is any kind of network
> effect.)
>
> Second, Java bytecode is not adopted as a universal software
> distribution format (SDF). System software is typically not
> distributed in a "portable" format. Selective download (or fat
> binaries for physical media distribution) seem the common
> method of providing (rather limited) support of multiple
> machine binary formats.
>
> Third, I do not think Java bytecode has strong system-level
> support for caching compilation work. I do not know of any
> theoretical obstacle to a JVM persistently caching such, but
> I was under the impression that such was not common practice
> even for business servers with expected reuse of software. (I
> *think* Android's Davlik has some support for install-time
> compilation.) For servers, this would impact performance after
> restart but not typical performance (as the programs tend to
> remain active and the JVM especially so).

I use Java as an example expecting it would be familiar, but it is not
the only example. For another: the AS400 family distributed in a virtual
ISA (and still does I believe). There are others less well known.

> Fourth, the Mill incorporates several unconventional features
> (static scheduling, single address space, virtually addressed
> caches, etc.) in addition to using a higher-level SDF. While I
> perceive these as coordinated as a reasoned design, others
> might feel such is a sign of design by committee.
>
> (I disagree with the choice of static scheduling and feel that
> more attention to core-internal (L1-inward) and inter-core
> communication is desirable. However, I very much like the idea
> of a higher-level SDF; I view such as a reasonable work-caching
> choice compared to only having a choice between developer-
> friendly format (source code) or architecture-specific, micro-
> architecture-optimized format as well as providing some ability
> to more transparently work around hardware bugs. [For a
> conventional system, if a particular rarely used group of actions
> resulted in a timing violation, a typical fix might be to
> underclock the core. With a higher-level SDF, the bug might be
> averted by software construction.])

Our attention to core communication exists but is NYF.

That's one of the reasons that Mill cores do *much* more sanity checking
that is common. Also, the ISA precludes a great many of the situations
in which compilers for legacy architectures screw up: no function
prelude/postludes for example

> Even on a conventional serial, static-register based architecture,
> optimization can increase the difficulty for a human of mapping
> compiler-generated assembly back to source code.
>
> I suspect tools could help and formal proofs might comfort some
> more rational people. However, even a correct formal proof of correct
> translation of llvm front-end output (language definition) to genAsm
> and genAsm to each conAsm would not assure unexpected
> behavior whether llvm correctly handles the language definition but
> programmers assume a different interpretation and the difference
> does not manifest in conventional architectures or llvm mishandles
> the language but the bug does not manifest in conventional
> architectures. (I receive the impression that memory consistency
> has historically been associated with such miscommunication.)
>

That's a problem for any architecture, witness the at times violent
disagreements here over Undefined Behavior. :-)

Re: Vector ISA Categorisation

<5c992582-0cde-4dc2-8e80-556de4f8eb26n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18655&group=comp.arch#18655

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:6b0f:: with SMTP id w15mr534150qts.366.1626119741880;
Mon, 12 Jul 2021 12:55:41 -0700 (PDT)
X-Received: by 2002:a9d:ba9:: with SMTP id 38mr535500oth.276.1626119741540;
Mon, 12 Jul 2021 12:55:41 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 12 Jul 2021 12:55:41 -0700 (PDT)
In-Reply-To: <80be8c80-109d-4b85-9822-68fa19ee17ffn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:2914:c98d:387d:bfef;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:2914:c98d:387d:bfef
References: <sb6s70$dip$1@newsreader4.netcologne.de> <sb6vfb$1ov$1@dont-email.me>
<sb70q1$fsg$2@newsreader4.netcologne.de> <sb912k$c4c$1@dont-email.me>
<sb99gi$1r5$1@newsreader4.netcologne.de> <sbh665$sht$1@dont-email.me>
<sbubiu$unp$1@dont-email.me> <sbudg8$aje$1@dont-email.me> <sc12qv$8ka$1@dont-email.me>
<sc186o$gns$1@dont-email.me> <sc5cg5$a3p$1@dont-email.me> <sc5fh8$p7q$1@dont-email.me>
<sc8pjr$8ib$1@dont-email.me> <sc8uoc$2tc$1@dont-email.me> <sc9iib$3ei$1@dont-email.me>
<scac7e$ph4$1@dont-email.me> <63597d55-f5bd-42fc-bae3-38155d072128n@googlegroups.com>
<scan92$k7m$1@dont-email.me> <dc5c8894-e51d-46a8-b682-9784fb8ac205n@googlegroups.com>
<9020308c-08e6-4f4f-b29f-e4320c19b1c2n@googlegroups.com> <2336ffa3-df90-461a-a1cb-51147dfc504dn@googlegroups.com>
<9a596e40-0c21-4b4c-83b1-56c745dd199cn@googlegroups.com> <sceb52$b4t$1@dont-email.me>
<0e87d075-e620-4173-accc-e16e0adbba35n@googlegroups.com> <scgi37$u7v$1@dont-email.me>
<57a0784c-b114-460e-af96-9930e94441f3n@googlegroups.com> <schqua$k02$1@dont-email.me>
<999476a5-100e-49dc-9a06-4550a7c928f0n@googlegroups.com> <a0e2018b-9bae-4549-855b-50ff92cacbc2n@googlegroups.com>
<80be8c80-109d-4b85-9822-68fa19ee17ffn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5c992582-0cde-4dc2-8e80-556de4f8eb26n@googlegroups.com>
Subject: Re: Vector ISA Categorisation
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 12 Jul 2021 19:55:41 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Mon, 12 Jul 2021 19:55 UTC

On Monday, July 12, 2021 at 2:30:18 PM UTC-5, luke.l...@gmail.com wrote:
> On Monday, July 12, 2021 at 8:01:34 PM UTC+1, MitchAlsup wrote:
>
> > Agreed, but I would phrase the situation differently. MM and FFT have "difficult" addressing
> > of memory, and if/when your architecture handles these AGEN problems with succinctness
> > and efficiency, not many other serious memory addressing problems will be encountered.
> i also added an explicit "bitreverse" unit stride LD mode. again this was one of those "oink, what on earth are standard ISAs doing" moments. it made a dog's dinner mess of the ISA but i feel it's worth it to be able to do DFT and DCT in around 10 instructions.
> > The VEC instruction identifies the registers with carry iteration to iteration dependencies
> > {Mostly only the index register}. A register used as a destination in a vectorized loop
> > is a vector register unless it was identified as a loop carrier, otherwise the operand
> > registers become scalar.
<
> my immediate reaction is that it may be much simpler, by using standard reg hazard tracking, treating all Vector elements as multi issue scalars. this *should* allow the situation to be reassessed after a write scalar is encountered, carrying on parallel execution.
>
> where it gets complex is identifying the number of elements that are permitted to be run simultaneously. nominally (the fallback) is one: all are scalar in effect.
<
But here, the same vector code runs when there is memory aliasing as when there is not::
<
for( i = 0; i < MAX, i++ )
a[i] = b[i]*x+a[i-j];
>
for any j.
<
Thus the compiler does not have to solve this memory aliasing problem in order to convert
the loop into vector form. When j ~IN( 0..MAX ) it runs at full vector speed. When j = 1, the
loop runs at the latency of the FMAC unit plus one cycle. SAME code.
<
> however i have an idea: an architectural "hint" instruction which the programmer tells the hardware *how many* elements may be run Horizontally without violating hazards. hardware may run *up to* that limit but not exceed it.
<
Mostly the programmer does not know this unit of data.
>
> if the limit is set "unlimited" then fascinatingly it says "all elements may be processed vertically and that by definition is the standard Cray-style Horizontal-first execution.
<
Which comes with the compiler HAVING to solve the "memory is not aliased" problem.
>
> then it is not necessary for the hardware to assess the situation: the compiler has *told* you "up to N elrments can be computed in parallel".
> > <
> > In My 66000 ISA, VVM watches the registers as the loop is installed into the execution
> > window. Scalar registers are read from the RF and installed in the reservation stations
> > ONCE, hereafter they remain, not waiting for a dependency from iteration to iteration.
> > <
> > > nice. in SVP64 it's sv.addi RT.v, 0, #nnnn which is the pseudo op for load immediate, or if sv.addi RT.v, r9, #0 that would copy the sum (R9+0) into the whole vector.
> > >
> > > btw i strongly suggest adding an "iota" instruction which puts the immedistes 0 1 2 3 .... into the vector. apparently this is very useful.
> > <
> > More of a symptom of the problem rather than a solution............
> it may have been added for doing bit-reversed LOADs. iota folloqed by bitrev followed by INDEXED load. and i have seen it used to synthesise VSCATTER from VGATHER (in RVV).
> > > actual explicit packed SIMD instructions, just, please, consider SIMD ISAs to be the programming equivalent of the worst kind of hell you could possibly inflict.
> > <
> > A convert !!!!!!!!!!!!!!!!!!!!
> i knew the SIMD harmful article, but it wasn't until i compiled 2 line IAXPY with AVX512 and it generated 340 instructions i realised the true horror.
>
> each widening only nakes things worse not better by increasing latency because the elements have to all complete before the next (entire) batch can proceed. in the case of SIMD LD/ST that translates to being far, far worse performance even than the most basic Vector engine.
> > My 66000 is allowed to use its predicate instructions inside loops which
> > acts like a vector of predicates (above) but expressed exactly the same in
> > scalar code as in vector code. No ISA additions needed to do this..............
> i do like this.
>
> l.

On 2021-07-12, luke.l...@gmail.com wrote:
> On Monday, July 12, 2021 at 5:34:20 PM UTC+1, Marcus wrote:
>> On 2021-07-12, luke.l...@gmail.com wrote:
>

[snip]

>
> in the case of FFT butterfly that's 3 loops with 3 indices/offsets, none of which are entirely linear.
>
> in the case of Matrix Multiply, again it's 3 different indices.
>
> with FFT and MM being used for such a ridiculous number of computer science algorithms including Reed Solomon (NTT, the Number Theory version of DFT) i figured, "what the heck, go for it"
>

Yes I know. I really have not gotten around to MM and FFT yet, but when
I do I'm sure I'll have to rethink (or add) some things.

>>> MRISC32, SVP64 and RVV because they are all "Horizontal",
>>> the instructions could (hypothetically in MRISC32's case?) use
>>> a scalar as a source or destination (VEXTRACT, VSPLAT, even
>>> arithmetic).
>> Yes the MRISC32 can feed ("splat") a scalar onto a vector operation. The
>> scalar can either be a scalar register (R0..R31) or a 15-bit immediate
>> value (usually using my I15HL encoding - so either a sign-extended or a
>> sticky-left-shifted 14-bit value).
>
> nice. in SVP64 it's sv.addi RT.v, 0, #nnnn which is the pseudo op for load immediate, or if sv.addi RT.v, r9, #0 that would copy the sum (R9+0) into the whole vector.
>
> btw i strongly suggest adding an "iota" instruction which puts the immedistes 0 1 2 3 .... into the vector. apparently this is very useful.

That's what I use LDEA (LoaD Effective Address) for:

LDEA V3, [Z, #1] ; V3[k] = 0 + 1 * k
LDEA V4, [R2, R3] ; V4[k] = R2 + R3 * k
ITOF V5, V3, Z ; V5 = [0.0F, 1.0F, 2.0F, ...]

>
>> Not sure, but this sounds like my plan for MRISC32, i.e. that to the
>> software the register size ("max vector length") will typically be
>> longer than the EU width. So e.g. with a 16-element register size you
>> could have a 1-wide or 4-wide ALU (or something else) - the software
>> model is the same.
>
> yes that's the one. and if the MVL is huge then to be honest i can't see that it wouldn't be future proof.

The ISA currently limits the MVL to 2^31 elements (which would mean that
each vector register would be larger than the addressable memory) ;-)

In other words, there's no upper limit except for the practical limits
imposed by the micro architecture.

>
>> Similarly the "units of transaction" and register
>> ports could be 1-wide, 2-wide, 4-wide, ..., etc, but to the software
>> the vector length would typically be longer.
>
> yehyeh, exactly.
>
>> Yes MRISC32 is mostly 1), ... but, well, it's is kind of a mix of
>> 1) and 2).
>>
>> Both FP and INT can live in either scalar registers (R0-R31) or vector
>> registers (V0-V31). So there are effectively two user visible register
>> files.
>
> urrr as long as the ISA can select either arbitrarily it shoukd be good :)
>
>> OTOH I included (optional) support for packed SIMD on the vector
>> element level, so that a single 32-bit vector element can be split into
>> 2x16 or 4x8, "true SIMD" style (ugly, but simple and fairly cost
>> effective, and occasionally useful).
>
> don't do it. really. ok, if you do, then do it as "sub-vectors", which map onto Vulkan vec2, vec3 and vec4, respectively, but fir god's sake don't do "actual packed SIMD"
>
> set VL=1 and SUBVL=3 yes.
>
> actual explicit packed SIMD instructions, just, please, consider SIMD ISAs to be the programming equivalent of the worst kind of hell you could possibly inflict.
>

Yeah yeah, I know. It's a poor man's solution to a couple of problems. I
knew that for small data types (byte, half-word) I wanted compute
throughput that equals data throughput, and I knew that I wanted to
support certain modulo/limiting arithmetic operations and FP arithmetic
operations for smaller data types (especially true for the envisioned
64-bit version of the ISA).

In my limited wisdom I figured that the way to do that is to have
flexible EU:s that can do 4x8-, 2x16- or 1x32-bit operations, and the
easiest way to handle that in my ISA was to expose that functionality
by adding a "type" field to the instruction word.

The downside is that you get all those pesky alignment & tail issues
that come with packed SIMD ISA:s, but on the upside you get rid of the
data hazard issues since you're doing true vector on top - and since the
packed SIMD width is fixed at 32 bits regardless of the MVL we're also
future-proof.

But I agree - packed SIMD is a PITA, hence I'm not very proud of this
solution.

>> And the same goes for the scalar
>> registers: each scalar register can be treated as 1x32, 2x16 or 4x8. In
>> other words, the packed SIMD stuff is not a function of the register
>> file or even the vector control - it's a function of each individual EU.
>
> yes. if it's hidden behind a Vector ISA Frontend, SIMD *ALUs* are incredibly valuable. caveat: *predicated* SIMD ALUs are useful.
>
>> Note: In a future MRISC64 ISA, where all scalar registers and vector
>> elements are 64 bits wide, the packed SIMD functionality will be very
>> similar to MMX.
>
> unless it's part of the Vector ISA (SUBVL=1/2/3/4) you will regret it, i promise.

I'll have to evaluate some of the possible solutions. I'm not sure if
I'm going to do anything about it in MRISC32, but for MRISC64 I'll do
some thinking.

Actually, funny story: I started the work on MRISC32 to learn more about
CPU micro architecture design (I only had basic theoretical knowledge
from a university course back in the 90's), and I was just about to call
my ISA "VRISC" (as in Vector/RISC) when I learned about RISC-V, and
quickly changed the name to "MRISC". Had I known about RISC-V before I
started my CPU project I would probably not have invented my own ISA. In
hindsight I'm glad I did not blindly go with RISC-V.

>
>>
>> I'm not sure that this is a good idea, but so far it's been quite nice
>> to occasionally be able to do 4x8 integer arithmetic or 2x16 FP
>> arithmetic on scalar registers, for instance, and the vector logic
>> should be much simpler than RVV for instance.
>
> RVV has Sub-Vectors. a single bit of predicate mask applies to each subvector.
>
> what RVV is missing which makes it Absolute Hell for 3D (Vulkan Shaders) is swizzle. swizzle is i think something like 30% of all 3D Shader LD/MV operations, it's enormous.
>
> (swizzle: a vec4 can be treated as SIMD BUT has lane-crossing, XYZW can be routed YYZX or ZYXW or any combination.)
>
> swizzle encoding however is... expensive. most GPU ISAs are 64 bit minimum.

Yes, I only have byte-swizzle at the word level (so RGBA32 -> BGRA32 is
trivial, for instance). For vectors I currently only have
scatter/gather (which is at least better than packed SIMD w/o lane-
crossing swizzle), but I may have to come up with a more powerful
solution.

>
>
>> The only situations that I've come across where you may want to define
>> the vector length up front as a programmer are:
>>
>> 1) When you're going to do horizontal operations with the help of
>> folding and it's likely that the data array size is < 8 X the machine
>> vector size (or so), in which case the folding steps should be unrolled
>> up-front for optimal performance. If you're working on arrays with 1000+
>> elements - just go ahead and code up the folding steps in regular
>> dynamic loop at the end.
>
> this is why NEC's SX-Aurora just went with it and added Horizontal add, mul, or, and, xor etc.

Currently there's nothing in my ISA that requires "lane-crossing", which
makes it very simple to implement, and I like to see how far I can take
it.

Had the ISA demanded a pure vector model (iterative rather than multi-
element), like early vector machines, some things would be quite simple
to add, like horizontal reduce of certain operations, or mask based
skipping of elements etc.

However since I want to support both serial and semi-parallel element
processing, I have been reluctant to add "lane-crossing magic" just yet.

>
>> 2) When you have natural data packet sizes (say, interleaved
>> N-dimensional components such as [X,Y,Z,W,R,G,B,A] or a 4x4 matrix).
>
> and RADIX-2 butterfly, or fixed width DCT, yes. hence why we went with explicit VL as well in SVP64.
>
>> 3) When you want to do control flow based on vector content, e.g.
>> checking the end-of-iteration criterion for a Mandelbrot iteration, in
>> which case you may not want to spend too much time iterating on a wide
>> vector when all elements but one satisfy your criterion.
>
> ah for that we added "Data-dependent Fail-First".
>
> using Condition Codes testing in each element, if the test fails the *Vector Length is automatically truncated*. subsequent operations take place *at that altered length*.
>
> Vertical-first Vector ISAs of course do not have to play this trick, they can just exit the loop at the first failed element condition.
>
> look up SVE "Fail-first" and RVV "Speculative Load". i adapted that and applied it to data.

Click here to read the complete article

Re: Vector ISA Categorisation

<fbf6751b-6b88-4283-92ea-1fff4b7fe200n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18698&group=comp.arch#18698

copy link Newsgroups: comp.arch

X-Received: by 2002:ad4:4a03:: with SMTP id m3mr4522428qvz.33.1626177369121;
Tue, 13 Jul 2021 04:56:09 -0700 (PDT)
X-Received: by 2002:a9d:5f19:: with SMTP id f25mr3486865oti.206.1626177368863;
Tue, 13 Jul 2021 04:56:08 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.snarked.org!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 13 Jul 2021 04:56:08 -0700 (PDT)
In-Reply-To: <scjk6o$rme$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=217.147.94.29; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 217.147.94.29
References: <sb6s70$dip$1@newsreader4.netcologne.de> <sb99gi$1r5$1@newsreader4.netcologne.de>
<sbh665$sht$1@dont-email.me> <sbubiu$unp$1@dont-email.me> <sbudg8$aje$1@dont-email.me>
<sc12qv$8ka$1@dont-email.me> <sc186o$gns$1@dont-email.me> <sc5cg5$a3p$1@dont-email.me>
<sc5fh8$p7q$1@dont-email.me> <sc8pjr$8ib$1@dont-email.me> <sc8uoc$2tc$1@dont-email.me>
<sc9iib$3ei$1@dont-email.me> <scac7e$ph4$1@dont-email.me> <63597d55-f5bd-42fc-bae3-38155d072128n@googlegroups.com>
<scan92$k7m$1@dont-email.me> <dc5c8894-e51d-46a8-b682-9784fb8ac205n@googlegroups.com>
<9020308c-08e6-4f4f-b29f-e4320c19b1c2n@googlegroups.com> <2336ffa3-df90-461a-a1cb-51147dfc504dn@googlegroups.com>
<9a596e40-0c21-4b4c-83b1-56c745dd199cn@googlegroups.com> <sceb52$b4t$1@dont-email.me>
<0e87d075-e620-4173-accc-e16e0adbba35n@googlegroups.com> <scgi37$u7v$1@dont-email.me>
<57a0784c-b114-460e-af96-9930e94441f3n@googlegroups.com> <schqua$k02$1@dont-email.me>
<999476a5-100e-49dc-9a06-4550a7c928f0n@googlegroups.com> <scjk6o$rme$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <fbf6751b-6b88-4283-92ea-1fff4b7fe200n@googlegroups.com>
Subject: Re: Vector ISA Categorisation
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Tue, 13 Jul 2021 11:56:09 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 193

by: luke.l...@gmail.com - Tue, 13 Jul 2021 11:56 UTC

On Tuesday, July 13, 2021 at 9:51:38 AM UTC+1, mbitsnbites wrote:

> > btw i strongly suggest adding an "iota" instruction which puts the immedistes 0 1 2 3 .... into the vector. apparently this is very useful.
> That's what I use LDEA (LoaD Effective Address) for:
>
> LDEA V3, [Z, #1] ; V3[k] = 0 + 1 * k
> LDEA V4, [R2, R3] ; V4[k] = R2 + R3 * k
> ITOF V5, V3, Z ; V5 = [0.0F, 1.0F, 2.0F, ...]

this is standard "Unit Stride" mode. what about when you need
to do a bit-reversed LD, for FFT? the usual way to do that in
"standard" Vector ISAs is:

SETVL 8 ; length of 8 (3 bits)
IOTA v0 ; 0 1 3 .... 6 7
BITREVERSE v0, #3 ; 3-bit here matches the VL 0b110 -> 0b011 etc
LDEA v3, [Z, v0 << #3] # load double-words from 0 3 1 2 4 6 5 7

and other wonderful awfulness

> > actual explicit packed SIMD instructions, just, please, consider SIMD ISAs to be the programming equivalent of the worst kind of hell you could possibly inflict.
> >
> Yeah yeah, I know. It's a poor man's solution to a couple of problems. I
> knew that for small data types (byte, half-word) I wanted compute
> throughput that equals data throughput, and I knew that I wanted to
> support certain modulo/limiting arithmetic operations and FP arithmetic
> operations for smaller data types (especially true for the envisioned
> 64-bit version of the ISA).

this is perfectly fine and achievable... *WITHOUT* exposing the f***-wittery
that is SIMD to the actual users of the ISA.

* Predicated SIMD backends *WITHOUT* having a front-end ISA for users to stab themselves in the head with
* byte-level Write-Enable lines on the regfile (like there is on SRAMs, the regfile *being* SRAMs this is a no-brainer)

then:

- byte-level Vector instructions send bit-level predicate masks
through the predicated SIMD backend ALUs and on to the
regfiles.
- hword-level Vector instructions *DOUBLE-UP* the bit-level
predicate masks through the predicated SIMD backends
and pass TWO bits per element to the regfiles
- word-level Vector instructions send FOUR copies of the
per-element predicate mask through the SIMD backends
and pass FOUR bits per element to the write-enable lines
on the regfile
- dword sends 8 copies...

you get the idea.

> In my limited wisdom I figured that the way to do that is to have
> flexible EU:s that can do 4x8-, 2x16- or 1x32-bit operations, and the
> easiest way to handle that in my ISA was to expose that functionality
> by adding a "type" field to the instruction word.

which is perfectly fine, normal, and standard fare. except, please,
for your own sanity make that type field part of the *Vector* ISA
(controlled by VL) *NOT* go "hmm i have to provide people with
a way to stab themselves in the head".

seriously: read this article
https://www.sigarch.org/simd-instructions-considered-harmful/

if its significance does not sink in the first time, read it again and
again and again until it does.

all you need to do to "complete" the Vector ISA and not expose
the end-user to endless opportunities for self harm is:

* create an automatic Predicate Mask based on VL:
automatic_mask = (1<<VL)-1
* AND that with the masks going into the Predicate-aware SIMD
back-end ALUs

that's it.

that's all you need to do as a "first implementation".

more complex implementations you want some in-flight
buffers to analyse situations like this:

SETVL 7
mask = 0b00010110
ADDbyte v0, v1, v2, mask
BITNEG mask
ADDbyte v0, v3, v4, mask

here it should be clear that without "merging" of those
two adds, the byte-level Vector ADDs take 2 cycles,
each back-end SIMD ALU running empty in *EXACTLY*
the opposing spots.

it should be obvious therefore that it is possible to
merge the two ADDs, based on the mask being bit-level
inverted, into one single back-end SIMD ADD that runs
at 100% capacity.

> The downside is that you get all those pesky alignment & tail issues
> that come with packed SIMD ISA:s, but on the upside you get rid of the
> data hazard issues since you're doing true vector on top - and since the
> packed SIMD width is fixed at 32 bits regardless of the MVL we're also
> future-proof.

now all you need to do is do the cost analysis "is poisoning the ISA
and users irrevocably with packed SIMD worth it".

> > swizzle encoding however is... expensive. most GPU ISAs are 64 bit minimum.
> Yes, I only have byte-swizzle at the word level (so RGBA32 -> BGRA32 is
> trivial, for instance). For vectors I currently only have
> scatter/gather (which is at least better than packed SIMD w/o lane-
> crossing swizzle), but I may have to come up with a more powerful
> solution.

the "least cost" one is to have a MV.swizzle which takes a single
swizzle argument. full swizzle immediates for vec2/3/4 are *twelve*
(12) bytes. if you have 2 (one for src1, one for src2)
that's *twenty four* bit immediates.

on every single arithmetic operation.

oink.

> > look up SVE "Fail-first" and RVV "Speculative Load". i adapted that and applied it to data.
> Will do. Sounds to me like you'd have to do something akin to a permute
> operation though

not for FFIRST.LOAD, no.

> > yes. except for strncpy or saturated audio processing this is fine. then the assumptions all go to hell. efforts to fix that inside the loop just slow the loop down. reduce the vector size likewise.
> In what way is saturated audio processing a problem. Are you referring
> to IIR (feedback) filters?

audio clipping. let's say you want a highly optimised loop which
multiplies by a volume, but if it clips you want auto-adaptive volume.

the "usual" way would be to have a test inside the loop "is volume exceeded"
which of course slows the loop down

data-dependent condition codes would *truncate* the parallel processing
to the first failed (saturated) audio sample, and this can be detected
*outside* of the inner (element-based) parallel loop.

> > we will be doing stratified regfiles. 0-31 are "normal", 32-127 are 4 way interleaved.
> When you say "interleaved", I assume you mean that you're using four
> independently accessible register banks.

* QTY 4of 4R1W regfiles.
* read-ports
* read-level optional cyclic buffer
* QTY4 independent ALUs
* write-level optional cyclic buffer
* write-ports

any operation which does not align modulo 4 will end up with
some latency as it gets shunted into the cyclic buffer to
"align" with an ALU that can deal with that.

no need for massive 4x4 64-bit crossbars.

no need for 12R10W regfiles.

> > then we will add a massive cyclic shift register which, if you make the mistake of issuing an op "ADD r32 r33 r34" Instead.of "ADD r32 r36 r40" (note same modulo 4 there) the operands drop into the cyclic buffer and get shunted to an ALU that can cope. increases latency, we don't mind that.
> Will it not be a pain for the compiler register allocator to keep track
> of these rules to produce optimal code?

yep. tough titty. we're not going to be doing massive 128-entry 12R10W regfiles.
that's the other choice. does QTY 2 128-entry 64-bit regfiles with 12 read and
10 write ports sound like it's a good idea?

On 2021-07-13, luke.l...@gmail.com wrote:
> On Tuesday, July 13, 2021 at 9:51:38 AM UTC+1, mbitsnbites wrote:
>
>>> btw i strongly suggest adding an "iota" instruction which puts the immedistes 0 1 2 3 .... into the vector. apparently this is very useful.
>> That's what I use LDEA (LoaD Effective Address) for:
>>
>> LDEA V3, [Z, #1] ; V3[k] = 0 + 1 * k
>> LDEA V4, [R2, R3] ; V4[k] = R2 + R3 * k
>> ITOF V5, V3, Z ; V5 = [0.0F, 1.0F, 2.0F, ...]
>
> this is standard "Unit Stride" mode. what about when you need
> to do a bit-reversed LD, for FFT? the usual way to do that in
> "standard" Vector ISAs is:
>
> SETVL 8 ; length of 8 (3 bits)
> IOTA v0 ; 0 1 3 .... 6 7
> BITREVERSE v0, #3 ; 3-bit here matches the VL 0b110 -> 0b011 etc
> LDEA v3, [Z, v0 << #3] # load double-words from 0 3 1 2 4 6 5 7
>
> and other wonderful awfulness

Yes, in MRISC32 you'd currently have to do:

LDI VL, #8
LDEA V1, [Z, #1] ; [0,1,2,...,7]
REV V1, V1 ; Bit reverse word
LSR V1, V1, #32-3 ; [0,4,2,...,7]
LDW V2, [R1, V1*4] ; Load words from R1 + [0,4,2,...,7]

Kinda clunky.

Alternative 1: Prepare the bit-reversed offset vector (3 instructions)
outside of your hot loop(s), in which case I wouldn't worry too much.

Alternative 2: Put the offset array in a data constant and load from
memory (preferably outside of hot loops). Would require 8 bytes of
constant data in this case.

As I said, I have yet to try to implement an optimal FFT or MM yet,
so I haven't analyzed the bottlenecks yet.

>
>>> actual explicit packed SIMD instructions, just, please, consider SIMD ISAs to be the programming equivalent of the worst kind of hell you could possibly inflict.
>>>
>> Yeah yeah, I know. It's a poor man's solution to a couple of problems. I
>> knew that for small data types (byte, half-word) I wanted compute
>> throughput that equals data throughput, and I knew that I wanted to
>> support certain modulo/limiting arithmetic operations and FP arithmetic
>> operations for smaller data types (especially true for the envisioned
>> 64-bit version of the ISA).
>
> this is perfectly fine and achievable... *WITHOUT* exposing the f***-wittery
> that is SIMD to the actual users of the ISA.
>
> * Predicated SIMD backends *WITHOUT* having a front-end ISA for users to stab themselves in the head with
> * byte-level Write-Enable lines on the regfile (like there is on SRAMs, the regfile *being* SRAMs this is a no-brainer)
>
> then:
>
> - byte-level Vector instructions send bit-level predicate masks
> through the predicated SIMD backend ALUs and on to the
> regfiles.
> - hword-level Vector instructions *DOUBLE-UP* the bit-level
> predicate masks through the predicated SIMD backends
> and pass TWO bits per element to the regfiles
> - word-level Vector instructions send FOUR copies of the
> per-element predicate mask through the SIMD backends
> and pass FOUR bits per element to the write-enable lines
> on the regfile
> - dword sends 8 copies...
>
> you get the idea.

Does not sound like rocket science.

>
>> In my limited wisdom I figured that the way to do that is to have
>> flexible EU:s that can do 4x8-, 2x16- or 1x32-bit operations, and the
>> easiest way to handle that in my ISA was to expose that functionality
>> by adding a "type" field to the instruction word.
>
> which is perfectly fine, normal, and standard fare. except, please,
> for your own sanity make that type field part of the *Vector* ISA
> (controlled by VL) *NOT* go "hmm i have to provide people with
> a way to stab themselves in the head".
>
> seriously: read this article
> https://www.sigarch.org/simd-instructions-considered-harmful/
>

I have. Multiple times. MRISC32 does SAXPY in fewer instructions than
RVV (because it has better addressing modes). I also have years of
experience writing ARM & x86 SIMD code, so I know the pain.

> if its significance does not sink in the first time, read it again and
> again and again until it does.
>
> all you need to do to "complete" the Vector ISA and not expose
> the end-user to endless opportunities for self harm is:
>
> * create an automatic Predicate Mask based on VL:
> automatic_mask = (1<<VL)-1
> * AND that with the masks going into the Predicate-aware SIMD
> back-end ALUs
>
> that's it.
>

Thanks. I will consider this at some point when I get around to it.

> that's all you need to do as a "first implementation".
>
> more complex implementations you want some in-flight
> buffers to analyse situations like this:
>
> SETVL 7
> mask = 0b00010110
> ADDbyte v0, v1, v2, mask
> BITNEG mask
> ADDbyte v0, v3, v4, mask
>
> here it should be clear that without "merging" of those
> two adds, the byte-level Vector ADDs take 2 cycles,
> each back-end SIMD ALU running empty in *EXACTLY*
> the opposing spots.
>
> it should be obvious therefore that it is possible to
> merge the two ADDs, based on the mask being bit-level
> inverted, into one single back-end SIMD ADD that runs
> at 100% capacity.
>
>> The downside is that you get all those pesky alignment & tail issues
>> that come with packed SIMD ISA:s, but on the upside you get rid of the
>> data hazard issues since you're doing true vector on top - and since the
>> packed SIMD width is fixed at 32 bits regardless of the MVL we're also
>> future-proof.
>
> now all you need to do is do the cost analysis "is poisoning the ISA
> and users irrevocably with packed SIMD worth it".
>
>>> swizzle encoding however is... expensive. most GPU ISAs are 64 bit minimum.
>> Yes, I only have byte-swizzle at the word level (so RGBA32 -> BGRA32 is
>> trivial, for instance). For vectors I currently only have
>> scatter/gather (which is at least better than packed SIMD w/o lane-
>> crossing swizzle), but I may have to come up with a more powerful
>> solution.
>
> the "least cost" one is to have a MV.swizzle which takes a single
> swizzle argument. full swizzle immediates for vec2/3/4 are *twelve*
> (12) bytes. if you have 2 (one for src1, one for src2)
> that's *twenty four* bit immediates.
>
> on every single arithmetic operation.
>
> oink.

I have given this some thought. My current SHUF instruction works well
in a 32-bit environment since my control word immediate is 13 bits,
which fits well in the 15-bit immediate field of the instruction.

For 64-bit (or more) you obviously need wider control words, and/or do
other tricks (like a not-fully-generic sizzle, specialized unary
instructions for common swizzle tasks, and simply pre-loading large
swizzle immediates into registers outside of hot loops). I have an
issue for MRISC64 where I ponder ideas:

https://github.com/mbitsnbites/mrisc64/issues/2

>
>>> look up SVE "Fail-first" and RVV "Speculative Load". i adapted that and applied it to data.
>> Will do. Sounds to me like you'd have to do something akin to a permute
>> operation though
>
> not for FFIRST.LOAD, no.
>
>>> yes. except for strncpy or saturated audio processing this is fine. then the assumptions all go to hell. efforts to fix that inside the loop just slow the loop down. reduce the vector size likewise.
>> In what way is saturated audio processing a problem. Are you referring
>> to IIR (feedback) filters?
>
> audio clipping. let's say you want a highly optimised loop which
> multiplies by a volume, but if it clips you want auto-adaptive volume.
>
> the "usual" way would be to have a test inside the loop "is volume exceeded"
> which of course slows the loop down
>

Ok got it.

> data-dependent condition codes would *truncate* the parallel processing
> to the first failed (saturated) audio sample, and this can be detected
> *outside* of the inner (element-based) parallel loop.
>
>>> we will be doing stratified regfiles. 0-31 are "normal", 32-127 are 4 way interleaved.
>> When you say "interleaved", I assume you mean that you're using four
>> independently accessible register banks.
>
> * QTY 4of 4R1W regfiles.
> * read-ports
> * read-level optional cyclic buffer
> * QTY4 independent ALUs
> * write-level optional cyclic buffer
> * write-ports
>
> any operation which does not align modulo 4 will end up with
> some latency as it gets shunted into the cyclic buffer to
> "align" with an ALU that can deal with that.
>
> no need for massive 4x4 64-bit crossbars.
>
> no need for 12R10W regfiles.
>
>>> then we will add a massive cyclic shift register which, if you make the mistake of issuing an op "ADD r32 r33 r34" Instead.of "ADD r32 r36 r40" (note same modulo 4 there) the operands drop into the cyclic buffer and get shunted to an ALU that can cope. increases latency, we don't mind that.
>> Will it not be a pain for the compiler register allocator to keep track
>> of these rules to produce optimal code?
>
> yep. tough titty. we're not going to be doing massive 128-entry 12R10W regfiles.
> that's the other choice. does QTY 2 128-entry 64-bit regfiles with 12 read and
> 10 write ports sound like it's a good idea?

Click here to read the complete article

Re: Vector ISA Categorisation

<863e6886-f580-4cce-aaef-ddf8d6baa4dfn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18710&group=comp.arch#18710

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:4004:: with SMTP id h4mr5135116qko.370.1626194380450;
Tue, 13 Jul 2021 09:39:40 -0700 (PDT)
X-Received: by 2002:aca:dbd6:: with SMTP id s205mr171555oig.155.1626194380195;
Tue, 13 Jul 2021 09:39:40 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 13 Jul 2021 09:39:39 -0700 (PDT)
In-Reply-To: <fbf6751b-6b88-4283-92ea-1fff4b7fe200n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:5862:643b:ebd2:6621;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:5862:643b:ebd2:6621
References: <sb6s70$dip$1@newsreader4.netcologne.de> <sb99gi$1r5$1@newsreader4.netcologne.de>
<sbh665$sht$1@dont-email.me> <sbubiu$unp$1@dont-email.me> <sbudg8$aje$1@dont-email.me>
<sc12qv$8ka$1@dont-email.me> <sc186o$gns$1@dont-email.me> <sc5cg5$a3p$1@dont-email.me>
<sc5fh8$p7q$1@dont-email.me> <sc8pjr$8ib$1@dont-email.me> <sc8uoc$2tc$1@dont-email.me>
<sc9iib$3ei$1@dont-email.me> <scac7e$ph4$1@dont-email.me> <63597d55-f5bd-42fc-bae3-38155d072128n@googlegroups.com>
<scan92$k7m$1@dont-email.me> <dc5c8894-e51d-46a8-b682-9784fb8ac205n@googlegroups.com>
<9020308c-08e6-4f4f-b29f-e4320c19b1c2n@googlegroups.com> <2336ffa3-df90-461a-a1cb-51147dfc504dn@googlegroups.com>
<9a596e40-0c21-4b4c-83b1-56c745dd199cn@googlegroups.com> <sceb52$b4t$1@dont-email.me>
<0e87d075-e620-4173-accc-e16e0adbba35n@googlegroups.com> <scgi37$u7v$1@dont-email.me>
<57a0784c-b114-460e-af96-9930e94441f3n@googlegroups.com> <schqua$k02$1@dont-email.me>
<999476a5-100e-49dc-9a06-4550a7c928f0n@googlegroups.com> <scjk6o$rme$1@dont-email.me>
<fbf6751b-6b88-4283-92ea-1fff4b7fe200n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <863e6886-f580-4cce-aaef-ddf8d6baa4dfn@googlegroups.com>
Subject: Re: Vector ISA Categorisation
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 13 Jul 2021 16:39:40 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Tue, 13 Jul 2021 16:39 UTC

On Tuesday, July 13, 2021 at 6:56:10 AM UTC-5, luke.l...@gmail.com wrote:
> On Tuesday, July 13, 2021 at 9:51:38 AM UTC+1, mbitsnbites wrote:
>
> > > btw i strongly suggest adding an "iota" instruction which puts the immedistes 0 1 2 3 .... into the vector. apparently this is very useful.
> > That's what I use LDEA (LoaD Effective Address) for:
> >
> > LDEA V3, [Z, #1] ; V3[k] = 0 + 1 * k
> > LDEA V4, [R2, R3] ; V4[k] = R2 + R3 * k
> > ITOF V5, V3, Z ; V5 = [0.0F, 1.0F, 2.0F, ...]
> this is standard "Unit Stride" mode. what about when you need
> to do a bit-reversed LD, for FFT? the usual way to do that in
> "standard" Vector ISAs is:
>
> SETVL 8 ; length of 8 (3 bits)
> IOTA v0 ; 0 1 3 .... 6 7
> BITREVERSE v0, #3 ; 3-bit here matches the VL 0b110 -> 0b011 etc
> LDEA v3, [Z, v0 << #3] # load double-words from 0 3 1 2 4 6 5 7
>
> and other wonderful awfulness
> > > actual explicit packed SIMD instructions, just, please, consider SIMD ISAs to be the programming equivalent of the worst kind of hell you could possibly inflict.
> > >
> > Yeah yeah, I know. It's a poor man's solution to a couple of problems. I
> > knew that for small data types (byte, half-word) I wanted compute
> > throughput that equals data throughput, and I knew that I wanted to
> > support certain modulo/limiting arithmetic operations and FP arithmetic
> > operations for smaller data types (especially true for the envisioned
> > 64-bit version of the ISA).
> this is perfectly fine and achievable... *WITHOUT* exposing the f***-wittery
> that is SIMD to the actual users of the ISA.
>
> * Predicated SIMD backends *WITHOUT* having a front-end ISA for users to stab themselves in the head with
> * byte-level Write-Enable lines on the regfile (like there is on SRAMs, the regfile *being* SRAMs this is a no-brainer)
>
> then:
>
> - byte-level Vector instructions send bit-level predicate masks
> through the predicated SIMD backend ALUs and on to the
> regfiles.
> - hword-level Vector instructions *DOUBLE-UP* the bit-level
> predicate masks through the predicated SIMD backends
> and pass TWO bits per element to the regfiles
> - word-level Vector instructions send FOUR copies of the
> per-element predicate mask through the SIMD backends
> and pass FOUR bits per element to the write-enable lines
> on the regfile
> - dword sends 8 copies...
>
> you get the idea.
> > In my limited wisdom I figured that the way to do that is to have
> > flexible EU:s that can do 4x8-, 2x16- or 1x32-bit operations, and the
> > easiest way to handle that in my ISA was to expose that functionality
> > by adding a "type" field to the instruction word.
> which is perfectly fine, normal, and standard fare. except, please,
> for your own sanity make that type field part of the *Vector* ISA
> (controlled by VL) *NOT* go "hmm i have to provide people with
> a way to stab themselves in the head".
>
> seriously: read this article
> https://www.sigarch.org/simd-instructions-considered-harmful/
>
> if its significance does not sink in the first time, read it again and
> again and again until it does.
>
> all you need to do to "complete" the Vector ISA and not expose
<
Instead of a vector ISA, My 66000 created two bookend instruction-modifiers
(VEC and LOOP) which define the HW semantics of the vectorized loop. The
instructions being vectorized are from the std ISA. So, we get complete
vectorization by adding exactly 2 instructions and never need any more.
<
> the end-user to endless opportunities for self harm is:
>
> * create an automatic Predicate Mask based on VL:
> automatic_mask = (1<<VL)-1
> * AND that with the masks going into the Predicate-aware SIMD
> back-end ALUs
>
> that's it.
>
> that's all you need to do as a "first implementation".
>
> more complex implementations you want some in-flight
> buffers to analyse situations like this:
>
> SETVL 7
> mask = 0b00010110
> ADDbyte v0, v1, v2, mask
> BITNEG mask
> ADDbyte v0, v3, v4, mask
>
> here it should be clear that without "merging" of those
> two adds, the byte-level Vector ADDs take 2 cycles,
> each back-end SIMD ALU running empty in *EXACTLY*
> the opposing spots.
>
> it should be obvious therefore that it is possible to
> merge the two ADDs, based on the mask being bit-level
> inverted, into one single back-end SIMD ADD that runs
> at 100% capacity.
> > The downside is that you get all those pesky alignment & tail issues
> > that come with packed SIMD ISA:s, but on the upside you get rid of the
> > data hazard issues since you're doing true vector on top - and since the
> > packed SIMD width is fixed at 32 bits regardless of the MVL we're also
> > future-proof.
> now all you need to do is do the cost analysis "is poisoning the ISA
> and users irrevocably with packed SIMD worth it".
> > > swizzle encoding however is... expensive. most GPU ISAs are 64 bit minimum.
> > Yes, I only have byte-swizzle at the word level (so RGBA32 -> BGRA32 is
> > trivial, for instance). For vectors I currently only have
> > scatter/gather (which is at least better than packed SIMD w/o lane-
> > crossing swizzle), but I may have to come up with a more powerful
> > solution.
> the "least cost" one is to have a MV.swizzle which takes a single
> swizzle argument. full swizzle immediates for vec2/3/4 are *twelve*
> (12) bytes. if you have 2 (one for src1, one for src2)
> that's *twenty four* bit immediates.
>
> on every single arithmetic operation.
<
Or you can avoid SIMD/SIMTing yourself into a corner and perform the swizzle
with register names.......
>
> oink.
> > > look up SVE "Fail-first" and RVV "Speculative Load". i adapted that and applied it to data.
> > Will do. Sounds to me like you'd have to do something akin to a permute
> > operation though
> not for FFIRST.LOAD, no.
> > > yes. except for strncpy or saturated audio processing this is fine. then the assumptions all go to hell. efforts to fix that inside the loop just slow the loop down. reduce the vector size likewise.
> > In what way is saturated audio processing a problem. Are you referring
> > to IIR (feedback) filters?
> audio clipping. let's say you want a highly optimised loop which
> multiplies by a volume, but if it clips you want auto-adaptive volume.
<
Multiplies by volume, then multiplies by the tone curve, and only at the end decides
to saturate or not.
>
> the "usual" way would be to have a test inside the loop "is volume exceeded"
> which of course slows the loop down
>
> data-dependent condition codes would *truncate* the parallel processing
> to the first failed (saturated) audio sample, and this can be detected
> *outside* of the inner (element-based) parallel loop.
> > > we will be doing stratified regfiles. 0-31 are "normal", 32-127 are 4 way interleaved.
> > When you say "interleaved", I assume you mean that you're using four
> > independently accessible register banks.
> * QTY 4of 4R1W regfiles.
> * read-ports
> * read-level optional cyclic buffer
> * QTY4 independent ALUs
> * write-level optional cyclic buffer
> * write-ports
>
> any operation which does not align modulo 4 will end up with
> some latency as it gets shunted into the cyclic buffer to
> "align" with an ALU that can deal with that.
>
> no need for massive 4x4 64-bit crossbars.
>
> no need for 12R10W regfiles.
> > > then we will add a massive cyclic shift register which, if you make the mistake of issuing an op "ADD r32 r33 r34" Instead.of "ADD r32 r36 r40" (note same modulo 4 there) the operands drop into the cyclic buffer and get shunted to an ALU that can cope. increases latency, we don't mind that.
> > Will it not be a pain for the compiler register allocator to keep track
> > of these rules to produce optimal code?
> yep. tough titty. we're not going to be doing massive 128-entry 12R10W regfiles.
> that's the other choice. does QTY 2 128-entry 64-bit regfiles with 12 read and
> 10 write ports sound like it's a good idea?
>
> l.

Re: Vector ISA Categorisation

<5deba2bb-fa46-43e7-a8f2-01bc5cffb519n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18719&group=comp.arch#18719

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:68c9:: with SMTP id d192mr6564123qkc.212.1626214549416;
Tue, 13 Jul 2021 15:15:49 -0700 (PDT)
X-Received: by 2002:a9d:4e0a:: with SMTP id p10mr4994166otf.329.1626214549261;
Tue, 13 Jul 2021 15:15:49 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.mixmin.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 13 Jul 2021 15:15:49 -0700 (PDT)
In-Reply-To: <863e6886-f580-4cce-aaef-ddf8d6baa4dfn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fa3c:a000:3409:149f:aa05:d54e;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fa3c:a000:3409:149f:aa05:d54e
References: <sb6s70$dip$1@newsreader4.netcologne.de> <sb99gi$1r5$1@newsreader4.netcologne.de>
<sbh665$sht$1@dont-email.me> <sbubiu$unp$1@dont-email.me> <sbudg8$aje$1@dont-email.me>
<sc12qv$8ka$1@dont-email.me> <sc186o$gns$1@dont-email.me> <sc5cg5$a3p$1@dont-email.me>
<sc5fh8$p7q$1@dont-email.me> <sc8pjr$8ib$1@dont-email.me> <sc8uoc$2tc$1@dont-email.me>
<sc9iib$3ei$1@dont-email.me> <scac7e$ph4$1@dont-email.me> <63597d55-f5bd-42fc-bae3-38155d072128n@googlegroups.com>
<scan92$k7m$1@dont-email.me> <dc5c8894-e51d-46a8-b682-9784fb8ac205n@googlegroups.com>
<9020308c-08e6-4f4f-b29f-e4320c19b1c2n@googlegroups.com> <2336ffa3-df90-461a-a1cb-51147dfc504dn@googlegroups.com>
<9a596e40-0c21-4b4c-83b1-56c745dd199cn@googlegroups.com> <sceb52$b4t$1@dont-email.me>
<0e87d075-e620-4173-accc-e16e0adbba35n@googlegroups.com> <scgi37$u7v$1@dont-email.me>
<57a0784c-b114-460e-af96-9930e94441f3n@googlegroups.com> <schqua$k02$1@dont-email.me>
<999476a5-100e-49dc-9a06-4550a7c928f0n@googlegroups.com> <scjk6o$rme$1@dont-email.me>
<fbf6751b-6b88-4283-92ea-1fff4b7fe200n@googlegroups.com> <863e6886-f580-4cce-aaef-ddf8d6baa4dfn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5deba2bb-fa46-43e7-a8f2-01bc5cffb519n@googlegroups.com>
Subject: Re: Vector ISA Categorisation
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Tue, 13 Jul 2021 22:15:49 +0000
Content-Type: text/plain; charset="UTF-8"

by: Quadibloc - Tue, 13 Jul 2021 22:15 UTC

On Tuesday, July 13, 2021 at 10:39:42 AM UTC-6, MitchAlsup wrote:

> Instead of a vector ISA, My 66000 created two bookend instruction-modifiers
> (VEC and LOOP) which define the HW semantics of the vectorized loop. The
> instructions being vectorized are from the std ISA. So, we get complete
> vectorization by adding exactly 2 instructions and never need any more.

I'm now working on laying groundwork for, at some later date, including something
vaguely resembling this in Concertina II.

http://www.quadibloc.com/arch/ct14int.htm

Since Concertina II is built around fetching 256-bit chunks of code, and decoding
the instructions in them in parallel, I had to add block header formats in which
it was possible to label instructions as "prefixed". Essentially, this would apply
to everything between VEC and LOOP. The idea is to delay decoding and execution
of the indicated instructions until the VEC instruction is processed, so that the
actions taken in response to the instructions can be modified.

Of course, in this particular case, only execution, and not decoding, is really
changed - unless one includes conversion to micro-ops as part of decoding,
though. Also, the mechanism is a general one.

There will be plenty of differences between what VVM will become after I get my
sticky paws on it than what it is in your design, I must confess. So if I give you credit
for its virtues, I will also naturally include a disclaimer that you are in no way to
blame for its flaws as I have changed it.

One I've noted: I will explicitly differentiate 'real registers' from 'dataflow nodes'
by allocating the first 1/4 of register specifier space to the former.

Another is that a branch instruction won't terminate the loop. If one is found between
my equivalents of VEC and LOOP, an illegal instruction exception will be thrown.

That's because I'm conceptualizing my version as building a dataflow machine
inside the CPU, and so branches don't belong. Of course, there is an exit condition,
but that will be put at the beginning of the sequence, along with the increment clause.

So the instruction you call VEC would get called something like BXDS (Build and
Execute Dataflow Sequence) and LOOP would just become END.

In other recent changes, I've made it possible to have a header that's only 16 bits
long, so as to return a block format with mostly 16-bit instructions for extra-compact
code; this also allowed adding a 48-bit header, so that the extended format instructions
with arbitrary mixing of 16, 32, 48, and 64-bit instructions no longer require a full
64 bits of overhead.

So both of these changes are aimed at allowing more compact code.

John Savard

Re: Vector ISA Categorisation

<f91f3a66-f4f6-4367-9c70-2d2ba5cf4a37n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18721&group=comp.arch#18721

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:ad0d:: with SMTP id f13mr6440013qkm.453.1626215148090;
Tue, 13 Jul 2021 15:25:48 -0700 (PDT)
X-Received: by 2002:a05:6808:1313:: with SMTP id y19mr4850709oiv.37.1626215147908;
Tue, 13 Jul 2021 15:25:47 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 13 Jul 2021 15:25:47 -0700 (PDT)
In-Reply-To: <fbf6751b-6b88-4283-92ea-1fff4b7fe200n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fa3c:a000:3409:149f:aa05:d54e;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fa3c:a000:3409:149f:aa05:d54e
References: <sb6s70$dip$1@newsreader4.netcologne.de> <sb99gi$1r5$1@newsreader4.netcologne.de>
<sbh665$sht$1@dont-email.me> <sbubiu$unp$1@dont-email.me> <sbudg8$aje$1@dont-email.me>
<sc12qv$8ka$1@dont-email.me> <sc186o$gns$1@dont-email.me> <sc5cg5$a3p$1@dont-email.me>
<sc5fh8$p7q$1@dont-email.me> <sc8pjr$8ib$1@dont-email.me> <sc8uoc$2tc$1@dont-email.me>
<sc9iib$3ei$1@dont-email.me> <scac7e$ph4$1@dont-email.me> <63597d55-f5bd-42fc-bae3-38155d072128n@googlegroups.com>
<scan92$k7m$1@dont-email.me> <dc5c8894-e51d-46a8-b682-9784fb8ac205n@googlegroups.com>
<9020308c-08e6-4f4f-b29f-e4320c19b1c2n@googlegroups.com> <2336ffa3-df90-461a-a1cb-51147dfc504dn@googlegroups.com>
<9a596e40-0c21-4b4c-83b1-56c745dd199cn@googlegroups.com> <sceb52$b4t$1@dont-email.me>
<0e87d075-e620-4173-accc-e16e0adbba35n@googlegroups.com> <scgi37$u7v$1@dont-email.me>
<57a0784c-b114-460e-af96-9930e94441f3n@googlegroups.com> <schqua$k02$1@dont-email.me>
<999476a5-100e-49dc-9a06-4550a7c928f0n@googlegroups.com> <scjk6o$rme$1@dont-email.me>
<fbf6751b-6b88-4283-92ea-1fff4b7fe200n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f91f3a66-f4f6-4367-9c70-2d2ba5cf4a37n@googlegroups.com>
Subject: Re: Vector ISA Categorisation
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Tue, 13 Jul 2021 22:25:48 +0000
Content-Type: text/plain; charset="UTF-8"

by: Quadibloc - Tue, 13 Jul 2021 22:25 UTC

Subject	Author
Split register files	Thomas Koenig
Re: Split register files	Ivan Godard
Re: Split register files	Thomas Koenig
Re: Split register files	Brett
Re: Split register files	Thomas Koenig
Re: Split register files	Brett
Re: Split register files	Brett
Re: Split register files	Ivan Godard
Re: Split register files	Brett
Re: Split register files	Ivan Godard
Re: Split register files	Stefan Monnier
Re: Split register files	Ivan Godard
Re: Split register files	Stephen Fuld
Re: Split register files	Stefan Monnier
Rescue vs scratchpad (was: Split register files)	Stefan Monnier
Re: Rescue vs scratchpad (was: Split register files)	Ivan Godard
Re: Split register files	Brett
Re: Split register files	Ivan Godard
Re: Split register files	Brett
Re: Split register files	Ivan Godard
Re: Mill conAsm vs genAsm (was: Split register files)	Marcus
Re: Mill conAsm vs genAsm (was: Split register files)	Ivan Godard
Re: Mill conAsm vs genAsm (was: Split register files)	Quadibloc
Re: Mill conAsm vs genAsm (was: Split register files)	Ivan Godard
Re: Mill conAsm vs genAsm (was: Split register files)	MitchAlsup
Re: Mill conAsm vs genAsm (was: Split register files)	Quadibloc
Re: Mill conAsm vs genAsm (was: Split register files)	MitchAlsup
Re: Mill conAsm vs genAsm (was: Split register files)	Quadibloc
Re: Mill conAsm vs genAsm (was: Split register files)	Marcus
Re: Mill conAsm vs genAsm (was: Split register files)	Quadibloc
Re: Mill conAsm vs genAsm (was: Split register files)	Marcus
Vector ISA Categorisation	luke.l...@gmail.com
Re: Vector ISA Categorisation	Stephen Fuld
Re: Vector ISA Categorisation	luke.l...@gmail.com
Re: Vector ISA Categorisation	Stefan Monnier
Re: Vector ISA Categorisation	Stephen Fuld
Re: Vector ISA Categorisation	Marcus
Re: Vector ISA Categorisation	luke.l...@gmail.com
Re: Vector ISA Categorisation	mbitsnbites
Re: Vector ISA Categorisation	luke.l...@gmail.com
Re: Vector ISA Categorisation	Marcus
Re: Vector ISA Categorisation	MitchAlsup
Re: Vector ISA Categorisation	Quadibloc
Re: Vector ISA Categorisation	Quadibloc
Re: Vector ISA Categorisation	MitchAlsup
Re: Vector ISA Categorisation	MitchAlsup
Re: Vector ISA Categorisation	Quadibloc
Re: Vector ISA Categorisation	Ivan Godard
Re: Vector ISA Categorisation	Quadibloc
Re: Vector ISA Categorisation	MitchAlsup
Re: Vector ISA Categorisation	Quadibloc
Re: Vector ISA Categorisation	Quadibloc
Re: Vector ISA Categorisation	MitchAlsup
Re: Vector ISA Categorisation	Quadibloc
Re: Vector ISA Categorisation	MitchAlsup
Re: Vector ISA Categorisation	Quadibloc
Re: Vector ISA Categorisation	Quadibloc
Re: Vector ISA Categorisation	Quadibloc
Re: Vector ISA Categorisation	luke.l...@gmail.com
Re: Vector ISA Categorisation	Quadibloc
Re: Vector ISA Categorisation	Quadibloc
Re: Vector ISA Categorisation	Quadibloc
Re: Vector ISA Categorisation	luke.l...@gmail.com
Re: Vector ISA Categorisation	Quadibloc
Re: Vector ISA Categorisation	Marcus
Re: Vector ISA Categorisation	Quadibloc
Re: Vector ISA Categorisation	Quadibloc
Re: Vector ISA Categorisation	luke.l...@gmail.com
Re: Vector ISA Categorisation	Quadibloc
Re: Vector ISA Categorisation	Quadibloc
Re: Vector ISA Categorisation	Quadibloc
Re: Vector ISA Categorisation	MitchAlsup
Re: Vector ISA Categorisation	luke.l...@gmail.com
Re: Vector ISA Categorisation	MitchAlsup
Re: Vector ISA Categorisation	luke.l...@gmail.com
Re: Vector ISA Categorisation	Thomas Koenig
Re: Vector ISA Categorisation	luke.l...@gmail.com
Re: Vector ISA Categorisation	Ivan Godard
Re: Vector ISA Categorisation	Thomas Koenig
Re: Vector ISA Categorisation	MitchAlsup
Re: Vector ISA Categorisation	EricP
Re: Vector ISA Categorisation	Stefan Monnier
Re: Vector ISA Categorisation	MitchAlsup
Re: Vector ISA Categorisation	MitchAlsup
Re: Vector ISA Categorisation	EricP
Re: Vector ISA Categorisation	MitchAlsup
Re: Vector ISA Categorisation	Quadibloc
Re: Vector ISA Categorisation	Thomas Koenig
Re: Vector ISA Categorisation	MitchAlsup
Re: Vector ISA Categorisation	Thomas Koenig
Re: Vector ISA Categorisation	MitchAlsup
Re: Vector ISA Categorisation	luke.l...@gmail.com
Re: Vector ISA Categorisation	luke.l...@gmail.com
Re: Vector ISA Categorisation	MitchAlsup
Re: Vector ISA Categorisation	luke.l...@gmail.com
Re: Vector ISA Categorisation	MitchAlsup
Re: Vector ISA Categorisation	MitchAlsup
Re: Vector ISA Categorisation	Terje Mathisen
Re: Vector ISA Categorisation	MitchAlsup
Re: Vector ISA Categorisation	MitchAlsup
Re: Vector ISA Categorisation	luke.l...@gmail.com
Re: Vector ISA Categorisation	MitchAlsup
Re: Mill conAsm vs genAsm (was: Split register files)	Quadibloc
Re: Mill conAsm vs genAsm (was: Split register files)	luke.l...@gmail.com
Re: Mill conAsm vs genAsm (was: Split register files)	Paul A. Clayton
Re: Mill conAsm vs genAsm	Stefan Monnier
Re: Split register files	Stefan Monnier
Re: Split register files	Thomas Koenig
Re: Split register files	John Dallman
Re: Split register files	Anton Ertl
Re: Split register files	Stefan Monnier
Re: Split register files	MitchAlsup