Welcome to novaBBS (click a section below)

mail files register newsreader groups login

Message-ID:

Linux, the way to get rid of boot viruses -- MaDsen Wikholm, mwikholm@at8.abo.fi

RISC-V vs. Aarch64

Subject	Author
RISC-V vs. Aarch64	Anton Ertl
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Anton Ertl
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	BGB
Re: RISC-V vs. Aarch64	Anton Ertl
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	robf...@gmail.com
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Quadibloc
Re: RISC-V vs. Aarch64	Quadibloc
Re: RISC-V vs. Aarch64	Quadibloc
Re: RISC-V vs. Aarch64	Marcus
Re: RISC-V vs. Aarch64	BGB
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	BGB
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	BGB
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Thomas Koenig
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	EricP
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Marcus
Re: RISC-V vs. Aarch64	BGB
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	BGB
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Thomas Koenig
Re: RISC-V vs. Aarch64	Marcus
Re: RISC-V vs. Aarch64	EricP
Re: RISC-V vs. Aarch64	Marcus
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Niklas Holsti
Re: RISC-V vs. Aarch64	Bill Findlay
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	Thomas Koenig
Re: RISC-V vs. Aarch64	Thomas Koenig
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	BGB
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	Thomas Koenig
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	Marcus
Re: RISC-V vs. Aarch64	Thomas Koenig
Re: RISC-V vs. Aarch64	aph
Re: RISC-V vs. Aarch64	Michael S
Re: RISC-V vs. Aarch64	Thomas Koenig
Re: RISC-V vs. Aarch64	robf...@gmail.com
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	Tim Rentsch
Re: RISC-V vs. Aarch64	Terje Mathisen
Re: RISC-V vs. Aarch64	Thomas Koenig
Re: RISC-V vs. Aarch64	Marcus
Re: RISC-V vs. Aarch64	Guillaume
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Marcus
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	Thomas Koenig
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	EricP
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	EricP
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	EricP
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	Brett
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	Stephen Fuld
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	Stefan Monnier
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Stephen Fuld
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	EricP
Re: RISC-V vs. Aarch64	EricP
Re: RISC-V vs. Aarch64	Ivan Godard
The type of Mill's belt's slots	Stefan Monnier
Re: The type of Mill's belt's slots	MitchAlsup
Re: The type of Mill's belt's slots	Ivan Godard
Re: The type of Mill's belt's slots	Stefan Monnier
Re: The type of Mill's belt's slots	Ivan Godard
Re: The type of Mill's belt's slots	Stefan Monnier
Re: The type of Mill's belt's slots	Ivan Godard
Re: The type of Mill's belt's slots	MitchAlsup
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	Guillaume
Re: RISC-V vs. Aarch64	Quadibloc
MRISC32 vectorization (was: RISC-V vs. Aarch64)	Thomas Koenig
Re: RISC-V vs. Aarch64	Terje Mathisen
Re: RISC-V vs. Aarch64	Quadibloc
Re: RISC-V vs. Aarch64	Anton Ertl
Re: RISC-V vs. Aarch64	aph

Pages:12 3 4 5 6 7 8 9 10 11 12 13 14 15

RISC-V vs. Aarch64

<2021Dec24.163843@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22397&group=comp.arch#22397

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: RISC-V vs. Aarch64
Date: Fri, 24 Dec 2021 15:38:43 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 77
Message-ID: <2021Dec24.163843@mips.complang.tuwien.ac.at>
Injection-Info: reader02.eternal-september.org; posting-host="23d2a299e6ae2f6ff650580b2afae5f2";
logging-data="8085"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX185YkeyUloPf3VVh5U/wJcN"
Cancel-Lock: sha1:88+UeKUlpOXV8Yl+S5TqGWLBQuM=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Fri, 24 Dec 2021 15:38 UTC

I have recently (despite me) watched
<https://www.youtube.com/watch?v=Ii_pEXKKYUg>, where Chris Celio
compares some aspects RV64G, RV64GC, ARM A32, ARM A64, IA-32 and
AMD64. There is also a tech report <https://arxiv.org/abs/1607.02318>
<https://arxiv.org/pdf/1607.02318.pdf>

In this posting I look at the different approaches taken in A64 (aka
Aarch64) and RV64GC:

* RISC-V has simple instructions, e.g., with only one addressing mode
(reg+offset), and the base instruction set encodes each instruction
in 32 bits, but with the C (compressed) extension adds a 16-bit
encoding for the most frequent instructions.

* A64 has fixed-length 32-bit instructions, but they can be more
complex: A64 has more addressing modes and additional instructions
like load pair and store pair; in particular, the A64 architects
seem to have had few concerns about the register read and write
ports needed per instruction; E.g., a store-pair instruction can
need four read ports, and a load pair instruction can need three
write ports (AFAIK).

Celio argues that the instruction density of RV64GC is competetive,
and that the C extension is cheap to implement for a small
implementation.

It's pretty obvious that a small implementation of RV64G is smaller
than a small implementation of A64, and adding the C extension to a
small implementation of RV64G (to turn it to RV64GC) is reported in
the talk IIRC (it's on the order of 700 transistors, so still cheap),
so you can get a small RV64GC cheaper than a small A64 implementation
and have similar code density.

Celio also argues that instruction fusion can overcome the problem of

But I wonder how things turn out for larger implementations: Now
RV64GC needs twice as many decoders to decode, say, 32 bytes of
instructions, and then some additional hardware that selects which of
these decodes are valid (and of course all this extra hardware costs
energy).

And with instruction fusion the end result of the decoding step
becomes even more complex.

OTOH, a high-performance A64 implementation probably wants some kind
of register port allocator (possibly with port requirements across
several cycles), so it has its own source of complexity. I wonder if
that's the reason why some high-performance ARMs have microcode
caches; I would normally have thought that microcode caches are
unnecesary for a fixed-length instruction format.

What do people more familiar with hardware design think?

A classic question is the classification of the architecture: Celio
claims that ARM is CISC (using an A32 Load-multiple instruction as
example). These questions are not decided by touchy-feely arguments,
but rather by instruction set properties. All RISCs are load/store
general-purpose instruction sets; even the A32 load/store-multiple
instructions can be considered a variant of that; one might argue that
accessing two pages in one instruction is not RISC, but by that
criterion RISC-V is not RISC (it allows page-crossing unaligned
accesses). One other common trait in RISCs is fixed-length 32-bit
instructions (A32, MIPS, SPARC, 29K, 88K, Power, Alpha), but there are
exceptions: IBM ROMP, ARM Thumb, (IIRC) MIPS-X, and now RISC-V with
the C extension.

An interesting point in the talk is that zero-extension is a common
idiom in RISC-V; the talk does not explain this completely, but
apparently the 32-bit variants of instructions sign-extend (like Alpha
does, while AMD64 and A64 zero-extend), and the ABI passes 32-bit
values around in sign-extended form (including 32-bit unsigned
values).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: RISC-V vs. Aarch64

<0a8ff16a-53de-420e-9c82-cfc9e87f62e9n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22400&group=comp.arch#22400

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:5b82:: with SMTP id a2mr6457949qta.519.1640369869772;
Fri, 24 Dec 2021 10:17:49 -0800 (PST)
X-Received: by 2002:a05:6808:1248:: with SMTP id o8mr5677009oiv.157.1640369869476;
Fri, 24 Dec 2021 10:17:49 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 24 Dec 2021 10:17:49 -0800 (PST)
In-Reply-To: <2021Dec24.163843@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:d876:5ad9:f900:90dc;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:d876:5ad9:f900:90dc
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <0a8ff16a-53de-420e-9c82-cfc9e87f62e9n@googlegroups.com>
Subject: Re: RISC-V vs. Aarch64
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 24 Dec 2021 18:17:49 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 142

by: MitchAlsup - Fri, 24 Dec 2021 18:17 UTC

On Friday, December 24, 2021 at 11:00:14 AM UTC-6, Anton Ertl wrote:
> I have recently (despite me) watched
> <https://www.youtube.com/watch?v=Ii_pEXKKYUg>, where Chris Celio
> compares some aspects RV64G, RV64GC, ARM A32, ARM A64, IA-32 and
> AMD64. There is also a tech report <https://arxiv.org/abs/1607.02318>
> <https://arxiv.org/pdf/1607.02318.pdf>
>
> In this posting I look at the different approaches taken in A64 (aka
> Aarch64) and RV64GC:
>
> * RISC-V has simple instructions, e.g., with only one addressing mode
> (reg+offset), and the base instruction set encodes each instruction
> in 32 bits, but with the C (compressed) extension adds a 16-bit
> encoding for the most frequent instructions.
>
> * A64 has fixed-length 32-bit instructions, but they can be more
> complex: A64 has more addressing modes and additional instructions
> like load pair and store pair; in particular, the A64 architects
> seem to have had few concerns about the register read and write
> ports needed per instruction; E.g., a store-pair instruction can
> need four read ports, and a load pair instruction can need three
> write ports (AFAIK).
>
> Celio argues that the instruction density of RV64GC is competetive,
> and that the C extension is cheap to implement for a small
> implementation.
<
My 66000 ISA has x96-64 instruction set density and is RISC with
a couple of additions--large constants (immediates and displacements
or both).
>
> It's pretty obvious that a small implementation of RV64G is smaller
> than a small implementation of A64, and adding the C extension to a
> small implementation of RV64G (to turn it to RV64GC) is reported in
> the talk IIRC (it's on the order of 700 transistors, so still cheap),
> so you can get a small RV64GC cheaper than a small A64 implementation
> and have similar code density.
<
Once you have constructed the register file(s), the integer, memory, and
floating point units, the size of the fetcher and decoder is immaterial
(noise).
<
>
> Celio also argues that instruction fusion can overcome the problem of
<
Did you incompletely cut a paragraph ?
>
> But I wonder how things turn out for larger implementations: Now
> RV64GC needs twice as many decoders to decode, say, 32 bytes of
> instructions, and then some additional hardware that selects which of
> these decodes are valid (and of course all this extra hardware costs
> energy).
<
As I argue above, the size of decoders is immaterial. But here you are
using the word decoder where you should be using the work instruction
recognizer. A lot of the logic in a superScalar implementation in the
"decoder" area is flip-flops and wires. Once an instruction is recognized
and routed to where it needs to go, then there is a lot of pipeline setup
logic at least as big as the instruction recognizer.
>
> And with instruction fusion the end result of the decoding step
> becomes even more complex.
<
If the instruction fusing is only performed on back-to-back instructions
it is not very hard. If you break this rule, it becomes at least cubic and
at most NP-complete. So, don't break that rule.
<
>
> OTOH, a high-performance A64 implementation probably wants some kind
> of register port allocator (possibly with port requirements across
> several cycles), so it has its own source of complexity. I wonder if
> that's the reason why some high-performance ARMs have microcode
> caches; I would normally have thought that microcode caches are
> unnecesary for a fixed-length instruction format.
>
> What do people more familiar with hardware design think?
<
½ of one, ½ of the other. You do what you think necessary to keep
the expensive HW occupied with work.
>
> A classic question is the classification of the architecture: Celio
> claims that ARM is CISC (using an A32 Load-multiple instruction as
> example). These questions are not decided by touchy-feely arguments,
> but rather by instruction set properties. All RISCs are load/store
> general-purpose instruction sets; even the A32 load/store-multiple
> instructions can be considered a variant of that; one might argue that
> accessing two pages in one instruction is not RISC, but by that
> criterion RISC-V is not RISC (it allows page-crossing unaligned
> accesses). One other common trait in RISCs is fixed-length 32-bit
> instructions (A32, MIPS, SPARC, 29K, 88K, Power, Alpha), but there are
> exceptions: IBM ROMP, ARM Thumb, (IIRC) MIPS-X, and now RISC-V with
> the C extension.
<
a) misaligned access SHOULD be in every architecture, it is mandated by
some features of some languages. Aligned accesses are always higher
performing; but misaligned should not cost more than 1 extra cycle and
as few as zero extra cycles. Page crossings should be allowed--however
one should be in a situation where crossing of a memory boundary is
NOT ALLOWed {MTRR, different RWE on pages, wrap from positive to
negative,...}
<
b) It is a variant of LM and SM (Enter and EXIT) that make My 66000
ISA size competitive with x86-64; taking what would have been dozens
of instructions and making them one. Since My 66000 has HW context
switch built in, there is already a path from the cache to the RF that
is multiple registers wide (minimum 4 registers). Might as well use
this path even in a 1-instruction-per-cycle implementation. Also note:
This path does not run through the LD/ST aligner.
<
My 66000 also contains switch instructions where the instruction
provides a register value of the switch index, the instruction contains
a limit to the table, and indexes the table feeding the result directly
to IP without data going through a register or the data cache. Should
the index be out-of-bounds control is delivered to default: !
<
c) My 66000 has 32-bit fixed length instruction-specifiers followed
by1-4 doublewords of constants needed by the instruction. One can
<ahem> decode the instruction and derive pointers to these constants,
the number of the constants and a pointer to the next instruction in
30-total gates and 4-gates of delay.
<
There is NO JUSTIFIABLE reason that an instruction is not entirely
self-contained!
<
>
> An interesting point in the talk is that zero-extension is a common
> idiom in RISC-V; the talk does not explain this completely, but
> apparently the 32-bit variants of instructions sign-extend (like Alpha
> does, while AMD64 and A64 zero-extend), and the ABI passes 32-bit
> values around in sign-extended form (including 32-bit unsigned
> values).
<
My 66000 ISA has both ZE and SE plus the ability to smash a value
{which may contain more bits than its container} with ZE or SE.
<
My 66000 ABI passes everything around in 64-bit containers. The
filling of these containers perform SE or ZE as desired.
>
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: RISC-V vs. Aarch64

<2021Dec24.200638@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22402&group=comp.arch#22402

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Fri, 24 Dec 2021 19:06:38 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 24
Message-ID: <2021Dec24.200638@mips.complang.tuwien.ac.at>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at> <0a8ff16a-53de-420e-9c82-cfc9e87f62e9n@googlegroups.com>
Injection-Info: reader02.eternal-september.org; posting-host="23d2a299e6ae2f6ff650580b2afae5f2";
logging-data="15025"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18Vv4++/fmX/p5e9+vOcWiN"
Cancel-Lock: sha1:WeHfOjnC2SuAyWEYIpuppqEqGTU=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Fri, 24 Dec 2021 19:06 UTC

MitchAlsup <MitchAlsup@aol.com> writes:
>On Friday, December 24, 2021 at 11:00:14 AM UTC-6, Anton Ertl wrote:
>> Celio also argues that instruction fusion can overcome the problem of=20
>
>Did you incompletely cut a paragraph ?

Probably I switched to writing a different paragraph and then forgot
to return:-)

What I wanted to write:

|Celio also argues that instruction fusion can overcome the problem of
|latency chains of the simple instructions. E.g., he gives the
|example of an indexed load consisting of a shift, an add, and a load
|on RISC-V; instruction fusion can combine them into one instruction
|with the latency of one instruction.

But actually I don't think Celio discusses the latency directly, it's
just implicit in his talk.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: RISC-V vs. Aarch64

<67430066-07f1-4276-bcc9-c542eac2faefn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22403&group=comp.arch#22403

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:9c3:: with SMTP id y3mr3784962qky.367.1640374615131;
Fri, 24 Dec 2021 11:36:55 -0800 (PST)
X-Received: by 2002:a05:6808:211c:: with SMTP id r28mr6430804oiw.155.1640374614850;
Fri, 24 Dec 2021 11:36:54 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 24 Dec 2021 11:36:54 -0800 (PST)
In-Reply-To: <2021Dec24.200638@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:d876:5ad9:f900:90dc;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:d876:5ad9:f900:90dc
References: <2021Dec24.163843@mips.complang.tuwien.ac.at> <0a8ff16a-53de-420e-9c82-cfc9e87f62e9n@googlegroups.com>
<2021Dec24.200638@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <67430066-07f1-4276-bcc9-c542eac2faefn@googlegroups.com>
Subject: Re: RISC-V vs. Aarch64
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 24 Dec 2021 19:36:55 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 26

by: MitchAlsup - Fri, 24 Dec 2021 19:36 UTC

On Friday, December 24, 2021 at 1:18:28 PM UTC-6, Anton Ertl wrote:
> MitchAlsup <Mitch...@aol.com> writes:
> >On Friday, December 24, 2021 at 11:00:14 AM UTC-6, Anton Ertl wrote:
> >> Celio also argues that instruction fusion can overcome the problem of=20
> >
> >Did you incompletely cut a paragraph ?
> Probably I switched to writing a different paragraph and then forgot
> to return:-)
> What I wanted to write:
>
> |Celio also argues that instruction fusion can overcome the problem of
> |latency chains of the simple instructions. E.g., he gives the
> |example of an indexed load consisting of a shift, an add, and a load
> |on RISC-V; instruction fusion can combine them into one instruction
> |with the latency of one instruction.
<
You have still "eaten" the size of the 3 instructions in the I$, and I would argue
that it takes less power to directly decode LD Rd,[Rb,Ri<<s+disp] than to
decode {SL Rt,Rindex,#shift; ADD Rt,Rt,Rb; LD Rd,[Rt+disp]}
<
>
> But actually I don't think Celio discusses the latency directly, it's
> just implicit in his talk.
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: RISC-V vs. Aarch64

<sq5dj1$1q9$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22406&group=comp.arch#22406

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Fri, 24 Dec 2021 15:20:31 -0600
Organization: A noiseless patient Spider
Lines: 221
Message-ID: <sq5dj1$1q9$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 24 Dec 2021 21:20:33 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="29dd3cd53019c5a242f76c821a715dea";
logging-data="1865"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19nUjgkyv8hfJDGsGZvTR7b"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.1
Cancel-Lock: sha1:C7LDC4/vmh79Cz/5mb1pdIsdWWM=
In-Reply-To: <2021Dec24.163843@mips.complang.tuwien.ac.at>
Content-Language: en-US

by: BGB - Fri, 24 Dec 2021 21:20 UTC

On 12/24/2021 9:38 AM, Anton Ertl wrote:
> I have recently (despite me) watched
> <https://www.youtube.com/watch?v=Ii_pEXKKYUg>, where Chris Celio
> compares some aspects RV64G, RV64GC, ARM A32, ARM A64, IA-32 and
> AMD64. There is also a tech report <https://arxiv.org/abs/1607.02318>
> <https://arxiv.org/pdf/1607.02318.pdf>
>
> In this posting I look at the different approaches taken in A64 (aka
> Aarch64) and RV64GC:
>
> * RISC-V has simple instructions, e.g., with only one addressing mode
> (reg+offset), and the base instruction set encodes each instruction
> in 32 bits, but with the C (compressed) extension adds a 16-bit
> encoding for the most frequent instructions.
>

A 16/32 encoding can save ~ 40-60% IME vs fixed 32-bit instructions.

For example: I was recently experimenting again with a "Fix32" variant
of my ISA, but building the Boot ROM in Fix32 mode basically blows out
the ROM size (had to both omit some stuff and also expand the ROM size
to 48K in this test; vs the 16/32 ISA where it still fits in 32K).

I had experimented in the past, noting that 16/24/32 doesn't save enough
over 16/32 to justify the cost.

For the main ROM, to be able to boot Fix32 a secondary core (and be
within the 32K limit), I had to put some of the initial Boot ASM in
Fix32 mode. It detects that it is on a secondary core and then goes into
a loop where it spins indefinitely and waits for a special interrupt
(with any secondary cores being "woken up" by a special inter-processor
interrupt post-boot).

I suspect a lower reasonable limit for addressing modes is (Reg,Disp)
and (Reg,Index). Fewer than this leads to unnecessary inefficiency, and
both modes are "very common" in practice.

More elaborate modes quickly run into diminishing returns territory; so,
things like auto-increment and similar are better off left out IMO.

> * A64 has fixed-length 32-bit instructions, but they can be more
> complex: A64 has more addressing modes and additional instructions
> like load pair and store pair; in particular, the A64 architects
> seem to have had few concerns about the register read and write
> ports needed per instruction; E.g., a store-pair instruction can
> need four read ports, and a load pair instruction can need three
> write ports (AFAIK).
>

These are less of an issue if one assumes a minimum width for the core.
If the core is always at least 3-wide, this isn't an issue.

For a 1-wide core, there may need to be compromise. For example, my
Fix32 core mentioned previously is only 1-wide, so needed to omit the
MOV.X instruction. I did end up retaining Jumbo encodings, but they use
a slightly different mechanism from the 3-wide core.

I have noted I can pair a 3-wide core with a 1-wide core, and fit it
(barely) in the FPGA. Partial issue is that I am paying a fairly steep
cost for the 64B burst transfers in the L2/DDR stage, but I kinda need
these to get sufficient memory bandwidth for a RAM-backed framebuffer to
work (otherwise the screen is broken garbage).

Ironically though, the 1-wide core isn't *that* much cheaper than the
3-wide core, despite being significantly less capable.

I could maybe make the core a bit cheaper by leaving out RV64I support,
which avoids a few fairly "costly" mechanisms (Compare-And-Branch;
Arbitrary register as link-register; ...). Or, conversely, a core which
*only* does RV64I.

Still wont save much for the amount of LUTs eaten by the L1D$ and TLB
though (just these two are around 1/3 of the total LUT cost of the
1-wide core).

Meanwhile, things like the instruction decoder mostly seem to mostly
"disappear into the noise" in this case.

> Celio argues that the instruction density of RV64GC is competetive,
> and that the C extension is cheap to implement for a small
> implementation.
>
> It's pretty obvious that a small implementation of RV64G is smaller
> than a small implementation of A64, and adding the C extension to a
> small implementation of RV64G (to turn it to RV64GC) is reported in
> the talk IIRC (it's on the order of 700 transistors, so still cheap),
> so you can get a small RV64GC cheaper than a small A64 implementation
> and have similar code density.
>

Yeah, by the time you have all the other stuff in 'G', the cost of the
'C' extension is probably negligible.

> Celio also argues that instruction fusion can overcome the problem of
>
> But I wonder how things turn out for larger implementations: Now
> RV64GC needs twice as many decoders to decode, say, 32 bytes of
> instructions, and then some additional hardware that selects which of
> these decodes are valid (and of course all this extra hardware costs
> energy).
>
> And with instruction fusion the end result of the decoding step
> becomes even more complex.
>

Instruction fusion seems like a needless complexity IMO. It would be
better if the ISA can be "dense" enough to make the fusion mostly
unnecessary. Though, "where to draw the line" is probably a subject of
debate.

Granted, it could probably be justified if one has already paid the
complexity cost needed to support superscalar (since it should be able
to use the same mechanism to deal with both).

I have come to the realization though that, because my cores lack many
of the tricks that fancier RISC-V cores use, my cores running in RISC-V
mode will at-best have somewhat worse performance than something like SweRV.

If I were designing a "similar" ISA to RISC-V (with no status flags), I
would probably leave out full Compare-and-Branch instructions, and
instead have a few "simpler" conditional branches, say:
BEQZ reg, label //Branch if reg==0
BNEZ reg, label //Branch if reg!=0
BGEZ reg, label //Branch if reg>=0
BLTZ reg, label //Branch if reg< 0

While conceptually, this doesn't save much, it would be cheaper to
implement in hardware. Relative compares could then use compare
instructions:
CMPx Rs, Rt, Rd
Where:
(Rs==Rt) => 0;
(Rs> Rt) => 1;
(Rs< Rt) => -1.

Though, one issue with a plain SUB is that it would not work correctly
for comparing integer values the same width as the machine registers (if
the difference is too large, the intermediate value will overflow).

> OTOH, a high-performance A64 implementation probably wants some kind
> of register port allocator (possibly with port requirements across
> several cycles), so it has its own source of complexity. I wonder if
> that's the reason why some high-performance ARMs have microcode
> caches; I would normally have thought that microcode caches are
> unnecesary for a fixed-length instruction format.
>

I wouldn't think this necessary, if I were implementing it I could do it
more like how I deal with 128-bit SIMD and similar in my existing core,
and map a single instruction to multiple lanes when needed.

> What do people more familiar with hardware design think?
>
> A classic question is the classification of the architecture: Celio
> claims that ARM is CISC (using an A32 Load-multiple instruction as
> example). These questions are not decided by touchy-feely arguments,
> but rather by instruction set properties. All RISCs are load/store
> general-purpose instruction sets; even the A32 load/store-multiple
> instructions can be considered a variant of that; one might argue that
> accessing two pages in one instruction is not RISC, but by that
> criterion RISC-V is not RISC (it allows page-crossing unaligned
> accesses). One other common trait in RISCs is fixed-length 32-bit
> instructions (A32, MIPS, SPARC, 29K, 88K, Power, Alpha), but there are
> exceptions: IBM ROMP, ARM Thumb, (IIRC) MIPS-X, and now RISC-V with
> the C extension.
>

IMO, Load/Store is the big issue...

Load/Store allows decent performance from a simplistic pipeline.

As soon as one deviates from Load/Store, one has effectively thrown a
hand grenade into the mix (one is doomed to either needing a much more
complex decoder, OoO, or paying a cost in terms of a significant
increase in clock-cycle counts per instruction).

In comparison, variable-length instructions and misaligned memory access
are not nearly as destructive. Granted, they are not free either. I
suspect the L1 D$ in my case would be somewhat cheaper if it did not
need to deal with misaligned access.

However, besides "cheap core", it is also nice to be able to have "fast
LZ77 decoding" and similar, which is an area where misaligned memory
access pays off.

Page-crossing doesn't seem to be too big of an issue, since it is rare,
and can be handled with two TLB misses in a row if needed (only the
first TLB miss gets served; when the interrupt returns and the
instruction tries again, it causes another TLB miss).

> An interesting point in the talk is that zero-extension is a common
> idiom in RISC-V; the talk does not explain this completely, but
> apparently the 32-bit variants of instructions sign-extend (like Alpha
> does, while AMD64 and A64 zero-extend), and the ABI passes 32-bit
> values around in sign-extended form (including 32-bit unsigned
> values).
>

Click here to read the complete article

Re: RISC-V vs. Aarch64

<sq5dvo$443$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22408&group=comp.arch#22408

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Fri, 24 Dec 2021 15:27:18 -0600
Organization: A noiseless patient Spider
Lines: 41
Message-ID: <sq5dvo$443$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<0a8ff16a-53de-420e-9c82-cfc9e87f62e9n@googlegroups.com>
<2021Dec24.200638@mips.complang.tuwien.ac.at>
<67430066-07f1-4276-bcc9-c542eac2faefn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 24 Dec 2021 21:27:20 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="29dd3cd53019c5a242f76c821a715dea";
logging-data="4227"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/rHuWFZhH3WjVwy7uPxSO5"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.1
Cancel-Lock: sha1:dEOzPW1NVnibCllCHIQdVoGsuvA=
In-Reply-To: <67430066-07f1-4276-bcc9-c542eac2faefn@googlegroups.com>
Content-Language: en-US

by: BGB - Fri, 24 Dec 2021 21:27 UTC

On 12/24/2021 1:36 PM, MitchAlsup wrote:
> On Friday, December 24, 2021 at 1:18:28 PM UTC-6, Anton Ertl wrote:
>> MitchAlsup <Mitch...@aol.com> writes:
>>> On Friday, December 24, 2021 at 11:00:14 AM UTC-6, Anton Ertl wrote:
>>>> Celio also argues that instruction fusion can overcome the problem of=20
>>>
>>> Did you incompletely cut a paragraph ?
>> Probably I switched to writing a different paragraph and then forgot
>> to return:-)
>> What I wanted to write:
>>
>> |Celio also argues that instruction fusion can overcome the problem of
>> |latency chains of the simple instructions. E.g., he gives the
>> |example of an indexed load consisting of a shift, an add, and a load
>> |on RISC-V; instruction fusion can combine them into one instruction
>> |with the latency of one instruction.
> <
> You have still "eaten" the size of the 3 instructions in the I$, and I would argue
> that it takes less power to directly decode LD Rd,[Rb,Ri<<s+disp] than to
> decode {SL Rt,Rindex,#shift; ADD Rt,Rt,Rb; LD Rd,[Rt+disp]}
> <

Agreed.

Fusion seems like a band-aid, and it isn't free.

If it is handled as a special case in a superscalar decoder, even if the
fusion itself works and is "basically free", one may still pay
indirectly in terms of instructions which were not executed in parallel
but would have been executed in parallel had these other instructions
not existed in the way.

>>
>> But actually I don't think Celio discusses the latency directly, it's
>> just implicit in his talk.
>> - anton
>> --
>> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
>> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: RISC-V vs. Aarch64

<59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22411&group=comp.arch#22411

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:7650:: with SMTP id i16mr7017048qtr.220.1640383584091;
Fri, 24 Dec 2021 14:06:24 -0800 (PST)
X-Received: by 2002:a4a:3390:: with SMTP id q138mr5117789ooq.54.1640383583810;
Fri, 24 Dec 2021 14:06:23 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 24 Dec 2021 14:06:23 -0800 (PST)
In-Reply-To: <sq5dj1$1q9$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:d876:5ad9:f900:90dc;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:d876:5ad9:f900:90dc
References: <2021Dec24.163843@mips.complang.tuwien.ac.at> <sq5dj1$1q9$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com>
Subject: Re: RISC-V vs. Aarch64
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 24 Dec 2021 22:06:24 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 229

by: MitchAlsup - Fri, 24 Dec 2021 22:06 UTC

On Friday, December 24, 2021 at 3:20:36 PM UTC-6, BGB wrote:
> On 12/24/2021 9:38 AM, Anton Ertl wrote:
<snip>
>
> I suspect a lower reasonable limit for addressing modes is (Reg,Disp)
> and (Reg,Index). Fewer than this leads to unnecessary inefficiency, and
> both modes are "very common" in practice.
<
I added [Rbase+Rindex<<scale+disp] even if it is only 2%-3% because
it is expressive (more to FORTRAN needs than C needs) and it saves
instruction space. I provided ONLY the above two (2) in M88K and
later when doing x86-64 found out the error of my ways.
<
The least one can add is [Rbase+Rindex<<scale]
>
> More elaborate modes quickly run into diminishing returns territory; so,
> things like auto-increment and similar are better off left out IMO.
<
Agreed except for [Rbase+Rindex<<scale+disp]
<
<snip again>
> Ironically though, the 1-wide core isn't *that* much cheaper than the
> 3-wide core, despite being significantly less capable.
>
> I could maybe make the core a bit cheaper by leaving out RV64I support,
> which avoids a few fairly "costly" mechanisms (Compare-And-Branch;
> Arbitrary register as link-register; ...). Or, conversely, a core which
> *only* does RV64I.
<
When Compare and Branch is correctly specified it can be performed
in 1 pipeline cycle. Then specified without looking "at the gates" is
seldom is !
>
> Still wont save much for the amount of LUTs eaten by the L1D$ and TLB
> though (just these two are around 1/3 of the total LUT cost of the
> 1-wide core).
>
<snip>
> >
> > And with instruction fusion the end result of the decoding step
> > becomes even more complex.
> >
> Instruction fusion seems like a needless complexity IMO. It would be
> better if the ISA can be "dense" enough to make the fusion mostly
> unnecessary. Though, "where to draw the line" is probably a subject of
> debate.
<
Instruction fusing is worthwhile on a 1-wide machine with certain pipeline
organization to execute 2 instructions per cycle and a few other tricks.
<
In my 1-wide My 66130 core, I can CoIssue several instruction pairs:
a) any instruction and an unconditional branch
b) any result delivering instruction and a conditional branch or predicate
consuming the result.
c) Store instruction followed by a calculation instruction such that the
total number of register reads is <= 3.
d) any non-branch instruction followed by an unconditional branch
instruction allow the branch to be taken at execution cost of ZERO
cycles.
<
An eXcel spreadsheet shows up to 30% performance advantage over a
strict 1-wide machine. {This is within spitting distance of the gain of
In Order 2-wide over In Order 1-wide with less than 20% of the cost}
>
> Granted, it could probably be justified if one has already paid the
> complexity cost needed to support superscalar (since it should be able
> to use the same mechanism to deal with both).
<
I justified it not because of SuperScalar infrastructure (which I do not have)
But because My 66000 ISA <practically> requires an instruction fetch buffer;
So, the vast majority of the time the instructions are present and simply
waiting for the pipeline to get around to them.
>
> I have come to the realization though that, because my cores lack many
> of the tricks that fancier RISC-V cores use, my cores running in RISC-V
> mode will at-best have somewhat worse performance than something like SweRV.
<
Yes, not being like the others has certain disadvantages, too.
>
>
>
> If I were designing a "similar" ISA to RISC-V (with no status flags), I
<
Before you head in this direction, I find RISC-V ISA rather dreadful and
only barely serviceable.
<
> would probably leave out full Compare-and-Branch instructions, and
> instead have a few "simpler" conditional branches, say:
> BEQZ reg, label //Branch if reg==0
> BNEZ reg, label //Branch if reg!=0
> BGEZ reg, label //Branch if reg>=0
> BLTZ reg, label //Branch if reg< 0
>
> While conceptually, this doesn't save much, it would be cheaper to
> implement in hardware.
<
Having done both, I can warn you that your assumption is filled with badly
formed misconceptions. From a purity standpoint you do have a point;
from a gate count perspective and a cycle time perspective you do not.
<
> Relative compares could then use compare
> instructions:
> CMPx Rs, Rt, Rd
> Where:
> (Rs==Rt) => 0;
> (Rs> Rt) => 1;
> (Rs< Rt) => -1.
>
> Though, one issue with a plain SUB is that it would not work correctly
> for comparing integer values the same width as the machine registers (if
> the difference is too large, the intermediate value will overflow).
<
Which is why one needs CMP instructions and not to rely on SUB to do 98%
of the work.
<
> > OTOH, a high-performance A64 implementation probably wants some kind
> > of register port allocator (possibly with port requirements across
> > several cycles), so it has its own source of complexity. I wonder if
> > that's the reason why some high-performance ARMs have microcode
> > caches; I would normally have thought that microcode caches are
> > unnecesary for a fixed-length instruction format.
> >
> I wouldn't think this necessary, if I were implementing it I could do it
> more like how I deal with 128-bit SIMD and similar in my existing core,
> and map a single instruction to multiple lanes when needed.
> > What do people more familiar with hardware design think?
> >
> > A classic question is the classification of the architecture: Celio
> > claims that ARM is CISC (using an A32 Load-multiple instruction as
> > example). These questions are not decided by touchy-feely arguments,
> > but rather by instruction set properties. All RISCs are load/store
> > general-purpose instruction sets; even the A32 load/store-multiple
> > instructions can be considered a variant of that; one might argue that
> > accessing two pages in one instruction is not RISC, but by that
> > criterion RISC-V is not RISC (it allows page-crossing unaligned
> > accesses). One other common trait in RISCs is fixed-length 32-bit
> > instructions (A32, MIPS, SPARC, 29K, 88K, Power, Alpha), but there are
> > exceptions: IBM ROMP, ARM Thumb, (IIRC) MIPS-X, and now RISC-V with
> > the C extension.
> >
> IMO, Load/Store is the big issue...
>
> Load/Store allows decent performance from a simplistic pipeline.
<
Ahem: 4-wide machines do not have simplistic pipelines, so the burden is
only on the lesser implementations and nothing on the higher performance
ones.
>
> As soon as one deviates from Load/Store, one has effectively thrown a
> hand grenade into the mix (one is doomed to either needing a much more
> complex decoder, OoO, or paying a cost in terms of a significant
> increase in clock-cycle counts per instruction).
>
There is the multi-fire reservation station trick used on Athlon and Opteron,
which pretty much makes the problem vanish.
>
> In comparison, variable-length instructions and misaligned memory access
> are not nearly as destructive. Granted, they are not free either. I
> suspect the L1 D$ in my case would be somewhat cheaper if it did not
> need to deal with misaligned access.
<
I would assert that they are better than FREE. They add more performance
than the added gates to do these things cost.
>
> However, besides "cheap core", it is also nice to be able to have "fast
> LZ77 decoding" and similar, which is an area where misaligned memory
> access pays off.
<
¿dynamic bitfields?
>
> Page-crossing doesn't seem to be too big of an issue, since it is rare,
> and can be handled with two TLB misses in a row if needed (only the
> first TLB miss gets served; when the interrupt returns and the
> instruction tries again, it causes another TLB miss).
<
It is only time, and 98%99% of the time these crossing don't happen.
{98% in 4KB pages, 99% in 8KB pages}
<
> > An interesting point in the talk is that zero-extension is a common
> > idiom in RISC-V; the talk does not explain this completely, but
> > apparently the 32-bit variants of instructions sign-extend (like Alpha
> > does, while AMD64 and A64 zero-extend), and the ABI passes 32-bit
> > values around in sign-extended form (including 32-bit unsigned
> > values).
> >
> It is a tradeoff.
>
> In BJX2, I went with signed values being kept in sign-extended form, and
> unsigned values kept in zero-extended form (and casts requiring explicit
> sign or zero extension of the results).
<
The tradeoff is even harder where the linker fills in the "upper" bits of a
register. If the underlying premise is sign extension, one COULD need 2
registers to hold the "same value" for 2 different large address constants.
<
If the underlying premise is zero extension, certain bit pasting ways using
+ need to be changed to use |.
<
Another reason not to make SW, of any sort, have to paste bits together to
make <larger> constants.
>
> This is a general rule even if there are only a minority of cases where
> it should matter.
<
More than you alude.

Click here to read the complete article

Re: RISC-V vs. Aarch64

<6e7ed718-3d41-4b99-8294-bb3ff8a27b72n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22418&group=comp.arch#22418

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:b0e:: with SMTP id t14mr5992061qkg.146.1640396014473;
Fri, 24 Dec 2021 17:33:34 -0800 (PST)
X-Received: by 2002:a9d:6ad4:: with SMTP id m20mr5980081otq.243.1640396014138;
Fri, 24 Dec 2021 17:33:34 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 24 Dec 2021 17:33:33 -0800 (PST)
In-Reply-To: <59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:d876:5ad9:f900:90dc;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:d876:5ad9:f900:90dc
References: <2021Dec24.163843@mips.complang.tuwien.ac.at> <sq5dj1$1q9$1@dont-email.me>
<59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <6e7ed718-3d41-4b99-8294-bb3ff8a27b72n@googlegroups.com>
Subject: Re: RISC-V vs. Aarch64
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 25 Dec 2021 01:33:34 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 48

by: MitchAlsup - Sat, 25 Dec 2021 01:33 UTC

On Friday, December 24, 2021 at 4:06:25 PM UTC-6, MitchAlsup wrote:
> On Friday, December 24, 2021 at 3:20:36 PM UTC-6, BGB wrote:
> > On 12/24/2021 9:38 AM, Anton Ertl wrote:
> <snip>

> > Ironically though, the 1-wide core isn't *that* much cheaper than the
> > 3-wide core, despite being significantly less capable.
> >
> > I could maybe make the core a bit cheaper by leaving out RV64I support,
> > which avoids a few fairly "costly" mechanisms (Compare-And-Branch;
> > Arbitrary register as link-register; ...). Or, conversely, a core which
> > *only* does RV64I.
> <
> When Compare and Branch is correctly specified it can be performed
> in 1 pipeline cycle. Then specified without looking "at the gates" is
> seldom is !
> >
> > Still wont save much for the amount of LUTs eaten by the L1D$ and TLB
> > though (just these two are around 1/3 of the total LUT cost of the
> > 1-wide core).
<
For example: With the combined register file of My 66000 ISA, and in
consideration of a lower end design point, I changed the pipeline by
adding a stage between LD-Align and WriteBack--this stage is appro-
priately known of as WAIT !
<
WAIT allows me to pipeline FP operations into clock stages of
2-clocks each. This gains 5 more gates of delay; so, a 16 gate machine
gets 16+5+16 gates per FP clock cycle. This should make FP take fewer
cycles (seen at the main pipeline rate) for 1-wide implementations.
<
The integer pipeline looks like | EXECUTE | CACHE | ALIGN | WAIT |
The floating point pipeline ......| .......EXECUTE1.........| .......EXECUTE2.......|
>
One can put a comparison operation in WAIT so conditional branches
can the comparison is ready by the time the branch needs it.
<
Most FP calculations have memory references interspersed with FP
calculation instructions, so ½ piping the FP unit causes little performance
loss and there are gains made other places.
> >

Re: RISC-V vs. Aarch64

<sq675n$tht$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22420&group=comp.arch#22420

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Fri, 24 Dec 2021 20:37:11 -0800
Organization: A noiseless patient Spider
Lines: 16
Message-ID: <sq675n$tht$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<0a8ff16a-53de-420e-9c82-cfc9e87f62e9n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 25 Dec 2021 04:37:12 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="49fb671ecae1fabc97e23bc3ec74929a";
logging-data="30269"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19pM7PRFHN3MFHjhNYjwWUF"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.1
Cancel-Lock: sha1:uiCXYBqoKVn6nuduihjUn0/ARrA=
In-Reply-To: <0a8ff16a-53de-420e-9c82-cfc9e87f62e9n@googlegroups.com>
Content-Language: en-US

by: Ivan Godard - Sat, 25 Dec 2021 04:37 UTC

On 12/24/2021 10:17 AM, MitchAlsup wrote:

> c) My 66000 has 32-bit fixed length instruction-specifiers followed
> by1-4 doublewords of constants needed by the instruction. One can
> <ahem> decode the instruction and derive pointers to these constants,
> the number of the constants and a pointer to the next instruction in
> 30-total gates and 4-gates of delay.
> <
> There is NO JUSTIFIABLE reason that an instruction is not entirely
> self-contained!

Really? I suppose that might be true for OOO that is emulating
sequential execution, but what about VLIW and other wide multi-issue?
Chopping off the variable and large parts so they can be recognized in
parallel lets the issue width no longer depend on how many constants you
have.

Re: RISC-V vs. Aarch64

<6a5ed3c8-7aea-4b0b-8c67-25695b586cbbn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22422&group=comp.arch#22422

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:4495:: with SMTP id x21mr6675574qkp.604.1640420697487;
Sat, 25 Dec 2021 00:24:57 -0800 (PST)
X-Received: by 2002:a9d:206a:: with SMTP id n97mr7179942ota.142.1640420697181;
Sat, 25 Dec 2021 00:24:57 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 25 Dec 2021 00:24:56 -0800 (PST)
In-Reply-To: <sq675n$tht$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2607:fea8:1de1:fb00:1c8:dbff:a967:123b;
posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 2607:fea8:1de1:fb00:1c8:dbff:a967:123b
References: <2021Dec24.163843@mips.complang.tuwien.ac.at> <0a8ff16a-53de-420e-9c82-cfc9e87f62e9n@googlegroups.com>
<sq675n$tht$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <6a5ed3c8-7aea-4b0b-8c67-25695b586cbbn@googlegroups.com>
Subject: Re: RISC-V vs. Aarch64
From: robfi...@gmail.com (robf...@gmail.com)
Injection-Date: Sat, 25 Dec 2021 08:24:57 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 33

by: robf...@gmail.com - Sat, 25 Dec 2021 08:24 UTC

The My66000 sounds great. I may borrow some ideas from it.

In Thor2021 if the constant is larger than 23 bits, then prefixes are used to extend it. The primary reason to do this was to keep instructions less than 64-bits wide, so the redundancy in the cache was limited. Cache lines overflow to support 16-bit alignment, with data crossing a cache line stored in two places. A second reason was that prefixes allow the constant to extend beyond 64-bits, to 128 or 256 bits. That should make it possible to do 256-bit wide operations with constants, if needed. 128-bit constant may be needed for the floating-point.
However, there is some room in the opcode space to include a slew of instructions that may use a 64-bit constant directly. So, I am debating with myself whether to support larger constants.

The current core has only a 128-bit wide data bus to the outside world, ROM and DRAM. Cache lines are burst loaded across this bus using a 4-word burst. Writes write one 64-bit word at a time across the bus.
In the path from the cache to the register file, loading a whole cache line’s worth of data would definitely improve performance for context switches. However, if the context is not in cache then performance would be limited by the external read. I may modify things to be able to store 128-bit register pairs at once across the external bus. So, I think an octet register load instruction, and a register pair store instruction may help performance. These could be wrapped into micro-coded instructions to save or restore all registers. There is only a small data cache – 32kB so with frequent access to external memory I think it may get just about as much milage out of a pair load instruction.

It is interesting because by making the register file access path wider in an FPGA the depth remains the same because the file is made out of LUT rams.. It cost more transistors and power. If the read/write path is eight register wide then there could be 8 sets of 64 registers because LUT rams are 64-deep anyways.

Re: RISC-V vs. Aarch64

<sq6udp$hj3$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22424&group=comp.arch#22424

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.de!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-eb03-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Sat, 25 Dec 2021 11:14:01 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sq6udp$hj3$1@newsreader4.netcologne.de>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sq5dj1$1q9$1@dont-email.me>
<59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com>
Injection-Date: Sat, 25 Dec 2021 11:14:01 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-eb03-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:eb03:0:7285:c2ff:fe6c:992d";
logging-data="18019"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Sat, 25 Dec 2021 11:14 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:
> On Friday, December 24, 2021 at 3:20:36 PM UTC-6, BGB wrote:
>> On 12/24/2021 9:38 AM, Anton Ertl wrote:
><snip>
>>
>> I suspect a lower reasonable limit for addressing modes is (Reg,Disp)
>> and (Reg,Index). Fewer than this leads to unnecessary inefficiency, and
>> both modes are "very common" in practice.
><
> I added [Rbase+Rindex<<scale+disp] even if it is only 2%-3% because
> it is expressive (more to FORTRAN needs than C needs) and it saves
> instruction space. I provided ONLY the above two (2) in M88K and
> later when doing x86-64 found out the error of my ways.
><
> The least one can add is [Rbase+Rindex<<scale]

Depending on how pressed for bits in opcodes one is, the scale
could also be implied in the size of the data, so a 64-bit word
would have either scale = 0 or scale == 3. I think HP-PA did that.

This would cover the most common case of linear array access.

>>
>> More elaborate modes quickly run into diminishing returns territory; so,
>> things like auto-increment and similar are better off left out IMO.
><
> Agreed except for [Rbase+Rindex<<scale+disp]

In your ISA, you specified the disp as either a 32 or a 64-bit quantity.
So, any offset in the range between -2^31 to 2^31-1 needs two 32-bit
words.

Looking at the (probably more common) case where the offset can fit
into (for example) 16 bits, a good alternative might be

LEA Rx,[Rbase+Rindex<<scale]
LD Rx,[Rx+offset]

which also uses to words and does not use an extra register.
The use case would then be pretty small so that I would probably
tend to leave it out.

Re: RISC-V vs. Aarch64

<sq7fls$237$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22426&group=comp.arch#22426

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Sat, 25 Dec 2021 08:08:28 -0800
Organization: A noiseless patient Spider
Lines: 28
Message-ID: <sq7fls$237$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sq5dj1$1q9$1@dont-email.me>
<59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com>
<sq6udp$hj3$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 25 Dec 2021 16:08:28 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="49fb671ecae1fabc97e23bc3ec74929a";
logging-data="2151"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+QLNv7Pz6ga5QYfYrOMR7X"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.1
Cancel-Lock: sha1:SfobE58w/ShtbwxlLt9B9+fIxQQ=
In-Reply-To: <sq6udp$hj3$1@newsreader4.netcologne.de>
Content-Language: en-US

by: Ivan Godard - Sat, 25 Dec 2021 16:08 UTC

On 12/25/2021 3:14 AM, Thomas Koenig wrote:
> MitchAlsup <MitchAlsup@aol.com> schrieb:
>> On Friday, December 24, 2021 at 3:20:36 PM UTC-6, BGB wrote:
>>> On 12/24/2021 9:38 AM, Anton Ertl wrote:
>> <snip>
>>>
>>> I suspect a lower reasonable limit for addressing modes is (Reg,Disp)
>>> and (Reg,Index). Fewer than this leads to unnecessary inefficiency, and
>>> both modes are "very common" in practice.
>> <
>> I added [Rbase+Rindex<<scale+disp] even if it is only 2%-3% because
>> it is expressive (more to FORTRAN needs than C needs) and it saves
>> instruction space. I provided ONLY the above two (2) in M88K and
>> later when doing x86-64 found out the error of my ways.
>> <
>> The least one can add is [Rbase+Rindex<<scale]
>
> Depending on how pressed for bits in opcodes one is, the scale
> could also be implied in the size of the data, so a 64-bit word
> would have either scale = 0 or scale == 3. I think HP-PA did that.
>
> This would cover the most common case of linear array access.

Mill does that. Pretty much all the cases that don't work involve
picking a field out of an indexed element of an array of structs, and
nearly all of those require a shift large enough that it wouldn't fit in
the scale encoding of ISAs like x86 anyway.

Re: RISC-V vs. Aarch64

<2021Dec25.174017@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22428&group=comp.arch#22428

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Sat, 25 Dec 2021 16:40:17 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 38
Message-ID: <2021Dec25.174017@mips.complang.tuwien.ac.at>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at> <0a8ff16a-53de-420e-9c82-cfc9e87f62e9n@googlegroups.com> <2021Dec24.200638@mips.complang.tuwien.ac.at> <67430066-07f1-4276-bcc9-c542eac2faefn@googlegroups.com>
Injection-Info: reader02.eternal-september.org; posting-host="70510cc33546f9e44e24374914364b6e";
logging-data="32179"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18+0+NLrhzZZHRNq3BxRpPO"
Cancel-Lock: sha1:xVT5fdnbpHIu/2UnpRCKrcde5q8=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Sat, 25 Dec 2021 16:40 UTC

MitchAlsup <MitchAlsup@aol.com> writes:
>On Friday, December 24, 2021 at 1:18:28 PM UTC-6, Anton Ertl wrote:
>> MitchAlsup <Mitch...@aol.com> writes:
>> >On Friday, December 24, 2021 at 11:00:14 AM UTC-6, Anton Ertl wrote:
>> >> Celio also argues that instruction fusion can overcome the problem of=20
>> >
>> >Did you incompletely cut a paragraph ?
>> Probably I switched to writing a different paragraph and then forgot
>> to return:-)
>> What I wanted to write:
>>
>> |Celio also argues that instruction fusion can overcome the problem of
>> |latency chains of the simple instructions. E.g., he gives the
>> |example of an indexed load consisting of a shift, an add, and a load
>> |on RISC-V; instruction fusion can combine them into one instruction
>> |with the latency of one instruction.
><
>You have still "eaten" the size of the 3 instructions in the I$

Celio argues and supports it with numbers that RV64GC is competetive
with A64 in code size.

>and I would argue
>that it takes less power to directly decode LD Rd,[Rb,Ri<<s+disp] than to
>decode {SL Rt,Rindex,#shift; ADD Rt,Rt,Rb; LD Rd,[Rt+disp]}

Yes, energy is certainly an issue. The question is how do RV64GC
decoders compare to A64 decoders for similar-performance cores on
their usual instruction mix.

It seems to me that the RV64GC might benefit more from a uop cache.
And given that high-performance A64 implementations have one,
competing RV64GC implementations will probably have one, too.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: RISC-V vs. Aarch64

<c7IxJ.93263$IB7.84845@fx02.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22429&group=comp.arch#22429

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx02.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
References: <2021Dec24.163843@mips.complang.tuwien.ac.at> <sq5dj1$1q9$1@dont-email.me> <59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com> <sq6udp$hj3$1@newsreader4.netcologne.de> <sq7fls$237$1@dont-email.me>
In-Reply-To: <sq7fls$237$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 40
Message-ID: <c7IxJ.93263$IB7.84845@fx02.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sat, 25 Dec 2021 17:00:56 UTC
Date: Sat, 25 Dec 2021 12:00:43 -0500
X-Received-Bytes: 2655

by: EricP - Sat, 25 Dec 2021 17:00 UTC

Ivan Godard wrote:
> On 12/25/2021 3:14 AM, Thomas Koenig wrote:
>> MitchAlsup <MitchAlsup@aol.com> schrieb:
>>> On Friday, December 24, 2021 at 3:20:36 PM UTC-6, BGB wrote:
>>>> On 12/24/2021 9:38 AM, Anton Ertl wrote:
>>> <snip>
>>>>
>>>> I suspect a lower reasonable limit for addressing modes is (Reg,Disp)
>>>> and (Reg,Index). Fewer than this leads to unnecessary inefficiency, and
>>>> both modes are "very common" in practice.
>>> <
>>> I added [Rbase+Rindex<<scale+disp] even if it is only 2%-3% because
>>> it is expressive (more to FORTRAN needs than C needs) and it saves
>>> instruction space. I provided ONLY the above two (2) in M88K and
>>> later when doing x86-64 found out the error of my ways.
>>> <
>>> The least one can add is [Rbase+Rindex<<scale]
>>
>> Depending on how pressed for bits in opcodes one is, the scale
>> could also be implied in the size of the data, so a 64-bit word
>> would have either scale = 0 or scale == 3. I think HP-PA did that.
>>
>> This would cover the most common case of linear array access.
>
> Mill does that. Pretty much all the cases that don't work involve
> picking a field out of an indexed element of an array of structs, and
> nearly all of those require a shift large enough that it wouldn't fit in
> the scale encoding of ISAs like x86 anyway.

Picking the imaginary field out of a complex number vector requires
the full [Rbase+Rindex<<scale+disp] but also requires a separate
scale field that allows *2 of the largest FP size.

I was considering leaving holes for future FP128 data type so that
implies a LEA instruction that allows scale of 1,2,4,8,16,32.
Rounding up the scale field size to 3 bits adds 64,128.

This would probably sit well with that whole quaternion and octonion crowd.

Re: RISC-V vs. Aarch64

<sq7kbd$ums$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22431&group=comp.arch#22431

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Sat, 25 Dec 2021 09:28:13 -0800
Organization: A noiseless patient Spider
Lines: 51
Message-ID: <sq7kbd$ums$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sq5dj1$1q9$1@dont-email.me>
<59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com>
<sq6udp$hj3$1@newsreader4.netcologne.de> <sq7fls$237$1@dont-email.me>
<c7IxJ.93263$IB7.84845@fx02.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 25 Dec 2021 17:28:13 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="49fb671ecae1fabc97e23bc3ec74929a";
logging-data="31452"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+pjezpLGBCfPaUH2guU+Lb"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.1
Cancel-Lock: sha1:dtKaKGQhfYDUUEdFtzbEQydj9MM=
In-Reply-To: <c7IxJ.93263$IB7.84845@fx02.iad>
Content-Language: en-US

by: Ivan Godard - Sat, 25 Dec 2021 17:28 UTC

On 12/25/2021 9:00 AM, EricP wrote:
> Ivan Godard wrote:
>> On 12/25/2021 3:14 AM, Thomas Koenig wrote:
>>> MitchAlsup <MitchAlsup@aol.com> schrieb:
>>>> On Friday, December 24, 2021 at 3:20:36 PM UTC-6, BGB wrote:
>>>>> On 12/24/2021 9:38 AM, Anton Ertl wrote:
>>>> <snip>
>>>>>
>>>>> I suspect a lower reasonable limit for addressing modes is (Reg,Disp)
>>>>> and (Reg,Index). Fewer than this leads to unnecessary inefficiency,
>>>>> and
>>>>> both modes are "very common" in practice.
>>>> <
>>>> I added [Rbase+Rindex<<scale+disp] even if it is only 2%-3% because
>>>> it is expressive (more to FORTRAN needs than C needs) and it saves
>>>> instruction space. I provided ONLY the above two (2) in M88K and
>>>> later when doing x86-64 found out the error of my ways.
>>>> <
>>>> The least one can add is [Rbase+Rindex<<scale]
>>>
>>> Depending on how pressed for bits in opcodes one is, the scale
>>> could also be implied in the size of the data, so a 64-bit word
>>> would have either scale = 0 or scale == 3. I think HP-PA did that.
>>>
>>> This would cover the most common case of linear array access.
>>
>> Mill does that. Pretty much all the cases that don't work involve
>> picking a field out of an indexed element of an array of structs, and
>> nearly all of those require a shift large enough that it wouldn't fit
>> in the scale encoding of ISAs like x86 anyway.
>
> Picking the imaginary field out of a complex number vector requires
> the full [Rbase+Rindex<<scale+disp] but also requires a separate
> scale field that allows *2 of the largest FP size.
>
> I was considering leaving holes for future FP128 data type so that
> implies a LEA instruction that allows scale of 1,2,4,8,16,32.
> Rounding up the scale field size to 3 bits adds 64,128.
>
> This would probably sit well with that whole quaternion and octonion crowd.

Yep, that' a case where scaling by size doesn't work. However, random
indices into complex arrays are rare; usually you are indexing over a
loop. Sometimes the compiler can build the scaling into the control
variables. When it can't, it just puts the shift into the prior bundle.
On a wide issue machine that's free.

In our code samples, the only place where it makes a difference is in
sporadic references outside of loops: "A[F()].f". Those are rare enough
that we decided to save the entropy.

Re: RISC-V vs. Aarch64

<2021Dec25.181011@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22432&group=comp.arch#22432

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Sat, 25 Dec 2021 17:10:11 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 116
Message-ID: <2021Dec25.181011@mips.complang.tuwien.ac.at>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at> <sq5dj1$1q9$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="70510cc33546f9e44e24374914364b6e";
logging-data="7648"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19bjX+rKltvwz0NR5eh1aXK"
Cancel-Lock: sha1:CL8hUWX+qYnvN+/mOlR+Se1o7uI=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Sat, 25 Dec 2021 17:10 UTC

BGB <cr88192@gmail.com> writes:
>> * A64 has fixed-length 32-bit instructions, but they can be more
>> complex: A64 has more addressing modes and additional instructions
>> like load pair and store pair; in particular, the A64 architects
>> seem to have had few concerns about the register read and write
>> ports needed per instruction; E.g., a store-pair instruction can
>> need four read ports, and a load pair instruction can need three
>> write ports (AFAIK).
>>
>
>These are less of an issue if one assumes a minimum width for the core.
>If the core is always at least 3-wide, this isn't an issue.

Why? Sure, you could have one instruction port that supports up to
four reads and the others only support two, and one that supports
three writes, and the others only support one, but that would still be
8 register reads and 5 writes per cycle, and my impression was that
having many register ports is really expensive.

There is also the option of reducing register port requirements
through the bypasses, but what do you do if there is not enough
bypassed data?

>Instruction fusion seems like a needless complexity IMO. It would be
>better if the ISA can be "dense" enough to make the fusion mostly
>unnecessary.

Well, Celio's position is that RISC-V gets code density through the C
extension and uop density through instruction fusion.

In the past we have seen many cases where it was more beneficial to
solve things in microarchitecture rather than architecture: caches
(rather than explicit fast memory), OoO rather than EPIC, dynamic
branch prediction rather than delay slots and static branch
prediction, bypasses rather than tranport-triggered architecture,
microarchitectural memory disambiguation rather than IA-64's ALAT.

Some of the benefits of putting features in the microarchitecture are
that

1) You can easily remove them (even within one implementation, by
using the appropriate chicken bit), while architectural features are
forever.

2) When you add them, existing code can benefit from them, while
architectural features need at least a recompilation.

RISC seemed to be the converse case of exposing things in the
architecture that, say, a VAX implementation does in the
microarchitecture, but one could also see VAX as another case of
exposing the microarchitecture in the architecture: the
microarchitecture was microcoded (because the microcode store was
faster than main memory at the time; but that's actually another
example of explicit fast memory mentioned above), so architects
designed instruction sets that made heavy use of microcode.

>If I were designing a "similar" ISA to RISC-V (with no status flags), I
>would probably leave out full Compare-and-Branch instructions, and
>instead have a few "simpler" conditional branches, say:
> BEQZ reg, label //Branch if reg==0
> BNEZ reg, label //Branch if reg!=0
> BGEZ reg, label //Branch if reg>=0
> BLTZ reg, label //Branch if reg< 0
>
>While conceptually, this doesn't save much, it would be cheaper to
>implement in hardware. Relative compares could then use compare
>instructions:
> CMPx Rs, Rt, Rd
>Where:
> (Rs==Rt) => 0;
> (Rs> Rt) => 1;
> (Rs< Rt) => -1.
>
>Though, one issue with a plain SUB is that it would not work correctly
>for comparing integer values the same width as the machine registers (if
>the difference is too large, the intermediate value will overflow).

Take a look at how Mitch Alsup solved this in the 88000.

>However, besides "cheap core", it is also nice to be able to have "fast
>LZ77 decoding" and similar, which is an area where misaligned memory
>access pays off.

If you know that you are doing misaligned accesses, composing a
possibly misaligned load from two loads and a combining instruction
(or a few) looks like a simpler approach.

I think the reason why everyone has switched to supporting misaligned
accesses is that there is too much software out there that needs it.

Also, you really want it for SIMD stuff (and your LZ77 example is
actually an example of that), so you need to implement it anyway, and
can just as well extend it to the other memory accesses.

>Page-crossing doesn't seem to be too big of an issue, since it is rare,
>and can be handled with two TLB misses in a row if needed (only the
>first TLB miss gets served; when the interrupt returns and the
>instruction tries again, it causes another TLB miss).

If you deal with TLB misses through interrupts, you have to make sure
that the first one is still there when the second one returns, or some
other way to guarantee eventual progress.

>In BJX2, I went with signed values being kept in sign-extended form, and
>unsigned values kept in zero-extended form

That seems like a natural way to do it. However, I am not sure if
that works well in the presence of type mismatches between caller and
callee in production C code. At least I imagine that's the reason why
ABIs force one kind of extension, and then do the appropriate
extension for the defined type at the usage site if necessary.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: RISC-V vs. Aarch64

<2deffe34-6906-49d8-8657-62f785e02a73n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22434&group=comp.arch#22434

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:1721:: with SMTP id az33mr4561253qkb.93.1640455605200;
Sat, 25 Dec 2021 10:06:45 -0800 (PST)
X-Received: by 2002:aca:646:: with SMTP id 67mr8389991oig.175.1640455604980;
Sat, 25 Dec 2021 10:06:44 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 25 Dec 2021 10:06:44 -0800 (PST)
In-Reply-To: <sq675n$tht$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:7149:e490:e0cd:51ab;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:7149:e490:e0cd:51ab
References: <2021Dec24.163843@mips.complang.tuwien.ac.at> <0a8ff16a-53de-420e-9c82-cfc9e87f62e9n@googlegroups.com>
<sq675n$tht$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2deffe34-6906-49d8-8657-62f785e02a73n@googlegroups.com>
Subject: Re: RISC-V vs. Aarch64
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 25 Dec 2021 18:06:45 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 20

by: MitchAlsup - Sat, 25 Dec 2021 18:06 UTC

On Friday, December 24, 2021 at 10:37:14 PM UTC-6, Ivan Godard wrote:
> On 12/24/2021 10:17 AM, MitchAlsup wrote:
>
> > c) My 66000 has 32-bit fixed length instruction-specifiers followed
> > by1-4 doublewords of constants needed by the instruction. One can
> > <ahem> decode the instruction and derive pointers to these constants,
> > the number of the constants and a pointer to the next instruction in
> > 30-total gates and 4-gates of delay.
> > <
> > There is NO JUSTIFIABLE reason that an instruction is not entirely
> > self-contained!
<
> Really? I suppose that might be true for OOO that is emulating
> sequential execution, but what about VLIW and other wide multi-issue?
> Chopping off the variable and large parts so they can be recognized in
> parallel lets the issue width no longer depend on how many constants you
> have.
<
IIUC: The packet of instructions on a Mill obeys the concept of being
"self-contained". It may have bits hither and yon, but they all show up
from a single fetch.

Re: RISC-V vs. Aarch64

<5f3851f0-1bcf-44eb-bd44-1f280b01d4d4n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22435&group=comp.arch#22435

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:e41:: with SMTP id o1mr9636193qvc.63.1640455956234;
Sat, 25 Dec 2021 10:12:36 -0800 (PST)
X-Received: by 2002:a05:6830:1445:: with SMTP id w5mr7889649otp.112.1640455955984;
Sat, 25 Dec 2021 10:12:35 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 25 Dec 2021 10:12:35 -0800 (PST)
In-Reply-To: <sq6udp$hj3$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:7149:e490:e0cd:51ab;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:7149:e490:e0cd:51ab
References: <2021Dec24.163843@mips.complang.tuwien.ac.at> <sq5dj1$1q9$1@dont-email.me>
<59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com> <sq6udp$hj3$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5f3851f0-1bcf-44eb-bd44-1f280b01d4d4n@googlegroups.com>
Subject: Re: RISC-V vs. Aarch64
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 25 Dec 2021 18:12:36 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 48

by: MitchAlsup - Sat, 25 Dec 2021 18:12 UTC

On Saturday, December 25, 2021 at 5:14:03 AM UTC-6, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > On Friday, December 24, 2021 at 3:20:36 PM UTC-6, BGB wrote:
> >> On 12/24/2021 9:38 AM, Anton Ertl wrote:
> ><snip>
> >>
> >> I suspect a lower reasonable limit for addressing modes is (Reg,Disp)
> >> and (Reg,Index). Fewer than this leads to unnecessary inefficiency, and
> >> both modes are "very common" in practice.
> ><
> > I added [Rbase+Rindex<<scale+disp] even if it is only 2%-3% because
> > it is expressive (more to FORTRAN needs than C needs) and it saves
> > instruction space. I provided ONLY the above two (2) in M88K and
> > later when doing x86-64 found out the error of my ways.
> ><
> > The least one can add is [Rbase+Rindex<<scale]
> Depending on how pressed for bits in opcodes one is, the scale
> could also be implied in the size of the data, so a 64-bit word
> would have either scale = 0 or scale == 3. I think HP-PA did that.
<
Without "+disp" I did this in M88K.
>
> This would cover the most common case of linear array access.
> >>
> >> More elaborate modes quickly run into diminishing returns territory; so,
> >> things like auto-increment and similar are better off left out IMO.
> ><
> > Agreed except for [Rbase+Rindex<<scale+disp]
> In your ISA, you specified the disp as either a 32 or a 64-bit quantity.
> So, any offset in the range between -2^31 to 2^31-1 needs two 32-bit
> words.
<
My 66000 has:
a) [Rbase+disp16] that fits in one 32-bit word
b) [Rbase+Rindex<<scale] that fits in one 32-bit word
c) [Rbase+Rindex<<scale+disp32] that fits in two 32-bit words
d) [Rbase+Rindex<<scale+disp64] that fits in three 32-bit words
<
Displacements are sign extended so the problem above is moot.
>
> Looking at the (probably more common) case where the offset can fit
> into (for example) 16 bits, a good alternative might be
>
> LEA Rx,[Rbase+Rindex<<scale]
> LD Rx,[Rx+offset]
>
> which also uses to words and does not use an extra register.
> The use case would then be pretty small so that I would probably
> tend to leave it out.

Re: RISC-V vs. Aarch64

<sq7p9m$s3o$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22437&group=comp.arch#22437

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Sat, 25 Dec 2021 10:52:38 -0800
Organization: A noiseless patient Spider
Lines: 55
Message-ID: <sq7p9m$s3o$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sq5dj1$1q9$1@dont-email.me>
<59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com>
<sq6udp$hj3$1@newsreader4.netcologne.de>
<5f3851f0-1bcf-44eb-bd44-1f280b01d4d4n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 25 Dec 2021 18:52:38 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="49fb671ecae1fabc97e23bc3ec74929a";
logging-data="28792"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+HlVHsDZ8w+0c5FgMKrfrh"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.1
Cancel-Lock: sha1:3H2iNm0nQLq/Z4A7Yt6CjQfbabg=
In-Reply-To: <5f3851f0-1bcf-44eb-bd44-1f280b01d4d4n@googlegroups.com>
Content-Language: en-US

by: Ivan Godard - Sat, 25 Dec 2021 18:52 UTC

On 12/25/2021 10:12 AM, MitchAlsup wrote:
> On Saturday, December 25, 2021 at 5:14:03 AM UTC-6, Thomas Koenig wrote:
>> MitchAlsup <Mitch...@aol.com> schrieb:
>>> On Friday, December 24, 2021 at 3:20:36 PM UTC-6, BGB wrote:
>>>> On 12/24/2021 9:38 AM, Anton Ertl wrote:
>>> <snip>
>>>>
>>>> I suspect a lower reasonable limit for addressing modes is (Reg,Disp)
>>>> and (Reg,Index). Fewer than this leads to unnecessary inefficiency, and
>>>> both modes are "very common" in practice.
>>> <
>>> I added [Rbase+Rindex<<scale+disp] even if it is only 2%-3% because
>>> it is expressive (more to FORTRAN needs than C needs) and it saves
>>> instruction space. I provided ONLY the above two (2) in M88K and
>>> later when doing x86-64 found out the error of my ways.
>>> <
>>> The least one can add is [Rbase+Rindex<<scale]
>> Depending on how pressed for bits in opcodes one is, the scale
>> could also be implied in the size of the data, so a 64-bit word
>> would have either scale = 0 or scale == 3. I think HP-PA did that.
> <
> Without "+disp" I did this in M88K.
>>
>> This would cover the most common case of linear array access.
>>>>
>>>> More elaborate modes quickly run into diminishing returns territory; so,
>>>> things like auto-increment and similar are better off left out IMO.
>>> <
>>> Agreed except for [Rbase+Rindex<<scale+disp]
>> In your ISA, you specified the disp as either a 32 or a 64-bit quantity.
>> So, any offset in the range between -2^31 to 2^31-1 needs two 32-bit
>> words.
> <
> My 66000 has:
> a) [Rbase+disp16] that fits in one 32-bit word
> b) [Rbase+Rindex<<scale] that fits in one 32-bit word
> c) [Rbase+Rindex<<scale+disp32] that fits in two 32-bit words
> d) [Rbase+Rindex<<scale+disp64] that fits in three 32-bit words
> <

vs:
[Rbase]
[Rbase+disp]
[Rbase+Rindex(<<scale)]
[Rbase+Rindex(<<scale)+disp]
<all above + Rpredicate>

where disp is 0/1/2/4 bytes, R's are belt width (4 bits on Silver),
scale is one bit, opCode varies by slot - 11 bits typical. Total 16-56 bits.

I don't see the use of disp64, though I suppose it falls out for free
from your encoding of wide literals.

BTW, virtually all cases in our corpus are either disp=0 or disp=8, i.e.
16 or 24 bits total (plus index/scale/predicate if any).

Re: RISC-V vs. Aarch64

<sq7pqh$vcj$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22439&group=comp.arch#22439

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Sat, 25 Dec 2021 13:01:34 -0600
Organization: A noiseless patient Spider
Lines: 117
Message-ID: <sq7pqh$vcj$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sq5dj1$1q9$1@dont-email.me>
<59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com>
<sq6udp$hj3$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 25 Dec 2021 19:01:37 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="ecc0660471340972c1c701cec1d2adc3";
logging-data="32147"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19uTHqQ8QYxssaSuzXL6Srm"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.1
Cancel-Lock: sha1:pVAB4c2G5hJSUWttSikFN6XrvmU=
In-Reply-To: <sq6udp$hj3$1@newsreader4.netcologne.de>
Content-Language: en-US

by: BGB - Sat, 25 Dec 2021 19:01 UTC

On 12/25/2021 5:14 AM, Thomas Koenig wrote:
> MitchAlsup <MitchAlsup@aol.com> schrieb:
>> On Friday, December 24, 2021 at 3:20:36 PM UTC-6, BGB wrote:
>>> On 12/24/2021 9:38 AM, Anton Ertl wrote:
>> <snip>
>>>
>>> I suspect a lower reasonable limit for addressing modes is (Reg,Disp)
>>> and (Reg,Index). Fewer than this leads to unnecessary inefficiency, and
>>> both modes are "very common" in practice.
>> <
>> I added [Rbase+Rindex<<scale+disp] even if it is only 2%-3% because
>> it is expressive (more to FORTRAN needs than C needs) and it saves
>> instruction space. I provided ONLY the above two (2) in M88K and
>> later when doing x86-64 found out the error of my ways.
>> <
>> The least one can add is [Rbase+Rindex<<scale]
>
> Depending on how pressed for bits in opcodes one is, the scale
> could also be implied in the size of the data, so a 64-bit word
> would have either scale = 0 or scale == 3. I think HP-PA did that.
>
> This would cover the most common case of linear array access.
>

This is also how it works on BJX2.
In most cases, the scale is the element size.

The main exceptions are:
PC, GBR, or TBR is encoded as a base register (byte scale);
R1 is used as an index, which is interpreted as an unscaled R0.

It is possible to encode an explicit scale and (Rb+Ri*Sc+Disp) mode via
an Op64 encoding, but I have left this as optional. I may split these
into two different sub-cases, where the ability to encode a scaled
displacement is treated as separate from the ability to do so with a
non-zero displacement field.

Current scales are 1, 2, 4, and 8.
The MOV.X (128-bit) load/store ops uses a scale of 8, which works well
for loading/storing a register pair (primary use case), but is a little
annoying for indexing an array of 128-bit elements (requires a 1-bit
left shift for the index); but this situation is infrequent.

>>>
>>> More elaborate modes quickly run into diminishing returns territory; so,
>>> things like auto-increment and similar are better off left out IMO.
>> <
>> Agreed except for [Rbase+Rindex<<scale+disp]
>
> In your ISA, you specified the disp as either a 32 or a 64-bit quantity.
> So, any offset in the range between -2^31 to 2^31-1 needs two 32-bit
> words.
>
> Looking at the (probably more common) case where the offset can fit
> into (for example) 16 bits, a good alternative might be
>
> LEA Rx,[Rbase+Rindex<<scale]
> LD Rx,[Rx+offset]
>
> which also uses to words and does not use an extra register.
> The use case would then be pretty small so that I would probably
> tend to leave it out.

Such is the issue...

A case which will occur fairly infrequently and is trivially encoded in
a 2 op sequence with only a single cycle of penalty, doesn't make a
particularly strong case.

My usual heuristics are based on:
Is it common?
Is it expensive to encode otherwise?
Does it appear cheap or free as a result of an existing mechanism?
...

So, particularly common cases will end up with more specialized encodings.

Things which are useful or otherwise expensive to implement in software
may get special instructions.

Likewise for things which trivially map to an existing unit or similar
(and have enough of a use-case to justify adding them).

In my case, my decoder works in two sub-stages:
Lookup the internal opcode numbers, form type, ... based on instruction
bits;
Unpack the instruction as given in the instruction form.

The former seems to primarily depend on which and how many bits are
considered in the lookup, and how many bits of output are produced. It
does not seem to care about how many patterns are assigned within this
space.

The latter case mostly cares about how many possible ways exist to
decode the instruction.

Overall decoder cost is moderate. For a 3-wide decoder with both BJX2
and RISC-V support, I am looking at around 4K LUTs. This is less than
either the Lane 1 ALU (6K LUT) or the L1 D$ (8K LUT).

Though, the Lane 2/3 ALUs are a lot smaller, but they deal with less.
They only do integer arithmetic.

Lane 1 ALU does:
Artihmetic ops;
Compare;
Various format conversions;
CLZ / CTZ;
UTX1 decode;
...

Re: RISC-V vs. Aarch64

<sq816p$9ak$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22445&group=comp.arch#22445

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Sat, 25 Dec 2021 15:07:35 -0600
Organization: A noiseless patient Spider
Lines: 435
Message-ID: <sq816p$9ak$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sq5dj1$1q9$1@dont-email.me>
<59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 25 Dec 2021 21:07:38 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="ecc0660471340972c1c701cec1d2adc3";
logging-data="9556"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/Q0uri8EWWFFVSF4LEnVGl"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.1
Cancel-Lock: sha1:DnfalMq/evfi4ic67J7P8G53Fu4=
In-Reply-To: <59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com>
Content-Language: en-US

by: BGB - Sat, 25 Dec 2021 21:07 UTC

On 12/24/2021 4:06 PM, MitchAlsup wrote:
> On Friday, December 24, 2021 at 3:20:36 PM UTC-6, BGB wrote:
>> On 12/24/2021 9:38 AM, Anton Ertl wrote:
> <snip>
>>
>> I suspect a lower reasonable limit for addressing modes is (Reg,Disp)
>> and (Reg,Index). Fewer than this leads to unnecessary inefficiency, and
>> both modes are "very common" in practice.
> <
> I added [Rbase+Rindex<<scale+disp] even if it is only 2%-3% because
> it is expressive (more to FORTRAN needs than C needs) and it saves
> instruction space. I provided ONLY the above two (2) in M88K and
> later when doing x86-64 found out the error of my ways.
> <
> The least one can add is [Rbase+Rindex<<scale]

There is an implicit scale implied with the Index, but it in my case, it
is element sized in the 32-bit encoding.

In BJX2, there is a [Rbase + Rindex<<Sc + Disp] mode, but:
It is optional;
It requires an Op64 encoding;
My compiler isn't smart enough to utilize it effectively.

It is a little cheaper to support:
[Rbase + Rindex<<Sc]
Encoded as:
[Rbase + Rindex<<Sc + 0]

Which doesn't support any additional machinery, but then I need to
distinguish these cases (both cases use the same encoding, just in the
latter case, the displacement is Zero). It is partly definition, but
also has a minor impact in terms of how I define things in terms of the
CPUID feature flags and similar.

>>
>> More elaborate modes quickly run into diminishing returns territory; so,
>> things like auto-increment and similar are better off left out IMO.
> <
> Agreed except for [Rbase+Rindex<<scale+disp]
> <

Granted, it is in a gray area.

Though, lacking this case is not nearly as adverse as lacking
(Reg,Index) or (Reg,Index*Sc) ...

> <snip again>
>> Ironically though, the 1-wide core isn't *that* much cheaper than the
>> 3-wide core, despite being significantly less capable.
>>
>> I could maybe make the core a bit cheaper by leaving out RV64I support,
>> which avoids a few fairly "costly" mechanisms (Compare-And-Branch;
>> Arbitrary register as link-register; ...). Or, conversely, a core which
>> *only* does RV64I.
> <
> When Compare and Branch is correctly specified it can be performed
> in 1 pipeline cycle. Then specified without looking "at the gates" is
> seldom is !

In the present form, the determination is made in EX1, and the branch is
initiated in EX2 (invalidating whatever is in EX1 at the time). Pushing
it later would likely require using an interlock or similar (so as to
not allow any instructions into the pipeline until the branch direction
can be determined).

The RISC-V style compare-and-branch instructions at present come with a
fairly steep cost in terms of LUTs and timing, as otherwise the results
of the compare would not be used until 1 cycle later.

>>
>> Still wont save much for the amount of LUTs eaten by the L1D$ and TLB
>> though (just these two are around 1/3 of the total LUT cost of the
>> 1-wide core).
>>
> <snip>
>>>
>>> And with instruction fusion the end result of the decoding step
>>> becomes even more complex.
>>>
>> Instruction fusion seems like a needless complexity IMO. It would be
>> better if the ISA can be "dense" enough to make the fusion mostly
>> unnecessary. Though, "where to draw the line" is probably a subject of
>> debate.
> <
> Instruction fusing is worthwhile on a 1-wide machine with certain pipeline
> organization to execute 2 instructions per cycle and a few other tricks.
> <
> In my 1-wide My 66130 core, I can CoIssue several instruction pairs:
> a) any instruction and an unconditional branch
> b) any result delivering instruction and a conditional branch or predicate
> consuming the result.
> c) Store instruction followed by a calculation instruction such that the
> total number of register reads is <= 3.
> d) any non-branch instruction followed by an unconditional branch
> instruction allow the branch to be taken at execution cost of ZERO
> cycles.
> <
> An eXcel spreadsheet shows up to 30% performance advantage over a
> strict 1-wide machine. {This is within spitting distance of the gain of
> In Order 2-wide over In Order 1-wide with less than 20% of the cost}

OK. My existing core does not support (or perform) any sort of
instruction fusion.

Things like instruction fetch / PC-step are determined in the 'IF'
stage, but this is done in terms of instruction bits and mode flags,
rather than recognizing any instructions in particular.

It seems like one could fetch a block of instructions and then include
an "offset" based on how many instructions were executed from the prior
cycle; however this seems more complicated than the current approach,
and for a machine with 96 bit bundles would likely require fetching in
terms of 256 bit blocks rather than the 96 bits corresponding to the
current bundle (since one likely wouldn't know the exact position of the
bundle, or how far to adjust PC, until ID1 or similar).

>>
>> Granted, it could probably be justified if one has already paid the
>> complexity cost needed to support superscalar (since it should be able
>> to use the same mechanism to deal with both).
> <
> I justified it not because of SuperScalar infrastructure (which I do not have)
> But because My 66000 ISA <practically> requires an instruction fetch buffer;
> So, the vast majority of the time the instructions are present and simply
> waiting for the pipeline to get around to them.

OK. I am still using an approach more like:
Send PC to the I$ during PF;
Get a 96 bit bundle and PC_step during IF;
Advance PC and feed back into I$ (PF for next cycle).

This works, but requires things to be known exactly during IF, and does
not deal well with any uncertainty here.

Similarly, trying to stick either fusion or superscalar logic into the
IF stage seems problematic (cost, and if there are any false positives,
the pipeline state is basically hosed).

>>
>> I have come to the realization though that, because my cores lack many
>> of the tricks that fancier RISC-V cores use, my cores running in RISC-V
>> mode will at-best have somewhat worse performance than something like SweRV.
> <
> Yes, not being like the others has certain disadvantages, too.

Yeah. I can make the BJX2 core run RISC-V, but not "well". To do better
would require doing a core specifically for running RISC-V, but then
there is little purpose to do so apart from interop with my existing
cores on a common bus.

My recent "minicore" also aims to support RISC-V, but is at least
"slightly less absurd" in that it is closer to the common featureset of
BJX2 and RISC-V (it is just sort of a crappy subset of both ISAs). Its
performance isn't likely to be all that impressive though in either case.

>>
>>
>>
>> If I were designing a "similar" ISA to RISC-V (with no status flags), I
> <
> Before you head in this direction, I find RISC-V ISA rather dreadful and
> only barely serviceable.
> <

Possibly, there are things I am not a fan of.

Main reasons I am bothering with it at all:
It is much more popular;
It is supported by GCC;
It (sorta) maps onto my existing pipeline...
Or at least RV64I and RV64IC.
Pretty much everything beyond this is not really 1:1.

However, there are issues:
Pure RISC-V isn't really able to operate the BJX2 core;
BGBCC doesn't yet have a RISC-V target (or ELF object files);
...

This basically limits cross-ISA interaction at present mostly to ASM blobs.

I have yet to work out specifics for how I will do cross ISA interfacing
to a sufficient degree to actually load up RISC-V binaries on the BJX2
core (it can already be done in theory, but they can't do a whole lot as
of yet).

Harder case would be bare-metal, which is basically what I would need to
have any real hope of a Linux port or similar; but in this case would
require ability to link object files between compilers and across ISA
boundaries.

The simpler route would just sorta being to load RISC-V ELF binaries
into TestKern.

Trying to port the Linux kernel to build on BGBCC is likely no-go.

Though, even if I did so, most existing Linux on RISC-V ports assume
RV64G or RV64GC, which is a bit out-of-reach at present.

>> would probably leave out full Compare-and-Branch instructions, and
>> instead have a few "simpler" conditional branches, say:
>> BEQZ reg, label //Branch if reg==0
>> BNEZ reg, label //Branch if reg!=0
>> BGEZ reg, label //Branch if reg>=0
>> BLTZ reg, label //Branch if reg< 0
>>
>> While conceptually, this doesn't save much, it would be cheaper to
>> implement in hardware.
> <
> Having done both, I can warn you that your assumption is filled with badly
> formed misconceptions. From a purity standpoint you do have a point;
> from a gate count perspective and a cycle time perspective you do not.
> <

Click here to read the complete article

Re: RISC-V vs. Aarch64

<f298dcfe-49ad-4a92-8e24-78b290897b0en@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22446&group=comp.arch#22446

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:2001:: with SMTP id c1mr7852404qka.374.1640468755546;
Sat, 25 Dec 2021 13:45:55 -0800 (PST)
X-Received: by 2002:a9d:7443:: with SMTP id p3mr8127249otk.331.1640468755290;
Sat, 25 Dec 2021 13:45:55 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 25 Dec 2021 13:45:55 -0800 (PST)
In-Reply-To: <sq7p9m$s3o$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:7149:e490:e0cd:51ab;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:7149:e490:e0cd:51ab
References: <2021Dec24.163843@mips.complang.tuwien.ac.at> <sq5dj1$1q9$1@dont-email.me>
<59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com> <sq6udp$hj3$1@newsreader4.netcologne.de>
<5f3851f0-1bcf-44eb-bd44-1f280b01d4d4n@googlegroups.com> <sq7p9m$s3o$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f298dcfe-49ad-4a92-8e24-78b290897b0en@googlegroups.com>
Subject: Re: RISC-V vs. Aarch64
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 25 Dec 2021 21:45:55 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 80

by: MitchAlsup - Sat, 25 Dec 2021 21:45 UTC

On Saturday, December 25, 2021 at 12:52:41 PM UTC-6, Ivan Godard wrote:
> On 12/25/2021 10:12 AM, MitchAlsup wrote:
> > On Saturday, December 25, 2021 at 5:14:03 AM UTC-6, Thomas Koenig wrote:
> >> MitchAlsup <Mitch...@aol.com> schrieb:
> >>> On Friday, December 24, 2021 at 3:20:36 PM UTC-6, BGB wrote:
> >>>> On 12/24/2021 9:38 AM, Anton Ertl wrote:
> >>> <snip>
> >>>>
> >>>> I suspect a lower reasonable limit for addressing modes is (Reg,Disp)
> >>>> and (Reg,Index). Fewer than this leads to unnecessary inefficiency, and
> >>>> both modes are "very common" in practice.
> >>> <
> >>> I added [Rbase+Rindex<<scale+disp] even if it is only 2%-3% because
> >>> it is expressive (more to FORTRAN needs than C needs) and it saves
> >>> instruction space. I provided ONLY the above two (2) in M88K and
> >>> later when doing x86-64 found out the error of my ways.
> >>> <
> >>> The least one can add is [Rbase+Rindex<<scale]
> >> Depending on how pressed for bits in opcodes one is, the scale
> >> could also be implied in the size of the data, so a 64-bit word
> >> would have either scale = 0 or scale == 3. I think HP-PA did that.
> > <
> > Without "+disp" I did this in M88K.
> >>
> >> This would cover the most common case of linear array access.
> >>>>
> >>>> More elaborate modes quickly run into diminishing returns territory; so,
> >>>> things like auto-increment and similar are better off left out IMO.
> >>> <
> >>> Agreed except for [Rbase+Rindex<<scale+disp]
> >> In your ISA, you specified the disp as either a 32 or a 64-bit quantity.
> >> So, any offset in the range between -2^31 to 2^31-1 needs two 32-bit
> >> words.
> > <
> > My 66000 has:
> > a) [Rbase+disp16] that fits in one 32-bit word
> > b) [Rbase+Rindex<<scale] that fits in one 32-bit word
> > c) [Rbase+Rindex<<scale+disp32] that fits in two 32-bit words
> > d) [Rbase+Rindex<<scale+disp64] that fits in three 32-bit words
> > <
> vs:
> [Rbase]
> [Rbase+disp]
> [Rbase+Rindex(<<scale)]
> [Rbase+Rindex(<<scale)+disp]
> <all above + Rpredicate>
>
> where disp is 0/1/2/4 bytes, R's are belt width (4 bits on Silver),
> scale is one bit, opCode varies by slot - 11 bits typical. Total 16-56 bits.
>
> I don't see the use of disp64, though I suppose it falls out for free
> from your encoding of wide literals.
<
There are a couple of other things:
Rbase == 0 means use IP as the base register {Dispψ, Disp32 }
Rindex == 0 means index = 0.
Rbase == 0 AND Disp64 means the 64-bit constant is an absolute address
.....{and can be indexed if desired}.
<
So, DISP64 is used for absolute addressing anywhere in the address space.
>
> BTW, virtually all cases in our corpus are either disp=0 or disp=8, i..e.
> 16 or 24 bits total (plus index/scale/predicate if any).
<
I am seeing 2% DISP64
Dispψ means Disp field does not exist in instruction.
<
A surprising amount of STs contain constants to be deposited in memory
<
ST #5,[SP+32]

Re: RISC-V vs. Aarch64

<b7f6864e-8624-4486-abbf-b80492faa61fn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22447&group=comp.arch#22447

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:454a:: with SMTP id u10mr8494321qkp.605.1640469546914;
Sat, 25 Dec 2021 13:59:06 -0800 (PST)
X-Received: by 2002:a05:6808:1914:: with SMTP id bf20mr8808193oib.7.1640469546692;
Sat, 25 Dec 2021 13:59:06 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 25 Dec 2021 13:59:06 -0800 (PST)
In-Reply-To: <sq7pqh$vcj$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:7149:e490:e0cd:51ab;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:7149:e490:e0cd:51ab
References: <2021Dec24.163843@mips.complang.tuwien.ac.at> <sq5dj1$1q9$1@dont-email.me>
<59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com> <sq6udp$hj3$1@newsreader4.netcologne.de>
<sq7pqh$vcj$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b7f6864e-8624-4486-abbf-b80492faa61fn@googlegroups.com>
Subject: Re: RISC-V vs. Aarch64
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 25 Dec 2021 21:59:06 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 85

by: MitchAlsup - Sat, 25 Dec 2021 21:59 UTC

On Saturday, December 25, 2021 at 1:01:39 PM UTC-6, BGB wrote:
> On 12/25/2021 5:14 AM, Thomas Koenig wrote:
> > MitchAlsup <Mitch...@aol.com> schrieb:
> >> On Friday, December 24, 2021 at 3:20:36 PM UTC-6, BGB wrote:
> >>> On 12/24/2021 9:38 AM, Anton Ertl wrote:
<snip>
> Such is the issue...
>
> A case which will occur fairly infrequently and is trivially encoded in
> a 2 op sequence with only a single cycle of penalty, doesn't make a
> particularly strong case.
>
>
>
> My usual heuristics are based on:
> Is it common?
> Is it expensive to encode otherwise?
> Does it appear cheap or free as a result of an existing mechanism?
> ...
A good list.
>
> So, particularly common cases will end up with more specialized encodings.
>
> Things which are useful or otherwise expensive to implement in software
> may get special instructions.
<
A switch() instruction ended up in My 66000 ISA. This instruction transfers
control to a value calculated via a table in memory without consuming a register
and without passing the data through the data cache (switch tables are accessed
though the I$) and the index is range checked (transferring control to default:
in the out of range condition.)
<
This made the cut because there are enough switches in typical code, AND the
table access should not consume a register, AND the table data should not
pollute the D$, and we can use the branch adder in the PARSE stage of the
pipeline to perform the arithmetic and fetch instructions.
<
ENTER and EXIT made the cut because these can do the work of multiple instructions
(up to 19 under MY 66000 ABI) while avoiding pollution of registers, AND I can put
the preserved registers in a place where SW cannot access the data--eliminating
ROP attacks. ENTER and EXIT are faster than actually executing instructions to
perform the same amount of work because registers can be accessed 4-8 registers
per cycle due to HW context switch.
>
> Likewise for things which trivially map to an existing unit or similar
> (and have enough of a use-case to justify adding them).
>
>
> In my case, my decoder works in two sub-stages:
> Lookup the internal opcode numbers, form type, ... based on instruction
> bits;
> Unpack the instruction as given in the instruction form.
<
My decoding strategy is to look at 6-bits (Major OpCode) and use this to
determine instruction format {1-operand, 2-opernd, 3-operand, memory ref,
shifts, and predicate}. Registers can be routed directly to RF decode ports
in smaller implementations,
<
DECODE (after PARSE) then reads the RF, performs forwarding, binds
CoIssued instructions, calculates branch target addresses, and controls
the FETCH address multiplexer. Operands and OpCodes are routed to
appropriate calculation units.
>
> The former seems to primarily depend on which and how many bits are
> considered in the lookup, and how many bits of output are produced. It
> does not seem to care about how many patterns are assigned within this
> space.
>
> The latter case mostly cares about how many possible ways exist to
> decode the instruction.
>
> Overall decoder cost is moderate. For a 3-wide decoder with both BJX2
> and RISC-V support, I am looking at around 4K LUTs. This is less than
> either the Lane 1 ALU (6K LUT) or the L1 D$ (8K LUT).
>
>
> Though, the Lane 2/3 ALUs are a lot smaller, but they deal with less.
> They only do integer arithmetic.
>
> Lane 1 ALU does:
> Artihmetic ops;
> Compare;
> Various format conversions;
> CLZ / CTZ;
> UTX1 decode;
> ...

Re: RISC-V vs. Aarch64

<7c3e3c44-b788-4d26-bf3d-c54671f10119n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22448&group=comp.arch#22448

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:389:: with SMTP id j9mr10242451qtx.504.1640470989346;
Sat, 25 Dec 2021 14:23:09 -0800 (PST)
X-Received: by 2002:a9d:bf7:: with SMTP id 110mr8311288oth.94.1640470989032;
Sat, 25 Dec 2021 14:23:09 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 25 Dec 2021 14:23:08 -0800 (PST)
In-Reply-To: <sq816p$9ak$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:7149:e490:e0cd:51ab;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:7149:e490:e0cd:51ab
References: <2021Dec24.163843@mips.complang.tuwien.ac.at> <sq5dj1$1q9$1@dont-email.me>
<59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com> <sq816p$9ak$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <7c3e3c44-b788-4d26-bf3d-c54671f10119n@googlegroups.com>
Subject: Re: RISC-V vs. Aarch64
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 25 Dec 2021 22:23:09 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 181

by: MitchAlsup - Sat, 25 Dec 2021 22:23 UTC

On Saturday, December 25, 2021 at 3:07:40 PM UTC-6, BGB wrote:
> On 12/24/2021 4:06 PM, MitchAlsup wrote:
> > On Friday, December 24, 2021 at 3:20:36 PM UTC-6, BGB wrote:
> >> On 12/24/2021 9:38 AM, Anton Ertl wrote:
> > <snip>
> > Instruction fusing is worthwhile on a 1-wide machine with certain pipeline
> > organization to execute 2 instructions per cycle and a few other tricks..
> > <
> > In my 1-wide My 66130 core, I can CoIssue several instruction pairs:
> > a) any instruction and an unconditional branch
> > b) any result delivering instruction and a conditional branch or predicate
> > consuming the result.
> > c) Store instruction followed by a calculation instruction such that the
> > total number of register reads is <= 3.
> > d) any non-branch instruction followed by an unconditional branch
> > instruction allow the branch to be taken at execution cost of ZERO
> > cycles.
> > <
> > An eXcel spreadsheet shows up to 30% performance advantage over a
> > strict 1-wide machine. {This is within spitting distance of the gain of
> > In Order 2-wide over In Order 1-wide with less than 20% of the cost}
> OK. My existing core does not support (or perform) any sort of
> instruction fusion.
>
>
> Things like instruction fetch / PC-step are determined in the 'IF'
> stage, but this is done in terms of instruction bits and mode flags,
> rather than recognizing any instructions in particular.
>
> It seems like one could fetch a block of instructions and then include
> an "offset" based on how many instructions were executed from the prior
> cycle; however this seems more complicated than the current approach,
<
Once your instruction is no longer fixed at 32-bits, you find yourself asking
well, how may bits should I fetch every cycle. And you find that 4 words
are appropriate for a low-end machine, 8-16 for a larger machine. In other
words you have committed yourself to having an instruction buffer.
<
Once you have an instruction buffer, you can decode every instruction in
the buffer and develop a unary pointer to the next instruction (30 total gates
4 gates of delay) So, now if you pick a given instruction you can pick a
second instruction 6 gates later, and pick 3 and 4 four gates later, pick
5,6,7,8 4-gates later,... You simply skip over the constants during PARSE
you access the constants from something that looks surprizingly like
a RF in DECODE, performing forwarding,...
<
> and for a machine with 96 bit bundles would likely require fetching in
> terms of 256 bit blocks rather than the 96 bits corresponding to the
> current bundle (since one likely wouldn't know the exact position of the
> bundle, or how far to adjust PC, until ID1 or similar).
<
Once you have an instruction buffer, you can scan IB forward looking tor
branches; so, you can fetch the target prior to executing the branch and
perform 1-cycle branches (sometimes zero cycle branches/calls/returns)

>
>
>
> Trying to port the Linux kernel to build on BGBCC is likely no-go.
>
> Though, even if I did so, most existing Linux on RISC-V ports assume
> RV64G or RV64GC, which is a bit out-of-reach at present.
> >> would probably leave out full Compare-and-Branch instructions, and
> >> instead have a few "simpler" conditional branches, say:
> >> BEQZ reg, label //Branch if reg==0
> >> BNEZ reg, label //Branch if reg!=0
> >> BGEZ reg, label //Branch if reg>=0
> >> BLTZ reg, label //Branch if reg< 0
> >>
> >> While conceptually, this doesn't save much, it would be cheaper to
> >> implement in hardware.
> > <
> > Having done both, I can warn you that your assumption is filled with badly
> > formed misconceptions. From a purity standpoint you do have a point;
> > from a gate count perspective and a cycle time perspective you do not.
> > <
> It would on average cost more clock cycles, but it seems like:
> Detect 0 on input (already have this to detect Inf);
> Look at sign bit;
> Initiate branch signal.
>
> Should be cheaper (in terms of latency) than:
> Subtract values;
> Use carry, zero, and sign bits of result to determine branch;
> Initiate branch signal.
<
Yes, that is the trick, here. The compare-branch does not need a full carry
chain--it only needs a degenerate carry chain--and this degenerate chain
is at most ½ the gate delay of a full carry chain. So, the logic ends up as
XOR-great-bit-AND gate.
<
Now, if (when) you CoIssue CMP and BB, you setup the CMP logic so you
can determine if the branch will be taken prior to the point you can generate
the output of the CMP instruction itself. You use the intermediate state of the
CMP instruction to drive the BB instruction. And not only do you execute
2 instructions in 1 cycle, you have already fetched the target instruction
so, you are ready to insert either sequential or target instruction into Decode
in that same clock.
>
<snip>
> >>
> >> Load/Store allows decent performance from a simplistic pipeline.
> > <
> > Ahem: 4-wide machines do not have simplistic pipelines, so the burden is
> > only on the lesser implementations and nothing on the higher performance
> > ones.
<
> I am thinking mostly of lower-end implementations.
>
> In theory, one could split the complex addressing modes into multiple
> uOps or similar, and then execute them end-to-end. But, otherwise, one
> has a problem.
<
For a variety of reasons, you don't want to do that. FORTRAN code will be
accessing the stack interspersed with accesses to common block arrays.
So, you will want:
<
LD Ri,[SP+offset]
LD Rx,[IP+Rj<<3+offset]
<
To be able to "run" down the memory pipeline back-to-back continuously.
<
<snip>
> > More than you alude.
> Possibly.
>
>
> In general, code needs to keep values in the correct form because there
> are cases where it matters:
> Load/Store ops ended up using 33 bit displacements;
> Some operations always operate on full 64-bit inputs;
> ...
>
> Going the x86-64 / A64 route effectively requires doubling up nearly
> every operation with both 32-bit and 64-bit forms; rather than leaving
> most of the ISA as 64-bit except in cases where 64b is too costly and/or
> 32-bit semantics are required (such as to make C code work as expected).
>
>
> One thing I have observed is that one can get wonky results from C code
> if 'int' values are allowed to go outside of the expected range, more so
> when load/store gets involved (the compiler can save and reload a value
> and then have the value differ).
<
My 66000 has instructions to smash values in registers back into their prescribed
ranges:
SL Rchar,Rlong,<8:0>
SL Rhalf,Rlong,<16:0>
SL Rword,Rlong<32:0>
<
But these are degenerate subsets of real bit manipulation instructions.
>
> So, I ended up with operations like "ADDS.L" and "ADDU.L" whose basic
> sole purpose is to do ADD and similar in ways which produce results
> which wrap in the expected ways in the case of integer overflow.
>
> ...

Re: RISC-V vs. Aarch64

<sq8grs$qfc$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22454&group=comp.arch#22454

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Sat, 25 Dec 2021 17:34:52 -0800
Organization: A noiseless patient Spider
Lines: 89
Message-ID: <sq8grs$qfc$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sq5dj1$1q9$1@dont-email.me>
<59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com>
<sq6udp$hj3$1@newsreader4.netcologne.de>
<5f3851f0-1bcf-44eb-bd44-1f280b01d4d4n@googlegroups.com>
<sq7p9m$s3o$1@dont-email.me>
<f298dcfe-49ad-4a92-8e24-78b290897b0en@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 26 Dec 2021 01:34:53 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="eddd558de32a8ab01bdcb14681a629f3";
logging-data="27116"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18Vsag4kCOMaE2ZoPIif+s5"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.1
Cancel-Lock: sha1:SMfjbWFKSRcGf5Uoai+aCAqickQ=
In-Reply-To: <f298dcfe-49ad-4a92-8e24-78b290897b0en@googlegroups.com>
Content-Language: en-US

by: Ivan Godard - Sun, 26 Dec 2021 01:34 UTC

On 12/25/2021 1:45 PM, MitchAlsup wrote:
> On Saturday, December 25, 2021 at 12:52:41 PM UTC-6, Ivan Godard wrote:
>> On 12/25/2021 10:12 AM, MitchAlsup wrote:
>>> On Saturday, December 25, 2021 at 5:14:03 AM UTC-6, Thomas Koenig wrote:
>>>> MitchAlsup <Mitch...@aol.com> schrieb:
>>>>> On Friday, December 24, 2021 at 3:20:36 PM UTC-6, BGB wrote:
>>>>>> On 12/24/2021 9:38 AM, Anton Ertl wrote:
>>>>> <snip>
>>>>>>
>>>>>> I suspect a lower reasonable limit for addressing modes is (Reg,Disp)
>>>>>> and (Reg,Index). Fewer than this leads to unnecessary inefficiency, and
>>>>>> both modes are "very common" in practice.
>>>>> <
>>>>> I added [Rbase+Rindex<<scale+disp] even if it is only 2%-3% because
>>>>> it is expressive (more to FORTRAN needs than C needs) and it saves
>>>>> instruction space. I provided ONLY the above two (2) in M88K and
>>>>> later when doing x86-64 found out the error of my ways.
>>>>> <
>>>>> The least one can add is [Rbase+Rindex<<scale]
>>>> Depending on how pressed for bits in opcodes one is, the scale
>>>> could also be implied in the size of the data, so a 64-bit word
>>>> would have either scale = 0 or scale == 3. I think HP-PA did that.
>>> <
>>> Without "+disp" I did this in M88K.
>>>>
>>>> This would cover the most common case of linear array access.
>>>>>>
>>>>>> More elaborate modes quickly run into diminishing returns territory; so,
>>>>>> things like auto-increment and similar are better off left out IMO.
>>>>> <
>>>>> Agreed except for [Rbase+Rindex<<scale+disp]
>>>> In your ISA, you specified the disp as either a 32 or a 64-bit quantity.
>>>> So, any offset in the range between -2^31 to 2^31-1 needs two 32-bit
>>>> words.
>>> <
>>> My 66000 has:
>>> a) [Rbase+disp16] that fits in one 32-bit word
>>> b) [Rbase+Rindex<<scale] that fits in one 32-bit word
>>> c) [Rbase+Rindex<<scale+disp32] that fits in two 32-bit words
>>> d) [Rbase+Rindex<<scale+disp64] that fits in three 32-bit words
>>> <
>> vs:
>> [Rbase]
>> [Rbase+disp]
>> [Rbase+Rindex(<<scale)]
>> [Rbase+Rindex(<<scale)+disp]
>> <all above + Rpredicate>
>>
>> where disp is 0/1/2/4 bytes, R's are belt width (4 bits on Silver),
>> scale is one bit, opCode varies by slot - 11 bits typical. Total 16-56 bits.
>>
>> I don't see the use of disp64, though I suppose it falls out for free
>> from your encoding of wide literals.
> <
> There are a couple of other things:
> Rbase == 0 means use IP as the base register {Dispψ, Disp32 }

Which removes an address register. vs: Rbase can be a ptr off the belt
or any of 8 specRegs

> Rindex == 0 means index = 0.

vs no Rindex at all (different main opCode)

> Rbase == 0 AND Disp64 means the 64-bit constant is an absolute address
> ....{and can be indexed if desired}.
> <
> So, DISP64 is used for absolute addressing anywhere in the address space.

Absolute addressing is useless in PIC code

>>
>> BTW, virtually all cases in our corpus are either disp=0 or disp=8, i.e.
>> 16 or 24 bits total (plus index/scale/predicate if any).
> <
> I am seeing 2% DISP64

For what, other than absolute address? Which you shouldn't :-)

> Dispψ means Disp field does not exist in instruction.

Shows as "disp<Greek phi>" for my reader - ???

> <
> A surprising amount of STs contain constants to be deposited in memory
> <
> ST #5,[SP+32]

That's cute :-)

Pages:12 3 4 5 6 7 8 9 10 11 12 13 14 15

server_pubkey.txt

rocksolid light 0.9.8
clearnet tor