Welcome to novaBBS (click a section below)

mail files register newsreader groups login

Message-ID:

MOUNT TAPE U1439 ON B3, NO RING

Re: RISC-V vs. Aarch64

Subject	Author
RISC-V vs. Aarch64	Anton Ertl
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Anton Ertl
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	BGB
Re: RISC-V vs. Aarch64	Anton Ertl
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	robf...@gmail.com
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Quadibloc
Re: RISC-V vs. Aarch64	Quadibloc
Re: RISC-V vs. Aarch64	Quadibloc
Re: RISC-V vs. Aarch64	Marcus
Re: RISC-V vs. Aarch64	BGB
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	BGB
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	BGB
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Thomas Koenig
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	EricP
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Marcus
Re: RISC-V vs. Aarch64	BGB
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	BGB
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Thomas Koenig
Re: RISC-V vs. Aarch64	Marcus
Re: RISC-V vs. Aarch64	EricP
Re: RISC-V vs. Aarch64	Marcus
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Niklas Holsti
Re: RISC-V vs. Aarch64	Bill Findlay
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	Thomas Koenig
Re: RISC-V vs. Aarch64	Thomas Koenig
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	BGB
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	Thomas Koenig
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	Marcus
Re: RISC-V vs. Aarch64	Thomas Koenig
Re: RISC-V vs. Aarch64	aph
Re: RISC-V vs. Aarch64	Michael S
Re: RISC-V vs. Aarch64	Thomas Koenig
Re: RISC-V vs. Aarch64	robf...@gmail.com
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	Tim Rentsch
Re: RISC-V vs. Aarch64	Terje Mathisen
Re: RISC-V vs. Aarch64	Thomas Koenig
Re: RISC-V vs. Aarch64	Marcus
Re: RISC-V vs. Aarch64	Guillaume
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Marcus
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	Thomas Koenig
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	EricP
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	EricP
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	EricP
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	Brett
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	Stephen Fuld
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	Stefan Monnier
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	MitchAlsup
Re: RISC-V vs. Aarch64	Stephen Fuld
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	EricP
Re: RISC-V vs. Aarch64	EricP
Re: RISC-V vs. Aarch64	Ivan Godard
The type of Mill's belt's slots	Stefan Monnier
Re: The type of Mill's belt's slots	MitchAlsup
Re: The type of Mill's belt's slots	Ivan Godard
Re: The type of Mill's belt's slots	Stefan Monnier
Re: The type of Mill's belt's slots	Ivan Godard
Re: The type of Mill's belt's slots	Stefan Monnier
Re: The type of Mill's belt's slots	Ivan Godard
Re: The type of Mill's belt's slots	MitchAlsup
Re: RISC-V vs. Aarch64	Ivan Godard
Re: RISC-V vs. Aarch64	Guillaume
Re: RISC-V vs. Aarch64	Quadibloc
MRISC32 vectorization (was: RISC-V vs. Aarch64)	Thomas Koenig
Re: RISC-V vs. Aarch64	Terje Mathisen
Re: RISC-V vs. Aarch64	Quadibloc
Re: RISC-V vs. Aarch64	Anton Ertl
Re: RISC-V vs. Aarch64	aph

Pages:123 4 5 6 7 8 9 10 11 12 13 14 15

Re: RISC-V vs. Aarch64

<sq9j91$aog$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22456&group=comp.arch#22456

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-eb03-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Sun, 26 Dec 2021 11:22:09 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sq9j91$aog$1@newsreader4.netcologne.de>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sq5dj1$1q9$1@dont-email.me>
<59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com>
<sq816p$9ak$1@dont-email.me>
<7c3e3c44-b788-4d26-bf3d-c54671f10119n@googlegroups.com>
Injection-Date: Sun, 26 Dec 2021 11:22:09 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-eb03-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:eb03:0:7285:c2ff:fe6c:992d";
logging-data="11024"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Sun, 26 Dec 2021 11:22 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:

> FORTRAN code will be
> accessing the stack interspersed with accesses to common block arrays.

There are several types of Fortran applications that still use
COMMON blocks these days. You can assume that a new program
no longer uses COMMON, it is too error-prone and clumsy.

There programs with a more-or-less fixed complexity, which ran
minutes or hours on the mainframes and minis of the past and which
now run in seconds on a low-powered laptop, with most of the time
spent on program startup. I maintain a couple of these at work,
and nobody cares about efficiency there.

Then, there are larger packages from the past which are too
expensive (money or hour-wise) to rewrite which depend on COMMON
blocks for what passed for dynamic memory management in pre-Fortran
90 days. The typical style there is to pass around workspaces as
extra arguments. Look at Lapack for examples of the calling sequence,
so there is little direct access to the COMMON block.

New code is written with more concern for OpenMP, OpenACC, Coarrays,
and the object-oriented features of Fortran, and uses allocatable
variables extensively.

What you wrote was certainly true 30 years ago, maybe 20 years ago.
It should no longer be a consideration for a new architecture.

Re: RISC-V vs. Aarch64

<osednQpEs8AXzVX8nZ2dnUU78IvNnZ2d@supernews.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22457&group=comp.arch#22457

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.de!newsfeed.xs4all.nl!newsfeed9.news.xs4all.nl!border2.nntp.ams1.giganews.com!nntp.giganews.com!buffer2.nntp.ams1.giganews.com!nntp.supernews.com!news.supernews.com.POSTED!not-for-mail
NNTP-Posting-Date: Sun, 26 Dec 2021 05:22:50 -0600
Sender: Andrew Haley <aph@zarquon.pink>
From: aph...@littlepinkcloud.invalid
Subject: Re: RISC-V vs. Aarch64
Newsgroups: comp.arch
References: <2021Dec24.163843@mips.complang.tuwien.ac.at> <sq5dj1$1q9$1@dont-email.me>
User-Agent: tin/1.9.2-20070201 ("Dalaruan") (UNIX) (Linux/4.18.0-305.25.1.el8_4.x86_64 (x86_64))
Message-ID: <osednQpEs8AXzVX8nZ2dnUU78IvNnZ2d@supernews.com>
Date: Sun, 26 Dec 2021 05:22:50 -0600
Lines: 18
X-Trace: sv3-fuQLPvuiq0RdxPkXsFEdMZbBLanaRNbThU7liS8dE6tv3WZg9JhltzJyO/OgF2Rkep4IHvuUwGdO/gl!t1tR9R+CC8KcybzlZ4RTa60OUhR14fOo+K9y4zJzbRANde8gsOiARRRuqo2tij+STrZtK+JvdMyb!5irU3N4u
X-Complaints-To: www.supernews.com/docs/abuse.html
X-DMCA-Complaints-To: www.supernews.com/docs/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40
X-Original-Bytes: 1716

by: aph...@littlepinkcloud.invalid - Sun, 26 Dec 2021 11:22 UTC

BGB <cr88192@gmail.com> wrote:
>
> If I were designing a "similar" ISA to RISC-V (with no status flags), I
> would probably leave out full Compare-and-Branch instructions, and
> instead have a few "simpler" conditional branches, say:
> BEQZ reg, label //Branch if reg==0
> BNEZ reg, label //Branch if reg!=0
> BGEZ reg, label //Branch if reg>=0
> BLTZ reg, label //Branch if reg< 0
>
> While conceptually, this doesn't save much, it would be cheaper to
> implement in hardware.

Just an aside: A64 has all of these, in addition to conditional
branches depending on the flags. I didn't see any mention of that in
Celio et al.

Andrew.

Re: RISC-V vs. Aarch64

<90c858e0-cadd-4484-8ece-3246b34a9741n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22467&group=comp.arch#22467

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:199a:: with SMTP id bm26mr10568958qkb.542.1640550051194;
Sun, 26 Dec 2021 12:20:51 -0800 (PST)
X-Received: by 2002:a05:6808:4d2:: with SMTP id a18mr10876490oie.99.1640550051099;
Sun, 26 Dec 2021 12:20:51 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 26 Dec 2021 12:20:50 -0800 (PST)
In-Reply-To: <sq8grs$qfc$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:81ac:9808:f6a0:d948;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:81ac:9808:f6a0:d948
References: <2021Dec24.163843@mips.complang.tuwien.ac.at> <sq5dj1$1q9$1@dont-email.me>
<59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com> <sq6udp$hj3$1@newsreader4.netcologne.de>
<5f3851f0-1bcf-44eb-bd44-1f280b01d4d4n@googlegroups.com> <sq7p9m$s3o$1@dont-email.me>
<f298dcfe-49ad-4a92-8e24-78b290897b0en@googlegroups.com> <sq8grs$qfc$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <90c858e0-cadd-4484-8ece-3246b34a9741n@googlegroups.com>
Subject: Re: RISC-V vs. Aarch64
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 26 Dec 2021 20:20:51 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 17

by: MitchAlsup - Sun, 26 Dec 2021 20:20 UTC

On Saturday, December 25, 2021 at 7:34:56 PM UTC-6, Ivan Godard wrote:
> On 12/25/2021 1:45 PM, MitchAlsup wrote:

> > Dispψ means Disp field does not exist in instruction.
<
> Shows as "disp<Greek phi>" for my reader - ???
<
The other character I could have used was: ϕ
<
> > <
> > A surprising amount of STs contain constants to be deposited in memory
> > <
> > ST #5,[SP+32]
<
> That's cute :-)
<
Brian gets credit for adding this.

Re: RISC-V vs. Aarch64

<sqaivi$e79$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22468&group=comp.arch#22468

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Sun, 26 Dec 2021 14:23:11 -0600
Organization: A noiseless patient Spider
Lines: 318
Message-ID: <sqaivi$e79$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sq5dj1$1q9$1@dont-email.me> <2021Dec25.181011@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 26 Dec 2021 20:23:14 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="b39450aad085f07d940960b62d661ee6";
logging-data="14569"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/nMRh/zewBHMZlWzuWN9Di"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.1
Cancel-Lock: sha1:PipPKaomkeEC8Ec9nTDFcWAnK1Y=
In-Reply-To: <2021Dec25.181011@mips.complang.tuwien.ac.at>
Content-Language: en-US

by: BGB - Sun, 26 Dec 2021 20:23 UTC

On 12/25/2021 11:10 AM, Anton Ertl wrote:
> BGB <cr88192@gmail.com> writes:
>>> * A64 has fixed-length 32-bit instructions, but they can be more
>>> complex: A64 has more addressing modes and additional instructions
>>> like load pair and store pair; in particular, the A64 architects
>>> seem to have had few concerns about the register read and write
>>> ports needed per instruction; E.g., a store-pair instruction can
>>> need four read ports, and a load pair instruction can need three
>>> write ports (AFAIK).
>>>
>>
>> These are less of an issue if one assumes a minimum width for the core.
>> If the core is always at least 3-wide, this isn't an issue.
>
> Why? Sure, you could have one instruction port that supports up to
> four reads and the others only support two, and one that supports
> three writes, and the others only support one, but that would still be
> 8 register reads and 5 writes per cycle, and my impression was that
> having many register ports is really expensive.
>

If you have a multi-lane core, it already has the needed register ports.
You then spread a single instruction across 2 or 3 lanes.

Say, for example, in my core:
Each instruction natively has 2 read ports, 1 write port;
There are also instructions which read from more registers.

If you use an instruction which reads from the destination register,
this automatically eats Lane 3. If you use an instruction which operates
on 128-bit values, it eats all 3 lanes.

This was also partly why I ended up going primarily with a 3 lane
design. It allowed bundling a lot more cases in primarily 2-lane
operation (the 3rd lane mostly a source of spare register ports and the
occasional ALU op); and it had enough register ports to allow for
128-bit SIMD.

Meanwhile, if I do a 1-wide subset of my ISA, a number of instructions
are no longer usable because this subset is limited to a 3R+1W register
file (going to 2R+1W would no longer allow for register-indexed load/store).

Seemingly, despite the differences, for the CPU core itself, with a
similar baseline feature-set (MMU and FPU, ...), the 3-wide core is only
~ 150% the size of the 1-wide core; but somewhat more capable (and also
faster).

A bigger issue ATM is seemingly, that with 64B L2 transfers, there is ~
15K LUT going to the DDR controller and L2 cache (L2 also eats most of
the Block RAM).

However, it increases memory bandwidth, and allows for putting the video
memory in external DRAM (this only works effectively if the bandwidth is
high enough to be able to keep up with the raster; otherwise the screen
contents turn into garbage).

The VRAM module operates as its own cache, which operates in terms of
256-bit tiles (8x8|4x4 pixels), and in-turn takes another 4K LUT.
Though, it also contains logic to mimic the original MMIO interface.

(So, DDR+L2+VRAM is already ~ 1/3 of the total LUT budget).
Throw in another 2K for the PCM and FM Synth audio modules.

....

So, it is sorta like at present:
XC7A100: Can fit a 3-wide core, and a second 1-wide core;
Or, can used narrower DDR/L2, and fit 2x 3-wide cores,
but, need to omit a few features to fit 2x 3-wide.
XC7S50: Can fit a 3-wide core, need to reduce some settings.
XC7S25: Can shoe-horn in a 1-wide (RISC style) core.
Though, an ISA more like RV32I makes more sense on an XC7S25.

Don't have a board ATM with an XC7A35 (such as an Arty A7-35T); should
be able to fit a 1-wide core though as it is a little bigger than the
XC7S25.

The Arty A7-100T is basically the same FPGA as the Nexys A7 which I am
still using (just with a bigger RAM module, but fewer built-in
connectors; I really like having built-in VGA/PS2/SDcard slot though).

> There is also the option of reducing register port requirements
> through the bypasses, but what do you do if there is not enough
> bypassed data?
>

My ISA has side-channels, but mostly used for control registers and a
few special registers effected by the ISR mechanism. The CRs are mostly
tied up with internal state within the core.

A few could have arguably been better as GPRs; for example, both ARM and
RISC-V had put the Global and Link registers in GPR space. Personally, I
would assume sticking with one or two dedicated link registers.

Tried to come up with something here, but it ended up fairly similar to
the RISC-V layout (well, except that I would assume making the ABI
register assignments more contiguous).

>> Instruction fusion seems like a needless complexity IMO. It would be
>> better if the ISA can be "dense" enough to make the fusion mostly
>> unnecessary.
>
> Well, Celio's position is that RISC-V gets code density through the C
> extension and uop density through instruction fusion.
>

It can do this, but could be better if fusion were unnecessary.

It likely punishes low-end implementations (with poor performance) more
than it helps (by saving LUTs/gates).

Like, it is a design which helps "very low end" implementations, at the
expense of other "moderately low-end" implementations, which need to
implement "higher end" functionality to compensate for deficiencies.

I generally agree with the 'C' extension though.

> In the past we have seen many cases where it was more beneficial to
> solve things in microarchitecture rather than architecture: caches
> (rather than explicit fast memory), OoO rather than EPIC, dynamic
> branch prediction rather than delay slots and static branch
> prediction, bypasses rather than tranport-triggered architecture,
> microarchitectural memory disambiguation rather than IA-64's ALAT.
>
> Some of the benefits of putting features in the microarchitecture are
> that
>
> 1) You can easily remove them (even within one implementation, by
> using the appropriate chicken bit), while architectural features are
> forever.
>
> 2) When you add them, existing code can benefit from them, while
> architectural features need at least a recompilation.
>
> RISC seemed to be the converse case of exposing things in the
> architecture that, say, a VAX implementation does in the
> microarchitecture, but one could also see VAX as another case of
> exposing the microarchitecture in the architecture: the
> microarchitecture was microcoded (because the microcode store was
> faster than main memory at the time; but that's actually another
> example of explicit fast memory mentioned above), so architects
> designed instruction sets that made heavy use of microcode.
>

As can be noted, I am coming at it from the perspective of having ended
up with a DSP flavored hybrid VLIW/EPIC style ISA (albeit with
variable-sized bundles; more like Xtensa or Hexagon).

In my case, if you write:
OP2 | OP1
Or:
OP3 | OP2 | OP1

It is basically telling the assembler to encode these instructions to
run in parallel...

When compiling C code, the compiler is more speculative, and tries to
shuffle and bundle instructions, but "kinda sucks at doing so".

>> If I were designing a "similar" ISA to RISC-V (with no status flags), I
>> would probably leave out full Compare-and-Branch instructions, and
>> instead have a few "simpler" conditional branches, say:
>> BEQZ reg, label //Branch if reg==0
>> BNEZ reg, label //Branch if reg!=0
>> BGEZ reg, label //Branch if reg>=0
>> BLTZ reg, label //Branch if reg< 0
>>
>> While conceptually, this doesn't save much, it would be cheaper to
>> implement in hardware. Relative compares could then use compare
>> instructions:
>> CMPx Rs, Rt, Rd
>> Where:
>> (Rs==Rt) => 0;
>> (Rs> Rt) => 1;
>> (Rs< Rt) => -1.
>>
>> Though, one issue with a plain SUB is that it would not work correctly
>> for comparing integer values the same width as the machine registers (if
>> the difference is too large, the intermediate value will overflow).
>
> Take a look at how Mitch Alsup solved this in the 88000.
>

It does work as-is, just would maybe be nicer if it was cheaper and had
less impact on timing.

But, I think the idea was that one only needs to figure out the
resultant sign and carry bits, rather than necessarily waiting for the
full result.

May or may not look into this.

>> However, besides "cheap core", it is also nice to be able to have "fast
>> LZ77 decoding" and similar, which is an area where misaligned memory
>> access pays off.
>
> If you know that you are doing misaligned accesses, composing a
> possibly misaligned load from two loads and a combining instruction
> (or a few) looks like a simpler approach.
>

It can be faked easily enough, but as can be noted, the cycle cost of
faking it is higher than that of native hardware support.

Click here to read the complete article

Re: RISC-V vs. Aarch64

<2021Dec27.091535@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22479&group=comp.arch#22479

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Mon, 27 Dec 2021 08:15:35 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 121
Message-ID: <2021Dec27.091535@mips.complang.tuwien.ac.at>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at> <sq5dj1$1q9$1@dont-email.me> <2021Dec25.181011@mips.complang.tuwien.ac.at> <sqaivi$e79$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="49bb987570c8ec7ca42a64081be61d96";
logging-data="13400"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/NNLQi7upnHWglZCCSWZyr"
Cancel-Lock: sha1:C1PHIumTypftnFKxhClIfRmnSOo=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Mon, 27 Dec 2021 08:15 UTC

BGB <cr88192@gmail.com> writes:
>On 12/25/2021 11:10 AM, Anton Ertl wrote:
>> BGB <cr88192@gmail.com> writes:
>>>> * A64 has fixed-length 32-bit instructions, but they can be more
>>>> complex: A64 has more addressing modes and additional instructions
>>>> like load pair and store pair; in particular, the A64 architects
>>>> seem to have had few concerns about the register read and write
>>>> ports needed per instruction; E.g., a store-pair instruction can
>>>> need four read ports, and a load pair instruction can need three
>>>> write ports (AFAIK).
>>>>
>>>
>>> These are less of an issue if one assumes a minimum width for the core.
>>> If the core is always at least 3-wide, this isn't an issue.
>>
>> Why? Sure, you could have one instruction port that supports up to
>> four reads and the others only support two, and one that supports
>> three writes, and the others only support one, but that would still be
>> 8 register reads and 5 writes per cycle, and my impression was that
>> having many register ports is really expensive.
>>
>
>If you have a multi-lane core, it already has the needed register ports.
>You then spread a single instruction across 2 or 3 lanes.

Ok, but then these instructions reduce the utilization of the lanes.
A pre/post-increment load does not provide increased throughput over a
load and an add on such an implementation.

>> Well, Celio's position is that RISC-V gets code density through the C
>> extension and uop density through instruction fusion.
....
>Like, it is a design which helps "very low end" implementations, at the
>expense of other "moderately low-end" implementations, which need to
>implement "higher end" functionality to compensate for deficiencies.

Yes. The question is how much extra complexity this functionality
costs for moderately low-end implementations.

One can consider this the cost of having a wide-range architecture
rather than an architecture designed for a narrower range. In the
case of A64 the idea probably was also that they have A32/T32 for the
very low end, so they were able to design A64 for a narrower range.

>> If you know that you are doing misaligned accesses, composing a
>> possibly misaligned load from two loads and a combining instruction
>> (or a few) looks like a simpler approach.
>>
>
>It can be faked easily enough, but as can be noted, the cycle cost of
>faking it is higher than that of native hardware support.

Why should that be so? The work necessary for doing it explicitly is
the same. At least for the load case. For unaligned stores I expect
that a write-combining store buffer can do some things that are pretty
far from what any ISA that allows only aligned stores has done
explicitly yet; and current (possibly unaligned) stores are probably a
good interface to that hardware structure.

>In cases where one wants unaligned access on hardware which only does
>aligned access, this is where a keyword like "__unaligned" comes into
>play. This allows using fast (direct) access for aligned loads/stores,
>and a slower (faked) approach for misaligned load/store.

Yes, you would need to specify the unaligned accesses in programs.
However, given that most programs have been and are developed on
hardware where that has not been necessary, programmers are unlikely
to get that right, so allowing all memory accesses (exception: for
comminicating between threads) to be unaligned is the way to go and
has won.

>In this case, both RISC-V and BJX2 assume unaligned load/store
>operations for basic types, so no issue here (and there is no direct
>equivalent of MOV.X in RISC-V).

What is MOV.X?

>>> In BJX2, I went with signed values being kept in sign-extended form, and
>>> unsigned values kept in zero-extended form
>>
>> That seems like a natural way to do it. However, I am not sure if
>> that works well in the presence of type mismatches between caller and
>> callee in production C code. At least I imagine that's the reason why
>> ABIs force one kind of extension, and then do the appropriate
>> extension for the defined type at the usage site if necessary.
>>
>
>As long as the function has a prototype, the call/return will coerce the
>types into the form expected by the called function (as usual).
>
>If no prototype is present, it follows traditional rules:
> Small integer types are promoted to their largest machine form;
> Float is promoted to Double;
> ...
>
>For missing prototypes, kinda ended up (mostly) going with 'long':
> C90 says 'int', but this truncates pointers, 'long' does not;
> Also, 'long' doesn't seem to break anything here.

Pointers would be a problem for the RISC-V and AMD64 approaches, too,
because the sign/zero-extension would trample over the high 32-bits.
The compiler also notices the problems with pointers, because
operations like dereferencing don't work on integer types.

I am more thinking of problems like having an int variable in one
function, and passing that to a separately compiled function that
expects an unsigned (and similar cases). On RISC-V the callee will
zero-extend the passed value, on AMD64 the caller will zero-extend the
passed value, in your approach the caller will sign-extend the value
and pass it, and the callee will assume that the value has been
zero-extended.

> C99 makes missing prototypes invalid anyways, so...

This does not make programs with missing (or mismatched) prototypes go
away.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: RISC-V vs. Aarch64

<sqcvc7$e0n$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22489&group=comp.arch#22489

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Mon, 27 Dec 2021 12:07:01 -0600
Organization: A noiseless patient Spider
Lines: 251
Message-ID: <sqcvc7$e0n$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sq5dj1$1q9$1@dont-email.me> <2021Dec25.181011@mips.complang.tuwien.ac.at>
<sqaivi$e79$1@dont-email.me> <2021Dec27.091535@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 27 Dec 2021 18:07:03 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="8aa7ed5a85a0d9a3d655c87683959d4a";
logging-data="14359"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19X+Ck83gHc7iPQzBUjRVJV"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.1
Cancel-Lock: sha1:4x5Abr++uVA4H2qj/xSm+EXNUFo=
In-Reply-To: <2021Dec27.091535@mips.complang.tuwien.ac.at>
Content-Language: en-US

by: BGB - Mon, 27 Dec 2021 18:07 UTC

On 12/27/2021 2:15 AM, Anton Ertl wrote:
> BGB <cr88192@gmail.com> writes:
>> On 12/25/2021 11:10 AM, Anton Ertl wrote:
>>> BGB <cr88192@gmail.com> writes:
>>>>> * A64 has fixed-length 32-bit instructions, but they can be more
>>>>> complex: A64 has more addressing modes and additional instructions
>>>>> like load pair and store pair; in particular, the A64 architects
>>>>> seem to have had few concerns about the register read and write
>>>>> ports needed per instruction; E.g., a store-pair instruction can
>>>>> need four read ports, and a load pair instruction can need three
>>>>> write ports (AFAIK).
>>>>>
>>>>
>>>> These are less of an issue if one assumes a minimum width for the core.
>>>> If the core is always at least 3-wide, this isn't an issue.
>>>
>>> Why? Sure, you could have one instruction port that supports up to
>>> four reads and the others only support two, and one that supports
>>> three writes, and the others only support one, but that would still be
>>> 8 register reads and 5 writes per cycle, and my impression was that
>>> having many register ports is really expensive.
>>>
>>
>> If you have a multi-lane core, it already has the needed register ports.
>> You then spread a single instruction across 2 or 3 lanes.
>
> Ok, but then these instructions reduce the utilization of the lanes.
> A pre/post-increment load does not provide increased throughput over a
> load and an add on such an implementation.
>

Yes, generally...

One could decode a post-increment op as two ops internally.
MOV.B (R4)+, R5
Decoded as:
ADD 1, R4 | MOV.B (R4), R5

Though, encoding this directly in BJX2 (at the ISA level) would not be
allowed, as it violates the sequential/parallel rule (bundles are not
allowed when the result of executing the instructions sequentially would
differ from them being executed in parallel).

There is no auto-increment addressing in my case though, for the main
reason that it isn't used frequently enough to make much visible impact.
It can be done in 2 ops, and frequently the ADD op can be executed in
parallel with something else.

Eg:
while(cs<ce)
*ct++=*cs++;
As:
.L0:
MOV.B (R5), R7
ADD 1, R5 | MOV.B R7, (R4)
ADD 1, R4 | CMPQGE R5, R6
BF .L0

>>> Well, Celio's position is that RISC-V gets code density through the C
>>> extension and uop density through instruction fusion.
> ...
>> Like, it is a design which helps "very low end" implementations, at the
>> expense of other "moderately low-end" implementations, which need to
>> implement "higher end" functionality to compensate for deficiencies.
>
> Yes. The question is how much extra complexity this functionality
> costs for moderately low-end implementations.
>
> One can consider this the cost of having a wide-range architecture
> rather than an architecture designed for a narrower range. In the
> case of A64 the idea probably was also that they have A32/T32 for the
> very low end, so they were able to design A64 for a narrower range.
>

Yeah. I suspect the lower end for A64 is probably in-order superscalar
machines.

If you need a microcontroller, there is Thumb.

RV32I could likely be pretty competitive with Thumb, but for a core
which is higher end than a microcontroller, but lower end than a typical
superscalar core, it is likely to hurt.

>>> If you know that you are doing misaligned accesses, composing a
>>> possibly misaligned load from two loads and a combining instruction
>>> (or a few) looks like a simpler approach.
>>>
>>
>> It can be faked easily enough, but as can be noted, the cycle cost of
>> faking it is higher than that of native hardware support.
>
> Why should that be so? The work necessary for doing it explicitly is
> the same. At least for the load case. For unaligned stores I expect
> that a write-combining store buffer can do some things that are pretty
> far from what any ISA that allows only aligned stores has done
> explicitly yet; and current (possibly unaligned) stores are probably a
> good interface to that hardware structure.
>

Faking a misaligned load is likely to require a multi-instruction
sequence, and also likely require stalling on an interlock. Doing it in
hardware can allow the load to be done in a single clock cycle.

It is a similar reason for why I eventually ended up re-adding an FMOV.S
instruction:
MOV.L + FLDCF, typically ends up taking 4 cycles.
FMOV.S, avoids an interlock, often 1 cycle.

One could argue for a case where FMOV.S is load-only, but I re-added it
on both the Load and Store paths.

In effect, it is analogous to a 32-bit Load/Store with a
Binary32<->Binary64 converter glued on. The store path is a little more
expensive here, because it also involves potentially zeroing or rounding
the mantissa (whereas the widening conversion is effectively glorified
bit-shuffling).

Though, arguably, one could make the store path cheaper by using a
truncating conversion (as is done with SIMD ops), but OTOH one can
potentially leverage the same converter used for 'FSTCF'
(Double->Single), which does need a rounding conversion. This more
effects timing than it does LUT cost.

>> In cases where one wants unaligned access on hardware which only does
>> aligned access, this is where a keyword like "__unaligned" comes into
>> play. This allows using fast (direct) access for aligned loads/stores,
>> and a slower (faked) approach for misaligned load/store.
>
> Yes, you would need to specify the unaligned accesses in programs.
> However, given that most programs have been and are developed on
> hardware where that has not been necessary, programmers are unlikely
> to get that right, so allowing all memory accesses (exception: for
> comminicating between threads) to be unaligned is the way to go and
> has won.
>

Not quite so much, IME.

Even on hardware where unaligned access is typically the default, some
compilers have managed to make relying on it brittle enough that most C
code still tends to preserve an aligned/unaligned distinction.

>> In this case, both RISC-V and BJX2 assume unaligned load/store
>> operations for basic types, so no issue here (and there is no direct
>> equivalent of MOV.X in RISC-V).
>
> What is MOV.X?
>

A 128-bit Load/Store Pair.

It is also used for 128-bit SIMD Load/Store, since the ISA uses a single
register space for GPRs and SIMD.

From the way it is implemented, it requires an 8-byte alignment (unlike
a pair of MOV.Q instructions), and is often slightly faster if accesses
are aligned on a 16-byte boundary. It also only works on even-numbered
registers.

It can be faster than a pair of MOV.Q instructions, and is most commonly
found in function prolog/epilog sequences and similar.

>>>> In BJX2, I went with signed values being kept in sign-extended form, and
>>>> unsigned values kept in zero-extended form
>>>
>>> That seems like a natural way to do it. However, I am not sure if
>>> that works well in the presence of type mismatches between caller and
>>> callee in production C code. At least I imagine that's the reason why
>>> ABIs force one kind of extension, and then do the appropriate
>>> extension for the defined type at the usage site if necessary.
>>>
>>
>> As long as the function has a prototype, the call/return will coerce the
>> types into the form expected by the called function (as usual).
>>
>> If no prototype is present, it follows traditional rules:
>> Small integer types are promoted to their largest machine form;
>> Float is promoted to Double;
>> ...
>>
>> For missing prototypes, kinda ended up (mostly) going with 'long':
>> C90 says 'int', but this truncates pointers, 'long' does not;
>> Also, 'long' doesn't seem to break anything here.
>
> Pointers would be a problem for the RISC-V and AMD64 approaches, too,
> because the sign/zero-extension would trample over the high 32-bits.
> The compiler also notices the problems with pointers, because
> operations like dereferencing don't work on integer types.
>
> I am more thinking of problems like having an int variable in one
> function, and passing that to a separately compiled function that
> expects an unsigned (and similar cases). On RISC-V the callee will
> zero-extend the passed value, on AMD64 the caller will zero-extend the
> passed value, in your approach the caller will sign-extend the value
> and pass it, and the callee will assume that the value has been
> zero-extended.
>

Click here to read the complete article

Re: RISC-V vs. Aarch64

<08cad6b8-1a18-4920-8cab-75a9846c4136n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22496&group=comp.arch#22496

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:9ad8:: with SMTP id c207mr13483864qke.662.1640631842243;
Mon, 27 Dec 2021 11:04:02 -0800 (PST)
X-Received: by 2002:a05:6808:4d2:: with SMTP id a18mr13806926oie.99.1640631841985;
Mon, 27 Dec 2021 11:04:01 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 27 Dec 2021 11:04:01 -0800 (PST)
In-Reply-To: <sqcvc7$e0n$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:7907:330c:656:c2fa;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:7907:330c:656:c2fa
References: <2021Dec24.163843@mips.complang.tuwien.ac.at> <sq5dj1$1q9$1@dont-email.me>
<2021Dec25.181011@mips.complang.tuwien.ac.at> <sqaivi$e79$1@dont-email.me>
<2021Dec27.091535@mips.complang.tuwien.ac.at> <sqcvc7$e0n$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <08cad6b8-1a18-4920-8cab-75a9846c4136n@googlegroups.com>
Subject: Re: RISC-V vs. Aarch64
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 27 Dec 2021 19:04:02 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 241

by: MitchAlsup - Mon, 27 Dec 2021 19:04 UTC

On Monday, December 27, 2021 at 12:07:06 PM UTC-6, BGB wrote:
> On 12/27/2021 2:15 AM, Anton Ertl wrote:
> > BGB <cr8...@gmail.com> writes:
> >> On 12/25/2021 11:10 AM, Anton Ertl wrote:
> >>> BGB <cr8...@gmail.com> writes:
> >>>>> * A64 has fixed-length 32-bit instructions, but they can be more
> >>>>> complex: A64 has more addressing modes and additional instructions
> >>>>> like load pair and store pair; in particular, the A64 architects
> >>>>> seem to have had few concerns about the register read and write
> >>>>> ports needed per instruction; E.g., a store-pair instruction can
> >>>>> need four read ports, and a load pair instruction can need three
> >>>>> write ports (AFAIK).
> >>>>>
> >>>>
> >>>> These are less of an issue if one assumes a minimum width for the core.
> >>>> If the core is always at least 3-wide, this isn't an issue.
> >>>
> >>> Why? Sure, you could have one instruction port that supports up to
> >>> four reads and the others only support two, and one that supports
> >>> three writes, and the others only support one, but that would still be
> >>> 8 register reads and 5 writes per cycle, and my impression was that
> >>> having many register ports is really expensive.
> >>>
> >>
> >> If you have a multi-lane core, it already has the needed register ports.
> >> You then spread a single instruction across 2 or 3 lanes.
> >
> > Ok, but then these instructions reduce the utilization of the lanes.
> > A pre/post-increment load does not provide increased throughput over a
> > load and an add on such an implementation.
> >
> Yes, generally...
>
> One could decode a post-increment op as two ops internally.
> MOV.B (R4)+, R5
> Decoded as:
> ADD 1, R4 | MOV.B (R4), R5
>
> Though, encoding this directly in BJX2 (at the ISA level) would not be
> allowed, as it violates the sequential/parallel rule (bundles are not
> allowed when the result of executing the instructions sequentially would
> differ from them being executed in parallel).
>
>
> There is no auto-increment addressing in my case though, for the main
> reason that it isn't used frequently enough to make much visible impact.
> It can be done in 2 ops, and frequently the ADD op can be executed in
> parallel with something else.
>
> Eg:
> while(cs<ce)
> *ct++=*cs++;
> As:
> .L0:
> MOV.B (R5), R7
> ADD 1, R5 | MOV.B R7, (R4)
> ADD 1, R4 | CMPQGE R5, R6
> BF .L0
<
VEC Ri,<>
LDB R5,[Rcs+Ri]
STB R5,[Rct+Ri]
LOOP LT,#1,Ri
>
6 BJX2 instructions versus 4 My 66000.
<
> >>> Well, Celio's position is that RISC-V gets code density through the C
> >>> extension and uop density through instruction fusion.
> > ...
> >> Like, it is a design which helps "very low end" implementations, at the
> >> expense of other "moderately low-end" implementations, which need to
> >> implement "higher end" functionality to compensate for deficiencies.
> >
> > Yes. The question is how much extra complexity this functionality
> > costs for moderately low-end implementations.
> >
> > One can consider this the cost of having a wide-range architecture
> > rather than an architecture designed for a narrower range. In the
> > case of A64 the idea probably was also that they have A32/T32 for the
> > very low end, so they were able to design A64 for a narrower range.
> >
> Yeah. I suspect the lower end for A64 is probably in-order superscalar
> machines.
>
> If you need a microcontroller, there is Thumb.
>
> RV32I could likely be pretty competitive with Thumb, but for a core
> which is higher end than a microcontroller, but lower end than a typical
> superscalar core, it is likely to hurt.
> >>> If you know that you are doing misaligned accesses, composing a
> >>> possibly misaligned load from two loads and a combining instruction
> >>> (or a few) looks like a simpler approach.
> >>>
> >>
> >> It can be faked easily enough, but as can be noted, the cycle cost of
> >> faking it is higher than that of native hardware support.
> >
> > Why should that be so? The work necessary for doing it explicitly is
> > the same. At least for the load case. For unaligned stores I expect
> > that a write-combining store buffer can do some things that are pretty
> > far from what any ISA that allows only aligned stores has done
> > explicitly yet; and current (possibly unaligned) stores are probably a
> > good interface to that hardware structure.
> >
> Faking a misaligned load is likely to require a multi-instruction
> sequence, and also likely require stalling on an interlock. Doing it in
> hardware can allow the load to be done in a single clock cycle.
>
>
> It is a similar reason for why I eventually ended up re-adding an FMOV.S
> instruction:
> MOV.L + FLDCF, typically ends up taking 4 cycles.
> FMOV.S, avoids an interlock, often 1 cycle.
>
> One could argue for a case where FMOV.S is load-only, but I re-added it
> on both the Load and Store paths.
>
> In effect, it is analogous to a 32-bit Load/Store with a
> Binary32<->Binary64 converter glued on. The store path is a little more
> expensive here, because it also involves potentially zeroing or rounding
> the mantissa (whereas the widening conversion is effectively glorified
> bit-shuffling).
>
> Though, arguably, one could make the store path cheaper by using a
> truncating conversion (as is done with SIMD ops), but OTOH one can
> potentially leverage the same converter used for 'FSTCF'
> (Double->Single), which does need a rounding conversion. This more
> effects timing than it does LUT cost.
> >> In cases where one wants unaligned access on hardware which only does
> >> aligned access, this is where a keyword like "__unaligned" comes into
> >> play. This allows using fast (direct) access for aligned loads/stores,
> >> and a slower (faked) approach for misaligned load/store.
> >
> > Yes, you would need to specify the unaligned accesses in programs.
> > However, given that most programs have been and are developed on
> > hardware where that has not been necessary, programmers are unlikely
> > to get that right, so allowing all memory accesses (exception: for
> > comminicating between threads) to be unaligned is the way to go and
> > has won.
> >
> Not quite so much, IME.
>
> Even on hardware where unaligned access is typically the default, some
> compilers have managed to make relying on it brittle enough that most C
> code still tends to preserve an aligned/unaligned distinction.
> >> In this case, both RISC-V and BJX2 assume unaligned load/store
> >> operations for basic types, so no issue here (and there is no direct
> >> equivalent of MOV.X in RISC-V).
> >
> > What is MOV.X?
> >
> A 128-bit Load/Store Pair.
>
> It is also used for 128-bit SIMD Load/Store, since the ISA uses a single
> register space for GPRs and SIMD.
>
> From the way it is implemented, it requires an 8-byte alignment (unlike
> a pair of MOV.Q instructions), and is often slightly faster if accesses
> are aligned on a 16-byte boundary. It also only works on even-numbered
> registers.
>
> It can be faster than a pair of MOV.Q instructions, and is most commonly
> found in function prolog/epilog sequences and similar.
> >>>> In BJX2, I went with signed values being kept in sign-extended form, and
> >>>> unsigned values kept in zero-extended form
> >>>
> >>> That seems like a natural way to do it. However, I am not sure if
> >>> that works well in the presence of type mismatches between caller and
> >>> callee in production C code. At least I imagine that's the reason why
> >>> ABIs force one kind of extension, and then do the appropriate
> >>> extension for the defined type at the usage site if necessary.
> >>>
> >>
> >> As long as the function has a prototype, the call/return will coerce the
> >> types into the form expected by the called function (as usual).
> >>
> >> If no prototype is present, it follows traditional rules:
> >> Small integer types are promoted to their largest machine form;
> >> Float is promoted to Double;
> >> ...
> >>
> >> For missing prototypes, kinda ended up (mostly) going with 'long':
> >> C90 says 'int', but this truncates pointers, 'long' does not;
> >> Also, 'long' doesn't seem to break anything here.
> >
> > Pointers would be a problem for the RISC-V and AMD64 approaches, too,
> > because the sign/zero-extension would trample over the high 32-bits.
> > The compiler also notices the problems with pointers, because
> > operations like dereferencing don't work on integer types.
> >
> > I am more thinking of problems like having an int variable in one
> > function, and passing that to a separately compiled function that
> > expects an unsigned (and similar cases). On RISC-V the callee will
> > zero-extend the passed value, on AMD64 the caller will zero-extend the
> > passed value, in your approach the caller will sign-extend the value
> > and pass it, and the callee will assume that the value has been
> > zero-extended.
> >
> Yeah, pretty much.
> I depends a lot on what the callee does with the value which effects
> what the results are here.
> >> C99 makes missing prototypes invalid anyways, so...
> >
> > This does not make programs with missing (or mismatched) prototypes go
> > away.
> >
> Possibly true. Pretty much all of the compilers I am aware of went over
> to something partway between C90 and C99 semantics, typically behaving
> like C90 and then generating a warning about the missing prototype.
>
>
> But they "at least stand a fighting chance" in my case.
> Most basic types will (at lest) end up in the correct registers, even if
> the values don't match up exactly.
>
> Main exception is when 128-bit types get involved, which as-is:
> Pad up to the next even-numbered register (if needed);
> Pass the value as a register pair.
>
> Struct passing rules in my case are:
> sizeof(Foo) <= 8: Pass as a single register;
> sizeof(Foo) <= 16: Pass as a register pair;
> Else : Pass pointer to struct.
<
My 66000 ABI; effectively, lay out the argument list (and return value list)
in memory (on the stack) then put the first 8 double words in registers
(and do not create locations on the stack for argument[0..7].)
>
> This seemed to me like a reasonable compromise to me. Avoids both
> excessively complex/convoluted rules, as well as being reasonably efficient.
>
> If one runs out of registers, any remaining arguments are passed on the
> stack, following the same pattern. Like AMD64 (and unlike Win64), there
> is no dedicated space for argument spills. The callee can adjust the
> stack pointer by an extra 64B if they want such a spill space though, so
> in practice it likely isn't a huge issue.
>
>
> Return is basically similar, either the called function returns the
> struct in registers, or the caller passes a pointer to the structure
> which will receive the result. Trying to call a function which returns a
> struct with no destination will instead pass a pointer to a dummy struct.

Click here to read the complete article

Re: RISC-V vs. Aarch64

<sqdkot$lc2$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22517&group=comp.arch#22517

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Mon, 27 Dec 2021 18:12:11 -0600
Organization: A noiseless patient Spider
Lines: 328
Message-ID: <sqdkot$lc2$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sq5dj1$1q9$1@dont-email.me> <2021Dec25.181011@mips.complang.tuwien.ac.at>
<sqaivi$e79$1@dont-email.me> <2021Dec27.091535@mips.complang.tuwien.ac.at>
<sqcvc7$e0n$1@dont-email.me>
<08cad6b8-1a18-4920-8cab-75a9846c4136n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 28 Dec 2021 00:12:13 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="7a192c0c96e1f18167785330ec8f0ab7";
logging-data="21890"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/Z/LcZa8AFIOX6GCfoemSV"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.1
Cancel-Lock: sha1:JMbgIbhorKMwsfr/1KtgRsACXJ4=
In-Reply-To: <08cad6b8-1a18-4920-8cab-75a9846c4136n@googlegroups.com>
Content-Language: en-US

by: BGB - Tue, 28 Dec 2021 00:12 UTC

On 12/27/2021 1:04 PM, MitchAlsup wrote:
> On Monday, December 27, 2021 at 12:07:06 PM UTC-6, BGB wrote:
>> On 12/27/2021 2:15 AM, Anton Ertl wrote:
>>> BGB <cr8...@gmail.com> writes:
>>>> On 12/25/2021 11:10 AM, Anton Ertl wrote:
>>>>> BGB <cr8...@gmail.com> writes:
>>>>>>> * A64 has fixed-length 32-bit instructions, but they can be more
>>>>>>> complex: A64 has more addressing modes and additional instructions
>>>>>>> like load pair and store pair; in particular, the A64 architects
>>>>>>> seem to have had few concerns about the register read and write
>>>>>>> ports needed per instruction; E.g., a store-pair instruction can
>>>>>>> need four read ports, and a load pair instruction can need three
>>>>>>> write ports (AFAIK).
>>>>>>>
>>>>>>
>>>>>> These are less of an issue if one assumes a minimum width for the core.
>>>>>> If the core is always at least 3-wide, this isn't an issue.
>>>>>
>>>>> Why? Sure, you could have one instruction port that supports up to
>>>>> four reads and the others only support two, and one that supports
>>>>> three writes, and the others only support one, but that would still be
>>>>> 8 register reads and 5 writes per cycle, and my impression was that
>>>>> having many register ports is really expensive.
>>>>>
>>>>
>>>> If you have a multi-lane core, it already has the needed register ports.
>>>> You then spread a single instruction across 2 or 3 lanes.
>>>
>>> Ok, but then these instructions reduce the utilization of the lanes.
>>> A pre/post-increment load does not provide increased throughput over a
>>> load and an add on such an implementation.
>>>
>> Yes, generally...
>>
>> One could decode a post-increment op as two ops internally.
>> MOV.B (R4)+, R5
>> Decoded as:
>> ADD 1, R4 | MOV.B (R4), R5
>>
>> Though, encoding this directly in BJX2 (at the ISA level) would not be
>> allowed, as it violates the sequential/parallel rule (bundles are not
>> allowed when the result of executing the instructions sequentially would
>> differ from them being executed in parallel).
>>
>>
>> There is no auto-increment addressing in my case though, for the main
>> reason that it isn't used frequently enough to make much visible impact.
>> It can be done in 2 ops, and frequently the ADD op can be executed in
>> parallel with something else.
>>
>> Eg:
>> while(cs<ce)
>> *ct++=*cs++;
>> As:
>> .L0:
>> MOV.B (R5), R7
>> ADD 1, R5 | MOV.B R7, (R4)
>> ADD 1, R4 | CMPQGE R5, R6
>> BF .L0
> <
> VEC Ri,<>
> LDB R5,[Rcs+Ri]
> STB R5,[Rct+Ri]
> LOOP LT,#1,Ri
>>
> 6 BJX2 instructions versus 4 My 66000.
> <

Bigger issue here is probably that the loop, as written originally, will
bottleneck at ~ 6MB/s at 50MHz:
.L0:
MOV.B (R5), R7 //3c (2c penalty)
ADD 1, R5 | MOV.B R7, (R4) //2c (1c penalty)
ADD 1, R4 | CMPQGE R5, R6 //1c
BF .L0 //2c
Or, ~ 8c per byte.

Or, no bundling:
.L0:
MOV.B (R5), R7 //2c (1c penalty)
ADD 1, R5 //1c
MOV.B R7, (R4) //1c
ADD 1, R4 //1c
CMPQGE R5, R6 //1c
BF .L0 //2c
Still 8c per byte.

50/8 => 6.25

But, probably surprising no one, the CPU is devoid of any sort of
"cleverness" here. The compiler wont do anything either, leaving it
mostly up to the programmer.

One can mostly count themselves lucky if the compiler manages to avoid
dropping a few register spills in the middle of the loop or similar, ...

Technically, one could also write:
MOV 0, R7
.L0:
MOV.B (R5, R7), R3 //3c (2c penalty)
ADD 1, R7 | MOV.B R3, (R4, R7) //1c
JCMPQLT R6, R7, .L0 //2c (*)
ADD R6, R4 | ADD R6, R5

Which is 6c, 8.3MB/s.

But, the compiler wont infer this, and it isn't really all that much better.

*: These instructions share the same underlying mechanism as the
branches in RISC-V mode, and will exist if the RVI extension is enabled.
Note that compare-and-branch instructions may not be predicated.

Slightly faster:
MOV R6, R7
CMPQGE 8, R7
BF .L1
.L0:
ADD -8, R7 | MOV.Q (R5), R3 //3c (2c penalty)
ADD 8, R5 | MOV.Q R3, (R4) //1c
ADD 8, R4 | CMPQGE 8, R7 //1c
BT .L0 //2c
.L1:
CMPQGT 0, R7
BF .L3
.L2:
ADD -1, R7 | MOV.B (R5), R7 //3c (2c penalty)
ADD 1, R5 | MOV.B R7, (R4) //1c
ADD 1, R4 | CMPQGT 0, R7 //1c
BF .L2 //2c
.L3:

Which should average ~ 67 MB/s...

Throw in another loop stage copying 32B at a time, and it could be
potentially pushed up to around 200MB/s (at least, if staying within the
L1 cache), ...

>>>>> Well, Celio's position is that RISC-V gets code density through the C
>>>>> extension and uop density through instruction fusion.
>>> ...
>>>> Like, it is a design which helps "very low end" implementations, at the
>>>> expense of other "moderately low-end" implementations, which need to
>>>> implement "higher end" functionality to compensate for deficiencies.
>>>
>>> Yes. The question is how much extra complexity this functionality
>>> costs for moderately low-end implementations.
>>>
>>> One can consider this the cost of having a wide-range architecture
>>> rather than an architecture designed for a narrower range. In the
>>> case of A64 the idea probably was also that they have A32/T32 for the
>>> very low end, so they were able to design A64 for a narrower range.
>>>
>> Yeah. I suspect the lower end for A64 is probably in-order superscalar
>> machines.
>>
>> If you need a microcontroller, there is Thumb.
>>
>> RV32I could likely be pretty competitive with Thumb, but for a core
>> which is higher end than a microcontroller, but lower end than a typical
>> superscalar core, it is likely to hurt.
>>>>> If you know that you are doing misaligned accesses, composing a
>>>>> possibly misaligned load from two loads and a combining instruction
>>>>> (or a few) looks like a simpler approach.
>>>>>
>>>>
>>>> It can be faked easily enough, but as can be noted, the cycle cost of
>>>> faking it is higher than that of native hardware support.
>>>
>>> Why should that be so? The work necessary for doing it explicitly is
>>> the same. At least for the load case. For unaligned stores I expect
>>> that a write-combining store buffer can do some things that are pretty
>>> far from what any ISA that allows only aligned stores has done
>>> explicitly yet; and current (possibly unaligned) stores are probably a
>>> good interface to that hardware structure.
>>>
>> Faking a misaligned load is likely to require a multi-instruction
>> sequence, and also likely require stalling on an interlock. Doing it in
>> hardware can allow the load to be done in a single clock cycle.
>>
>>
>> It is a similar reason for why I eventually ended up re-adding an FMOV.S
>> instruction:
>> MOV.L + FLDCF, typically ends up taking 4 cycles.
>> FMOV.S, avoids an interlock, often 1 cycle.
>>
>> One could argue for a case where FMOV.S is load-only, but I re-added it
>> on both the Load and Store paths.
>>
>> In effect, it is analogous to a 32-bit Load/Store with a
>> Binary32<->Binary64 converter glued on. The store path is a little more
>> expensive here, because it also involves potentially zeroing or rounding
>> the mantissa (whereas the widening conversion is effectively glorified
>> bit-shuffling).
>>
>> Though, arguably, one could make the store path cheaper by using a
>> truncating conversion (as is done with SIMD ops), but OTOH one can
>> potentially leverage the same converter used for 'FSTCF'
>> (Double->Single), which does need a rounding conversion. This more
>> effects timing than it does LUT cost.
>>>> In cases where one wants unaligned access on hardware which only does
>>>> aligned access, this is where a keyword like "__unaligned" comes into
>>>> play. This allows using fast (direct) access for aligned loads/stores,
>>>> and a slower (faked) approach for misaligned load/store.
>>>
>>> Yes, you would need to specify the unaligned accesses in programs.
>>> However, given that most programs have been and are developed on
>>> hardware where that has not been necessary, programmers are unlikely
>>> to get that right, so allowing all memory accesses (exception: for
>>> comminicating between threads) to be unaligned is the way to go and
>>> has won.
>>>
>> Not quite so much, IME.
>>
>> Even on hardware where unaligned access is typically the default, some
>> compilers have managed to make relying on it brittle enough that most C
>> code still tends to preserve an aligned/unaligned distinction.
>>>> In this case, both RISC-V and BJX2 assume unaligned load/store
>>>> operations for basic types, so no issue here (and there is no direct
>>>> equivalent of MOV.X in RISC-V).
>>>
>>> What is MOV.X?
>>>
>> A 128-bit Load/Store Pair.
>>
>> It is also used for 128-bit SIMD Load/Store, since the ISA uses a single
>> register space for GPRs and SIMD.
>>
>> From the way it is implemented, it requires an 8-byte alignment (unlike
>> a pair of MOV.Q instructions), and is often slightly faster if accesses
>> are aligned on a 16-byte boundary. It also only works on even-numbered
>> registers.
>>
>> It can be faster than a pair of MOV.Q instructions, and is most commonly
>> found in function prolog/epilog sequences and similar.
>>>>>> In BJX2, I went with signed values being kept in sign-extended form, and
>>>>>> unsigned values kept in zero-extended form
>>>>>
>>>>> That seems like a natural way to do it. However, I am not sure if
>>>>> that works well in the presence of type mismatches between caller and
>>>>> callee in production C code. At least I imagine that's the reason why
>>>>> ABIs force one kind of extension, and then do the appropriate
>>>>> extension for the defined type at the usage site if necessary.
>>>>>
>>>>
>>>> As long as the function has a prototype, the call/return will coerce the
>>>> types into the form expected by the called function (as usual).
>>>>
>>>> If no prototype is present, it follows traditional rules:
>>>> Small integer types are promoted to their largest machine form;
>>>> Float is promoted to Double;
>>>> ...
>>>>
>>>> For missing prototypes, kinda ended up (mostly) going with 'long':
>>>> C90 says 'int', but this truncates pointers, 'long' does not;
>>>> Also, 'long' doesn't seem to break anything here.
>>>
>>> Pointers would be a problem for the RISC-V and AMD64 approaches, too,
>>> because the sign/zero-extension would trample over the high 32-bits.
>>> The compiler also notices the problems with pointers, because
>>> operations like dereferencing don't work on integer types.
>>>
>>> I am more thinking of problems like having an int variable in one
>>> function, and passing that to a separately compiled function that
>>> expects an unsigned (and similar cases). On RISC-V the callee will
>>> zero-extend the passed value, on AMD64 the caller will zero-extend the
>>> passed value, in your approach the caller will sign-extend the value
>>> and pass it, and the callee will assume that the value has been
>>> zero-extended.
>>>
>> Yeah, pretty much.
>> I depends a lot on what the callee does with the value which effects
>> what the results are here.
>>>> C99 makes missing prototypes invalid anyways, so...
>>>
>>> This does not make programs with missing (or mismatched) prototypes go
>>> away.
>>>
>> Possibly true. Pretty much all of the compilers I am aware of went over
>> to something partway between C90 and C99 semantics, typically behaving
>> like C90 and then generating a warning about the missing prototype.
>>
>>
>> But they "at least stand a fighting chance" in my case.
>> Most basic types will (at lest) end up in the correct registers, even if
>> the values don't match up exactly.
>>
>> Main exception is when 128-bit types get involved, which as-is:
>> Pad up to the next even-numbered register (if needed);
>> Pass the value as a register pair.
>>
>> Struct passing rules in my case are:
>> sizeof(Foo) <= 8: Pass as a single register;
>> sizeof(Foo) <= 16: Pass as a register pair;
>> Else : Pass pointer to struct.
> <
> My 66000 ABI; effectively, lay out the argument list (and return value list)
> in memory (on the stack) then put the first 8 double words in registers
> (and do not create locations on the stack for argument[0..7].)

Click here to read the complete article

Re: RISC-V vs. Aarch64

<sqhdga$hqs$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22555&group=comp.arch#22555

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-eb03-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Wed, 29 Dec 2021 10:32:42 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sqhdga$hqs$1@newsreader4.netcologne.de>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sq5dj1$1q9$1@dont-email.me> <2021Dec25.181011@mips.complang.tuwien.ac.at>
<sqaivi$e79$1@dont-email.me> <2021Dec27.091535@mips.complang.tuwien.ac.at>
<sqcvc7$e0n$1@dont-email.me>
Injection-Date: Wed, 29 Dec 2021 10:32:42 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-eb03-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:eb03:0:7285:c2ff:fe6c:992d";
logging-data="18268"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Wed, 29 Dec 2021 10:32 UTC

BGB <cr88192@gmail.com> schrieb:

> Eg:
> while(cs<ce)
> *ct++=*cs++;
> As:
> .L0:
> MOV.B (R5), R7
> ADD 1, R5 | MOV.B R7, (R4)
> ADD 1, R4 | CMPQGE R5, R6
> BF .L0

Nit: The two are only identical if there is at least one
byte to be copied.

Re: RISC-V vs. Aarch64

<sqi2jf$e6q$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22562&group=comp.arch#22562

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Wed, 29 Dec 2021 10:32:45 -0600
Organization: A noiseless patient Spider
Lines: 41
Message-ID: <sqi2jf$e6q$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sq5dj1$1q9$1@dont-email.me> <2021Dec25.181011@mips.complang.tuwien.ac.at>
<sqaivi$e79$1@dont-email.me> <2021Dec27.091535@mips.complang.tuwien.ac.at>
<sqcvc7$e0n$1@dont-email.me> <sqhdga$hqs$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 29 Dec 2021 16:32:47 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="05b8e81815da9711d207a325fbbd403c";
logging-data="14554"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18GJlgg4L0YKG08eh8frDZ7"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.1
Cancel-Lock: sha1:jDr6PlKnUQ678dXjSHSW8+W4Kqo=
In-Reply-To: <sqhdga$hqs$1@newsreader4.netcologne.de>
Content-Language: en-US

by: BGB - Wed, 29 Dec 2021 16:32 UTC

On 12/29/2021 4:32 AM, Thomas Koenig wrote:
> BGB <cr88192@gmail.com> schrieb:
>
>> Eg:
>> while(cs<ce)
>> *ct++=*cs++;
>> As:
>> .L0:
>> MOV.B (R5), R7
>> ADD 1, R5 | MOV.B R7, (R4)
>> ADD 1, R4 | CMPQGE R5, R6
>> BF .L0
>
> Nit: The two are only identical if there is at least one
> byte to be copied.

Yeah, true...

One of my other responses contained a pretty obvious screw up with
register usage, but I didn't notice until after I had posted it (and
Usenet doesn't have any edit or undo).

I guess I could note though that my compiler tends to compile while
loops like:
if(!cond)goto .LEND;
.L0:
body
if(cond)goto .L0;
.LEND:
Rather than:
.L0:
if(!cond)goto .LEND;
body
goto .L0;
.LEND:

As the former tends to give better performance, albeit with slightly
worse code density.

....

Re: RISC-V vs. Aarch64

<sqi5is$1ui$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22564&group=comp.arch#22564

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Wed, 29 Dec 2021 09:23:41 -0800
Organization: A noiseless patient Spider
Lines: 50
Message-ID: <sqi5is$1ui$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sq5dj1$1q9$1@dont-email.me> <2021Dec25.181011@mips.complang.tuwien.ac.at>
<sqaivi$e79$1@dont-email.me> <2021Dec27.091535@mips.complang.tuwien.ac.at>
<sqcvc7$e0n$1@dont-email.me> <sqhdga$hqs$1@newsreader4.netcologne.de>
<sqi2jf$e6q$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 29 Dec 2021 17:23:40 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="2e593565ba47294f220041520f666376";
logging-data="2002"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+S4YCpL7KEgjuUNbpLSGDl"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.1
Cancel-Lock: sha1:COc/JZKFwB2zsqTC+ZOU85dswlU=
In-Reply-To: <sqi2jf$e6q$1@dont-email.me>
Content-Language: en-US

by: Ivan Godard - Wed, 29 Dec 2021 17:23 UTC

On 12/29/2021 8:32 AM, BGB wrote:
> On 12/29/2021 4:32 AM, Thomas Koenig wrote:
>> BGB <cr88192@gmail.com> schrieb:
>>
>>> Eg:
>>>     while(cs<ce)
>>>       *ct++=*cs++;
>>> As:
>>>     .L0:
>>>                 MOV.B (R5), R7
>>>     ADD 1, R5 | MOV.B R7, (R4)
>>>     ADD 1, R4 | CMPQGE R5, R6
>>>     BF .L0
>>
>> Nit: The two are only identical if there is at least one
>> byte to be copied.
>
> Yeah, true...
>
> One of my other responses contained a pretty obvious screw up with
> register usage, but I didn't notice until after I had posted it (and
> Usenet doesn't have any edit or undo).
>
>
> I guess I could note though that my compiler tends to compile while
> loops like:
> if(!cond)goto .LEND;
> .L0:
> body
> if(cond)goto .L0;
> .LEND:
> Rather than:
> .L0:
> if(!cond)goto .LEND;
> body
> goto .L0;
> .LEND:
>
> As the former tends to give better performance, albeit with slightly
> worse code density.
>
> ...

I always used:
goto .LTEST;
.L0:
body
.LTEST
if(cond)goto .L0;
.LEND:

Re: RISC-V vs. Aarch64

<sqi87q$ac1$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22565&group=comp.arch#22565

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Wed, 29 Dec 2021 12:08:56 -0600
Organization: A noiseless patient Spider
Lines: 63
Message-ID: <sqi87q$ac1$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sq5dj1$1q9$1@dont-email.me> <2021Dec25.181011@mips.complang.tuwien.ac.at>
<sqaivi$e79$1@dont-email.me> <2021Dec27.091535@mips.complang.tuwien.ac.at>
<sqcvc7$e0n$1@dont-email.me> <sqhdga$hqs$1@newsreader4.netcologne.de>
<sqi2jf$e6q$1@dont-email.me> <sqi5is$1ui$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 29 Dec 2021 18:08:58 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="05b8e81815da9711d207a325fbbd403c";
logging-data="10625"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18MA01Ka0G72Dca7ZA3pUUQ"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.1
Cancel-Lock: sha1:bRJuJEw1YGlec964OqKeYxKfr+c=
In-Reply-To: <sqi5is$1ui$1@dont-email.me>
Content-Language: en-US

by: BGB - Wed, 29 Dec 2021 18:08 UTC

On 12/29/2021 11:23 AM, Ivan Godard wrote:
> On 12/29/2021 8:32 AM, BGB wrote:
>> On 12/29/2021 4:32 AM, Thomas Koenig wrote:
>>> BGB <cr88192@gmail.com> schrieb:
>>>
>>>> Eg:
>>>>     while(cs<ce)
>>>>       *ct++=*cs++;
>>>> As:
>>>>     .L0:
>>>>                 MOV.B (R5), R7
>>>>     ADD 1, R5 | MOV.B R7, (R4)
>>>>     ADD 1, R4 | CMPQGE R5, R6
>>>>     BF .L0
>>>
>>> Nit: The two are only identical if there is at least one
>>> byte to be copied.
>>
>> Yeah, true...
>>
>> One of my other responses contained a pretty obvious screw up with
>> register usage, but I didn't notice until after I had posted it (and
>> Usenet doesn't have any edit or undo).
>>
>>
>> I guess I could note though that my compiler tends to compile while
>> loops like:
>>    if(!cond)goto .LEND;
>>    .L0:
>>    body
>>    if(cond)goto .L0;
>>    .LEND:
>> Rather than:
>>    .L0:
>>    if(!cond)goto .LEND;
>>    body
>>    goto .L0;
>>    .LEND:
>>
>> As the former tends to give better performance, albeit with slightly
>> worse code density.
>>
>> ...
>
> I always used:
>     goto .LTEST;
>     .L0:
>     body
>     .LTEST
>     if(cond)goto .L0;
>     .LEND:

Hmm, yeah. Didn't think of this. I guess this has the advantage of
avoiding a need to duplicate the conditional.

There tends to also still be a label there, because this is where
"continue" lands.

The change is "fairly trivial". Testing this with a program, this
results in an ~ 0.15% reduction in overall binary size.

Program also passes the "still works" test.

Re: RISC-V vs. Aarch64

<sqiaol$7r2$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22566&group=comp.arch#22566

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-eb03-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Wed, 29 Dec 2021 18:52:05 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sqiaol$7r2$1@newsreader4.netcologne.de>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sq5dj1$1q9$1@dont-email.me> <2021Dec25.181011@mips.complang.tuwien.ac.at>
<sqaivi$e79$1@dont-email.me> <2021Dec27.091535@mips.complang.tuwien.ac.at>
<sqcvc7$e0n$1@dont-email.me> <sqhdga$hqs$1@newsreader4.netcologne.de>
<sqi2jf$e6q$1@dont-email.me>
Injection-Date: Wed, 29 Dec 2021 18:52:05 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-eb03-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:eb03:0:7285:c2ff:fe6c:992d";
logging-data="8034"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Wed, 29 Dec 2021 18:52 UTC

BGB <cr88192@gmail.com> schrieb:
> On 12/29/2021 4:32 AM, Thomas Koenig wrote:
>> BGB <cr88192@gmail.com> schrieb:
>>
>>> Eg:
>>> while(cs<ce)
>>> *ct++=*cs++;
>>> As:
>>> .L0:
>>> MOV.B (R5), R7
>>> ADD 1, R5 | MOV.B R7, (R4)
>>> ADD 1, R4 | CMPQGE R5, R6
>>> BF .L0
>>
>> Nit: The two are only identical if there is at least one
>> byte to be copied.
>
> Yeah, true...
>
> One of my other responses contained a pretty obvious screw up with
> register usage, but I didn't notice until after I had posted it (and
> Usenet doesn't have any edit or undo).

Usenet has Cancel and Supersede. I see from your headers that
you use Thunderbird for news, it appears to have a function to
generate a Cancel.

Re: RISC-V vs. Aarch64

<sqk7nc$dh0$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22606&group=comp.arch#22606

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Thu, 30 Dec 2021 13:12:28 +0100
Organization: A noiseless patient Spider
Lines: 23
Message-ID: <sqk7nc$dh0$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<0a8ff16a-53de-420e-9c82-cfc9e87f62e9n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 30 Dec 2021 12:12:28 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="6016acade3692638d873f399288fb852";
logging-data="13856"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+3LYWoVeWlQ2OCE88K4spLWoVoxVOAyQQ="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:iIQP2FOYmsNKyPjC11dAMz1O18A=
In-Reply-To: <0a8ff16a-53de-420e-9c82-cfc9e87f62e9n@googlegroups.com>
Content-Language: en-US

by: Marcus - Thu, 30 Dec 2021 12:12 UTC

On 2021-12-24, MitchAlsup wrote:
> On Friday, December 24, 2021 at 11:00:14 AM UTC-6, Anton Ertl wrote:

[snip]

>>
>> It's pretty obvious that a small implementation of RV64G is smaller
>> than a small implementation of A64, and adding the C extension to a
>> small implementation of RV64G (to turn it to RV64GC) is reported in
>> the talk IIRC (it's on the order of 700 transistors, so still cheap),
>> so you can get a small RV64GC cheaper than a small A64 implementation
>> and have similar code density.
> <
> Once you have constructed the register file(s), the integer, memory, and
> floating point units, the size of the fetcher and decoder is immaterial
> (noise).
> <

How about branch misprediction penalty? My assumption is that a more
complex decoder is likely to require more pipeline stages in the front
end, which is likely to hurt when you get branch mispredictions, right?

/Marcus

Re: RISC-V vs. Aarch64

<sqk8ce$qko$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22607&group=comp.arch#22607

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Thu, 30 Dec 2021 13:23:42 +0100
Organization: A noiseless patient Spider
Lines: 34
Message-ID: <sqk8ce$qko$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sq5dj1$1q9$1@dont-email.me>
<59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com>
<sq6udp$hj3$1@newsreader4.netcologne.de>
<5f3851f0-1bcf-44eb-bd44-1f280b01d4d4n@googlegroups.com>
<sq7p9m$s3o$1@dont-email.me>
<f298dcfe-49ad-4a92-8e24-78b290897b0en@googlegroups.com>
<sq8grs$qfc$1@dont-email.me>
<90c858e0-cadd-4484-8ece-3246b34a9741n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 30 Dec 2021 12:23:42 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="6016acade3692638d873f399288fb852";
logging-data="27288"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19FXMdgK+3n75tSuHU5H/JMuhQCmlqDeEE="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:zrnjK8D3sHJ7RKB8Jgf57pTc43Q=
In-Reply-To: <90c858e0-cadd-4484-8ece-3246b34a9741n@googlegroups.com>
Content-Language: en-US

by: Marcus - Thu, 30 Dec 2021 12:23 UTC

ON 2021-12-26, MitchAlsup wrote:
> On Saturday, December 25, 2021 at 7:34:56 PM UTC-6, Ivan Godard wrote:
>> On 12/25/2021 1:45 PM, MitchAlsup wrote:
>
>>> Dispψ means Disp field does not exist in instruction.
> <
>> Shows as "disp<Greek phi>" for my reader - ???
> <
> The other character I could have used was: ϕ
> <
>>> <
>>> A surprising amount of STs contain constants to be deposited in memory
>>> <
>>> ST #5,[SP+32]
> <
>> That's cute :-)
> <
> Brian gets credit for adding this.
>

In MRISC32 I only have support for storing zero to memory:

STW Z, [R4, #offset]

....or PC-relative address (PC +/- 4 MB):

STWPC Z, #address@pc

I added support for this in the GCC MRISC32 machine description when I
discovered that storing zero to a variable is a very common operation
(and previously it would do it as a LOAD-IMMEDIATE + STORE-REGISTER
pair).

/Marcus

Re: RISC-V vs. Aarch64

<sqkcvk$n97$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22608&group=comp.arch#22608

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Thu, 30 Dec 2021 14:42:12 +0100
Organization: A noiseless patient Spider
Lines: 74
Message-ID: <sqkcvk$n97$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sq5dj1$1q9$1@dont-email.me>
<59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 30 Dec 2021 13:42:12 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="6016acade3692638d873f399288fb852";
logging-data="23847"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+esNox7G6XzPnqW9Q28a69RbpXLn0Oabw="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:Q/89F8huQeODjNr8cnuxKeLpMJI=
In-Reply-To: <59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com>
Content-Language: en-US

by: Marcus - Thu, 30 Dec 2021 13:42 UTC

On 2021-12-24, MitchAlsup wrote:
> On Friday, December 24, 2021 at 3:20:36 PM UTC-6, BGB wrote:
>> On 12/24/2021 9:38 AM, Anton Ertl wrote:
> <snip>

<snip>

>> would probably leave out full Compare-and-Branch instructions, and
>> instead have a few "simpler" conditional branches, say:
>> BEQZ reg, label //Branch if reg==0
>> BNEZ reg, label //Branch if reg!=0
>> BGEZ reg, label //Branch if reg>=0
>> BLTZ reg, label //Branch if reg< 0
>>
>> While conceptually, this doesn't save much, it would be cheaper to
>> implement in hardware.
> <
> Having done both, I can warn you that your assumption is filled with badly
> formed misconceptions. From a purity standpoint you do have a point;
> from a gate count perspective and a cycle time perspective you do not.
> <
>> Relative compares could then use compare
>> instructions:
>> CMPx Rs, Rt, Rd
>> Where:
>> (Rs==Rt) => 0;
>> (Rs> Rt) => 1;
>> (Rs< Rt) => -1.
>>
>> Though, one issue with a plain SUB is that it would not work correctly
>> for comparing integer values the same width as the machine registers (if
>> the difference is too large, the intermediate value will overflow).
> <
> Which is why one needs CMP instructions and not to rely on SUB to do 98%
> of the work.
> <

MRISC32 has the "simple compare-to-fixed-value-and-branch", i.e:

BZ reg, label // Branch if reg == zero
BNZ reg, label // Branch if reg != zero
BS reg, label // Branch if reg == -1 (all bits Set)
BNS reg, label // Branch if reg != -1
BLT reg, label // Branch if reg < zero (signed)
BGE reg, label // Branch if reg >= zero (signed)
BLE reg, label // Branch if reg <= zero (signed)
BGT reg, label // Branch if reg > zero (signed)

....and then it has a bunch of compare instructions (actually "Set if",
which means that the target register is set to -1 if the condition is
true, or cleared to zero if it is false):

SEQ Rd, Rs, Rt/imm // Set Rd if Rs == Rt/imm
SNE Rd, Rs, Rt/imm // Set Rd if Rs != Rt/imm
SLT Rd, Rs, Rt/imm // Set Rd if Rs < Rt/imm (signed)
SLTU Rd, Rs, Rt/imm // Set Rd if Rs < Rt/imm (unsigned)
SLE Rd, Rs, Rt/imm // Set Rd if Rs <= Rt/imm (signed)
SLEU Rd, Rs, Rt/imm // Set Rd if Rs <= Rt/imm (unsigned)

....plus floating-point compare instructions (FSEQ, FSLT, ...).

The condition (EQ, LT, ...) is explicitly encoded in the comparison
instruction in order to produce a mask (all bits set or all bits
cleared), which comes in handy for vector operations. It also works well
together with the "bitwise select" instruction (SEL) for implementing
conditional select (conditional move).

Granted, there is some overlap between S[cc] and B[cc], and it's not
perfectly symmetric (e.g. there's only SLT/SLE, but no SGT/SGE), but
the compiler can usually sort these things out since there's ample
opportunity to transform conditions (e.g. replace SGE+BS with SLT+BNS).

/Marcus

Re: RISC-V vs. Aarch64

<RrlzJ.130558$SR4.25229@fx43.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22611&group=comp.arch#22611

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.de!newsfeed.xs4all.nl!newsfeed8.news.xs4all.nl!feeder1.feed.usenet.farm!feed.usenet.farm!peer01.ams4!peer.am4.highwinds-media.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx43.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
References: <2021Dec24.163843@mips.complang.tuwien.ac.at> <sq5dj1$1q9$1@dont-email.me> <59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com> <sqkcvk$n97$1@dont-email.me>
In-Reply-To: <sqkcvk$n97$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 34
Message-ID: <RrlzJ.130558$SR4.25229@fx43.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Thu, 30 Dec 2021 16:50:57 UTC
Date: Thu, 30 Dec 2021 11:50:17 -0500
X-Received-Bytes: 2339

by: EricP - Thu, 30 Dec 2021 16:50 UTC

Marcus wrote:
>
> ....and then it has a bunch of compare instructions (actually "Set if",
> which means that the target register is set to -1 if the condition is
> true, or cleared to zero if it is false):
>
> SEQ Rd, Rs, Rt/imm // Set Rd if Rs == Rt/imm
> SNE Rd, Rs, Rt/imm // Set Rd if Rs != Rt/imm
> SLT Rd, Rs, Rt/imm // Set Rd if Rs < Rt/imm (signed)
> SLTU Rd, Rs, Rt/imm // Set Rd if Rs < Rt/imm (unsigned)
> SLE Rd, Rs, Rt/imm // Set Rd if Rs <= Rt/imm (signed)
> SLEU Rd, Rs, Rt/imm // Set Rd if Rs <= Rt/imm (unsigned)
>
> ....plus floating-point compare instructions (FSEQ, FSLT, ...).
>
> The condition (EQ, LT, ...) is explicitly encoded in the comparison
> instruction in order to produce a mask (all bits set or all bits
> cleared), which comes in handy for vector operations. It also works well
> together with the "bitwise select" instruction (SEL) for implementing
> conditional select (conditional move).
>
> Granted, there is some overlap between S[cc] and B[cc], and it's not
> perfectly symmetric (e.g. there's only SLT/SLE, but no SGT/SGE), but
> the compiler can usually sort these things out since there's ample
> opportunity to transform conditions (e.g. replace SGE+BS with SLT+BNS).
>
> /Marcus
>

C,C++ and a bunch of languages explicitly define booleans as 0 or 1
so this definition won't be optimal for those languages.
VAX Fortran used 0,-1 for LOGICAL but I don't know if that
was defined by the language or implementation dependant.

Re: RISC-V vs. Aarch64

<sqksdv$b5f$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22614&group=comp.arch#22614

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Thu, 30 Dec 2021 12:05:48 -0600
Organization: A noiseless patient Spider
Lines: 213
Message-ID: <sqksdv$b5f$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<0a8ff16a-53de-420e-9c82-cfc9e87f62e9n@googlegroups.com>
<sqk7nc$dh0$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 30 Dec 2021 18:05:51 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="94d123a6a3772585eabf4ec6ac39afd6";
logging-data="11439"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX189FdJNGHv+ue0tHo+74Z3+"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.1
Cancel-Lock: sha1:uZbt24GSzWy5hhkdurz69cbU2TQ=
In-Reply-To: <sqk7nc$dh0$1@dont-email.me>
Content-Language: en-US

by: BGB - Thu, 30 Dec 2021 18:05 UTC

On 12/30/2021 6:12 AM, Marcus wrote:
> On 2021-12-24, MitchAlsup wrote:
>> On Friday, December 24, 2021 at 11:00:14 AM UTC-6, Anton Ertl wrote:
>
> [snip]
>
>>>
>>> It's pretty obvious that a small implementation of RV64G is smaller
>>> than a small implementation of A64, and adding the C extension to a
>>> small implementation of RV64G (to turn it to RV64GC) is reported in
>>> the talk IIRC (it's on the order of 700 transistors, so still cheap),
>>> so you can get a small RV64GC cheaper than a small A64 implementation
>>> and have similar code density.
>> <
>> Once you have constructed the register file(s), the integer, memory, and
>> floating point units, the size of the fetcher and decoder is immaterial
>> (noise).
>> <
>
> How about branch misprediction penalty? My assumption is that a more
> complex decoder is likely to require more pipeline stages in the front
> end, which is likely to hurt when you get branch mispredictions, right?
>

I suspect this isn't likely to be too much of an issue for most "sane"
encodings.

For something like x86 or 65C816 or similar, this is likely to get a
little messier (fully variable length byte-oriented ISAs).

Something M68K like would be intermediate (instructions were generally
16/32/48-bit with a length which can be determined from the instruction
encoded in first word). Bigger hassle would be that M68K is a Reg/Mem
ISA (like x86) with some fairly complex addressing modes.

In my case, it is possible to look at a few bits in the instruction word
and figure out the length (15:12):
0xE, 0xF: 32 bit
0x7, 0x9: 32 bit (XGPR)
Else: 16-bit
Bits (11:8) may also be needed to determine bundling and similar.
F0..F3: Scalar (F0..F3 blocks, Base Encoding)
F4..F7: WEX (Repeat F0..F3)
F8..FB: Scalar
FC..FD: WEX (Repeat F8..F9)
FE..FF: Jumbo Prefix
The Ez blocks are predicated forms:
E0..E3: (F0..F3)?T
E4..E7: (F0..F3)?F
E8..E9: (F8..F9)?T
EA..EB: WEX (F0/F2)?T
EC..ED: (F8..F9)?F
EE..EF: WEX (F0/F2)?F

The 7 and 9 blocks repeat F0, and F2/F1, albeit with register fields
expanded to 6 bits. Bit[11] is interpreted as a WEX flag, and these
encodings lack support for predication (meanwhile, the Op64 encodings
allow predication but not WEX).

This could be "cleaner", but some parts of the ISA were later additions.

While RISC-V uses the low-order bits, in premise it isn't that much
different. If the CPU is set to RISC-V mode, the flags are modified to
reflect RISC-V's encodings.

Decoding the various major blocks is handled with "selection flags"
(flags are set to indicate the major blocks to handle the instruction).

Within the IF stage, the combination of Fx (32-bit) and Wx (WEX) flags
are used to determine the overall length of the bundle:
FxA=0: 16-bit
FxA=1:
WxA=0: 32-bit
WxA=1 && (SR.WXE || Jumbo):
FxB=0: 48-bit (Reserved)
FxB=1:
WxB=0: 64-bit
WxB=1:
FxC=0: 80-bit (Reserved)
FxC=1: 96-bit

Jumbo prefixes are mostly special here in that they may force bundle
decoding when the "WEX Enable" flag is clear.

At present, Fetch works like:
Gives 96-bits representing whatever was at PC;
Gives an overall length for the instruction/bundle;
Used to calculate the next cycles' PC.

The 96-bundle / etc are then used as inputs to the ID1 stage.

For example, decoding in my case looks something like:
Unpack the register fields, various immediate types, ...
May then be modified by the presence of a jumbo prefix.
Use big nested table lookup to find a few pieces of information:
nmid: Major opcode (6b)
ucix: Minor opcode / Data (6b)
ucty: Pipeline Type (3b)
ccty: Default Condition Code (3b)
fmid: Major instruction form (4b)
ity: Instruction form subtytpe (4b)
bty: Data / Element Type (Ld/St) (3b)
Do a case based on fmid and ity:
These map instruction registers/... to the decoder ports.
This seems to be the more expensive part.
Addition Outputs:
Rm, Ro, Rp, Rn (4x 7-bit)
Imm: 33-bits
...

An outer level decoder will run several instances of the former decoder,
in order to deal with bundled instructions, ...

So, in effect one might have, say:
3x 32-bit decoder (BJX2 32b) (FzA/FzB/FzC)
1x 16-bit decoder (BJX2 16b) (Bz)
1x 32-bit decoder (RISC-V 32b) (RvA)
Or, 3x 32-bit decoder (RISC-V 32b) (RvA/RvB/RvC, *)
1x 16-bit decoder (RISC-V 16b) (Rz)

Based on the bundle encoding and CPU mode, it maps the instructions from
the decoders onto the CPU pipeline (Lanes 1/2/3).

*: If using a crazy hack to glue bundles and jumbo prefixes onto RISC-V
at the expense of not being able to have both RVC and bundles able to be
encoded at the same time (otherwise only a single RV decoder is used).
Hackery would be used to encode mode-changes into function pointers or
via the GOT. This breaks the RISC-V ISA spec in a few ways that
(probably) shouldn't effect most code in the wild (the actual rules are
a little weird; I suspect if code relies on the ability to have
misaligned PC with it being selectively ignored, this is likely to break).

Eg:
if(FxA && !RVI)
if(WxA)
if(WxB)
Lane3=FzA
Lane2=FzB
Lane1=FzC
else
Lane3=NOP
Lane2=FzA
Lane1=FzB
else
Lane3=NOP
Lane2=NOP
Lane1=FzA
else if(!RVI)
Lane3=NOP
Lane2=NOP
Lane1=Bz
else
Similar, but for RISC-V ops...

This is then followed by some special-case logic to spread SIMD
encodings and similar across multiple lanes, the Immediate handling for
Jumbo96 ops, ...

In BJX2, the Jumbo encodings are special, but are routed through the
32-bit decoders. They may receive an extra 26-bit field signaling a
jumbo prefix in the adjacent lane. Some additional hackery is used for
Jumbo96 encodings (uses multi-lane handling for the immediate).

The Jumbo prefix primarily modifies the decoding of the immediate
fields, or the register fields (Op64 encodings).

This whole process fits within a single clock cycle (ID1).

The next stage (ID2) is mostly related to fetching register values from
the register file and/or forwarding them from the EX stages. For
branches/PC-rel/... PC points to the end of the current bundle.

In the remaining stages (ID2 onwards), the CPU is basically 3 parallel
lanes with a 6R+3W register file.

Some units span multiple lanes, and some instructions may span multiple
lanes in the decoder.

While up-front, this seems more complicated, overall it seems cheaper
than what would be required for a superscalar decoder (which would in
effect need to use pattern-matching across the instructions and register
fields to figure out the bundle width).

One possibility could be to handle superscalar as a hardware level
"autobundler" in the IF stage. Cost and latency seem like potential
issues though. Simplest case would be to check for possible register
collisions, and (if none exist) if both operations are "clean"
operations (such as ALU ops or similar). This hack would at least avoid
needing any significant change to the pipeline though.

Initial thinking was to handle bundling more as a "load time" hack in
the ELF loader, or as an AOT process. Otherwise, the code would only be
able to run one instruction at a time.

It is debatable though if the whole "dual ISA" thing even really makes
sense. Can also note that direct linking across ISAs wont really work as
the C ABI's are very different (so, in effect, BJX2 code callable from
RISC-V mode would also need to use RISC-V's C ABI; with the extra hair
of some internal register remapping, ...).

> /Marcus

Re: RISC-V vs. Aarch64

<abf67f3c-3f7d-477a-a8fa-d739672648aen@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22617&group=comp.arch#22617

copy link Newsgroups: comp.arch

X-Received: by 2002:ad4:5dc8:: with SMTP id m8mr28265648qvh.71.1640888980440;
Thu, 30 Dec 2021 10:29:40 -0800 (PST)
X-Received: by 2002:a05:6808:1283:: with SMTP id a3mr24742652oiw.110.1640888980216;
Thu, 30 Dec 2021 10:29:40 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 30 Dec 2021 10:29:40 -0800 (PST)
In-Reply-To: <sqk7nc$dh0$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:7c95:1043:36c7:2208;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:7c95:1043:36c7:2208
References: <2021Dec24.163843@mips.complang.tuwien.ac.at> <0a8ff16a-53de-420e-9c82-cfc9e87f62e9n@googlegroups.com>
<sqk7nc$dh0$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <abf67f3c-3f7d-477a-a8fa-d739672648aen@googlegroups.com>
Subject: Re: RISC-V vs. Aarch64
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 30 Dec 2021 18:29:40 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 32

by: MitchAlsup - Thu, 30 Dec 2021 18:29 UTC

On Thursday, December 30, 2021 at 6:12:31 AM UTC-6, Marcus wrote:
> On 2021-12-24, MitchAlsup wrote:
> > On Friday, December 24, 2021 at 11:00:14 AM UTC-6, Anton Ertl wrote:
>
> [snip]
>
> >>
> >> It's pretty obvious that a small implementation of RV64G is smaller
> >> than a small implementation of A64, and adding the C extension to a
> >> small implementation of RV64G (to turn it to RV64GC) is reported in
> >> the talk IIRC (it's on the order of 700 transistors, so still cheap),
> >> so you can get a small RV64GC cheaper than a small A64 implementation
> >> and have similar code density.
> > <
> > Once you have constructed the register file(s), the integer, memory, and
> > floating point units, the size of the fetcher and decoder is immaterial
> > (noise).
> > <
>
> How about branch misprediction penalty? My assumption is that a more
> complex decoder is likely to require more pipeline stages in the front
> end, which is likely to hurt when you get branch mispredictions, right?
<
The more pipeline stages one has, the less cost a decoder is.
<
The wider the pipeline, the higher the relative misprediction costs.
<
On the other hand, My team in 1990 figured out how to execute a predicted
branch in cycle k and be inserting instructions into the execution window
form the mispredicted direction in cycle k+1. You CAN make the repair
cost zero cycles, you still eat the delay to calculation of mispredict cycles.
>
> /Marcus

Re: RISC-V vs. Aarch64

<sql2cm$3h7$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22623&group=comp.arch#22623

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Thu, 30 Dec 2021 20:47:33 +0100
Organization: A noiseless patient Spider
Lines: 49
Message-ID: <sql2cm$3h7$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sq5dj1$1q9$1@dont-email.me>
<59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com>
<sqkcvk$n97$1@dont-email.me> <RrlzJ.130558$SR4.25229@fx43.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 30 Dec 2021 19:47:34 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="6016acade3692638d873f399288fb852";
logging-data="3623"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+TsKlFCOmTvOpMxaEJQzw9panxh/Vydlw="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.14.0
Cancel-Lock: sha1:WJ5jiesiGRdQflqD3rxkTg+G0bI=
In-Reply-To: <RrlzJ.130558$SR4.25229@fx43.iad>
Content-Language: en-US

by: Marcus - Thu, 30 Dec 2021 19:47 UTC

On 2021-12-30, EricP wrote:
> Marcus wrote:
>>
>> ....and then it has a bunch of compare instructions (actually "Set if",
>> which means that the target register is set to -1 if the condition is
>> true, or cleared to zero if it is false):
>>
>> SEQ Rd, Rs, Rt/imm // Set Rd if Rs == Rt/imm
>> SNE Rd, Rs, Rt/imm // Set Rd if Rs != Rt/imm
>> SLT Rd, Rs, Rt/imm // Set Rd if Rs < Rt/imm (signed)
>> SLTU Rd, Rs, Rt/imm // Set Rd if Rs < Rt/imm (unsigned)
>> SLE Rd, Rs, Rt/imm // Set Rd if Rs <= Rt/imm (signed)
>> SLEU Rd, Rs, Rt/imm // Set Rd if Rs <= Rt/imm (unsigned)
>>
>> ....plus floating-point compare instructions (FSEQ, FSLT, ...).
>>
>> The condition (EQ, LT, ...) is explicitly encoded in the comparison
>> instruction in order to produce a mask (all bits set or all bits
>> cleared), which comes in handy for vector operations. It also works well
>> together with the "bitwise select" instruction (SEL) for implementing
>> conditional select (conditional move).
>>
>> Granted, there is some overlap between S[cc] and B[cc], and it's not
>> perfectly symmetric (e.g. there's only SLT/SLE, but no SGT/SGE), but
>> the compiler can usually sort these things out since there's ample
>> opportunity to transform conditions (e.g. replace SGE+BS with SLT+BNS).
>>
>> /Marcus
>>
>
> C,C++ and a bunch of languages explicitly define booleans as 0 or 1
> so this definition won't be optimal for those languages.
> VAX Fortran used 0,-1 for LOGICAL but I don't know if that
> was defined by the language or implementation dependant.
>

As a software developer I'm painfully aware of this. I decided not to
care too much about it, though. Really, most software that relies on
this property of C should be frowned upon. E.g. expressions like:

a = b + (c == d);

....aren't really good programming practice.

I have seen a few places where the compiler does conversions from -1 to
+1 (but those have mostly been due to missing/bad pattern matching in
the machine description).

/Marcus

Re: RISC-V vs. Aarch64

<sql3e7$ogs$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22624&group=comp.arch#22624

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Thu, 30 Dec 2021 14:05:25 -0600
Organization: A noiseless patient Spider
Lines: 73
Message-ID: <sql3e7$ogs$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<0a8ff16a-53de-420e-9c82-cfc9e87f62e9n@googlegroups.com>
<sqk7nc$dh0$1@dont-email.me>
<abf67f3c-3f7d-477a-a8fa-d739672648aen@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 30 Dec 2021 20:05:27 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="94d123a6a3772585eabf4ec6ac39afd6";
logging-data="25116"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1945WorrJ3MoB+ZuiLhU96s"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.4.1
Cancel-Lock: sha1:RX+SaWB8LDzhuohdt4edE7V6jzM=
In-Reply-To: <abf67f3c-3f7d-477a-a8fa-d739672648aen@googlegroups.com>
Content-Language: en-US

by: BGB - Thu, 30 Dec 2021 20:05 UTC

On 12/30/2021 12:29 PM, MitchAlsup wrote:
> On Thursday, December 30, 2021 at 6:12:31 AM UTC-6, Marcus wrote:
>> On 2021-12-24, MitchAlsup wrote:
>>> On Friday, December 24, 2021 at 11:00:14 AM UTC-6, Anton Ertl wrote:
>>
>> [snip]
>>
>>>>
>>>> It's pretty obvious that a small implementation of RV64G is smaller
>>>> than a small implementation of A64, and adding the C extension to a
>>>> small implementation of RV64G (to turn it to RV64GC) is reported in
>>>> the talk IIRC (it's on the order of 700 transistors, so still cheap),
>>>> so you can get a small RV64GC cheaper than a small A64 implementation
>>>> and have similar code density.
>>> <
>>> Once you have constructed the register file(s), the integer, memory, and
>>> floating point units, the size of the fetcher and decoder is immaterial
>>> (noise).
>>> <
>>
>> How about branch misprediction penalty? My assumption is that a more
>> complex decoder is likely to require more pipeline stages in the front
>> end, which is likely to hurt when you get branch mispredictions, right?
> <
> The more pipeline stages one has, the less cost a decoder is.

Probably true:
With two "decode" stages (one for unpacking, another for register
fetch), I can have a relatively complicated unpacking process.

If one tries to unpack and fetch registers in the same stage, it is a
lot harder to make it pass timing.

Things like LUT cost are less predictable:
Breaking complex logic into multiple stages will often reduce LUT cost;
But, adding extra "forwarding" stages for timing reasons will often
increase LUT cost.

> <
> The wider the pipeline, the higher the relative misprediction costs.
> <
> On the other hand, My team in 1990 figured out how to execute a predicted
> branch in cycle k and be inserting instructions into the execution window
> form the mispredicted direction in cycle k+1. You CAN make the repair
> cost zero cycles, you still eat the delay to calculation of mispredict cycles.

This is where predication is nice:
It can absorb some of the cost of "small" branches (by avoiding the
branch in the fist place).

In the case of a branch, in my case, it takes effect ~ 2 cycles later.

EX1: Signal Branch
EX2: Initiate Branch (new Addr goes into PF);
EX3: First "post-branch" fetch reaches IF.

It is necessary though to invalidate everything previously in the
pipeline, so most of the cost is waiting for the old/invalidated
contents to cycle through.

Except with the branch predictor, which redirects the fetch earlier in
the pipeline. However, since it takes effect in ID1 (rather than IF),
there is still a slight delay.

And, there is a separate PF/IF stage mostly because the Block RAM
operates on a clock edge, ...

>>
>> /Marcus

Re: RISC-V vs. Aarch64

<f83def06-e4ee-4a89-897c-d115133b0ef6n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22626&group=comp.arch#22626

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:4e96:: with SMTP id 22mr28752185qtp.76.1640896440764;
Thu, 30 Dec 2021 12:34:00 -0800 (PST)
X-Received: by 2002:a05:6830:1d45:: with SMTP id p5mr23067491oth.350.1640896440556;
Thu, 30 Dec 2021 12:34:00 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 30 Dec 2021 12:34:00 -0800 (PST)
In-Reply-To: <sql2cm$3h7$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:7c95:1043:36c7:2208;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:7c95:1043:36c7:2208
References: <2021Dec24.163843@mips.complang.tuwien.ac.at> <sq5dj1$1q9$1@dont-email.me>
<59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com> <sqkcvk$n97$1@dont-email.me>
<RrlzJ.130558$SR4.25229@fx43.iad> <sql2cm$3h7$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f83def06-e4ee-4a89-897c-d115133b0ef6n@googlegroups.com>
Subject: Re: RISC-V vs. Aarch64
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 30 Dec 2021 20:34:00 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 62

by: MitchAlsup - Thu, 30 Dec 2021 20:34 UTC

On Thursday, December 30, 2021 at 1:47:36 PM UTC-6, Marcus wrote:
> On 2021-12-30, EricP wrote:
> > Marcus wrote:
> >>
> >> ....and then it has a bunch of compare instructions (actually "Set if",
> >> which means that the target register is set to -1 if the condition is
> >> true, or cleared to zero if it is false):
> >>
> >> SEQ Rd, Rs, Rt/imm // Set Rd if Rs == Rt/imm
> >> SNE Rd, Rs, Rt/imm // Set Rd if Rs != Rt/imm
> >> SLT Rd, Rs, Rt/imm // Set Rd if Rs < Rt/imm (signed)
> >> SLTU Rd, Rs, Rt/imm // Set Rd if Rs < Rt/imm (unsigned)
> >> SLE Rd, Rs, Rt/imm // Set Rd if Rs <= Rt/imm (signed)
> >> SLEU Rd, Rs, Rt/imm // Set Rd if Rs <= Rt/imm (unsigned)
> >>
> >> ....plus floating-point compare instructions (FSEQ, FSLT, ...).
> >>
> >> The condition (EQ, LT, ...) is explicitly encoded in the comparison
> >> instruction in order to produce a mask (all bits set or all bits
> >> cleared), which comes in handy for vector operations. It also works well
> >> together with the "bitwise select" instruction (SEL) for implementing
> >> conditional select (conditional move).
> >>
> >> Granted, there is some overlap between S[cc] and B[cc], and it's not
> >> perfectly symmetric (e.g. there's only SLT/SLE, but no SGT/SGE), but
> >> the compiler can usually sort these things out since there's ample
> >> opportunity to transform conditions (e.g. replace SGE+BS with SLT+BNS).
> >>
> >> /Marcus
> >>
> >
> > C,C++ and a bunch of languages explicitly define booleans as 0 or 1
> > so this definition won't be optimal for those languages.
> > VAX Fortran used 0,-1 for LOGICAL but I don't know if that
> > was defined by the language or implementation dependant.
<
IIRC Pascal uses 0 and -1
<
So, when I tackled this, I use signed (0 and -1) or unsigned (0 and +1)
bit field extracts from my CMP instructions to obtain TRUE or FALSE
depending on what the language wants. {The semi-equivalent SETGT
kind of instructions would have to double their footprint in the ISA
to cover both cases.} Having both cases available means that using
TRUE as a bit-field mask works well.
> >
> As a software developer I'm painfully aware of this. I decided not to
> care too much about it, though. Really, most software that relies on
> this property of C should be frowned upon. E.g. expressions like:
>
> a = b + (c == d);
>
> ...aren't really good programming practice.
<
And ESPECIALY bad in floating point where:
a) you should not be doing equality comparisons
b) you should be checking for NaNs
c) you need to write comparison code with 3 blocks (yes, no, not-comparable)
>
> I have seen a few places where the compiler does conversions from -1 to
> +1 (but those have mostly been due to missing/bad pattern matching in
> the machine description).
>
> /Marcus

Re: RISC-V vs. Aarch64

<sql73d$6es$2@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22628&group=comp.arch#22628

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-eb03-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Thu, 30 Dec 2021 21:07:57 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sql73d$6es$2@newsreader4.netcologne.de>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sq5dj1$1q9$1@dont-email.me>
<59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com>
<sqkcvk$n97$1@dont-email.me> <RrlzJ.130558$SR4.25229@fx43.iad>
<sql2cm$3h7$1@dont-email.me>
Injection-Date: Thu, 30 Dec 2021 21:07:57 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-eb03-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:eb03:0:7285:c2ff:fe6c:992d";
logging-data="6620"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Thu, 30 Dec 2021 21:07 UTC

Marcus <m.delete@this.bitsnbites.eu> schrieb:
> On 2021-12-30, EricP wrote:

>> C,C++ and a bunch of languages explicitly define booleans as 0 or 1
>> so this definition won't be optimal for those languages.
>> VAX Fortran used 0,-1 for LOGICAL but I don't know if that
>> was defined by the language or implementation dependant.

It is implementation defined.

With GNU Fortran, the choice was made to only allow 0 and 1.
Anything else, and you're likely to end up with random results
with LOGICAL variables.

One reason for that was that gcc, still a C-centric compiler, is
geared toward _Bool. Fortran has followed, because the C interop
sort of prescribes that.

> As a software developer I'm painfully aware of this. I decided not to
> care too much about it, though. Really, most software that relies on
> this property of C should be frowned upon. E.g. expressions like:
>
> a = b + (c == d);
>
> ...aren't really good programming practice.

Agreed.

> I have seen a few places where the compiler does conversions from -1 to
> +1 (but those have mostly been due to missing/bad pattern matching in
> the machine description).

What is the assembly for

_Bool foo (int a, int b)
{ return a > b;
}

for your architecture at a reasonable optimization level?

Re: RISC-V vs. Aarch64

<j36lq7Ff1hsU1@mid.individual.net>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22629&group=comp.arch#22629

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: niklas.h...@tidorum.invalid (Niklas Holsti)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Thu, 30 Dec 2021 23:14:46 +0200
Organization: Tidorum Ltd
Lines: 21
Message-ID: <j36lq7Ff1hsU1@mid.individual.net>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sq5dj1$1q9$1@dont-email.me>
<59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com>
<sqkcvk$n97$1@dont-email.me> <RrlzJ.130558$SR4.25229@fx43.iad>
<sql2cm$3h7$1@dont-email.me>
<f83def06-e4ee-4a89-897c-d115133b0ef6n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: individual.net uE4HXhMVt6G8x3ucCJkvkQPeV6zr4nMG8IRxPXSn6mqJqoJdkr
Cancel-Lock: sha1:cxKAqMPbrGK3HWPSkVLpJgHdkqc=
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:78.0)
Gecko/20100101 Thunderbird/78.14.0
In-Reply-To: <f83def06-e4ee-4a89-897c-d115133b0ef6n@googlegroups.com>
Content-Language: en-US

by: Niklas Holsti - Thu, 30 Dec 2021 21:14 UTC

On 2021-12-30 22:34, MitchAlsup wrote:
> On Thursday, December 30, 2021 at 1:47:36 PM UTC-6, Marcus wrote:
>> On 2021-12-30, EricP wrote:

[snip]

>>> C,C++ and a bunch of languages explicitly define booleans as 0 or 1
>>> so this definition won't be optimal for those languages.
>>> VAX Fortran used 0,-1 for LOGICAL but I don't know if that
>>> was defined by the language or implementation dependant.
> <
> IIRC Pascal uses 0 and -1

Pascal-the-language has a "boolean" type, with values False and True. I
believe the machine representation of those values is
implementation-defined.

Ada-the-language also has a "Boolean" type, with values False and True,
but it is defined as an enumeration type with the default representation
of such types, which means that False is represented as 0 and True as 1.

Re: RISC-V vs. Aarch64

<sql7vs$6es$3@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=22630&group=comp.arch#22630

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-eb03-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Thu, 30 Dec 2021 21:23:08 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sql7vs$6es$3@newsreader4.netcologne.de>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sq5dj1$1q9$1@dont-email.me>
<59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com>
<sqkcvk$n97$1@dont-email.me> <RrlzJ.130558$SR4.25229@fx43.iad>
<sql2cm$3h7$1@dont-email.me>
<f83def06-e4ee-4a89-897c-d115133b0ef6n@googlegroups.com>
Injection-Date: Thu, 30 Dec 2021 21:23:08 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-eb03-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:eb03:0:7285:c2ff:fe6c:992d";
logging-data="6620"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Thu, 30 Dec 2021 21:23 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:
> On Thursday, December 30, 2021 at 1:47:36 PM UTC-6, Marcus wrote:

>> As a software developer I'm painfully aware of this. I decided not to
>> care too much about it, though. Really, most software that relies on
>> this property of C should be frowned upon. E.g. expressions like:
>>
>> a = b + (c == d);
>>
>> ...aren't really good programming practice.
><
> And ESPECIALY bad in floating point where:
> a) you should not be doing equality comparisons

You can do equality comparisons if you know that the
results will be exact. For example,

a = 1.0 + 1.0
if (a /= 2.0) stop "Your adder is broken"

should never execute the STOP.

> b) you should be checking for NaNs

Depends. Sometimes NaNs are not an issue, and sometimes
it's fine to let them bubble up.

> c) you need to write comparison code with 3 blocks (yes, no, not-comparable)

Or you can use Fortran and check afterwards :-)

Here's an example taken from the Fortran 2018 standard, which
checks for different sizes first, and then for overflow during
the calculation.

MODULE DOT
! Module for dot product of two real arrays of rank 1.
! The caller needs to ensure that exceptions do not cause halting.
USE, INTRINSIC :: IEEE_EXCEPTIONS
LOGICAL :: MATRIX_ERROR = .FALSE.
INTERFACE OPERATOR(.dot.)
MODULE PROCEDURE MULT
END INTERFACE OPERATOR(.dot.)
CONTAINS
REAL FUNCTION MULT (A, B)
REAL, INTENT (IN) :: A(:), B(:)
INTEGER I
LOGICAL OVERFLOW
IF (SIZE(A) /= SIZE(B)) THEN
MATRIX_ERROR = .TRUE.
RETURN
END IF
! The processor ensures that IEEE_OVERFLOW is quiet.
MULT = 0.0
DO I = 1, SIZE (A)
MULT = MULT + A(I)*B(I)
END DO
CALL IEEE_GET_FLAG (IEEE_OVERFLOW, OVERFLOW)
IF (OVERFLOW) MATRIX_ERROR = .TRUE.
END FUNCTION MULT
END MODULE DOT

You can also check for IEEE_INVALID to see if any NaNs were
involved.

Pages:123 4 5 6 7 8 9 10 11 12 13 14 15

server_pubkey.txt

rocksolid light 0.9.81
clearnet tor