Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

To err is human, to moo bovine.


devel / comp.arch / Re: Mixed EGU/EGO floating-point

SubjectAuthor
* Mixed EGU/EGO floating-pointQuadibloc
`* Re: Mixed EGU/EGO floating-pointMitchAlsup
 +* Re: Mixed EGU/EGO floating-pointMitchAlsup
 |`- Re: Mixed EGU/EGO floating-pointQuadibloc
 +* Re: Mixed EGU/EGO floating-pointQuadibloc
 |`* Re: Mixed EGU/EGO floating-pointAnton Ertl
 | +* Re: Mixed EGU/EGO floating-pointMitchAlsup
 | |`- Re: Mixed EGU/EGO floating-pointMichael S
 | +* Re: Mixed EGU/EGO floating-pointJohn Levine
 | |+- Re: Mixed EGU/EGO floating-pointJohn Dallman
 | |+- Re: Mixed EGU/EGO floating-pointAnton Ertl
 | |+* Re: Mixed EGU/EGO floating-pointJimBrakefield
 | ||`- Re: Mixed EGU/EGO floating-pointEricP
 | |`- Re: Mixed EGU/EGO floating-pointMichael S
 | `* Re: Mixed EGU/EGO floating-pointTerje Mathisen
 |  +* Re: Mixed EGU/EGO floating-pointIvan Godard
 |  |+* Re: Mixed EGU/EGO floating-pointMitchAlsup
 |  ||`- Re: Mixed EGU/EGO floating-pointIvan Godard
 |  |+* Re: Mixed EGU/EGO floating-pointAnton Ertl
 |  ||`* Re: Mixed EGU/EGO floating-pointThomas Koenig
 |  || +* Re: Mixed EGU/EGO floating-pointEricP
 |  || |+* Re: Mixed EGU/EGO floating-pointThomas Koenig
 |  || ||`- Re: Mixed EGU/EGO floating-pointBGB
 |  || |+- Re: Mixed EGU/EGO floating-pointMitchAlsup
 |  || |`* Re: Mixed EGU/EGO floating-pointStephen Fuld
 |  || | +- Re: Mixed EGU/EGO floating-pointEricP
 |  || | +- Re: Mixed EGU/EGO floating-pointJohn Levine
 |  || | +* Re: Mixed EGU/EGO floating-pointTerje Mathisen
 |  || | |`- Re: Mixed EGU/EGO floating-pointIvan Godard
 |  || | `- Re: Mixed EGU/EGO floating-pointGeorge Neuner
 |  || +* Re: Mixed EGU/EGO floating-pointStephen Fuld
 |  || |`- Re: Mixed EGU/EGO floating-pointEricP
 |  || +- Re: Mixed EGU/EGO floating-pointMitchAlsup
 |  || `* Re: Mixed EGU/EGO floating-pointAnton Ertl
 |  ||  `* Re: Mixed EGU/EGO floating-pointThomas Koenig
 |  ||   +* Re: Mixed EGU/EGO floating-pointMitchAlsup
 |  ||   |`* Re: Mixed EGU/EGO floating-pointThomas Koenig
 |  ||   | +* Re: Mixed EGU/EGO floating-pointMitchAlsup
 |  ||   | |+* Re: Mixed EGU/EGO floating-pointStefan Monnier
 |  ||   | ||+- Re: Mixed EGU/EGO floating-pointMitchAlsup
 |  ||   | ||`* Re: Mixed EGU/EGO floating-pointIvan Godard
 |  ||   | || `* Re: Mixed EGU/EGO floating-pointMitchAlsup
 |  ||   | ||  `* Re: Mixed EGU/EGO floating-pointBGB
 |  ||   | ||   `* Re: Mixed EGU/EGO floating-pointMitchAlsup
 |  ||   | ||    `- Re: Mixed EGU/EGO floating-pointBGB
 |  ||   | |`* Re: Mixed EGU/EGO floating-pointAnton Ertl
 |  ||   | | `- Re: Mixed EGU/EGO floating-pointMitchAlsup
 |  ||   | `* Re: Mixed EGU/EGO floating-pointAnton Ertl
 |  ||   |  `- Re: Mixed EGU/EGO floating-pointThomas Koenig
 |  ||   `* Re: Mixed EGU/EGO floating-pointAnton Ertl
 |  ||    +* Re: Mixed EGU/EGO floating-pointBGB
 |  ||    |`- Re: Mixed EGU/EGO floating-pointMitchAlsup
 |  ||    `* Re: Mixed EGU/EGO floating-pointThomas Koenig
 |  ||     `- Re: Mixed EGU/EGO floating-pointMitchAlsup
 |  |`- Re: Mixed EGU/EGO floating-pointTerje Mathisen
 |  `* Re: Mixed EGU/EGO floating-pointMitchAlsup
 |   `- Re: Mixed EGU/EGO floating-pointBGB
 +* Re: Mixed EGU/EGO floating-pointJimBrakefield
 |`* Re: Mixed EGU/EGO floating-pointQuadibloc
 | +* Re: Mixed EGU/EGO floating-pointMitchAlsup
 | |+- Re: Mixed EGU/EGO floating-pointJimBrakefield
 | |`* Perfect roudning of trogonometric functions (was: Mixed EGU/EGO floating-point)Stefan Monnier
 | | `* Re: Perfect roudning of trogonometric functionsTerje Mathisen
 | |  `* Re: Perfect roudning of trogonometric functionsStefan Monnier
 | |   `- Re: Perfect roudning of trogonometric functionsTerje Mathisen
 | `* Re: Mixed EGU/EGO floating-pointEricP
 |  `* Re: Mixed EGU/EGO floating-pointMitchAlsup
 |   +- Re: Mixed EGU/EGO floating-pointJimBrakefield
 |   +- Re: Mixed EGU/EGO floating-pointTerje Mathisen
 |   `- Re: Mixed EGU/EGO floating-pointEricP
 `* Re: Mixed EGU/EGO floating-pointBGB
  `* Re: Mixed EGU/EGO floating-pointMitchAlsup
   +* Re: Mixed EGU/EGO floating-pointStefan Monnier
   |`* Re: Mixed EGU/EGO floating-pointBGB
   | `* Re: Mixed EGU/EGO floating-pointJohn Dallman
   |  +- Re: Mixed EGU/EGO floating-pointMitchAlsup
   |  +* Re: Mixed EGU/EGO floating-pointThomas Koenig
   |  |+- Re: Mixed EGU/EGO floating-pointJohn Dallman
   |  |+* Re: Mixed EGU/EGO floating-pointBGB
   |  ||`* Re: Mixed EGU/EGO floating-pointMitchAlsup
   |  || `* Re: Mixed EGU/EGO floating-pointBGB
   |  ||  `* Re: Mixed EGU/EGO floating-pointMitchAlsup
   |  ||   +* Re: Mixed EGU/EGO floating-pointIvan Godard
   |  ||   |`- Re: Mixed EGU/EGO floating-pointMitchAlsup
   |  ||   `* Re: Mixed EGU/EGO floating-pointBGB
   |  ||    `* Re: Mixed EGU/EGO floating-pointMitchAlsup
   |  ||     `* Re: Mixed EGU/EGO floating-pointBGB
   |  ||      `* Re: Mixed EGU/EGO floating-pointMitchAlsup
   |  ||       `- Re: Mixed EGU/EGO floating-pointBGB
   |  |`* Re: Mixed EGU/EGO floating-pointAnton Ertl
   |  | `* Re: Mixed EGU/EGO floating-pointMitchAlsup
   |  |  +* Re: Mixed EGU/EGO floating-pointQuadibloc
   |  |  |+- Re: Mixed EGU/EGO floating-pointMitchAlsup
   |  |  |`* Re: Mixed EGU/EGO floating-pointAnton Ertl
   |  |  | `- Re: Mixed EGU/EGO floating-pointQuadibloc
   |  |  `* Re: Mixed EGU/EGO floating-pointAnton Ertl
   |  |   `* Re: Mixed EGU/EGO floating-pointMitchAlsup
   |  |    +* Re: Mixed EGU/EGO floating-pointBGB
   |  |    |`* Re: Mixed EGU/EGO floating-pointBGB
   |  |    | `* Re: Mixed EGU/EGO floating-pointQuadibloc
   |  |    |  +- Re: Mixed EGU/EGO floating-pointBGB
   |  |    |  `* Re: Mixed EGU/EGO floating-pointMitchAlsup
   |  |    +- Re: Mixed EGU/EGO floating-pointAnton Ertl
   |  |    `- Re: Mixed EGU/EGO floating-pointQuadibloc
   |  `* Re: Mixed EGU/EGO floating-pointQuadibloc
   `* Re: Mixed EGU/EGO floating-pointBGB

Pages:12345
Re: Mixed EGU/EGO floating-point

<2022May15.193827@mips.complang.tuwien.ac.at>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25310&group=comp.arch#25310

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Mixed EGU/EGO floating-point
Date: Sun, 15 May 2022 17:38:27 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 27
Distribution: world
Message-ID: <2022May15.193827@mips.complang.tuwien.ac.at>
References: <de735513-d303-42b6-b375-916b89ddafcan@googlegroups.com> <2a41be98-e9c8-4159-bef7-ba7999e278ecn@googlegroups.com> <a4400d89-ce0d-4c76-aeb2-de2d13f40318n@googlegroups.com> <2022May13.200449@mips.complang.tuwien.ac.at> <t5p9pf$52o$1@gioia.aioe.org> <t5pb7g$6og$1@dont-email.me> <2022May15.091837@mips.complang.tuwien.ac.at> <t5r53e$cjh$1@newsreader4.netcologne.de>
Injection-Info: reader02.eternal-september.org; posting-host="a804ed51de3bbe163e8bef27aeacdbb8";
logging-data="15055"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/4xZWmn8PG/aoQt51nQMRN"
Cancel-Lock: sha1:9GMiXDOAE43cEVb5hs1+YzrvPN4=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Sun, 15 May 2022 17:38 UTC

Thomas Koenig <tkoenig@netcologne.de> writes:
>Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>
>> No, we already have NaNs for that. He wants to be able to compute the
>> sum/product of all present data, and what he proposes works for that.
>
> s = sum(a,mask=.not. ieee_is_nan(a))
>
>works fine.

Which CPU has this instruction?

>Simply ignoring values for summation would start leading to
>"interesting" results when you want to have an average instead of
>a sum, for example.

Set up a straw man ...

>Introducing new classes of numbers in floating point is no substitute
>for careful thought on part of the programmer.

.... and now beat on it.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Mixed EGU/EGO floating-point

<t5re2t$h0o$1@newsreader4.netcologne.de>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25311&group=comp.arch#25311

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!news.freedyn.de!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-5c6-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Mixed EGU/EGO floating-point
Date: Sun, 15 May 2022 17:42:53 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <t5re2t$h0o$1@newsreader4.netcologne.de>
References: <de735513-d303-42b6-b375-916b89ddafcan@googlegroups.com>
<2a41be98-e9c8-4159-bef7-ba7999e278ecn@googlegroups.com>
<a4400d89-ce0d-4c76-aeb2-de2d13f40318n@googlegroups.com>
<2022May13.200449@mips.complang.tuwien.ac.at> <t5p9pf$52o$1@gioia.aioe.org>
<t5pb7g$6og$1@dont-email.me> <2022May15.091837@mips.complang.tuwien.ac.at>
<t5r53e$cjh$1@newsreader4.netcologne.de>
<2022May15.193827@mips.complang.tuwien.ac.at>
Injection-Date: Sun, 15 May 2022 17:42:53 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-5c6-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:5c6:0:7285:c2ff:fe6c:992d";
logging-data="17432"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Sun, 15 May 2022 17:42 UTC

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
> Thomas Koenig <tkoenig@netcologne.de> writes:
>>Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>>
>>> No, we already have NaNs for that. He wants to be able to compute the
>>> sum/product of all present data, and what he proposes works for that.
>>
>> s = sum(a,mask=.not. ieee_is_nan(a))
>>
>>works fine.
>
> Which CPU has this instruction?

Any processor which implements Fortran.

Re: Mixed EGU/EGO floating-point

<memo.20220515192825.11824P@jgd.cix.co.uk>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25314&group=comp.arch#25314

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: jgd...@cix.co.uk (John Dallman)
Newsgroups: comp.arch
Subject: Re: Mixed EGU/EGO floating-point
Date: Sun, 15 May 2022 19:28 +0100 (BST)
Organization: A noiseless patient Spider
Lines: 11
Message-ID: <memo.20220515192825.11824P@jgd.cix.co.uk>
References: <t5rco4$ga5$1@newsreader4.netcologne.de>
Reply-To: jgd@cix.co.uk
Injection-Info: reader02.eternal-september.org; posting-host="8f769413a9fc99acd515f40d9a70ab70";
logging-data="7067"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Mj82Q8baWvd+qPCEy5txK5399zMW9Yn0="
Cancel-Lock: sha1:mX2+dwh+BDyYhWll7OLX6Dei/2I=
 by: John Dallman - Sun, 15 May 2022 18:28 UTC

In article <t5rco4$ga5$1@newsreader4.netcologne.de>,
tkoenig@netcologne.de (Thomas Koenig) wrote:

> Sounds like a broken design to me. What exactly were the different
> kinds of FP registers, and how were they different?

If I describe them, you'll be able to identify the vendor, which I'd
prefer to avoid. It's hardly fair to damn them now for a bad presentation
about 20 year sago.

John

Re: Mixed EGU/EGO floating-point

<6c1f724c-278a-43fe-8aab-efed4a1a310bn@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25315&group=comp.arch#25315

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:c44:0:b0:69f:81cb:1d6a with SMTP id 65-20020a370c44000000b0069f81cb1d6amr10098790qkm.494.1652639763908;
Sun, 15 May 2022 11:36:03 -0700 (PDT)
X-Received: by 2002:aca:b782:0:b0:325:7a29:352d with SMTP id
h124-20020acab782000000b003257a29352dmr11499878oif.217.1652639763672; Sun, 15
May 2022 11:36:03 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!pasdenom.info!nntpfeed.proxad.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 15 May 2022 11:36:03 -0700 (PDT)
In-Reply-To: <t5re2t$h0o$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:4811:b402:98e:d212;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:4811:b402:98e:d212
References: <de735513-d303-42b6-b375-916b89ddafcan@googlegroups.com>
<2a41be98-e9c8-4159-bef7-ba7999e278ecn@googlegroups.com> <a4400d89-ce0d-4c76-aeb2-de2d13f40318n@googlegroups.com>
<2022May13.200449@mips.complang.tuwien.ac.at> <t5p9pf$52o$1@gioia.aioe.org>
<t5pb7g$6og$1@dont-email.me> <2022May15.091837@mips.complang.tuwien.ac.at>
<t5r53e$cjh$1@newsreader4.netcologne.de> <2022May15.193827@mips.complang.tuwien.ac.at>
<t5re2t$h0o$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <6c1f724c-278a-43fe-8aab-efed4a1a310bn@googlegroups.com>
Subject: Re: Mixed EGU/EGO floating-point
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 15 May 2022 18:36:03 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Sun, 15 May 2022 18:36 UTC

On Sunday, May 15, 2022 at 12:42:56 PM UTC-5, Thomas Koenig wrote:
> Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
> > Thomas Koenig <tko...@netcologne.de> writes:
> >>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
> >>
> >>> No, we already have NaNs for that. He wants to be able to compute the
> >>> sum/product of all present data, and what he proposes works for that.
> >>
> >> s = sum(a,mask=.not. ieee_is_nan(a))
> >>
> >>works fine.
> >
> > Which CPU has this instruction?
> Any processor which implements Fortran.
<
I have used plenty of processors that had quality implementations of FORTRAN
but did not have that as an instruction.

Re: Mixed EGU/EGO floating-point

<t5rjju$s26$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25316&group=comp.arch#25316

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Mixed EGU/EGO floating-point
Date: Sun, 15 May 2022 14:15:59 -0500
Organization: A noiseless patient Spider
Lines: 343
Message-ID: <t5rjju$s26$1@dont-email.me>
References: <de735513-d303-42b6-b375-916b89ddafcan@googlegroups.com>
<2a41be98-e9c8-4159-bef7-ba7999e278ecn@googlegroups.com>
<t5p52v$shq$1@dont-email.me>
<30d6cdc7-ee4c-4b55-9ba8-8e7f236cd49fn@googlegroups.com>
<t5qc6a$2q7$1@dont-email.me>
<d8205d83-305d-4612-abae-314b009d00e9n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 15 May 2022 19:17:18 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="22c988749cee306dd5bbb12b51208a05";
logging-data="28742"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18YB1zXZm4FxRnvuLCmr/1L"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.8.0
Cancel-Lock: sha1:6/2PszSAPzYUq/GpTf646ddIbtw=
In-Reply-To: <d8205d83-305d-4612-abae-314b009d00e9n@googlegroups.com>
Content-Language: en-US
 by: BGB - Sun, 15 May 2022 19:15 UTC

On 5/15/2022 12:08 PM, MitchAlsup wrote:
> On Sunday, May 15, 2022 at 3:04:30 AM UTC-5, BGB wrote:
>> On 5/14/2022 4:29 PM, MitchAlsup wrote:
>>> On Saturday, May 14, 2022 at 3:57:06 PM UTC-5, BGB wrote:
>>>> On 5/13/2022 9:06 AM, MitchAlsup wrote:
>
>
>>> Infiniity has a defined magnitude :: bigger than any IEEE representable number
>>> NaN has a defined non-value :: not comparable to any IEEE representable
>>> number (including even itself).
>>> <
>>> I don't see why you would want Infinity to just be another NaN.
>>> <
>> If either Inf or NaN happens, it usually means "something has gone
>> wrong". Having them as distinct cases adds more special cases that need
>> to be detected and handled by hardware, without contributing much beyond
>> slightly different ways of saying "Well, the math has broken".
> <
> Needing to test for an explicit -0.0 has similar problems.

Possibly, though I had found that if one generates a -0, some software
will misbehave. It is seemingly necessary to force all 0 to be positive
zero for software compatibility reasons.

This means one has a few cases, eg:
A or B is NaN (with Inf as a sub-case);
A or B is Zero;

FADD/FSUB:
A==NaN || B==NaN
Result is NaN
Else, Normal Case
Generally, perform calculation internally as twos complement (*).
~ 66 bit so Int64 conversion works.

FMUL:
A==NaN || B==NaN
Result is NaN
A==Zero || B==Zero
Result is Zero
Else, Normal Case
~ DP: 54*54 ~> 68 (discards low-order results).

Assuming here an implementation which lacks an FDIV instruction.

*: One could argue that ones' complement would be cheaper here than twos
complement, but the difference in the results is more obvious, and one
ends up with an FPU where frequently (A+B)!=(B+A), where (A+B)==(B+A) is
a decidedly "nice to have" property.

Also, if one assumes the ability to express integer values as floating
point values, and the ability to operate them and still get integer
results, then one is going to need twos complement here.

For a 32-bit machine, it might make sense to only perform Binary32 ops
hardware, and fall back to software emulation for Binary64.

Though, this does lead to some problem cases:
'math.h' functions;
'atof' and similar;
...

Which are likely to take a pretty big performance hit if the hardware
only does Binary32.

If one implements FP compare with 'EQ' and 'GT' operators, then probably
only EQ needs to deal with NaN:
A==NaN || B==NaN: EQ always gives False.
Other cases: EQ is equivalent to integer EQ.

Mostly, because for a single 'GT' operator, there is no real sensible
way to handle NaN's, so it is easier to simply ignore their existence in
this case.

Where, say:
A==B: EQ(A,B),BT
A!=B: EQ(A,B),BF
A> B: GT(A,B),BT
A<=B: GT(A,B),BF
A>=B: GT(B,A),BF
A< B: GT(B,A),BT

Granted, one could try to argue for a different interpretation if one
assumes an ISA built around compare-and-branch instructions, or one
built on condition-code branches.

>>
>> In practice, the distinction contributes little in terms of either
>> practical use cases nor is much benefit for debugging.
>>
>> Though, I am still generally in favor of keeping NaN around.
>>>> Define "Denormal As Zero" as canonical;
>>> <
>>> This no longer saves circuitry.
>>> <
>> Denormal numbers are only really "free" if one has an FMA unit rather
>> than separate FADD/FMUL (so, eg, both FMUL and FADD can share the same
>> renormalization logic).
>>
>> Though, the cheaper option here seemingly being to not have FMA, in
>> which case it is no longer free.
> <
> Then you are not compliant with IEEE 754-2008 so why bother with the
> rest of it.

I am not assuming this to match up with IEEE-754-2008, that is not the
point.

This would likely need to be a separate spec, but would still basically
match up with IEEE 754 in terms of floating-point formats and similar.

But, it would be nice to be able to have another spec, that is like IEEE
754 but intended mostly for cheap embedded processors and
microcontrollers, but with much looser requirements and thus easier to hit.

In these cases, options thus far are typically either:
Cheap FPU which doesn't fully implement IEEE 754
And probably couldn't do so cost-effectively in the first place
Falling back to software emulation, which is significantly slower.

But, if one wants any semblance of floating-point performance, relying
on software emulation is basically no-go.

So, the claim of the this spec is not so much that it will match up with
the numerical results of a PC or similar but "has an FPU basically
sufficient to do basic FPU tasks".

Contrast would be something like an MSP430 or AVR8 based
microcontroller, where trying to use floating point math is best
summarized as, "No, don't even try".

>>
>> Though, I would assume the particular interpretation of DAZ as FTZ
>> (Flush to Zero) on the results, since the interpretation where "result
>> exponent may be random garbage" results in other (generally worse)
>> issues regarding the semantics.
>>
>>
>> Also technically cheaper to implement FMUL and FADD in a way where most
>> of the low order bits which "fall off the bottom" are effectively
>> discarded from the calculation, because only a relatively limited number
>> of bits below the ULP are likely to have much effect on the rounded result.
> <
> Yes, and FMAC is a bit larger than a FMUL and an FADD. But not hideously so.

If one assumes the need to work with full-width intermediate values (as
opposed to discarding all the low order bits from the result), it can
get a bit much.

One drawback of gluing the units together, is that one would end up with
a unit that needs a higher cycle latency than either an FMUL or an FADD
(at least, when these units are implemented with cost-cutting measures).

>>
>> Though, FADD does need a mantissa large enough internally to deal with
>> integer conversion (so, say, 66 bits to deal with Binary64<->Int64
>> conversion).
>>
>> Reusing FADD for conversion makes more sense, since FADD already has
>> most of the logic needed for doing conversions, and this is cheaper than
>> repeating the logic for a dedicated module.
>>
>>
>>
>> For FMUL, given the relatively limited dynamic range of the results (for
>> normalized inputs), the renormalization step is very minimal:
>> Result is 1<=x<2, All is good (do nothing);
>> Result is 2<=x<4, Shift right by 1 bit and add 1 to exponent.
> <
> Only if you FTZ. Otherwise if you process denorms by not "inventing"
> the hidden bit, you have to scan for the hidden bit and shift the
> result accordingly.

Yes, I am assuming DAZ+FTZ semantics here.

The cost of using a general-purpose normalizer (like one would use in an
FADD), also being fairly slow and expensive.

I suspect probably a bit part of the cost of the FADD is the
normalization logic.

>>
>> Main expensive part of FMUL being the "multiply the two mantissas
>> together" aspect.
>>>> ...
>>>>
>>>> FPU operations:
>>>> ADD/SUB/MUL
>>>> CMP, CONV
>>>>
>>>> Rounding:
>>>> Mostly Undefined (Rounding modes may be unsupported or ignored)
>>>> ADD/SUB/MUL are ULP +/- 1.5 or 2 or similar
>>> <
>>> Even the GPUs are migrating towards full IEEE 754 compliance.
> <
>> This is more likely due to GPGPU uses than due to full 754 being
>> particularly useful for graphics processing and similar.
> <
> I was told that GPUs were migrating towards full IEEE so as to reduce
> image "shimmer".

OK.

I haven't really noticed any big issues here.

IMHO, 3D graphics in games basically reached the "good enough" point
roughly 20 years ago, and most "improvement" since then hasn't really
contributed all that much to the overall experience.

The biggest "significant" improvement on this front was probably RTX,
but this still doesn't add enough to convince me to spend the money
needed to go and buy a graphics card which supports it.

Eg, still running a second hand GTX 980, basically good enough...
Before this, was running a second-hand GTX 460 for a while.

>>> <
>>>> Conversion to integer is always "truncate towards zero".
>>> <
>>> i = ICEIL( x );
>>> ...
>> There are ways to implement floor/ceil/... that don't depend on having
>> multiple rounding modes in hardware, or multiple float->int conversions.
> <
> When you look at the circuitry required, it is small. So, the best thing is
> to create direct instructions for these things. Given HW than can perform
> <
> i = TRUNK( x );
> <
> adding CEIL, FLOOR, RND; adds only a few percent more gates.


Click here to read the complete article
Re: Mixed EGU/EGO floating-point

<t5rm0u$kor$1@newsreader4.netcologne.de>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25318&group=comp.arch#25318

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-5c6-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Mixed EGU/EGO floating-point
Date: Sun, 15 May 2022 19:58:22 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <t5rm0u$kor$1@newsreader4.netcologne.de>
References: <de735513-d303-42b6-b375-916b89ddafcan@googlegroups.com>
<2a41be98-e9c8-4159-bef7-ba7999e278ecn@googlegroups.com>
<a4400d89-ce0d-4c76-aeb2-de2d13f40318n@googlegroups.com>
<2022May13.200449@mips.complang.tuwien.ac.at> <t5p9pf$52o$1@gioia.aioe.org>
<t5pb7g$6og$1@dont-email.me> <2022May15.091837@mips.complang.tuwien.ac.at>
<t5r53e$cjh$1@newsreader4.netcologne.de>
<2022May15.193827@mips.complang.tuwien.ac.at>
<t5re2t$h0o$1@newsreader4.netcologne.de>
<6c1f724c-278a-43fe-8aab-efed4a1a310bn@googlegroups.com>
Injection-Date: Sun, 15 May 2022 19:58:22 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-5c6-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:5c6:0:7285:c2ff:fe6c:992d";
logging-data="21275"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Sun, 15 May 2022 19:58 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:
> On Sunday, May 15, 2022 at 12:42:56 PM UTC-5, Thomas Koenig wrote:
>> Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>> > Thomas Koenig <tko...@netcologne.de> writes:
>> >>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>> >>
>> >>> No, we already have NaNs for that. He wants to be able to compute the
>> >>> sum/product of all present data, and what he proposes works for that.
>> >>
>> >> s = sum(a,mask=.not. ieee_is_nan(a))
>> >>
>> >>works fine.
>> >
>> > Which CPU has this instruction?
>> Any processor which implements Fortran.
><
> I have used plenty of processors that had quality implementations of FORTRAN
> but did not have that as an instruction.

I assume you didn't use much Fortran 90+ compilers, judging by some
of your previous comments :-)

And my actual point: No sense in implementing something in a floating
point format that can be be much better handled by code like the one
above (or the equivalent C code). As for "instruction" - well, it need
not be an actual machine instruction in an ISA, it could also be
a statement in a language.

Re: Mixed EGU/EGO floating-point

<t5rn0u$lsi$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25319&group=comp.arch#25319

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Mixed EGU/EGO floating-point
Date: Sun, 15 May 2022 15:14:08 -0500
Organization: A noiseless patient Spider
Lines: 104
Message-ID: <t5rn0u$lsi$1@dont-email.me>
References: <t5qe8t$fh0$1@dont-email.me>
<memo.20220515153920.11824J@jgd.cix.co.uk>
<t5rco4$ga5$1@newsreader4.netcologne.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 15 May 2022 20:15:26 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="22c988749cee306dd5bbb12b51208a05";
logging-data="22418"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19u0pKKnA9BYAhEj9OBqxyu"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.8.0
Cancel-Lock: sha1:7OgVr4nCtVuINawLNlJGJwHPR/4=
In-Reply-To: <t5rco4$ga5$1@newsreader4.netcologne.de>
Content-Language: en-US
 by: BGB - Sun, 15 May 2022 20:14 UTC

On 5/15/2022 12:20 PM, Thomas Koenig wrote:
> John Dallman <jgd@cix.co.uk> schrieb:
>> In article <t5qe8t$fh0$1@dont-email.me>, cr88192@gmail.com (BGB) wrote:
>>
>>> Though, FWIW, expecting bit-identical floating-point results
>>> between architectures or target machines is asking for trouble.
>>
>> It is. A manufacturer once told us we could not expect bit-identical
>> results between two different kinds of FP registers on the same machine,
>> both usable at the same time. For ISVs who use compilers, rather than
>> assembler, that's deadly.
>
> Sounds like a broken design to me. What exactly were the different
> kinds of FP registers, and how were they different?

I know of at least several major ISA's with this particular design issue
(typically between dedicated FPU registers, and doing FPU operations in
SIMD registers).

I am currently in the "everything goes in GPRs camp":
If one has 64-bit GPRs, then pretty much everything can go into them;
If one needs 128-bit SIMD, one can use two GPRs per SIMD vector.

This is even with 'int' and friends still being 32-bit, where most older
code doesn't make particularly effective use of an ISA which has 64-bit
registers.

One could potentially argue for limiting 'int' operations to 32 bits,
ignoring the high half of the register in these cases, but this would be
"kinda lame".

One could potentially have such an ISA which natively implements
arithmetic in a SIMD like manner, but then turns 64-bit integer ops into
a multi-instruction sequence (with manual carry propagation).

But, at this point I would probably still prefer an ISA with a single
larger register space (and a few asinine special cases) to the
traditional 3-way slit between GPRs, FPU, and SIMD registers.

Though, code which works with a lot of 128-bit vectors might still push
the limits of a 32 GPR design when each vector uses 2 GPRs.

One could argue for one of:
One has 64 GPRs, which is awkward for encoding reasons;
One has 32 GPRs, leading to register-pressure issues with 128b SIMD;
One has 32 regs, with half the space being "SIMD only".

In the latter case, say:
X0 -> R1:R0
X2 -> R3:R2
...
X30 -> R31:R30
X1,X3,...: Only existing as 128b SIMD registers.

But, this kinda sucks, as one can no longer use any operation on any
register.

I was kinda doing it this way initially, but then later added some
encoding hacks to allow much of the rest of the ISA to have access to
these registers.

For the most part though, my C ABI prefers to stick with the low 32 GPRs.

I had experimented with another ABI design that basically widened nearly
everything to 128 bits, and would use all of the register space.

This is on hold for now though:
Didn't get it debugged enough to really be usable;
Adversely effects code density;
Would also likely adversely effect performance and memory footprint.

....

Despite the underlying ISA being mostly unaffected by this design, it
would have not been binary compatible (in terms of either function calls
or in-memory data structures) with code built for the 64-bit ABI.

Even despite having R32..R63 available, there are relatively few
use-cases at present that make a particularly strong case for enabling
them in the C ABI.

Ironically, saving and using more than the "locally optimal" number of
registers for a given function will tend to actually make performance
worse, since any potential savings in terms of "fewer register spills"
is offset by the cost of saving/restoring more registers in the function
prolog and epilog (and, saving/restoring registers but then never using
them, is counter-productive).

Likewise, only a relatively small number of functions actually hit the
existing limits (likewise goes for functions which exceed 8 arguments;
they exist, but the vast majority of function calls use fewer than 8
arguments).

....

Re: Mixed EGU/EGO floating-point

<cf392aac-8931-41e8-9199-1c808ba7fa9an@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25320&group=comp.arch#25320

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:2412:b0:6a0:5f8e:c050 with SMTP id d18-20020a05620a241200b006a05f8ec050mr10358209qkn.462.1652646577149;
Sun, 15 May 2022 13:29:37 -0700 (PDT)
X-Received: by 2002:a05:6870:80ca:b0:f1:8fad:d9d1 with SMTP id
r10-20020a05687080ca00b000f18fadd9d1mr2312619oab.125.1652646576897; Sun, 15
May 2022 13:29:36 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 15 May 2022 13:29:36 -0700 (PDT)
In-Reply-To: <t5rjju$s26$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:4811:b402:98e:d212;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:4811:b402:98e:d212
References: <de735513-d303-42b6-b375-916b89ddafcan@googlegroups.com>
<2a41be98-e9c8-4159-bef7-ba7999e278ecn@googlegroups.com> <t5p52v$shq$1@dont-email.me>
<30d6cdc7-ee4c-4b55-9ba8-8e7f236cd49fn@googlegroups.com> <t5qc6a$2q7$1@dont-email.me>
<d8205d83-305d-4612-abae-314b009d00e9n@googlegroups.com> <t5rjju$s26$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <cf392aac-8931-41e8-9199-1c808ba7fa9an@googlegroups.com>
Subject: Re: Mixed EGU/EGO floating-point
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 15 May 2022 20:29:37 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Sun, 15 May 2022 20:29 UTC

On Sunday, May 15, 2022 at 2:17:21 PM UTC-5, BGB wrote:
> On 5/15/2022 12:08 PM, MitchAlsup wrote:
> > On Sunday, May 15, 2022 at 3:04:30 AM UTC-5, BGB wrote:
> >> On 5/14/2022 4:29 PM, MitchAlsup wrote:
> >>> On Saturday, May 14, 2022 at 3:57:06 PM UTC-5, BGB wrote:
> >>>> On 5/13/2022 9:06 AM, MitchAlsup wrote:
> >
> >
> >>> Infiniity has a defined magnitude :: bigger than any IEEE representable number
> >>> NaN has a defined non-value :: not comparable to any IEEE representable
> >>> number (including even itself).
> >>> <
> >>> I don't see why you would want Infinity to just be another NaN.
> >>> <
> >> If either Inf or NaN happens, it usually means "something has gone
> >> wrong". Having them as distinct cases adds more special cases that need
> >> to be detected and handled by hardware, without contributing much beyond
> >> slightly different ways of saying "Well, the math has broken".
> > <
> > Needing to test for an explicit -0.0 has similar problems.
> Possibly, though I had found that if one generates a -0, some software
> will misbehave. It is seemingly necessary to force all 0 to be positive
> zero for software compatibility reasons.
>
> This means one has a few cases, eg:
> A or B is NaN (with Inf as a sub-case);
> A or B is Zero;
>
>
> FADD/FSUB:
> A==NaN || B==NaN
> Result is NaN
> Else, Normal Case
> Generally, perform calculation internally as twos complement (*).
> ~ 66 bit so Int64 conversion works.
>
> FMUL:
> A==NaN || B==NaN
> Result is NaN
<
Sooner or later you will realize that when operands can all be NaNs, you
should choose one over the rest. My 66000 has the following rules:
3-operands:: Rs3 is chosen over Rs2 over Rs1
2-operands:: Rs2 is chosen over Rs1
1-opernd:: Rs1 is chosen.
<
The 3-operand case propagates the augend of FMAC over the multiply
operands since this operand recirculates in inner product codes. So,
when SW intercepts exceptions and "does something with new NaNs"
the earliest NaN carries the originating "does something" around with it.
<
> A==Zero || B==Zero
> Result is Zero
<
A properly signed zero. result.s = operand1.s ^ operand2.s
<
> Else, Normal Case
> ~ DP: 54*54 ~> 68 (discards low-order results).
>
> Assuming here an implementation which lacks an FDIV instruction.
>
>
> *: One could argue that ones' complement would be cheaper here than twos
> complement, but the difference in the results is more obvious, and one
> ends up with an FPU where frequently (A+B)!=(B+A), where (A+B)==(B+A) is
> a decidedly "nice to have" property.
<
When one built circuits with individual transistors and packages with 1-2-4 gates
in them, 1-complement is just as fast (or slow) as 2-complement. In any CMOS
technology, 1-complement has a "data path" of width longer for the end-around-carry
than 2-complement. Thus, there are logic reason to stick with 2-complement.
<
>
> Also, if one assumes the ability to express integer values as floating
> point values, and the ability to operate them and still get integer
> results, then one is going to need twos complement here.
>
>
>
> For a 32-bit machine, it might make sense to only perform Binary32 ops
> hardware, and fall back to software emulation for Binary64.
>
Note: I did not do this in Mc88100.
>
> Though, this does lead to some problem cases:
> 'math.h' functions;
> 'atof' and similar;
> ...
>
> Which are likely to take a pretty big performance hit if the hardware
> only does Binary32.
>
>
> If one implements FP compare with 'EQ' and 'GT' operators, then probably
> only EQ needs to deal with NaN:
> A==NaN || B==NaN: EQ always gives False.
> Other cases: EQ is equivalent to integer EQ.
>
>
> Mostly, because for a single 'GT' operator, there is no real sensible
> way to handle NaN's, so it is easier to simply ignore their existence in
> this case.
>
> Where, say:
> A==B: EQ(A,B),BT
> A!=B: EQ(A,B),BF
> A> B: GT(A,B),BT
> A<=B: GT(A,B),BF
> A>=B: GT(B,A),BF
> A< B: GT(B,A),BT
>
<
if( a > b )
{ // a actually > b goes here }
else
{ // a <= b || a== NaN || b == NaN goes here }
<
if( a <= b )
{ // a actually <= b goes here }
else
{ // a > b || a== NaN || b == NaN goes here }
<
And this is why the-else rearrangement is fraught with peril in optimization.
You can't flip the comparison operation and clauses (like you can with integers.)
>
> Granted, one could try to argue for a different interpretation if one
> assumes an ISA built around compare-and-branch instructions, or one
> built on condition-code branches.
<
I simply argue that there should be enough encodings to encode all 10
IEEE 754 specified behaviors.
> >>
> >> In practice, the distinction contributes little in terms of either
> >> practical use cases nor is much benefit for debugging.
> >>
> >> Though, I am still generally in favor of keeping NaN around.
> >>>> Define "Denormal As Zero" as canonical;
> >>> <
> >>> This no longer saves circuitry.
> >>> <
> >> Denormal numbers are only really "free" if one has an FMA unit rather
> >> than separate FADD/FMUL (so, eg, both FMUL and FADD can share the same
> >> renormalization logic).
> >>
> >> Though, the cheaper option here seemingly being to not have FMA, in
> >> which case it is no longer free.
> > <
> > Then you are not compliant with IEEE 754-2008 so why bother with the
> > rest of it.
> I am not assuming this to match up with IEEE-754-2008, that is not the
> point.
>
> This would likely need to be a separate spec, but would still basically
> match up with IEEE 754 in terms of floating-point formats and similar.
<
That falls under the "why bother" moniker.
>
>
> But, it would be nice to be able to have another spec, that is like IEEE
> 754 but intended mostly for cheap embedded processors and
> microcontrollers, but with much looser requirements and thus easier to hit.
<
IEEE baggage is "NOT EXPENSIVE" in modern fab technology.
{While it might be in FPGA and those lesser technologies.}
Most of the baggage overhead is designer time, not power, area, or cost.
>
>
> In these cases, options thus far are typically either:
> Cheap FPU which doesn't fully implement IEEE 754
<
So if true IEEE 754 costs $1.00
This same but not really IEEE 754 would cost $0.98
Is it still sellable ?
<
> And probably couldn't do so cost-effectively in the first place
> Falling back to software emulation, which is significantly slower.
>
> But, if one wants any semblance of floating-point performance, relying
> on software emulation is basically no-go.
>
>
> So, the claim of the this spec is not so much that it will match up with
> the numerical results of a PC or similar but "has an FPU basically
> sufficient to do basic FPU tasks".
>
You might get away with this for embedded, you won't for general purpose.
>
> Contrast would be something like an MSP430 or AVR8 based
> microcontroller, where trying to use floating point math is best
> summarized as, "No, don't even try".
> >>
> >> Though, I would assume the particular interpretation of DAZ as FTZ
> >> (Flush to Zero) on the results, since the interpretation where "result
> >> exponent may be random garbage" results in other (generally worse)
> >> issues regarding the semantics.
> >>
> >>
> >> Also technically cheaper to implement FMUL and FADD in a way where most
> >> of the low order bits which "fall off the bottom" are effectively
> >> discarded from the calculation, because only a relatively limited number
> >> of bits below the ULP are likely to have much effect on the rounded result.
> > <
> > Yes, and FMAC is a bit larger than a FMUL and an FADD. But not hideously so.
<
> If one assumes the need to work with full-width intermediate values (as
> opposed to discarding all the low order bits from the result), it can
> get a bit much.
>
>
> One drawback of gluing the units together, is that one would end up with
> a unit that needs a higher cycle latency than either an FMUL or an FADD
> (at least, when these units are implemented with cost-cutting measures).
<
Consider a FMAC machine compared to an FMUL + FADD machine.
The FMAC machine can do a y += a*b in 5 cycles whereas the FMUL + FADD
machine will take 8 (or possibly 7).
The FMAC machine will issue 1 instruction.
The FMUL + FADD machine will issue 2 instructions (4 cycles apart.)
<
This complicates a bunch of stuff that should not have complications added.
> >>
> >> Though, FADD does need a mantissa large enough internally to deal with
> >> integer conversion (so, say, 66 bits to deal with Binary64<->Int64
> >> conversion).
> >>
> >> Reusing FADD for conversion makes more sense, since FADD already has
> >> most of the logic needed for doing conversions, and this is cheaper than
> >> repeating the logic for a dedicated module.
> >>
> >>
> >>
> >> For FMUL, given the relatively limited dynamic range of the results (for
> >> normalized inputs), the renormalization step is very minimal:
> >> Result is 1<=x<2, All is good (do nothing);
> >> Result is 2<=x<4, Shift right by 1 bit and add 1 to exponent.
> > <
> > Only if you FTZ. Otherwise if you process denorms by not "inventing"
> > the hidden bit, you have to scan for the hidden bit and shift the
> > result accordingly.
> Yes, I am assuming DAZ+FTZ semantics here.
>
> The cost of using a general-purpose normalizer (like one would use in an
> FADD), also being fairly slow and expensive.
<
Only if you think 1 cycle is slow, and with leading-zero-prediction the cost
is power and area not cycles.
>
> I suspect probably a bit part of the cost of the FADD is the
> normalization logic.
<
About 1/3rd of the area and delay of FADD is normalize.
About 1/3rd of the area and delay of FADD is prealign.
> >>
> >> Main expensive part of FMUL being the "multiply the two mantissas
> >> together" aspect.
> >>>> ...
> >>>>
> >>>> FPU operations:
> >>>> ADD/SUB/MUL
> >>>> CMP, CONV
> >>>>
> >>>> Rounding:
> >>>> Mostly Undefined (Rounding modes may be unsupported or ignored)
> >>>> ADD/SUB/MUL are ULP +/- 1.5 or 2 or similar
> >>> <
> >>> Even the GPUs are migrating towards full IEEE 754 compliance.
> > <
> >> This is more likely due to GPGPU uses than due to full 754 being
> >> particularly useful for graphics processing and similar.
> > <
> > I was told that GPUs were migrating towards full IEEE so as to reduce
> > image "shimmer".
> OK.
>
> I haven't really noticed any big issues here.
>
> IMHO, 3D graphics in games basically reached the "good enough" point
> roughly 20 years ago, and most "improvement" since then hasn't really
> contributed all that much to the overall experience.
<
I would have said 15 years ago, but I suspect violent agreement here.
Tessellation creates a lot of "opportunities" for shimmer to show up.
>
>
> The biggest "significant" improvement on this front was probably RTX,
> but this still doesn't add enough to convince me to spend the money
> needed to go and buy a graphics card which supports it.
>
>
> Eg, still running a second hand GTX 980, basically good enough...
> Before this, was running a second-hand GTX 460 for a while.
> >>> <
> >>>> Conversion to integer is always "truncate towards zero".
> >>> <
> >>> i = ICEIL( x );
> >>> ...
> >> There are ways to implement floor/ceil/... that don't depend on having
> >> multiple rounding modes in hardware, or multiple float->int conversions.
> > <
> > When you look at the circuitry required, it is small. So, the best thing is
> > to create direct instructions for these things. Given HW than can perform
> > <
> > i = TRUNK( x );
> > <
> > adding CEIL, FLOOR, RND; adds only a few percent more gates.
> Possibly.
>
> I can note that in my case, my ISA has instructions with explicit
> rounding modes. But, a low-cost FPU probably shouldn't require them.
<
Not having them puts pressure on fast context switching. Thus, if taking
an exception, having a driver deal with the exception, dump state on the
offending task stack, returning control to signal handler in originating
task takes less than 100 cycles SW can perform these seldom used
instructions--but the instructions should STILL be defined in ISA.
<
And if the above sequence takes more than 1,000 cycles like x86-64 and
ARM then the overheads of SW emulation are unworthy of a "good"
implementation.
>
>
> Though, the rounding modes typically only work with the low-order bits
> and will not round if this would require a significant carry propagation.
<
Note: when the fraction is 0.1111111111111111111111{lgrs} and lgrs
tells you to increment, you increment both the fraction and the exponent
and you still have the proper result. For the directed rounding modes, you
need to consider the sign of the intermediate representation in deciding
whether to add 1 or not.
> >>
> Add:
> The idea for an x87-like format was, one could implement an FPU where
> the native format is non-normalized (like in x87), and then require an
> explicit re-normalization step before converting to IEEE formats.
>
> This is potentially a double-edged sword though, as it merely moves the
> cost from one place to another; and would adding a lot more instructions
> when working with floating-point values (vs keeping them in the IEEE
> formats).
<
The double edge sward is that in addition to moving bugs around, it loses
IEEE calculation-to-calculation agreement with IEEE 754
<
>
> In retrospect, probably not such a great idea...


Click here to read the complete article
Re: Mixed EGU/EGO floating-point

<b78d6cfa-cba6-40ca-832b-86af306de767n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25321&group=comp.arch#25321

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:1cc6:b0:45d:a313:d2d with SMTP id g6-20020a0562141cc600b0045da3130d2dmr12759324qvd.127.1652647240344;
Sun, 15 May 2022 13:40:40 -0700 (PDT)
X-Received: by 2002:a05:6870:d254:b0:e9:5d17:9e35 with SMTP id
h20-20020a056870d25400b000e95d179e35mr7296158oac.154.1652647240143; Sun, 15
May 2022 13:40:40 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!feeder.erje.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 15 May 2022 13:40:39 -0700 (PDT)
In-Reply-To: <t5rn0u$lsi$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:4811:b402:98e:d212;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:4811:b402:98e:d212
References: <t5qe8t$fh0$1@dont-email.me> <memo.20220515153920.11824J@jgd.cix.co.uk>
<t5rco4$ga5$1@newsreader4.netcologne.de> <t5rn0u$lsi$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b78d6cfa-cba6-40ca-832b-86af306de767n@googlegroups.com>
Subject: Re: Mixed EGU/EGO floating-point
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 15 May 2022 20:40:40 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Sun, 15 May 2022 20:40 UTC

On Sunday, May 15, 2022 at 3:15:30 PM UTC-5, BGB wrote:
> On 5/15/2022 12:20 PM, Thomas Koenig wrote:
> > John Dallman <j...@cix.co.uk> schrieb:
> >> In article <t5qe8t$fh0$1...@dont-email.me>, cr8...@gmail.com (BGB) wrote:
> >>
> >>> Though, FWIW, expecting bit-identical floating-point results
> >>> between architectures or target machines is asking for trouble.
> >>
> >> It is. A manufacturer once told us we could not expect bit-identical
> >> results between two different kinds of FP registers on the same machine,
> >> both usable at the same time. For ISVs who use compilers, rather than
> >> assembler, that's deadly.
> >
> > Sounds like a broken design to me. What exactly were the different
> > kinds of FP registers, and how were they different?
> I know of at least several major ISA's with this particular design issue
> (typically between dedicated FPU registers, and doing FPU operations in
> SIMD registers).
>
>
> I am currently in the "everything goes in GPRs camp":
> If one has 64-bit GPRs, then pretty much everything can go into them;
> If one needs 128-bit SIMD, one can use two GPRs per SIMD vector.
<
My 66000 does all SIMD stuff as vectorized loops. No need for a
SIMD register file, either.
>
>
> This is even with 'int' and friends still being 32-bit, where most older
> code doesn't make particularly effective use of an ISA which has 64-bit
> registers.
<
# define int int64_t
>
> One could potentially argue for limiting 'int' operations to 32 bits,
> ignoring the high half of the register in these cases, but this would be
> "kinda lame".
>
> One could potentially have such an ISA which natively implements
> arithmetic in a SIMD like manner, but then turns 64-bit integer ops into
> a multi-instruction sequence (with manual carry propagation).
>
Vectorization, instead.
>
>
> But, at this point I would probably still prefer an ISA with a single
> larger register space (and a few asinine special cases) to the
> traditional 3-way slit between GPRs, FPU, and SIMD registers.
>
>
> Though, code which works with a lot of 128-bit vectors might still push
> the limits of a 32 GPR design when each vector uses 2 GPRs.
>
> One could argue for one of:
> One has 64 GPRs, which is awkward for encoding reasons;
> One has 32 GPRs, leading to register-pressure issues with 128b SIMD;
<
Not with VVM. PLUS: when you widen up the machine capabilities
you don't need new register resources, nor do you even need to
change the code to assess these new wider data paths.
<
> One has 32 regs, with half the space being "SIMD only".
>
>
> In the latter case, say:
> X0 -> R1:R0
> X2 -> R3:R2
> ...
> X30 -> R31:R30
<
Not sure you want to HW to allow using the SP and/or FP as SIMD registers
(assuming you want SIMD registers)
<
> X1,X3,...: Only existing as 128b SIMD registers.
>
> But, this kinda sucks, as one can no longer use any operation on any
> register.
<
kinda is a serious understatement.
>
> I was kinda doing it this way initially, but then later added some
> encoding hacks to allow much of the rest of the ISA to have access to
> these registers.
>
>
>
> For the most part though, my C ABI prefers to stick with the low 32 GPRs.
>
>
> I had experimented with another ABI design that basically widened nearly
> everything to 128 bits, and would use all of the register space.
>
> This is on hold for now though:
> Didn't get it debugged enough to really be usable;
> Adversely effects code density;
> Would also likely adversely effect performance and memory footprint.
>
> ...
>
> Despite the underlying ISA being mostly unaffected by this design, it
> would have not been binary compatible (in terms of either function calls
> or in-memory data structures) with code built for the 64-bit ABI.
>
>
> Even despite having R32..R63 available, there are relatively few
> use-cases at present that make a particularly strong case for enabling
> them in the C ABI.
>
> Ironically, saving and using more than the "locally optimal" number of
> registers for a given function will tend to actually make performance
> worse, since any potential savings in terms of "fewer register spills"
> is offset by the cost of saving/restoring more registers in the function
> prolog and epilog (and, saving/restoring registers but then never using
> them, is counter-productive).
<
When the save/restore is allowed to use full cache access width (while
normal LDs and STs only use data-path widths) prologue and epilogue
sequences are faster than register spills.
>
> Likewise, only a relatively small number of functions actually hit the
> existing limits (likewise goes for functions which exceed 8 arguments;
> they exist, but the vast majority of function calls use fewer than 8
> arguments).
>
> ...

Re: Mixed EGU/EGO floating-point

<t5rqgp$edg$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25322&group=comp.arch#25322

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Mixed EGU/EGO floating-point
Date: Sun, 15 May 2022 16:13:47 -0500
Organization: A noiseless patient Spider
Lines: 111
Message-ID: <t5rqgp$edg$1@dont-email.me>
References: <de735513-d303-42b6-b375-916b89ddafcan@googlegroups.com>
<2a41be98-e9c8-4159-bef7-ba7999e278ecn@googlegroups.com>
<a4400d89-ce0d-4c76-aeb2-de2d13f40318n@googlegroups.com>
<2022May13.200449@mips.complang.tuwien.ac.at> <t5p9pf$52o$1@gioia.aioe.org>
<19670a53-3f89-406f-86ea-d24641c17066n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 15 May 2022 21:15:05 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="22c988749cee306dd5bbb12b51208a05";
logging-data="14768"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Nt4fP3zQ/2C8owJCDRLD7"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.8.0
Cancel-Lock: sha1:ZGNw2/ZMsHd6ux/Vau2kmsqDinY=
In-Reply-To: <19670a53-3f89-406f-86ea-d24641c17066n@googlegroups.com>
Content-Language: en-US
 by: BGB - Sun, 15 May 2022 21:13 UTC

On 5/14/2022 9:27 PM, MitchAlsup wrote:
> On Saturday, May 14, 2022 at 5:17:22 PM UTC-5, Terje Mathisen wrote:
>> Anton Ertl wrote:
>>> Quadibloc <jsa...@ecn.ab.ca> writes:
>>>> And the IEEE 754
>>>> standard has been hugely popular, as illustrated by its wide adoption.
>>>>
>>>> This could be due to the market power of the x86 due to Windows, and not
>>>> to any particular merit of the standard, of course; it's hard to tell.
>>>
>>> This particular claim is easy to refute: IEEE 754 won before Windows,
>>> so it did not become popular because of Windows. Basically all new
>>> architectures after IEEE 754 (1985) and a few before (e.g., 8087
>>> (1980), 68881 (1984)) adopted IEEE 754, and in time the older
>>> architectures adopted it as well. By contrast, Windows was not
>>> really successful before Windows 3 (1990).
>> Afair ieee754 can be traced back to around 1978, i.e. the design of the
>> 8087, didn't Intel consult with Kahan around that timeframe?
>>
>> The original 1985 standard mostly blessed those already-implemented
>> versions of the prior drafts, this required grandfathering a few
>> unfortunate parts that had not been clearly defined and which had been
>> implemented in more than one way.
>>
>> I.e. what is the correct way to detect subnormal results? Before or
>> after rounding? (Imho the only sane choice is to first normalize the
>> result so that any negative (biased) exponent is locked at zero, then
>> round at that position.)
>>
>> Another unfortunate choice was in how to encode Signaling vs Quiet NaNs:
>> Which bit in the mantissa is used and what does it mean?
>> Is the bit set or reset for Quiet? I have seen it claimed that by having
>> QNaNs setting the bit, any trap handler which wants to convert an SNaN
>> to QNaN can do so by setting that bit, without having to worry about the
>> entire mantissa then ending up zero, which would be illegal since it
>> turns the value into Inf.
>>
>> The proper way to encode this, imnsho, would have been to reserve the
>> two top mantissa bits and disregard the rest for the purpose of
>> determining Inf vs NaN.
>>
>> 00 - Inf
>> 01 - None (Missing value)
>> 10 - QNaN
>> 11 - SNaN
>>
>> This way the top mantissa bit would be sticky, always indicating a NaN
>> (with the exp field 0x1111... of course), the None value would be
>> treated as zero for FADD/FSUB and 1.0 for FMUL so you could multiply or
>> add together all elements of an array and not worry about any missing
>> elements.
> <
> This may simplify software implementations, but HW implementations
> find it easy enough to detect the fraction == 0 as Infinity.

While it is not "difficult" to check whether the mantissa is 0, it isn't
entirely free either.

In my case, I used the top 4 bits of the mantissa:
0000: Inf (Assumes the rest of the bits are also zero)
Else: NaN

No real practical distinction between NaN types (not yet figured out
much use-case for QNaN vs SNaN; pretty much everything behaves like QNaN
in this case). In my case, the FPU does not throw exceptions in any case.

Though, 2 or 3 bits could be a little easier to feed into a LUT in some
cases.

Though, in my case:
(62:52)==11'h7FF: Inf/NaN
(51:48)==4'h0: Inf
Else, NaN
(62:52)==11'h000: Zero
(Rest of mantissa is typically ignored or assumed to be 0).

On the output end, Inf, NaN, and Zero flags will set a flag that forces
(47:0) to be 0, with the other bits being set based on the type of result.

For FMUL, either input being Zero sets the "force output to Zero" case.
For FADD/FSUB, this would mostly come up if the exponent would go negative.

The cost of the "force mantissa to zero" mechanism being a "reasonable
tradeoff" from a cost POV (arguable better than letting it contain
random garbage bits).

For FCMPEQ, Inf vs NaN does make a difference in that NaN needs to cause
EQ to be False, wheres Inf can compare equal to itself.

> In any event,
> HW does not process calculations with infinities, just special case them
> to decide what the answer wants to be; so the 3 gates of delay checking
> for zero is insignificant.
> <
> Also note: instead of detecting infinity with fraction == 0, one can detect
> implementation infinities by ( fraction<52> | fraction<51> ) == 1. (one
> gate of delay).
>>
>> Most other operations would of course be invalid, giving a NaN result.
>> Terje
>>
>> --
>> - <Terje.Mathisen at tmsw.no>
>> "almost all programming can be viewed as an exercise in caching"

Re: Mixed EGU/EGO floating-point

<380df37d-3cf9-4bcc-a599-c31ac300137cn@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25323&group=comp.arch#25323

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:24cd:b0:6a0:414c:a648 with SMTP id m13-20020a05620a24cd00b006a0414ca648mr10456483qkn.465.1652650345426;
Sun, 15 May 2022 14:32:25 -0700 (PDT)
X-Received: by 2002:a54:4f83:0:b0:324:f58f:4b95 with SMTP id
g3-20020a544f83000000b00324f58f4b95mr11559512oiy.4.1652650344840; Sun, 15 May
2022 14:32:24 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 15 May 2022 14:32:24 -0700 (PDT)
In-Reply-To: <t5rm0u$kor$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:4811:b402:98e:d212;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:4811:b402:98e:d212
References: <de735513-d303-42b6-b375-916b89ddafcan@googlegroups.com>
<2a41be98-e9c8-4159-bef7-ba7999e278ecn@googlegroups.com> <a4400d89-ce0d-4c76-aeb2-de2d13f40318n@googlegroups.com>
<2022May13.200449@mips.complang.tuwien.ac.at> <t5p9pf$52o$1@gioia.aioe.org>
<t5pb7g$6og$1@dont-email.me> <2022May15.091837@mips.complang.tuwien.ac.at>
<t5r53e$cjh$1@newsreader4.netcologne.de> <2022May15.193827@mips.complang.tuwien.ac.at>
<t5re2t$h0o$1@newsreader4.netcologne.de> <6c1f724c-278a-43fe-8aab-efed4a1a310bn@googlegroups.com>
<t5rm0u$kor$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <380df37d-3cf9-4bcc-a599-c31ac300137cn@googlegroups.com>
Subject: Re: Mixed EGU/EGO floating-point
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 15 May 2022 21:32:25 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Sun, 15 May 2022 21:32 UTC

On Sunday, May 15, 2022 at 2:58:25 PM UTC-5, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > On Sunday, May 15, 2022 at 12:42:56 PM UTC-5, Thomas Koenig wrote:
> >> Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
> >> > Thomas Koenig <tko...@netcologne.de> writes:
> >> >>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
> >> >>
> >> >>> No, we already have NaNs for that. He wants to be able to compute the
> >> >>> sum/product of all present data, and what he proposes works for that.
> >> >>
> >> >> s = sum(a,mask=.not. ieee_is_nan(a))
> >> >>
> >> >>works fine.
> >> >
> >> > Which CPU has this instruction?
> >> Any processor which implements Fortran.
> ><
> > I have used plenty of processors that had quality implementations of FORTRAN
> > but did not have that as an instruction.
<
> I assume you didn't use much Fortran 90+ compilers, judging by some
> of your previous comments :-)
<
Essentially all of my FORTRAN exposure after 1983 was FORTRAN programmers
and compiler writers coming to me asking what is the proper instructions for "this"
piece of FORTRAN code, and the converse: Why did the FROTRAN compiler spit
out this sequence of code ?
>
> And my actual point: No sense in implementing something in a floating
> point format that can be be much better handled by code like the one
> above (or the equivalent C code). As for "instruction" - well, it need
> not be an actual machine instruction in an ISA, it could also be
> a statement in a language.
<
While My 66000 has instructions FMAX and FMIN and you CAN implement
FABS as FMAX(x,-x). It saves power to have its own instruction. Also note:
FMAX( x, -x ) is 1 instruction because My 66000 ISA has sign control over
operands already.
<
AND there are ways of organizing data paths such that the ABS instruction
can be performed in forwarding a result to an operand of another instruction
effectively taking 0 cycles to execute. So, a savings in power always, and
a saving in cycles of higher end machines tipped the balance to adding FABS
over leaving it out.
<
There were a couple of other instructions that "made the cut" by saving
power or cycles at one end of the implementation spectrum or the other.

Re: Mixed EGU/EGO floating-point

<jwvsfpabftk.fsf-monnier+comp.arch@gnu.org>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25324&group=comp.arch#25324

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!rocksolid2!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Mixed EGU/EGO floating-point
Date: Sun, 15 May 2022 18:20:08 -0400
Organization: A noiseless patient Spider
Lines: 11
Message-ID: <jwvsfpabftk.fsf-monnier+comp.arch@gnu.org>
References: <de735513-d303-42b6-b375-916b89ddafcan@googlegroups.com>
<2a41be98-e9c8-4159-bef7-ba7999e278ecn@googlegroups.com>
<a4400d89-ce0d-4c76-aeb2-de2d13f40318n@googlegroups.com>
<2022May13.200449@mips.complang.tuwien.ac.at>
<t5p9pf$52o$1@gioia.aioe.org> <t5pb7g$6og$1@dont-email.me>
<2022May15.091837@mips.complang.tuwien.ac.at>
<t5r53e$cjh$1@newsreader4.netcologne.de>
<2022May15.193827@mips.complang.tuwien.ac.at>
<t5re2t$h0o$1@newsreader4.netcologne.de>
<6c1f724c-278a-43fe-8aab-efed4a1a310bn@googlegroups.com>
<t5rm0u$kor$1@newsreader4.netcologne.de>
<380df37d-3cf9-4bcc-a599-c31ac300137cn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="9fe540bfe3a276011506340c94dde7da";
logging-data="7683"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/IJsdT78paN9gueRUxpNe1"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:iWT+VXUF3sGdX1zDf6056DB/bfw=
sha1:0EoD7DSo7dxZKTn3PcB70ppXKQE=
 by: Stefan Monnier - Sun, 15 May 2022 22:20 UTC

> While My 66000 has instructions FMAX and FMIN and you CAN implement
> FABS as FMAX(x,-x). It saves power to have its own instruction.

Interesting. Do you happen to know exactly where this power savings
come from? I guess it depends on how the instruction is implemented, of
course (i.e. does it have its own separate implementation in the
FP-ALU, or is it decoded into something equivalent to FMAX(x,-x), or
something in-between)?

Stefan

Re: Mixed EGU/EGO floating-point

<t5rum0$bu9$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25325&group=comp.arch#25325

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Mixed EGU/EGO floating-point
Date: Sun, 15 May 2022 17:24:50 -0500
Organization: A noiseless patient Spider
Lines: 243
Message-ID: <t5rum0$bu9$1@dont-email.me>
References: <t5qe8t$fh0$1@dont-email.me>
<memo.20220515153920.11824J@jgd.cix.co.uk>
<t5rco4$ga5$1@newsreader4.netcologne.de> <t5rn0u$lsi$1@dont-email.me>
<b78d6cfa-cba6-40ca-832b-86af306de767n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 15 May 2022 22:26:08 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="266fc1a26eb0f7f6881626248ae6c1ba";
logging-data="12233"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/PGLZmjJAReBSMdebTDvIt"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.8.0
Cancel-Lock: sha1:4mXbI1v21mG+2w/MW72jLMVvnHw=
In-Reply-To: <b78d6cfa-cba6-40ca-832b-86af306de767n@googlegroups.com>
Content-Language: en-US
 by: BGB - Sun, 15 May 2022 22:24 UTC

On 5/15/2022 3:40 PM, MitchAlsup wrote:
> On Sunday, May 15, 2022 at 3:15:30 PM UTC-5, BGB wrote:
>> On 5/15/2022 12:20 PM, Thomas Koenig wrote:
>>> John Dallman <j...@cix.co.uk> schrieb:
>>>> In article <t5qe8t$fh0$1...@dont-email.me>, cr8...@gmail.com (BGB) wrote:
>>>>
>>>>> Though, FWIW, expecting bit-identical floating-point results
>>>>> between architectures or target machines is asking for trouble.
>>>>
>>>> It is. A manufacturer once told us we could not expect bit-identical
>>>> results between two different kinds of FP registers on the same machine,
>>>> both usable at the same time. For ISVs who use compilers, rather than
>>>> assembler, that's deadly.
>>>
>>> Sounds like a broken design to me. What exactly were the different
>>> kinds of FP registers, and how were they different?
>> I know of at least several major ISA's with this particular design issue
>> (typically between dedicated FPU registers, and doing FPU operations in
>> SIMD registers).
>>
>>
>> I am currently in the "everything goes in GPRs camp":
>> If one has 64-bit GPRs, then pretty much everything can go into them;
>> If one needs 128-bit SIMD, one can use two GPRs per SIMD vector.
> <
> My 66000 does all SIMD stuff as vectorized loops. No need for a
> SIMD register file, either.

No SIMD file in my case, only GPRs.

SIMD operations exist, in one of several "flavors":
Those that operate on 64 bits, and use a single GPR;
Those that operate on 128-bits, and use a pair.

Which in turn has other effects:
Operations on 64-bit vectors can in many cases be bundled;
Operations on 128-bit vectors can't be bundled.

Mostly because the latter cases eat multiple lanes.

>>
>>
>> This is even with 'int' and friends still being 32-bit, where most older
>> code doesn't make particularly effective use of an ISA which has 64-bit
>> registers.
> <
> # define int int64_t

One could do this...

Wouldn't achieve much for most older code beyond wasting memory and
probably causing it to no longer work correctly.

>>
>> One could potentially argue for limiting 'int' operations to 32 bits,
>> ignoring the high half of the register in these cases, but this would be
>> "kinda lame".
>>
>> One could potentially have such an ISA which natively implements
>> arithmetic in a SIMD like manner, but then turns 64-bit integer ops into
>> a multi-instruction sequence (with manual carry propagation).
>>
> Vectorization, instead.

I was thinking where instead of having a 64-bit ADD, one used a 2x
32-bit ADD, and then had additional instructions to update the high-word
based on the preceding instructions' result in the low-word.

But, yes, from an ISA design POV this would be worse than just having a
64-bit ADD instruction.

Meanwhile: IRL, I ended up going the other direction and having a
128-bit ADDX instruction (sometimes useful).

>>
>>
>> But, at this point I would probably still prefer an ISA with a single
>> larger register space (and a few asinine special cases) to the
>> traditional 3-way slit between GPRs, FPU, and SIMD registers.
>>
>>
>> Though, code which works with a lot of 128-bit vectors might still push
>> the limits of a 32 GPR design when each vector uses 2 GPRs.
>>
>> One could argue for one of:
>> One has 64 GPRs, which is awkward for encoding reasons;
>> One has 32 GPRs, leading to register-pressure issues with 128b SIMD;
> <
> Not with VVM. PLUS: when you widen up the machine capabilities
> you don't need new register resources, nor do you even need to
> change the code to assess these new wider data paths.
> <

Doing SIMD the way I did it seemed like the simplest/cheapest option.

No current plans to go beyond 128 bits.

>> One has 32 regs, with half the space being "SIMD only".
>>
>>
>> In the latter case, say:
>> X0 -> R1:R0
>> X2 -> R3:R2
>> ...
>> X30 -> R31:R30
> <
> Not sure you want to HW to allow using the SP and/or FP as SIMD registers
> (assuming you want SIMD registers)

A few of the registers are basically "undefined" if accessed as 128-bit:
(R1:R0), May be encoded, but effectively undefined at present;
(R15:R14), Contains SP, also undefined.

If the emulator sees these cases, it will turn the instruction into a
breakpoint.

If the CPU core encounters this, it will likely access the GPR space
"behind" these registers, as the way the SPRs is implemented is as
special register IDs that are overlaid on top of the GPR space by the
decoder, and the logic for SIMD registers would basically side-step this
remapping.

In the RISC-V mode, similar remapping is also done, though a few of the
registers are mapped to different locations.

> <
>> X1,X3,...: Only existing as 128b SIMD registers.
>>
>> But, this kinda sucks, as one can no longer use any operation on any
>> register.
> <
> kinda is a serious understatement.

This is why I later reworked this into the XGPR extension...

There are some cracks in the design, but being able to (more or less)
access all of the registers in a consistent way is much preferable to
having asymmetric access to half of the register space by only certain
types of instructions.

If XGPR is not enabled, we can assume that R32..R63 do not exist (and
the low bit of the register for 128-bit operations is "Must Be Zero").

>>
>> I was kinda doing it this way initially, but then later added some
>> encoding hacks to allow much of the rest of the ISA to have access to
>> these registers.
>>
>>
>>
>> For the most part though, my C ABI prefers to stick with the low 32 GPRs.
>>
>>
>> I had experimented with another ABI design that basically widened nearly
>> everything to 128 bits, and would use all of the register space.
>>
>> This is on hold for now though:
>> Didn't get it debugged enough to really be usable;
>> Adversely effects code density;
>> Would also likely adversely effect performance and memory footprint.
>>
>> ...
>>
>> Despite the underlying ISA being mostly unaffected by this design, it
>> would have not been binary compatible (in terms of either function calls
>> or in-memory data structures) with code built for the 64-bit ABI.
>>
>>
>> Even despite having R32..R63 available, there are relatively few
>> use-cases at present that make a particularly strong case for enabling
>> them in the C ABI.
>>
>> Ironically, saving and using more than the "locally optimal" number of
>> registers for a given function will tend to actually make performance
>> worse, since any potential savings in terms of "fewer register spills"
>> is offset by the cost of saving/restoring more registers in the function
>> prolog and epilog (and, saving/restoring registers but then never using
>> them, is counter-productive).
> <
> When the save/restore is allowed to use full cache access width (while
> normal LDs and STs only use data-path widths) prologue and epilogue
> sequences are faster than register spills.

Probably true.

I was mostly using Load/Store pair (which operate 128 bits at a time).

But, a register saved/restored but not used, still costs more than not
saving/restoring the register.

But, "getting it right" isn't necessarily a given, if one assumes a
register allocator which tries to "round robin" the register allocation
in an attempt to increase usable ILP.

So, ended up using heuristics to divide up the register space, and
enabling parts of the register space based on register pressure.

So, say:
R8 ..R14: Always enabled
R24..R31: Enabled for high-pressure functions.
R4 ..R7 : Enabled for leaf functions.
R18..R23: Enabled for high-pressure leaf functions.

If I enabled R32..R63 in the main C ABI:
R40..R47: Enabled for very high register pressure.
R56..R63: Enabled for very high register pressure.
R32..R39: Enabled for very high pressure leaf functions.
R48..R55: Enabled for very high pressure leaf functions.

However, when not doing things like using 128-bit values for pretty much
everything, then there isn't really enough register pressure for the
compiler to justify enabling the higher-numbered registers.

And, if enabled, a round-robin register allocation scheme will end up
with nearly all of the registers being saved/restored.

Some settings may fine turn the heuristics, such as speed optimization
will bias towards saving/restoring more registers, and size optimization
towards fewer.


Click here to read the complete article
Re: Mixed EGU/EGO floating-point

<dccaaed4-ea0e-4a81-9c40-3c56e1d5b0can@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25326&group=comp.arch#25326

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:54d:b0:2f3:ce29:234a with SMTP id m13-20020a05622a054d00b002f3ce29234amr12914953qtx.559.1652655248644;
Sun, 15 May 2022 15:54:08 -0700 (PDT)
X-Received: by 2002:a05:6808:16ac:b0:2f9:52e5:da90 with SMTP id
bb44-20020a05680816ac00b002f952e5da90mr12134848oib.5.1652655248456; Sun, 15
May 2022 15:54:08 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 15 May 2022 15:54:08 -0700 (PDT)
In-Reply-To: <jwvsfpabftk.fsf-monnier+comp.arch@gnu.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:4811:b402:98e:d212;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:4811:b402:98e:d212
References: <de735513-d303-42b6-b375-916b89ddafcan@googlegroups.com>
<2a41be98-e9c8-4159-bef7-ba7999e278ecn@googlegroups.com> <a4400d89-ce0d-4c76-aeb2-de2d13f40318n@googlegroups.com>
<2022May13.200449@mips.complang.tuwien.ac.at> <t5p9pf$52o$1@gioia.aioe.org>
<t5pb7g$6og$1@dont-email.me> <2022May15.091837@mips.complang.tuwien.ac.at>
<t5r53e$cjh$1@newsreader4.netcologne.de> <2022May15.193827@mips.complang.tuwien.ac.at>
<t5re2t$h0o$1@newsreader4.netcologne.de> <6c1f724c-278a-43fe-8aab-efed4a1a310bn@googlegroups.com>
<t5rm0u$kor$1@newsreader4.netcologne.de> <380df37d-3cf9-4bcc-a599-c31ac300137cn@googlegroups.com>
<jwvsfpabftk.fsf-monnier+comp.arch@gnu.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <dccaaed4-ea0e-4a81-9c40-3c56e1d5b0can@googlegroups.com>
Subject: Re: Mixed EGU/EGO floating-point
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 15 May 2022 22:54:08 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2697
 by: MitchAlsup - Sun, 15 May 2022 22:54 UTC

On Sunday, May 15, 2022 at 5:20:12 PM UTC-5, Stefan Monnier wrote:
> > While My 66000 has instructions FMAX and FMIN and you CAN implement
> > FABS as FMAX(x,-x). It saves power to have its own instruction.
> Interesting. Do you happen to know exactly where this power savings
> come from?
<
You do not have to perform a comparison (cost of integer adder) as all
you need to do is make sure 1-bit is clear (0).
<
> I guess it depends on how the instruction is implemented, of
> course (i.e. does it have its own separate implementation in the
> FP-ALU, or is it decoded into something equivalent to FMAX(x,-x), or
> something in-between)?
<
In higher end machines it is performed by suppressing the HoB from asserting 1
on the forwarding path (about 1/4rd of a gate of delay as it adds a single additional
input to the n-way forwarding multiplexer and only on the HoB).
>
>
> Stefan

Re: Mixed EGU/EGO floating-point

<2df555e6-2b58-4972-8fd2-87f8cace547dn@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25327&group=comp.arch#25327

 copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:59cd:0:b0:2f3:c08d:9ffa with SMTP id f13-20020ac859cd000000b002f3c08d9ffamr13349029qtf.564.1652656507934;
Sun, 15 May 2022 16:15:07 -0700 (PDT)
X-Received: by 2002:a4a:5894:0:b0:35e:b78c:2ca9 with SMTP id
f142-20020a4a5894000000b0035eb78c2ca9mr5243305oob.56.1652656507693; Sun, 15
May 2022 16:15:07 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!feeder.erje.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 15 May 2022 16:15:07 -0700 (PDT)
In-Reply-To: <t5rum0$bu9$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:4811:b402:98e:d212;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:4811:b402:98e:d212
References: <t5qe8t$fh0$1@dont-email.me> <memo.20220515153920.11824J@jgd.cix.co.uk>
<t5rco4$ga5$1@newsreader4.netcologne.de> <t5rn0u$lsi$1@dont-email.me>
<b78d6cfa-cba6-40ca-832b-86af306de767n@googlegroups.com> <t5rum0$bu9$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2df555e6-2b58-4972-8fd2-87f8cace547dn@googlegroups.com>
Subject: Re: Mixed EGU/EGO floating-point
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 15 May 2022 23:15:07 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Sun, 15 May 2022 23:15 UTC

On Sunday, May 15, 2022 at 5:26:11 PM UTC-5, BGB wrote:
> On 5/15/2022 3:40 PM, MitchAlsup wrote:

> >> I am currently in the "everything goes in GPRs camp":
> >> If one has 64-bit GPRs, then pretty much everything can go into them;
> >> If one needs 128-bit SIMD, one can use two GPRs per SIMD vector.
> > <
> > My 66000 does all SIMD stuff as vectorized loops. No need for a
> > SIMD register file, either.
> No SIMD file in my case, only GPRs.
>
> SIMD operations exist, in one of several "flavors":
> Those that operate on 64 bits, and use a single GPR;
> Those that operate on 128-bits, and use a pair.
<
Yes, but I am looking both forwards and backwards at the same time.
I have SIMD of 64×2^k for some integer k determined by the implementation.
So, on lowest end machines, VVM runs 1 iteration per cycle, in middle range
implementations VVM runs 2 or 4 iterations per cycle, and in higher end
implementations, VVM runs 4-8-16 iterations per cycle.
<
Especially note: there are 0 SIMD OpCodes in the ISA.
>
>
> Which in turn has other effects:
> Operations on 64-bit vectors can in many cases be bundled;
> Operations on 128-bit vectors can't be bundled.
<
This is the static look I mentioned above. You want an architecture that at the
lowest end runs 1 instruction per cycle throughout the pipeline. At the other
end you want the pipeline to run 6-8-10 instruction per cycle--ALL FROM THE
SAME instruction stream.
>
> Mostly because the latter cases eat multiple lanes.
<
As Ivan would say:: Fix It !
> >>
> >>
> >> This is even with 'int' and friends still being 32-bit, where most older
> >> code doesn't make particularly effective use of an ISA which has 64-bit
> >> registers.
> > <
> > # define int int64_t
> One could do this...
>
> Wouldn't achieve much for most older code beyond wasting memory and
> probably causing it to no longer work correctly.
> >>
> >> One could potentially argue for limiting 'int' operations to 32 bits,
> >> ignoring the high half of the register in these cases, but this would be
> >> "kinda lame".
> >>
> >> One could potentially have such an ISA which natively implements
> >> arithmetic in a SIMD like manner, but then turns 64-bit integer ops into
> >> a multi-instruction sequence (with manual carry propagation).
> >>
> > Vectorization, instead.
> I was thinking where instead of having a 64-bit ADD, one used a 2x
> 32-bit ADD, and then had additional instructions to update the high-word
> based on the preceding instructions' result in the low-word.
<
And this is used "how often" ???
>
> But, yes, from an ISA design POV this would be worse than just having a
> 64-bit ADD instruction.
>
>
> Meanwhile: IRL, I ended up going the other direction and having a
> 128-bit ADDX instruction (sometimes useful).
> >>
> >>
> >> But, at this point I would probably still prefer an ISA with a single
> >> larger register space (and a few asinine special cases) to the
> >> traditional 3-way slit between GPRs, FPU, and SIMD registers.
> >>
> >>
> >> Though, code which works with a lot of 128-bit vectors might still push
> >> the limits of a 32 GPR design when each vector uses 2 GPRs.
> >>
> >> One could argue for one of:
> >> One has 64 GPRs, which is awkward for encoding reasons;
> >> One has 32 GPRs, leading to register-pressure issues with 128b SIMD;
> > <
> > Not with VVM. PLUS: when you widen up the machine capabilities
> > you don't need new register resources, nor do you even need to
> > change the code to assess these new wider data paths.
> > <
> Doing SIMD the way I did it seemed like the simplest/cheapest option.
<
With no (== little) look to future implementations. where you have 10× the
resources you have today.........
>
> No current plans to go beyond 128 bits.
> >> One has 32 regs, with half the space being "SIMD only".
> >>
> >>
> >> In the latter case, say:
> >> X0 -> R1:R0
> >> X2 -> R3:R2
> >> ...
> >> X30 -> R31:R30
> > <
> > Not sure you want to HW to allow using the SP and/or FP as SIMD registers
> > (assuming you want SIMD registers)
<
> A few of the registers are basically "undefined" if accessed as 128-bit:
> (R1:R0), May be encoded, but effectively undefined at present;
> (R15:R14), Contains SP, also undefined.
<
Strange place to put SP and FP........
>
> If the emulator sees these cases, it will turn the instruction into a
> breakpoint.
>
> If the CPU core encounters this, it will likely access the GPR space
> "behind" these registers, as the way the SPRs is implemented is as
> special register IDs that are overlaid on top of the GPR space by the
> decoder, and the logic for SIMD registers would basically side-step this
> remapping.
>
> In the RISC-V mode, similar remapping is also done, though a few of the
> registers are mapped to different locations.
> > <
> >> X1,X3,...: Only existing as 128b SIMD registers.
> >>
> >> But, this kinda sucks, as one can no longer use any operation on any
> >> register.
> > <
> > kinda is a serious understatement.
> This is why I later reworked this into the XGPR extension...
>
> There are some cracks in the design, but being able to (more or less)
> access all of the registers in a consistent way is much preferable to
> having asymmetric access to half of the register space by only certain
> types of instructions.
>
> If XGPR is not enabled, we can assume that R32..R63 do not exist (and
> the low bit of the register for 128-bit operations is "Must Be Zero").
<
What do you do for code compiled assuming XGPR exists and you want to
run on THIS implementation ??
> >>
> >> I was kinda doing it this way initially, but then later added some
> >> encoding hacks to allow much of the rest of the ISA to have access to
> >> these registers.

> >> Ironically, saving and using more than the "locally optimal" number of
> >> registers for a given function will tend to actually make performance
> >> worse, since any potential savings in terms of "fewer register spills"
> >> is offset by the cost of saving/restoring more registers in the function
> >> prolog and epilog (and, saving/restoring registers but then never using
> >> them, is counter-productive).
> > <
> > When the save/restore is allowed to use full cache access width (while
> > normal LDs and STs only use data-path widths) prologue and epilogue
> > sequences are faster than register spills.
> Probably true.
>
> I was mostly using Load/Store pair (which operate 128 bits at a time).
>
> But, a register saved/restored but not used, still costs more than not
> saving/restoring the register.
<
Certainly, but that is why the prologues and epilogues are not canned,
the compiler has a choice on how many get saved and how many get
restored--all cleverly orchestrated with ABI rules in mind.
<
The compiler can save as many or as few registers between R16 and R29
as it desired, can save an update FP as desired, can save and restore SP
if desired (seldom) or just update and backdate SP as desired. If desired
Registers R1-R8 (arguments) can be saved--and here they get concatenated
with the stack passed arguments for ease of varargs.
<
So, the compiler gets to choose how many and if.
>
>
> But, "getting it right" isn't necessarily a given, if one assumes a
> register allocator which tries to "round robin" the register allocation
> in an attempt to increase usable ILP.
<
Agreed, it took Brian a "while" to get My 66000 ABI fully integrated in his
LLVM port. But after he did, the code looks fabulous.
>
>
> So, ended up using heuristics to divide up the register space, and
> enabling parts of the register space based on register pressure.
>
> So, say:
> R8 ..R14: Always enabled
> R24..R31: Enabled for high-pressure functions.
> R4 ..R7 : Enabled for leaf functions.
> R18..R23: Enabled for high-pressure leaf functions.
<
Not sure what you are getting at here.
>
> If I enabled R32..R63 in the main C ABI:
> R40..R47: Enabled for very high register pressure.
> R56..R63: Enabled for very high register pressure.
> R32..R39: Enabled for very high pressure leaf functions.
> R48..R55: Enabled for very high pressure leaf functions.
>
<
With only 32 registers (1 being SP and 1 optionally being FP) I am finding
very little spill/fill codes, so whatever LLVM did and whatever Brian's
My 66000 port does, I don't seem to run into the mentioned problems.
>
> However, when not doing things like using 128-bit values for pretty much
> everything, then there isn't really enough register pressure for the
> compiler to justify enabling the higher-numbered registers.
>
> And, if enabled, a round-robin register allocation scheme will end up
> with nearly all of the registers being saved/restored.
>
There are some special cases you need to be aware of: simple recursive
functions are often faster when you do NOT allocate locals into registers
and just leave them on the stack. In this corner, the trade offs are rather
delicate.
>
> Some settings may fine turn the heuristics, such as speed optimization
> will bias towards saving/restoring more registers, and size optimization
> towards fewer.
> >>
> >> Likewise, only a relatively small number of functions actually hit the
> >> existing limits (likewise goes for functions which exceed 8 arguments;
> >> they exist, but the vast majority of function calls use fewer than 8
> >> arguments).
> >>
> >> ...


Click here to read the complete article
Re: Mixed EGU/EGO floating-point

<t5s9si$ara$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25328&group=comp.arch#25328

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Mixed EGU/EGO floating-point
Date: Sun, 15 May 2022 18:37:22 -0700
Organization: A noiseless patient Spider
Lines: 211
Message-ID: <t5s9si$ara$1@dont-email.me>
References: <t5qe8t$fh0$1@dont-email.me>
<memo.20220515153920.11824J@jgd.cix.co.uk>
<t5rco4$ga5$1@newsreader4.netcologne.de> <t5rn0u$lsi$1@dont-email.me>
<b78d6cfa-cba6-40ca-832b-86af306de767n@googlegroups.com>
<t5rum0$bu9$1@dont-email.me>
<2df555e6-2b58-4972-8fd2-87f8cace547dn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 16 May 2022 01:37:23 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="9dceea0eea63448eb6fbeadc31a1ef7c";
logging-data="11114"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19NFgddS33Ks7NsHSwm/0oY"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.9.0
Cancel-Lock: sha1:gLRVakKP7uCkWXGdjq8N4f3vSzE=
In-Reply-To: <2df555e6-2b58-4972-8fd2-87f8cace547dn@googlegroups.com>
Content-Language: en-US
 by: Ivan Godard - Mon, 16 May 2022 01:37 UTC

On 5/15/2022 4:15 PM, MitchAlsup wrote:
> On Sunday, May 15, 2022 at 5:26:11 PM UTC-5, BGB wrote:
>> On 5/15/2022 3:40 PM, MitchAlsup wrote:
>
>>>> I am currently in the "everything goes in GPRs camp":
>>>> If one has 64-bit GPRs, then pretty much everything can go into them;
>>>> If one needs 128-bit SIMD, one can use two GPRs per SIMD vector.
>>> <
>>> My 66000 does all SIMD stuff as vectorized loops. No need for a
>>> SIMD register file, either.
>> No SIMD file in my case, only GPRs.
>>
>> SIMD operations exist, in one of several "flavors":
>> Those that operate on 64 bits, and use a single GPR;
>> Those that operate on 128-bits, and use a pair.
> <
> Yes, but I am looking both forwards and backwards at the same time.
> I have SIMD of 64×2^k for some integer k determined by the implementation.
> So, on lowest end machines, VVM runs 1 iteration per cycle, in middle range
> implementations VVM runs 2 or 4 iterations per cycle, and in higher end
> implementations, VVM runs 4-8-16 iterations per cycle.
> <
> Especially note: there are 0 SIMD OpCodes in the ISA.
>>
>>
>> Which in turn has other effects:
>> Operations on 64-bit vectors can in many cases be bundled;
>> Operations on 128-bit vectors can't be bundled.
> <
> This is the static look I mentioned above. You want an architecture that at the
> lowest end runs 1 instruction per cycle throughout the pipeline. At the other
> end you want the pipeline to run 6-8-10 instruction per cycle--ALL FROM THE
> SAME instruction stream.
>>
>> Mostly because the latter cases eat multiple lanes.
> <
> As Ivan would say:: Fix It !
>>>>
>>>>
>>>> This is even with 'int' and friends still being 32-bit, where most older
>>>> code doesn't make particularly effective use of an ISA which has 64-bit
>>>> registers.
>>> <
>>> # define int int64_t
>> One could do this...
>>
>> Wouldn't achieve much for most older code beyond wasting memory and
>> probably causing it to no longer work correctly.
>>>>
>>>> One could potentially argue for limiting 'int' operations to 32 bits,
>>>> ignoring the high half of the register in these cases, but this would be
>>>> "kinda lame".
>>>>
>>>> One could potentially have such an ISA which natively implements
>>>> arithmetic in a SIMD like manner, but then turns 64-bit integer ops into
>>>> a multi-instruction sequence (with manual carry propagation).
>>>>
>>> Vectorization, instead.
>> I was thinking where instead of having a 64-bit ADD, one used a 2x
>> 32-bit ADD, and then had additional instructions to update the high-word
>> based on the preceding instructions' result in the low-word.
> <
> And this is used "how often" ???
>>
>> But, yes, from an ISA design POV this would be worse than just having a
>> 64-bit ADD instruction.
>>
>>
>> Meanwhile: IRL, I ended up going the other direction and having a
>> 128-bit ADDX instruction (sometimes useful).
>>>>
>>>>
>>>> But, at this point I would probably still prefer an ISA with a single
>>>> larger register space (and a few asinine special cases) to the
>>>> traditional 3-way slit between GPRs, FPU, and SIMD registers.
>>>>
>>>>
>>>> Though, code which works with a lot of 128-bit vectors might still push
>>>> the limits of a 32 GPR design when each vector uses 2 GPRs.
>>>>
>>>> One could argue for one of:
>>>> One has 64 GPRs, which is awkward for encoding reasons;
>>>> One has 32 GPRs, leading to register-pressure issues with 128b SIMD;
>>> <
>>> Not with VVM. PLUS: when you widen up the machine capabilities
>>> you don't need new register resources, nor do you even need to
>>> change the code to assess these new wider data paths.
>>> <
>> Doing SIMD the way I did it seemed like the simplest/cheapest option.
> <
> With no (== little) look to future implementations. where you have 10× the
> resources you have today.........
>>
>> No current plans to go beyond 128 bits.
>>>> One has 32 regs, with half the space being "SIMD only".
>>>>
>>>>
>>>> In the latter case, say:
>>>> X0 -> R1:R0
>>>> X2 -> R3:R2
>>>> ...
>>>> X30 -> R31:R30
>>> <
>>> Not sure you want to HW to allow using the SP and/or FP as SIMD registers
>>> (assuming you want SIMD registers)
> <
>> A few of the registers are basically "undefined" if accessed as 128-bit:
>> (R1:R0), May be encoded, but effectively undefined at present;
>> (R15:R14), Contains SP, also undefined.
> <
> Strange place to put SP and FP........
>>
>> If the emulator sees these cases, it will turn the instruction into a
>> breakpoint.
>>
>> If the CPU core encounters this, it will likely access the GPR space
>> "behind" these registers, as the way the SPRs is implemented is as
>> special register IDs that are overlaid on top of the GPR space by the
>> decoder, and the logic for SIMD registers would basically side-step this
>> remapping.
>>
>> In the RISC-V mode, similar remapping is also done, though a few of the
>> registers are mapped to different locations.
>>> <
>>>> X1,X3,...: Only existing as 128b SIMD registers.
>>>>
>>>> But, this kinda sucks, as one can no longer use any operation on any
>>>> register.
>>> <
>>> kinda is a serious understatement.
>> This is why I later reworked this into the XGPR extension...
>>
>> There are some cracks in the design, but being able to (more or less)
>> access all of the registers in a consistent way is much preferable to
>> having asymmetric access to half of the register space by only certain
>> types of instructions.
>>
>> If XGPR is not enabled, we can assume that R32..R63 do not exist (and
>> the low bit of the register for 128-bit operations is "Must Be Zero").
> <
> What do you do for code compiled assuming XGPR exists and you want to
> run on THIS implementation ??
>>>>
>>>> I was kinda doing it this way initially, but then later added some
>>>> encoding hacks to allow much of the rest of the ISA to have access to
>>>> these registers.
>
>>>> Ironically, saving and using more than the "locally optimal" number of
>>>> registers for a given function will tend to actually make performance
>>>> worse, since any potential savings in terms of "fewer register spills"
>>>> is offset by the cost of saving/restoring more registers in the function
>>>> prolog and epilog (and, saving/restoring registers but then never using
>>>> them, is counter-productive).
>>> <
>>> When the save/restore is allowed to use full cache access width (while
>>> normal LDs and STs only use data-path widths) prologue and epilogue
>>> sequences are faster than register spills.
>> Probably true.
>>
>> I was mostly using Load/Store pair (which operate 128 bits at a time).
>>
>> But, a register saved/restored but not used, still costs more than not
>> saving/restoring the register.
> <
> Certainly, but that is why the prologues and epilogues are not canned,
> the compiler has a choice on how many get saved and how many get
> restored--all cleverly orchestrated with ABI rules in mind.
> <
> The compiler can save as many or as few registers between R16 and R29
> as it desired, can save an update FP as desired, can save and restore SP
> if desired (seldom) or just update and backdate SP as desired. If desired
> Registers R1-R8 (arguments) can be saved--and here they get concatenated
> with the stack passed arguments for ease of varargs.
> <
> So, the compiler gets to choose how many and if.
>>
>>
>> But, "getting it right" isn't necessarily a given, if one assumes a
>> register allocator which tries to "round robin" the register allocation
>> in an attempt to increase usable ILP.
> <
> Agreed, it took Brian a "while" to get My 66000 ABI fully integrated in his
> LLVM port. But after he did, the code looks fabulous.
>>
>>
>> So, ended up using heuristics to divide up the register space, and
>> enabling parts of the register space based on register pressure.
>>
>> So, say:
>> R8 ..R14: Always enabled
>> R24..R31: Enabled for high-pressure functions.
>> R4 ..R7 : Enabled for leaf functions.
>> R18..R23: Enabled for high-pressure leaf functions.
> <
> Not sure what you are getting at here.
>>
>> If I enabled R32..R63 in the main C ABI:
>> R40..R47: Enabled for very high register pressure.
>> R56..R63: Enabled for very high register pressure.
>> R32..R39: Enabled for very high pressure leaf functions.
>> R48..R55: Enabled for very high pressure leaf functions.
>>
> <
> With only 32 registers (1 being SP and 1 optionally being FP) I am finding
> very little spill/fill codes, so whatever LLVM did and whatever Brian's
> My 66000 port does, I don't seem to run into the mentioned problems.


Click here to read the complete article
Re: Mixed EGU/EGO floating-point

<t5sah1$ara$2@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25329&group=comp.arch#25329

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Mixed EGU/EGO floating-point
Date: Sun, 15 May 2022 18:48:19 -0700
Organization: A noiseless patient Spider
Lines: 14
Message-ID: <t5sah1$ara$2@dont-email.me>
References: <de735513-d303-42b6-b375-916b89ddafcan@googlegroups.com>
<2a41be98-e9c8-4159-bef7-ba7999e278ecn@googlegroups.com>
<a4400d89-ce0d-4c76-aeb2-de2d13f40318n@googlegroups.com>
<2022May13.200449@mips.complang.tuwien.ac.at> <t5p9pf$52o$1@gioia.aioe.org>
<t5pb7g$6og$1@dont-email.me> <2022May15.091837@mips.complang.tuwien.ac.at>
<t5r53e$cjh$1@newsreader4.netcologne.de>
<2022May15.193827@mips.complang.tuwien.ac.at>
<t5re2t$h0o$1@newsreader4.netcologne.de>
<6c1f724c-278a-43fe-8aab-efed4a1a310bn@googlegroups.com>
<t5rm0u$kor$1@newsreader4.netcologne.de>
<380df37d-3cf9-4bcc-a599-c31ac300137cn@googlegroups.com>
<jwvsfpabftk.fsf-monnier+comp.arch@gnu.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 16 May 2022 01:48:17 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="9dceea0eea63448eb6fbeadc31a1ef7c";
logging-data="11114"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/fWLGCMicbizet8bypiaao"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.9.0
Cancel-Lock: sha1:PcXTiH4mtD2tLryBt71XOB54gCw=
In-Reply-To: <jwvsfpabftk.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US
 by: Ivan Godard - Mon, 16 May 2022 01:48 UTC

On 5/15/2022 3:20 PM, Stefan Monnier wrote:
>> While My 66000 has instructions FMAX and FMIN and you CAN implement
>> FABS as FMAX(x,-x). It saves power to have its own instruction.
>
> Interesting. Do you happen to know exactly where this power savings
> come from? I guess it depends on how the instruction is implemented, of
> course (i.e. does it have its own separate implementation in the
> FP-ALU, or is it decoded into something equivalent to FMAX(x,-x), or
> something in-between)?
>
>
> Stefan

FABS is a bitClear instruction.

Re: Mixed EGU/EGO floating-point

<337779ea-5572-4cf6-a663-d0ab990ecbbdn@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25330&group=comp.arch#25330

 copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:7fd0:0:b0:2f3:fda4:6ddf with SMTP id b16-20020ac87fd0000000b002f3fda46ddfmr13439076qtk.323.1652666537632;
Sun, 15 May 2022 19:02:17 -0700 (PDT)
X-Received: by 2002:a05:6870:c59b:b0:f1:231c:c82c with SMTP id
ba27-20020a056870c59b00b000f1231cc82cmr10399795oab.217.1652666537389; Sun, 15
May 2022 19:02:17 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 15 May 2022 19:02:17 -0700 (PDT)
In-Reply-To: <t5sah1$ara$2@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:4811:b402:98e:d212;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:4811:b402:98e:d212
References: <de735513-d303-42b6-b375-916b89ddafcan@googlegroups.com>
<2a41be98-e9c8-4159-bef7-ba7999e278ecn@googlegroups.com> <a4400d89-ce0d-4c76-aeb2-de2d13f40318n@googlegroups.com>
<2022May13.200449@mips.complang.tuwien.ac.at> <t5p9pf$52o$1@gioia.aioe.org>
<t5pb7g$6og$1@dont-email.me> <2022May15.091837@mips.complang.tuwien.ac.at>
<t5r53e$cjh$1@newsreader4.netcologne.de> <2022May15.193827@mips.complang.tuwien.ac.at>
<t5re2t$h0o$1@newsreader4.netcologne.de> <6c1f724c-278a-43fe-8aab-efed4a1a310bn@googlegroups.com>
<t5rm0u$kor$1@newsreader4.netcologne.de> <380df37d-3cf9-4bcc-a599-c31ac300137cn@googlegroups.com>
<jwvsfpabftk.fsf-monnier+comp.arch@gnu.org> <t5sah1$ara$2@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <337779ea-5572-4cf6-a663-d0ab990ecbbdn@googlegroups.com>
Subject: Re: Mixed EGU/EGO floating-point
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 16 May 2022 02:02:17 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2487
 by: MitchAlsup - Mon, 16 May 2022 02:02 UTC

On Sunday, May 15, 2022 at 8:48:20 PM UTC-5, Ivan Godard wrote:
> On 5/15/2022 3:20 PM, Stefan Monnier wrote:
> >> While My 66000 has instructions FMAX and FMIN and you CAN implement
> >> FABS as FMAX(x,-x). It saves power to have its own instruction.
> >
> > Interesting. Do you happen to know exactly where this power savings
> > come from? I guess it depends on how the instruction is implemented, of
> > course (i.e. does it have its own separate implementation in the
> > FP-ALU, or is it decoded into something equivalent to FMAX(x,-x), or
> > something in-between)?
> >
> >
> > Stefan
<
> FABS is a bitClear instruction.
<
But you don't even need to encode which bit to clear.

Re: Mixed EGU/EGO floating-point

<6facfc97-200e-4f34-b138-bc28e8613ee2n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25332&group=comp.arch#25332

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:3cf:b0:2f3:ec70:4e72 with SMTP id k15-20020a05622a03cf00b002f3ec704e72mr13808567qtx.61.1652667165129;
Sun, 15 May 2022 19:12:45 -0700 (PDT)
X-Received: by 2002:a05:6830:2475:b0:605:4339:dbc9 with SMTP id
x53-20020a056830247500b006054339dbc9mr5352657otr.313.1652667164897; Sun, 15
May 2022 19:12:44 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 15 May 2022 19:12:44 -0700 (PDT)
In-Reply-To: <t5s9si$ara$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:4811:b402:98e:d212;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:4811:b402:98e:d212
References: <t5qe8t$fh0$1@dont-email.me> <memo.20220515153920.11824J@jgd.cix.co.uk>
<t5rco4$ga5$1@newsreader4.netcologne.de> <t5rn0u$lsi$1@dont-email.me>
<b78d6cfa-cba6-40ca-832b-86af306de767n@googlegroups.com> <t5rum0$bu9$1@dont-email.me>
<2df555e6-2b58-4972-8fd2-87f8cace547dn@googlegroups.com> <t5s9si$ara$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <6facfc97-200e-4f34-b138-bc28e8613ee2n@googlegroups.com>
Subject: Re: Mixed EGU/EGO floating-point
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 16 May 2022 02:12:45 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 3533
 by: MitchAlsup - Mon, 16 May 2022 02:12 UTC

On Sunday, May 15, 2022 at 8:37:26 PM UTC-5, Ivan Godard wrote:
> On 5/15/2022 4:15 PM, MitchAlsup wrote:
> > On Sunday, May 15, 2022 at 5:26:11 PM UTC-5, BGB wrote:
> >> On 5/15/2022 3:40 PM, MitchAlsup wrote:

> > <
> > With only 32 registers (1 being SP and 1 optionally being FP) I am finding
> > very little spill/fill codes, so whatever LLVM did and whatever Brian's
> > My 66000 port does, I don't seem to run into the mentioned problems.
<
> Idle curiosity: what do you dofor a function that has more arguments
> than registers?
<
Arguments in containers smaller than 64-bit are enlarged to 64-bit containers.
Arguments larger than 64-bit containers are enlarged next larger MOD-64-bit
container.
<
First 8 containers are passed in R1-R9 {arg[0]..arg[8]}
Remaining arguments are passed on the top of the stack {SP->arg[9]}
>
> Also: how do you do VARARGS?
<
R1-R8 is pushed onto the stack creating a vector of memory arguments
{arg[0]..arg[k]}. This is done by using ENTER where the stop register is R8.
Since most vararg subroutines call lots of functions, this typically results in
ENTER R16,R9,sizeof( local_data_on_stack )
<
A pointer is created = &arg[0]
<
As each argument is consumed the pointer is incremented--no special type casing
on the consumption side.
<
BTW: this is how I did it on Denelcore HEP.
<
Multivalue returns::
<
first 8 64-bit-containers return in R1-R8, rest of the container return on the top
of the stack. EXIT can perform this register load for you, if desired.
<
The same code sequences work when safe-stack is in use and when safe-stack
is not in use. ENTER and EXIT sequencers know when to switch from regular
stack to safe-stack and back.
<
For those interested:: safe-stack is a stack where ENTER can deposit preserved
registers such that the called subroutine cannot see the values (LD and ST to
those addresses PAGEFAULT), and where EXIT can reload the saved values.
While storing the preserved registers, ENTER clears the register so called
subroutine cannot see the values (after the ENTER instruction) R16-R29 = 0.

Re: Mixed EGU/EGO floating-point

<t5sdud$vto$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25333&group=comp.arch#25333

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Mixed EGU/EGO floating-point
Date: Sun, 15 May 2022 21:45:19 -0500
Organization: A noiseless patient Spider
Lines: 651
Message-ID: <t5sdud$vto$1@dont-email.me>
References: <de735513-d303-42b6-b375-916b89ddafcan@googlegroups.com>
<2a41be98-e9c8-4159-bef7-ba7999e278ecn@googlegroups.com>
<t5p52v$shq$1@dont-email.me>
<30d6cdc7-ee4c-4b55-9ba8-8e7f236cd49fn@googlegroups.com>
<t5qc6a$2q7$1@dont-email.me>
<d8205d83-305d-4612-abae-314b009d00e9n@googlegroups.com>
<t5rjju$s26$1@dont-email.me>
<cf392aac-8931-41e8-9199-1c808ba7fa9an@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 16 May 2022 02:46:38 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="266fc1a26eb0f7f6881626248ae6c1ba";
logging-data="32696"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18mlpXsLa1mHFuTto6hefns"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.8.0
Cancel-Lock: sha1:kOELoTLClkyvdZqZMht7D4pC18s=
In-Reply-To: <cf392aac-8931-41e8-9199-1c808ba7fa9an@googlegroups.com>
Content-Language: en-US
 by: BGB - Mon, 16 May 2022 02:45 UTC

On 5/15/2022 3:29 PM, MitchAlsup wrote:
> On Sunday, May 15, 2022 at 2:17:21 PM UTC-5, BGB wrote:
>> On 5/15/2022 12:08 PM, MitchAlsup wrote:
>>> On Sunday, May 15, 2022 at 3:04:30 AM UTC-5, BGB wrote:
>>>> On 5/14/2022 4:29 PM, MitchAlsup wrote:
>>>>> On Saturday, May 14, 2022 at 3:57:06 PM UTC-5, BGB wrote:
>>>>>> On 5/13/2022 9:06 AM, MitchAlsup wrote:
>>>
>>>
>>>>> Infiniity has a defined magnitude :: bigger than any IEEE representable number
>>>>> NaN has a defined non-value :: not comparable to any IEEE representable
>>>>> number (including even itself).
>>>>> <
>>>>> I don't see why you would want Infinity to just be another NaN.
>>>>> <
>>>> If either Inf or NaN happens, it usually means "something has gone
>>>> wrong". Having them as distinct cases adds more special cases that need
>>>> to be detected and handled by hardware, without contributing much beyond
>>>> slightly different ways of saying "Well, the math has broken".
>>> <
>>> Needing to test for an explicit -0.0 has similar problems.
>> Possibly, though I had found that if one generates a -0, some software
>> will misbehave. It is seemingly necessary to force all 0 to be positive
>> zero for software compatibility reasons.
>>
>> This means one has a few cases, eg:
>> A or B is NaN (with Inf as a sub-case);
>> A or B is Zero;
>>
>>
>> FADD/FSUB:
>> A==NaN || B==NaN
>> Result is NaN
>> Else, Normal Case
>> Generally, perform calculation internally as twos complement (*).
>> ~ 66 bit so Int64 conversion works.
>>
>> FMUL:
>> A==NaN || B==NaN
>> Result is NaN
> <
> Sooner or later you will realize that when operands can all be NaNs, you
> should choose one over the rest. My 66000 has the following rules:
> 3-operands:: Rs3 is chosen over Rs2 over Rs1
> 2-operands:: Rs2 is chosen over Rs1
> 1-opernd:: Rs1 is chosen.
> <
> The 3-operand case propagates the augend of FMAC over the multiply
> operands since this operand recirculates in inner product codes. So,
> when SW intercepts exceptions and "does something with new NaNs"
> the earliest NaN carries the originating "does something" around with it.
> <

Probably assumes that one has NaN's that contain useful information of
some sort.

>> A==Zero || B==Zero
>> Result is Zero
> <
> A properly signed zero. result.s = operand1.s ^ operand2.s
> <

Possibly, though generating -0 (such as from the result of multiplying a
negative number with 0) seems to cause Quake's "progs.dat" VM to break...

But, it does work if one has all zeroes result in positive zeroes.

>> Else, Normal Case
>> ~ DP: 54*54 ~> 68 (discards low-order results).
>>
>> Assuming here an implementation which lacks an FDIV instruction.
>>
>>
>> *: One could argue that ones' complement would be cheaper here than twos
>> complement, but the difference in the results is more obvious, and one
>> ends up with an FPU where frequently (A+B)!=(B+A), where (A+B)==(B+A) is
>> a decidedly "nice to have" property.
> <
> When one built circuits with individual transistors and packages with 1-2-4 gates
> in them, 1-complement is just as fast (or slow) as 2-complement. In any CMOS
> technology, 1-complement has a "data path" of width longer for the end-around-carry
> than 2-complement. Thus, there are logic reason to stick with 2-complement.
> <

Mostly it is that ones' complement allows one to invert the sign using a
bitwise NOT, whereas twos complement requires either adding one (a carry
chain), or keeping track that a 1 needs to be added back in (via carry
propagation with a later ADD).

>>
>> Also, if one assumes the ability to express integer values as floating
>> point values, and the ability to operate them and still get integer
>> results, then one is going to need twos complement here.
>>
>>
>>
>> For a 32-bit machine, it might make sense to only perform Binary32 ops
>> hardware, and fall back to software emulation for Binary64.
>>
> Note: I did not do this in Mc88100.

Possibly.

Both Cortex-M and MicroBlaze seem to go this route though.

This also seems to be a design option in RISC-V as well.

I went a different direction of initially only providing Binary64
operations directly, and using converter ops for Binary32 and Binary16.

Though, it is possible to fake having scalar Binary32 ops by using
2xFp32 SIMD ops and then ignoring the high half of the register.

>>
>> Though, this does lead to some problem cases:
>> 'math.h' functions;
>> 'atof' and similar;
>> ...
>>
>> Which are likely to take a pretty big performance hit if the hardware
>> only does Binary32.
>>
>>
>> If one implements FP compare with 'EQ' and 'GT' operators, then probably
>> only EQ needs to deal with NaN:
>> A==NaN || B==NaN: EQ always gives False.
>> Other cases: EQ is equivalent to integer EQ.
>>
>>
>> Mostly, because for a single 'GT' operator, there is no real sensible
>> way to handle NaN's, so it is easier to simply ignore their existence in
>> this case.
>>
>> Where, say:
>> A==B: EQ(A,B),BT
>> A!=B: EQ(A,B),BF
>> A> B: GT(A,B),BT
>> A<=B: GT(A,B),BF
>> A>=B: GT(B,A),BF
>> A< B: GT(B,A),BT
>>
> <
> if( a > b )
> { // a actually > b goes here }
> else
> { // a <= b || a== NaN || b == NaN goes here }
> <
> if( a <= b )
> { // a actually <= b goes here }
> else
> { // a > b || a== NaN || b == NaN goes here }
> <
> And this is why the-else rearrangement is fraught with peril in optimization.
> You can't flip the comparison operation and clauses (like you can with integers.)

This does not work correctly with NaNs with how compare and branch
instructions work in my ISA.

Because there was no sensible way to handle it, this case was ignored.
NaNs will just sorta fall down whichever branch they would have fallen
down if they were not NaNs.

The "correct" behavior could in theory be faked by adding additional
branches by manually checking for NaN:
FCMPEQ R4, R4
BF .else
FCMPEQ R5, R5
BF .else
FCMPGT R4, R5
BT .else
... then ...
BRA .end
.else:
... else ...
.end:

This is not done, since it would slow stuff down and is "not actually
useful" in practice.

>>
>> Granted, one could try to argue for a different interpretation if one
>> assumes an ISA built around compare-and-branch instructions, or one
>> built on condition-code branches.
> <
> I simply argue that there should be enough encodings to encode all 10
> IEEE 754 specified behaviors.

Possibly, though it is debatable how relevant these semantics are.

>>>>
>>>> In practice, the distinction contributes little in terms of either
>>>> practical use cases nor is much benefit for debugging.
>>>>
>>>> Though, I am still generally in favor of keeping NaN around.
>>>>>> Define "Denormal As Zero" as canonical;
>>>>> <
>>>>> This no longer saves circuitry.
>>>>> <
>>>> Denormal numbers are only really "free" if one has an FMA unit rather
>>>> than separate FADD/FMUL (so, eg, both FMUL and FADD can share the same
>>>> renormalization logic).
>>>>
>>>> Though, the cheaper option here seemingly being to not have FMA, in
>>>> which case it is no longer free.
>>> <
>>> Then you are not compliant with IEEE 754-2008 so why bother with the
>>> rest of it.
>> I am not assuming this to match up with IEEE-754-2008, that is not the
>> point.
>>
>> This would likely need to be a separate spec, but would still basically
>> match up with IEEE 754 in terms of floating-point formats and similar.
> <
> That falls under the "why bother" moniker.

>>
>>
>> But, it would be nice to be able to have another spec, that is like IEEE
>> 754 but intended mostly for cheap embedded processors and
>> microcontrollers, but with much looser requirements and thus easier to hit.
> <
> IEEE baggage is "NOT EXPENSIVE" in modern fab technology.
> {While it might be in FPGA and those lesser technologies.}
> Most of the baggage overhead is designer time, not power, area, or cost.


Click here to read the complete article
Re: Mixed EGU/EGO floating-point

<b26cb5b2-896c-47af-9e7c-664921d2db82n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25334&group=comp.arch#25334

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:1351:b0:2f3:b1e4:9d2e with SMTP id w17-20020a05622a135100b002f3b1e49d2emr13721707qtk.412.1652672517025;
Sun, 15 May 2022 20:41:57 -0700 (PDT)
X-Received: by 2002:aca:e155:0:b0:325:6d76:da4b with SMTP id
y82-20020acae155000000b003256d76da4bmr7334571oig.125.1652672516740; Sun, 15
May 2022 20:41:56 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 15 May 2022 20:41:56 -0700 (PDT)
In-Reply-To: <memo.20220515153920.11824J@jgd.cix.co.uk>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:2060:a91b:1a54:d855;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:2060:a91b:1a54:d855
References: <t5qe8t$fh0$1@dont-email.me> <memo.20220515153920.11824J@jgd.cix.co.uk>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b26cb5b2-896c-47af-9e7c-664921d2db82n@googlegroups.com>
Subject: Re: Mixed EGU/EGO floating-point
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Mon, 16 May 2022 03:41:57 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2887
 by: Quadibloc - Mon, 16 May 2022 03:41 UTC

On Sunday, May 15, 2022 at 8:39:25 AM UTC-6, John Dallman wrote:

> It is. A manufacturer once told us we could not expect bit-identical
> results between two different kinds of FP registers on the same machine,
> both usable at the same time. For ISVs who use compilers, rather than
> assembler, that's deadly.

My original Concertina architecture has that characteristic for _some_
floating-point types.

For IEEE 754, though:

- when they're stored in the regular floating-point registers, they're stored
in an "internal form", similar to temporary real in the 8087, but up to 128
bits long.

- when they're stored in "short vector" registers (think of today's SIMD
vectors, from MMX onwards), that doesn't happen. But IEEE 754 rules
are _still followed_, it's just that the conversion to internal form before
calculating is done for each calculation - so the arithmetic takes a few
more cycles.

But the "simple floating" type, which can also be stored in the _integer_
registers, is therefore normally processed by an ALU that doesn't have
the ability to do rounding. But if one uses registers for variables of this
type that _can_ round, they will.

However, it isn't deadly. People writing compilers are instructed in
an applications note not to optimize by switching between sets of
registers that behave differently.

> Fortunately, it turned out that the people doing the presentation were
> confused, and the hardware was being sensible.

So there is no actual hardware of which you are aware which has this
characteristic. Or, at least, if such hardware exists, it isn't the hardware
made by the manufacturer whose salespeople presented it wrongly.

John Savard

Re: Mixed EGU/EGO floating-point

<b5d8795d-3db1-4aae-8c79-eca816e7c962n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25335&group=comp.arch#25335

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:528e:b0:45a:95ff:337f with SMTP id kj14-20020a056214528e00b0045a95ff337fmr13854653qvb.78.1652674082874;
Sun, 15 May 2022 21:08:02 -0700 (PDT)
X-Received: by 2002:aca:b782:0:b0:325:7a29:352d with SMTP id
h124-20020acab782000000b003257a29352dmr12052566oif.217.1652674082634; Sun, 15
May 2022 21:08:02 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 15 May 2022 21:08:02 -0700 (PDT)
In-Reply-To: <b26cb5b2-896c-47af-9e7c-664921d2db82n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:2060:a91b:1a54:d855;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:2060:a91b:1a54:d855
References: <t5qe8t$fh0$1@dont-email.me> <memo.20220515153920.11824J@jgd.cix.co.uk>
<b26cb5b2-896c-47af-9e7c-664921d2db82n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b5d8795d-3db1-4aae-8c79-eca816e7c962n@googlegroups.com>
Subject: Re: Mixed EGU/EGO floating-point
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Mon, 16 May 2022 04:08:02 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 4642
 by: Quadibloc - Mon, 16 May 2022 04:08 UTC

On Sunday, May 15, 2022 at 9:41:58 PM UTC-6, Quadibloc wrote:
> On Sunday, May 15, 2022 at 8:39:25 AM UTC-6, John Dallman wrote:
>
> > It is. A manufacturer once told us we could not expect bit-identical
> > results between two different kinds of FP registers on the same machine,
> > both usable at the same time. For ISVs who use compilers, rather than
> > assembler, that's deadly.

> My original Concertina architecture has that characteristic for _some_
> floating-point types.
>
> For IEEE 754, though:
>
> - when they're stored in the regular floating-point registers, they're stored
> in an "internal form", similar to temporary real in the 8087, but up to 128
> bits long.
>
> - when they're stored in "short vector" registers (think of today's SIMD
> vectors, from MMX onwards), that doesn't happen. But IEEE 754 rules
> are _still followed_, it's just that the conversion to internal form before
> calculating is done for each calculation - so the arithmetic takes a few
> more cycles.
>
> But the "simple floating" type, which can also be stored in the _integer_
> registers, is therefore normally processed by an ALU that doesn't have
> the ability to do rounding. But if one uses registers for variables of this
> type that _can_ round, they will.
>
> However, it isn't deadly. People writing compilers are instructed in
> an applications note not to optimize by switching between sets of
> registers that behave differently.

I checked; the situation is described on the page

http://www.quadibloc.com/arch/ar010102.htm

and there _is_ a potential problem even with IEEE 754 floats.

Since they're in internal form in registers, they may have a few
extra guard bits in register calculations that don't involve memory.

I think, though, that this will only be the case for calculations
involving numbers in the denormal range.

As for the inspiration behind "simple floating", that can be seen
here:

https://longstreet.typepad.com/thesciencebookstore/2010/02/a-short-look-at-fifths-and-fives-joltin-joe-legal-sausage-and-the-structure-of-mathematics.html

The Recomp II, a computer made by the Autonetics division of
North American Aviation, was an inexpensive computer that used
a head-per-track disk as its main memory. So it basically behaved
the same as a computer with drum memory.

One of its features was that this early transistorized computer was
also the world's cheapest computer with hardware floating-point.

They cut corners on the hardware floating-point to achieve this.

Basically, their "hardware floating-point" involved slightly tweaking
its hardware integer arithmetic - instead of having specialized
floating-point circuits, they made it easy to do that by using one
integer for the exponent, and one integer for the mantissa.

As this machine had a 40-bit word length, this meant that floating-point
numbers had a range of magnitudes on the order of 2 to the power +/-
549,755,813,887... or 10 to the power +/- 165,492,990,269.

So, indeed, if one tried to write the numbers at the top of that range in
a conventional way, with lots of zeroes on the end, instead of in scientific
notation, they would stretch 2 1/2 times around the Earth, as advertised.

Basically, they took a cost-cutting compromise, if not a bug, and made it
into a feature.

John Savard

Re: Mixed EGU/EGO floating-point

<2022May16.080036@mips.complang.tuwien.ac.at>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25336&group=comp.arch#25336

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Mixed EGU/EGO floating-point
Date: Mon, 16 May 2022 06:00:36 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 22
Distribution: world
Message-ID: <2022May16.080036@mips.complang.tuwien.ac.at>
References: <de735513-d303-42b6-b375-916b89ddafcan@googlegroups.com> <2a41be98-e9c8-4159-bef7-ba7999e278ecn@googlegroups.com> <a4400d89-ce0d-4c76-aeb2-de2d13f40318n@googlegroups.com> <2022May13.200449@mips.complang.tuwien.ac.at> <t5p9pf$52o$1@gioia.aioe.org> <t5pb7g$6og$1@dont-email.me> <2022May15.091837@mips.complang.tuwien.ac.at> <t5r53e$cjh$1@newsreader4.netcologne.de> <2022May15.193827@mips.complang.tuwien.ac.at> <t5re2t$h0o$1@newsreader4.netcologne.de>
Injection-Info: reader02.eternal-september.org; posting-host="128a42e0247b02702bd84c94989e66d3";
logging-data="21635"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX199PSi9o6svA/jk1rLyLuOl"
Cancel-Lock: sha1:MY/Nv4cssQmekuSla0fh8aOcLhE=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Mon, 16 May 2022 06:00 UTC

Thomas Koenig <tkoenig@netcologne.de> writes:
>Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>> Thomas Koenig <tkoenig@netcologne.de> writes:
>>>Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>>>
>>>> No, we already have NaNs for that. He wants to be able to compute the
>>>> sum/product of all present data, and what he proposes works for that.
>>>
>>> s = sum(a,mask=.not. ieee_is_nan(a))
>>>
>>>works fine.
>>
>> Which CPU has this instruction?
>
>Any processor which implements Fortran.

I.e., none.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Mixed EGU/EGO floating-point

<2022May16.082647@mips.complang.tuwien.ac.at>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25338&group=comp.arch#25338

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Mixed EGU/EGO floating-point
Date: Mon, 16 May 2022 06:26:47 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 15
Message-ID: <2022May16.082647@mips.complang.tuwien.ac.at>
References: <de735513-d303-42b6-b375-916b89ddafcan@googlegroups.com> <2022May13.200449@mips.complang.tuwien.ac.at> <t5p9pf$52o$1@gioia.aioe.org> <t5pb7g$6og$1@dont-email.me> <2022May15.091837@mips.complang.tuwien.ac.at> <t5r53e$cjh$1@newsreader4.netcologne.de> <2022May15.193827@mips.complang.tuwien.ac.at> <t5re2t$h0o$1@newsreader4.netcologne.de> <6c1f724c-278a-43fe-8aab-efed4a1a310bn@googlegroups.com> <t5rm0u$kor$1@newsreader4.netcologne.de> <380df37d-3cf9-4bcc-a599-c31ac300137cn@googlegroups.com>
Injection-Info: reader02.eternal-september.org; posting-host="128a42e0247b02702bd84c94989e66d3";
logging-data="2065"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+b5q0gE8nD8c8EDRPKSOsb"
Cancel-Lock: sha1:KRkfUjw4bd9rqd1O+KiIAI8j5lw=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Mon, 16 May 2022 06:26 UTC

MitchAlsup <MitchAlsup@aol.com> writes:
>While My 66000 has instructions FMAX and FMIN and you CAN implement
>FABS as FMAX(x,-x). It saves power to have its own instruction. Also note:
>FMAX( x, -x ) is 1 instruction because My 66000 ISA has sign control over
>operands already.

Anything wrong with x & 0x7fffffffffffffff ?

Given that you have to bear the costs of sign-magnitude, you should
also rip the benefits.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Mixed EGU/EGO floating-point

<t5t2me$rh5$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25341&group=comp.arch#25341

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Mixed EGU/EGO floating-point
Date: Mon, 16 May 2022 03:39:27 -0500
Organization: A noiseless patient Spider
Lines: 422
Message-ID: <t5t2me$rh5$1@dont-email.me>
References: <t5qe8t$fh0$1@dont-email.me>
<memo.20220515153920.11824J@jgd.cix.co.uk>
<t5rco4$ga5$1@newsreader4.netcologne.de> <t5rn0u$lsi$1@dont-email.me>
<b78d6cfa-cba6-40ca-832b-86af306de767n@googlegroups.com>
<t5rum0$bu9$1@dont-email.me>
<2df555e6-2b58-4972-8fd2-87f8cace547dn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 16 May 2022 08:40:46 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="266fc1a26eb0f7f6881626248ae6c1ba";
logging-data="28197"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19pVs2XGXjAtoKkBXcDU7Jv"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.8.0
Cancel-Lock: sha1:djxXEgZwi5IvbQHyHT7b6rnarLY=
In-Reply-To: <2df555e6-2b58-4972-8fd2-87f8cace547dn@googlegroups.com>
Content-Language: en-US
 by: BGB - Mon, 16 May 2022 08:39 UTC

On 5/15/2022 6:15 PM, MitchAlsup wrote:
> On Sunday, May 15, 2022 at 5:26:11 PM UTC-5, BGB wrote:
>> On 5/15/2022 3:40 PM, MitchAlsup wrote:
>
>>>> I am currently in the "everything goes in GPRs camp":
>>>> If one has 64-bit GPRs, then pretty much everything can go into them;
>>>> If one needs 128-bit SIMD, one can use two GPRs per SIMD vector.
>>> <
>>> My 66000 does all SIMD stuff as vectorized loops. No need for a
>>> SIMD register file, either.
>> No SIMD file in my case, only GPRs.
>>
>> SIMD operations exist, in one of several "flavors":
>> Those that operate on 64 bits, and use a single GPR;
>> Those that operate on 128-bits, and use a pair.
> <
> Yes, but I am looking both forwards and backwards at the same time.
> I have SIMD of 64×2^k for some integer k determined by the implementation.
> So, on lowest end machines, VVM runs 1 iteration per cycle, in middle range
> implementations VVM runs 2 or 4 iterations per cycle, and in higher end
> implementations, VVM runs 4-8-16 iterations per cycle.
> <
> Especially note: there are 0 SIMD OpCodes in the ISA.
>>
>>
>> Which in turn has other effects:
>> Operations on 64-bit vectors can in many cases be bundled;
>> Operations on 128-bit vectors can't be bundled.
> <
> This is the static look I mentioned above. You want an architecture that at the
> lowest end runs 1 instruction per cycle throughout the pipeline. At the other
> end you want the pipeline to run 6-8-10 instruction per cycle--ALL FROM THE
> SAME instruction stream.

>>
>> Mostly because the latter cases eat multiple lanes.
> <
> As Ivan would say:: Fix It !

Only way to not eat multiple lanes with 128-bit SIMD would be for each
lane to have a 128-bit operand path (with 128-bit register ports, ...).

As a result of the 64-bit data widths, running 128-bit data through the
pipeline this requires instructions spanning multiple lanes.

A similar case comes up with Imm56 and Imm64 encodings, since each lane
only has 33 bits for the immediate (but making it wider is not likely to
be cost-effective).

Most other parts (ALUs, shifters, ...) are 64 bits wide, with 128-bit
ALU operations implemented via trickery.

>>>>
>>>>
>>>> This is even with 'int' and friends still being 32-bit, where most older
>>>> code doesn't make particularly effective use of an ISA which has 64-bit
>>>> registers.
>>> <
>>> # define int int64_t
>> One could do this...
>>
>> Wouldn't achieve much for most older code beyond wasting memory and
>> probably causing it to no longer work correctly.
>>>>
>>>> One could potentially argue for limiting 'int' operations to 32 bits,
>>>> ignoring the high half of the register in these cases, but this would be
>>>> "kinda lame".
>>>>
>>>> One could potentially have such an ISA which natively implements
>>>> arithmetic in a SIMD like manner, but then turns 64-bit integer ops into
>>>> a multi-instruction sequence (with manual carry propagation).
>>>>
>>> Vectorization, instead.
>> I was thinking where instead of having a 64-bit ADD, one used a 2x
>> 32-bit ADD, and then had additional instructions to update the high-word
>> based on the preceding instructions' result in the low-word.
> <
> And this is used "how often" ???

As-is, 64-bit ADD is in ~ 14th place in the instruction-use ranking,
with 32-bit ADDS.L in 8th place (the ADDS.L instruction does a 32-bit
addition and then sign-extends the result).

In terms of usage-ranking, ADD comes shortly after Load/Store ops,
Branch instructions, and loading an immediate into a register.

ADD is closely followed by SHAD(arithmetic-shift) and CMPGT
(compare-greater).

It is preceded by LDI (constant load), and MOV (Reg,Reg).

IOW: Probably wouldn't exactly be ideal to turn it into a multi-op sequence.

>>
>> But, yes, from an ISA design POV this would be worse than just having a
>> 64-bit ADD instruction.
>>
>>
>> Meanwhile: IRL, I ended up going the other direction and having a
>> 128-bit ADDX instruction (sometimes useful).
>>>>
>>>>
>>>> But, at this point I would probably still prefer an ISA with a single
>>>> larger register space (and a few asinine special cases) to the
>>>> traditional 3-way slit between GPRs, FPU, and SIMD registers.
>>>>
>>>>
>>>> Though, code which works with a lot of 128-bit vectors might still push
>>>> the limits of a 32 GPR design when each vector uses 2 GPRs.
>>>>
>>>> One could argue for one of:
>>>> One has 64 GPRs, which is awkward for encoding reasons;
>>>> One has 32 GPRs, leading to register-pressure issues with 128b SIMD;
>>> <
>>> Not with VVM. PLUS: when you widen up the machine capabilities
>>> you don't need new register resources, nor do you even need to
>>> change the code to assess these new wider data paths.
>>> <
>> Doing SIMD the way I did it seemed like the simplest/cheapest option.
> <
> With no (== little) look to future implementations. where you have 10× the
> resources you have today.........

Only way this is gonna happen is if I get enough money to afford bigger
FPGAs.

For 10x the resources, this would put the FPGA's resources into Virtex
territory; Virtex FPGAs are *EXPENSIVE*...

If I could justify the cost, I could buy a "Nexys Video" board, which
would give me around 2x the FPGA resources of the "Nexys A7".

Getting 3x to 5x the resource budget would basically be Kintex territory
(typically $k for boards this size).

>>
>> No current plans to go beyond 128 bits.
>>>> One has 32 regs, with half the space being "SIMD only".
>>>>
>>>>
>>>> In the latter case, say:
>>>> X0 -> R1:R0
>>>> X2 -> R3:R2
>>>> ...
>>>> X30 -> R31:R30
>>> <
>>> Not sure you want to HW to allow using the SP and/or FP as SIMD registers
>>> (assuming you want SIMD registers)
> <
>> A few of the registers are basically "undefined" if accessed as 128-bit:
>> (R1:R0), May be encoded, but effectively undefined at present;
>> (R15:R14), Contains SP, also undefined.
> <
> Strange place to put SP and FP........

R15 = SP.

SP is in the same place as-it was on SuperH...

Despite the ISA being beaten beyond recognition, its design did still
evolve out of SuperH, and as a result, many of the registers are still
in the same places as where they were in SuperH.

Some things did change though, like the return-value register moved from
R0 to R2, the register space was expanded from 16->32->64, ...

A lot of other stuff has changed more significantly though.

>>
>> If the emulator sees these cases, it will turn the instruction into a
>> breakpoint.
>>
>> If the CPU core encounters this, it will likely access the GPR space
>> "behind" these registers, as the way the SPRs is implemented is as
>> special register IDs that are overlaid on top of the GPR space by the
>> decoder, and the logic for SIMD registers would basically side-step this
>> remapping.
>>
>> In the RISC-V mode, similar remapping is also done, though a few of the
>> registers are mapped to different locations.
>>> <
>>>> X1,X3,...: Only existing as 128b SIMD registers.
>>>>
>>>> But, this kinda sucks, as one can no longer use any operation on any
>>>> register.
>>> <
>>> kinda is a serious understatement.
>> This is why I later reworked this into the XGPR extension...
>>
>> There are some cracks in the design, but being able to (more or less)
>> access all of the registers in a consistent way is much preferable to
>> having asymmetric access to half of the register space by only certain
>> types of instructions.
>>
>> If XGPR is not enabled, we can assume that R32..R63 do not exist (and
>> the low bit of the register for 128-bit operations is "Must Be Zero").
> <
> What do you do for code compiled assuming XGPR exists and you want to
> run on THIS implementation ??

It wont work.

These scenarios are not binary compatible with each other.

There is not strict requirement that code built for BJX2 with one
profile be able to run on a core built with a different profile.


Click here to read the complete article
Pages:12345
server_pubkey.txt

rocksolid light 0.9.7
clearnet tor