Message-ID:

The Shuttle is now going five times the sound of speed. -- Dan Rather, first landing of Columbia

computers / comp.arch / Re: Configurable rounding modes (was The value of floating-point exceptions?)

Re: Configurable rounding modes (was The value of floating-point exceptions?)

<sdsglg$pn1$1@dont-email.me>

https://www.novabbs.com/computers/article-flat.php?id=19290&group=comp.arch#19290

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Configurable rounding modes (was The value of floating-point
exceptions?)
Date: Wed, 28 Jul 2021 16:02:38 -0500
Organization: A noiseless patient Spider
Lines: 171
Message-ID: <sdsglg$pn1$1@dont-email.me>
References: <sd9a9h$ro6$1@dont-email.me>
<memo.20210721153537.10680P@jgd.cix.co.uk>
<e9c738bd-7c0e-4b3b-9385-3a0d0658b059n@googlegroups.com>
<a74c6bf2-9ad1-4969-b3cb-b650ae8ebdadn@googlegroups.com>
<sde6m7$kr1$1@dont-email.me> <sde74m$nio$1@dont-email.me>
<7cf5713e-f138-488b-9ccf-d85df84c50can@googlegroups.com>
<e7e0b9a2-7990-4ec8-9c40-a6e9a07bd306n@googlegroups.com>
<fc5a33d0-7c17-4855-8ab3-162884bd6b7bn@googlegroups.com>
<713a35af-9cce-4954-b968-1b4b754e7b1en@googlegroups.com>
<sdjghd$1i4a$1@gioia.aioe.org> <sdk2kd$lu1$1@dont-email.me>
<476a9f6b-5fa0-4606-aed6-cf31089b8c5bn@googlegroups.com>
<sdlo5n$da3$1@dont-email.me>
<1f8827fe-a288-4f2e-8e4e-40cc343febbdn@googlegroups.com>
<sdol45$69t$1@dont-email.me>
<e18db4b1-445e-4922-a417-4d48971b160bn@googlegroups.com>
<sdpiea$ub0$1@dont-email.me> <sdpk4c$kdh$1@dont-email.me>
<sdrupc$ibc$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 28 Jul 2021 21:02:40 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="7c79c4a3f13d5bd91860f90c8d06b95d";
logging-data="26337"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Ol2kB5FC7JWXGRaskJcXV"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.12.0
Cancel-Lock: sha1:czhqZtmkhdA3+tHn7BTUA6SB9Sk=
In-Reply-To: <sdrupc$ibc$1@dont-email.me>
Content-Language: en-US

by: BGB - Wed, 28 Jul 2021 21:02 UTC

On 7/28/2021 10:57 AM, Marcus wrote:
> On 2021-07-27 20:43, BGB wrote:
>> On 7/27/2021 1:14 PM, Ivan Godard wrote:
>>> On 7/27/2021 10:32 AM, MitchAlsup wrote:
>>>> On Tuesday, July 27, 2021 at 4:54:15 AM UTC-5, Marcus wrote:
>>>>> On 2021-07-26 18:19, MitchAlsup wrote:
>>>>>> On Monday, July 26, 2021 at 2:27:54 AM UTC-5, Marcus wrote:
>>>>>>> On 2021-07-25 19:22, MitchAlsup wrote:
>>>>>>>> On Sunday, July 25, 2021 at 11:14:08 AM UTC-5, BGB wrote:
>>>>>>>>> On 7/25/2021 6:05 AM, Terje Mathisen wrote:
>>>>>>>>>> MitchAlsup wrote:
>>>>>>>>>>> Having watched this from inside:
>>>>>>>>>>> a) HW designers know a lot more about this today than in 1980
>>>>>>>>>>> b) even systems that started out as IEEE-format gradually went
>>>>>>>>>>> closer and closer to full IEEE-compliant (GPUs) until there
>>>>>>>>>>> is no
>>>>>>>>>>> useful difference in the quality of the arithmetic.
>>>>>>>>>>> c) once 754-2009 came out the overhead to do denorms went to
>>>>>>>>>>> zero, and there is no reason to avoid full speed denorms in
>>>>>>>>>>> practice.
>>>>>>>>>>> (BGB's small FPGA prototyping environment aside.)
>>>>>>>>>>
>>>>>>>>>> I agree.
>>>>>>>>>>
>>>>>>>>>>> d) HW designers have learned how to perform all of the rounding
>>>>>>>>>>> modes at no overhead compared to RNE.
>>>>>>>>>>
>>>>>>>>>> This is actually dead easy since all the other modes are
>>>>>>>>>> easier than
>>>>>>>>>> RNE: As soon as you have all four bits required for RNE (i.e.
>>>>>>>>>> sign/ulp/guard/sticky) then the remaining rounding modes only
>>>>>>>>>> need
>>>>>>>>>> various subsets of these, so you use the rounding mode to
>>>>>>>>>> route one of 5
>>>>>>>>>> or 6 possible 16-entry one-bit lookup tables into the rounding
>>>>>>>>>> circuit
>>>>>>>>>> where it becomes the input to be added into the ulp position
>>>>>>>>>> of the
>>>>>>>>>> final packed (sign/exp/mantissa) fp result.
>>>>>>>>>>
>>>>>>>>> Oddly enough, the extra cost to rounding itself is not the main
>>>>>>>>> issue
>>>>>>>>> with multiple rounding modes, but more the question of how the
>>>>>>>>> bits get
>>>>>>>>> there (if one doesn't already have an FPU status register or
>>>>>>>>> similar).
>>>>>>>>>
>>>>>>>>> Granted, could in theory put these bits in SR or similar, but,
>>>>>>>>> yeah...
>>>>>>>>>
>>>>>>>>> It would be better IMO if it were part of the instruction, but
>>>>>>>>> there
>>>>>>>>> isn't really any good / non-annoying way to encode this.
>>>>>>>> <
>>>>>>>> And this is why they are put in control/status registers.
>>>>>>>> <
>>>>>>> There are several problems with this, but the *main* problem is
>>>>>>> that the
>>>>>>> rounding mode setting becomes a super-global variable.
>>>>>> <
>>>>>> Given that they cannot be in a GPR or an FPR, Oh wise one,
>>>>>> Where would you put them ?
>>>>>> <
>>>>> At the same place that we specify the floating-point precision: in the
>>>>> instruction.
>>>>>>> If one subroutine
>>>>>>> touches the register, it affects all code in the same thread. And
>>>>>>> it's
>>>>>>> "super-global" because it crosses source code and language barriers
>>>>>> <
>>>>>> But does not cross thread or task boundaries. So its not "like
>>>>>> memory" either.
>>>>>> <
>>>>>>> (e.g. consider a program written in Go that has a Python
>>>>>>> scripting back
>>>>>>> end that calls out to a DLL that is written in C++ that changes the
>>>>>>> floating-point rounding mode...).
>>>>>>>
>>>>>>> As I have accounted for elsewhere, this is a real problem. As a SW
>>>>>>> developer I therefore prefer to work with architectures that do
>>>>>>> not have
>>>>>>> an FPU control register - /even/ if that means that I can only
>>>>>>> use RNE
>>>>>>> (because let's face it - that's the only rounding mode I'm ever
>>>>>>> going
>>>>>>> to use anyway).
>>>>>> <
>>>>>> So you discount libraries that may contain interval arithmetic or
>>>>>> properly
>>>>>> rounded transcendentals ?
>>>>> You might have misunderstood my point. As a software developer, if I
>>>>> have a choice between:
>>>>>
>>>>> 1. Rounding mode is configured in a thread global register.
>>>>> 2. Rounding mode is always RNE.
>>>>>
>>>>> ...then I pick 2, because that gives me 100% predictable results,
>>>>> whereas with 1 all bets are off (read my other examples as to why this
>>>>> is the case).
>>>>>
>>>>> *However*, if there is also the option:
>>>>>
>>>>> 3. Rounding mode is part of the instruction.
>>>>>
>>>>> ...then I pick 3.
>>>> <
>>>> Do you still pick 3 if the average instruction went from 32-bits in
>>>> size to
>>>> 37 bits in size ?
>>>
>>> Come on, no straw men please. If adding 3 bits to a dozen opcodes
>>> pushes the average of the whole ISA from 32 to 37 bits then you have
>>> bigger encoding problems than FP rounding.
>>>
>>> Five modes times 12 opcodes adds 60 models to an existing ~1500, or
>>> something like 2% of a bit to the entropy. If you can't do it in
>>> under a bit then you should change your encoding to fix the crappy
>>> entropy utilization.
>>
>> I agree.
>>
>> In my case, while this option effectively doubles the size of the
>> instructions in question, I expect that the average case impact will
>> be much smaller than this, given:
>> FPU instructions are statistically infrequent (vs memory or ALU ops);
>> One is only likely to need specialized rounding modes in obscure edge
>> cases (if at all).
>>
>> I could have done it within the existing 32-bit encoding, but was not
>> feeling inclined to do so (not common enough to be worth the cost in
>> encoding space).
>
> I agree too. Using up some 50 opcodes for FP instructions would be
> doable, but given how rare non-RNE rounding modes are, I think I'll just
> punt the problem and add a prefix mechanism at a later time (I have
> spared some encoding pages for things like that, and I'm already
> thinking about adding something like Mitch's CARRY prefix for multi-
> precision arithmetic).
>

Yeah. As noted, I did it via an Op64 prefix:
FFwZ_ZZii

Where:
w: Extend GPR bits.
ZZZ: Extend Opcode (000 maps to the 32-bit encoding space)
ii: Extend Immed (or function as a 4th Imm8/Reg6).

Then made it so that the existing FADD/FSUB/FMUL ops will decode into a
form where the Imm8 extension serves as a rounding mode.

In terms of 3R->4R ops, one of several expansions can happen:
Rm, Ro, Rn, Rn: Default (Same as normal 3R), Imm8 is ignored.
Rm, Ro, I8, Rn: Possibility 1 (Used as Imm8)
Rm, Ro, Rp, Rn: Possibility 2 (Used as a Register)

When decoded without the Op64 prefix:
Rm, Ro, Rn, Rn: Default.
Rm, Ro, 0, Rn: Possibility 1 (Rp gets 0)
Rm, Ro, Rn, Rn: Possibility 2 (Rp gets Rn)

> That would increase the size of FP instructions that use non-default
> rounding modes from 32 bits to 64 bits (worst case), which is
> acceptable in my book.
>

Likewise.

> /Marcus

Subject	Replies	Author
The value of floating-point exceptions? By: Marcus on Wed, 21 Jul 2021	225	Marcus