Message-ID:

History tends to exaggerate. -- Col. Green, "The Savage Curtain", stardate 5906.4

computers / comp.arch / Re: Configurable rounding modes (was The value of floating-point exceptions?)

Re: Configurable rounding modes (was The value of floating-point exceptions?)

<sdol45$69t$1@dont-email.me>

https://www.novabbs.com/computers/article-flat.php?id=19232&group=comp.arch#19232

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Configurable rounding modes (was The value of floating-point
exceptions?)
Date: Tue, 27 Jul 2021 11:54:12 +0200
Organization: A noiseless patient Spider
Lines: 212
Message-ID: <sdol45$69t$1@dont-email.me>
References: <sd9a9h$ro6$1@dont-email.me>
<memo.20210721153537.10680P@jgd.cix.co.uk>
<e9c738bd-7c0e-4b3b-9385-3a0d0658b059n@googlegroups.com>
<a74c6bf2-9ad1-4969-b3cb-b650ae8ebdadn@googlegroups.com>
<sde6m7$kr1$1@dont-email.me> <sde74m$nio$1@dont-email.me>
<7cf5713e-f138-488b-9ccf-d85df84c50can@googlegroups.com>
<e7e0b9a2-7990-4ec8-9c40-a6e9a07bd306n@googlegroups.com>
<fc5a33d0-7c17-4855-8ab3-162884bd6b7bn@googlegroups.com>
<713a35af-9cce-4954-b968-1b4b754e7b1en@googlegroups.com>
<sdjghd$1i4a$1@gioia.aioe.org> <sdk2kd$lu1$1@dont-email.me>
<476a9f6b-5fa0-4606-aed6-cf31089b8c5bn@googlegroups.com>
<sdlo5n$da3$1@dont-email.me>
<1f8827fe-a288-4f2e-8e4e-40cc343febbdn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 27 Jul 2021 09:54:13 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="aac6d43ec1da3802e0438f21242ccc52";
logging-data="6461"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX181eS7s2InxovGKyMzqi5iBaPeQqIkElvA="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:hPPQ62yf+Fo9lF6T61QHc0ODVN0=
In-Reply-To: <1f8827fe-a288-4f2e-8e4e-40cc343febbdn@googlegroups.com>
Content-Language: en-US

by: Marcus - Tue, 27 Jul 2021 09:54 UTC

On 2021-07-26 18:19, MitchAlsup wrote:
> On Monday, July 26, 2021 at 2:27:54 AM UTC-5, Marcus wrote:
>> On 2021-07-25 19:22, MitchAlsup wrote:
>>> On Sunday, July 25, 2021 at 11:14:08 AM UTC-5, BGB wrote:
>>>> On 7/25/2021 6:05 AM, Terje Mathisen wrote:
>>>>> MitchAlsup wrote:
>>>>>> Having watched this from inside:
>>>>>> a) HW designers know a lot more about this today than in 1980
>>>>>> b) even systems that started out as IEEE-format gradually went
>>>>>> closer and closer to full IEEE-compliant (GPUs) until there is no
>>>>>> useful difference in the quality of the arithmetic.
>>>>>> c) once 754-2009 came out the overhead to do denorms went to
>>>>>> zero, and there is no reason to avoid full speed denorms in practice.
>>>>>> (BGB's small FPGA prototyping environment aside.)
>>>>>
>>>>> I agree.
>>>>>
>>>>>> d) HW designers have learned how to perform all of the rounding
>>>>>> modes at no overhead compared to RNE.
>>>>>
>>>>> This is actually dead easy since all the other modes are easier than
>>>>> RNE: As soon as you have all four bits required for RNE (i.e.
>>>>> sign/ulp/guard/sticky) then the remaining rounding modes only need
>>>>> various subsets of these, so you use the rounding mode to route one of 5
>>>>> or 6 possible 16-entry one-bit lookup tables into the rounding circuit
>>>>> where it becomes the input to be added into the ulp position of the
>>>>> final packed (sign/exp/mantissa) fp result.
>>>>>
>>>> Oddly enough, the extra cost to rounding itself is not the main issue
>>>> with multiple rounding modes, but more the question of how the bits get
>>>> there (if one doesn't already have an FPU status register or similar).
>>>>
>>>> Granted, could in theory put these bits in SR or similar, but, yeah...
>>>>
>>>> It would be better IMO if it were part of the instruction, but there
>>>> isn't really any good / non-annoying way to encode this.
>>> <
>>> And this is why they are put in control/status registers.
>>> <
>> There are several problems with this, but the *main* problem is that the
>> rounding mode setting becomes a super-global variable.
> <
> Given that they cannot be in a GPR or an FPR, Oh wise one,
> Where would you put them ?
> <

At the same place that we specify the floating-point precision: in the
instruction.

>> If one subroutine
>> touches the register, it affects all code in the same thread. And it's
>> "super-global" because it crosses source code and language barriers
> <
> But does not cross thread or task boundaries. So its not "like memory" either.
> <
>> (e.g. consider a program written in Go that has a Python scripting back
>> end that calls out to a DLL that is written in C++ that changes the
>> floating-point rounding mode...).
>>
>> As I have accounted for elsewhere, this is a real problem. As a SW
>> developer I therefore prefer to work with architectures that do not have
>> an FPU control register - /even/ if that means that I can only use RNE
>> (because let's face it - that's the only rounding mode I'm ever going
>> to use anyway).
> <
> So you discount libraries that may contain interval arithmetic or properly
> rounded transcendentals ?

You might have misunderstood my point. As a software developer, if I
have a choice between:

1. Rounding mode is configured in a thread global register.
2. Rounding mode is always RNE.

....then I pick 2, because that gives me 100% predictable results,
whereas with 1 all bets are off (read my other examples as to why this
is the case).

*However*, if there is also the option:

3. Rounding mode is part of the instruction.

....then I pick 3.

>>
>> Having the rounding modes as part of the instruction is *much* better
>> from a SW developer's perspective - but that blows up opcode space for
>> a feature that is never used.
> <
> And there are cases where this philosophy would blow up code space
> when they were used. Does the compiler have to generate code for
> foo() in RNE, and RTPI and RTNI, and RTZ, and RTMAG modes?

Does the compiler have to generate code for foo() in binary16, binary32,
binary64 and binary128? No. The developer selects the precision and
rounding mode(s). The developer knows what rounding mode to use for a
particular algorithm.

>>
>> Perhaps your prefix instruction paradigm could be used for this (e.g.
>> like "CARRY")?
> <
> If I wanted 'per instruction RMs' this would be how I did it. A CARRY-like
> RM-instruction-modifier could cast RM over 5-ish subsequent instructions.

Sounds reasonable.

>>
>>> < Probably the
>>>> "least awful" would probably be to use an Op64 encoding, which then uses
>>>> some of the Immed extension bits to encode a rounding mode.
>>> <
>>> The argument against having them in instructions is that this prevents
>>> someone from running the code several times with different rounding
>>> modes set to detect any sensitivity to the actually chosen rounding mode.
>>> Kahan said he uses this a lot.
>>
>> ......let me question the usefulness of that. If I were to do such
>> experiments I would just compile the algorithm with different rounding
>> modes set explicitly (e.g. as a C++ template argument or something).
> <
> So you would end up with 5 copies of FPPPP, one for each rounding mode
> (at the cost of 8K×sizeof(inst) per rounding mode = 32KB/mode = 165KB)
> ??!!?

The use case given here sounds like it has more to do with experimenting
with different settings than to actually generate production code. As
such it would be easy to just re-compile the source code with different
rounding mode settings. I occasionally do this when experimenting with
different floating-point precisions (usually single precision vs
double-precision), e.g. using a C DEFINE.

>>
>>>>
>>>>
>>>> * FFw0_00ii_F0nm_5eo8 FADD Rm, Ro, Rn, Imm8
>>>> * FFw0_00ii_F0nm_5eo9 FSUB Rm, Ro, Rn, Imm8
>>>> * FFw0_00ii_F0nm_5eoA FMUL Rm, Ro, Rn, Imm8
>>>>
>>>> Where the Imm8 field encodes the rounding mode, say:
>>>> 00 = Round to Nearest.
>>>> 01 = Truncate.
>>>>
>>>> Or could go the SR route, but I don't want FPU behavior to depend on SR.
>>> <
>>> When one has multi-threading and control/status register, one simply
>>> reads the RM field and delivers it to the FU as an operand. A couple
>>> of interlock checks means you don't really have to stall the pipeline
>>> because these modes don't change all that often.
>>> <
>>
>> Again, that's the HW perspective. But from a SW perspective you *don't*
>> want RM to be part of a configuration register.
>>
>> Global variables are bad. This was well established some 50 years ago
>> (e.g. Wulf, W.A., Shaw, M., "Global Variables Considered Harmful" from
>> 1973). Global configuration registers even more so.
> <
> So Root pointers are now considered Harmful ?!?
>>
>>>>> Since the hidden bit is already hidden at this point, andy rounding
>>>>> overflow of the mantissa from 0xfff.. to 0x000.. will cause the exponent
>>>>> term to be incremented, possibly all the way to Inf. In all cases, this
>>>>> is the exactly correct behaviour.
>>>>>
>>>> Yep.
>>>>
>>>> Main limiting factor though is that for bigger formats (Double or FP96),
>>>> propagating the carry that far can be an issue.
>>> <
>>> Koogie-Stone adders !
>>>>
>>>> In the vast majority of cases, the carry gets absorbed within the low 8
>>>> or 16 bits or so (or if it doesn't, leave these bits as-is).
>>>>
>>>> For narrowing conversions to Binary16 or Binary32, full width rounding
>>>> is both easier and more useful.
>>>>
>>>>
>>>>
>>>> For FADD/FSUB, the vast majority of cases where a very long stream of
>>>> 1's would have occured can be avoided by doing the math internally in
>>>> twos complement form.
>>>>
>>>> Though, in this case, one can save a little cost by implementing the
>>>> "twos complement" as essentially ones' complement with a carry bit input
>>>> to the adder (one can't arrive at a case where both inputs are negative
>>>> with FADD).
>>> <
>>> This is a standard trick that everyone should know--I first saw it in the
>>> PDP-8 in the Complement and increment instruction--but it has come in
>>> handy several times and is the way operands are negated and complemented
>>> in My 66000. The operand is conditionally complemented with a carry in
>>> conditionally asserted. IF the operand is being processed is integer there
>>> is an adder that deals with the carry in. If the operand is logical, there is
>>> no adder and the carry in is ignored.
>>>>
>>>>
>>>> Cases can occur though where the result mantissa comes up negative
>>>> though, which can itself require a sign inversion. The only alternative
>>>> is to compare mantissa input values by value if the exponents are equal,
>>>> which is also fairly expensive.
>>>>
>>>> Though, potentially one could use the rounding step to "absorb" part of
>>>> the cost of the second sign inversion.
>>>>
>>>> Another possibility here could be to have an adder which produces two
>>>> outputs, namely both ((A+B)+Cin) and (~(A+B)+(!Cin)), and then using the
>>>> second output if the first came up negative.
>>>>
>>>> ...

Subject	Replies	Author
The value of floating-point exceptions? By: Marcus on Wed, 21 Jul 2021	225	Marcus