Message-ID:

There are some things worth dying for. -- Kirk, "Errand of Mercy", stardate 3201.7

On 7/27/2021 4:54 AM, Marcus wrote:
> On 2021-07-26 18:19, MitchAlsup wrote:
>> On Monday, July 26, 2021 at 2:27:54 AM UTC-5, Marcus wrote:
>>> On 2021-07-25 19:22, MitchAlsup wrote:
>>>> On Sunday, July 25, 2021 at 11:14:08 AM UTC-5, BGB wrote:
>>>>> On 7/25/2021 6:05 AM, Terje Mathisen wrote:
>>>>>> MitchAlsup wrote:
>>>>>>> Having watched this from inside:
>>>>>>> a) HW designers know a lot more about this today than in 1980
>>>>>>> b) even systems that started out as IEEE-format gradually went
>>>>>>> closer and closer to full IEEE-compliant (GPUs) until there is no
>>>>>>> useful difference in the quality of the arithmetic.
>>>>>>> c) once 754-2009 came out the overhead to do denorms went to
>>>>>>> zero, and there is no reason to avoid full speed denorms in
>>>>>>> practice.
>>>>>>> (BGB's small FPGA prototyping environment aside.)
>>>>>>
>>>>>> I agree.
>>>>>>
>>>>>>> d) HW designers have learned how to perform all of the rounding
>>>>>>> modes at no overhead compared to RNE.
>>>>>>
>>>>>> This is actually dead easy since all the other modes are easier than
>>>>>> RNE: As soon as you have all four bits required for RNE (i.e.
>>>>>> sign/ulp/guard/sticky) then the remaining rounding modes only need
>>>>>> various subsets of these, so you use the rounding mode to route
>>>>>> one of 5
>>>>>> or 6 possible 16-entry one-bit lookup tables into the rounding
>>>>>> circuit
>>>>>> where it becomes the input to be added into the ulp position of the
>>>>>> final packed (sign/exp/mantissa) fp result.
>>>>>>
>>>>> Oddly enough, the extra cost to rounding itself is not the main issue
>>>>> with multiple rounding modes, but more the question of how the bits
>>>>> get
>>>>> there (if one doesn't already have an FPU status register or similar).
>>>>>
>>>>> Granted, could in theory put these bits in SR or similar, but, yeah...
>>>>>
>>>>> It would be better IMO if it were part of the instruction, but there
>>>>> isn't really any good / non-annoying way to encode this.
>>>> <
>>>> And this is why they are put in control/status registers.
>>>> <
>>> There are several problems with this, but the *main* problem is that the
>>> rounding mode setting becomes a super-global variable.
>> <
>> Given that they cannot be in a GPR or an FPR, Oh wise one,
>> Where would you put them ?
>> <
>
> At the same place that we specify the floating-point precision: in the
> instruction.
>

FWIW:
SH-4 also put the FPU precision/format in FPSCR. Effectively, one had to
look at the state of this register to make any sense out of what
instructions to decode.

This kinda sucked, and for BJX2, I ended up going in the opposite
direction of not having any FPSCR at all.

>>>
>>> If one subroutine
>>> touches the register, it affects all code in the same thread. And it's
>>> "super-global" because it crosses source code and language barriers
>> <
>> But does not cross thread or task boundaries. So its not "like memory"
>> either.
>> <
>>> (e.g. consider a program written in Go that has a Python scripting back
>>> end that calls out to a DLL that is written in C++ that changes the
>>> floating-point rounding mode...).
>>>
>>> As I have accounted for elsewhere, this is a real problem. As a SW
>>> developer I therefore prefer to work with architectures that do not have
>>> an FPU control register - /even/ if that means that I can only use RNE
>>> (because let's face it - that's the only rounding mode I'm ever going
>>> to use anyway).
>> <
>> So you discount libraries that may contain interval arithmetic or
>> properly
>> rounded transcendentals ?
>
> You might have misunderstood my point. As a software developer, if I
> have a choice between:
>
> 1. Rounding mode is configured in a thread global register.
> 2. Rounding mode is always RNE.
>
> ...then I pick 2, because that gives me 100% predictable results,
> whereas with 1 all bets are off (read my other examples as to why this
> is the case).
>
> *However*, if there is also the option:
>
> 3. Rounding mode is part of the instruction.
>
> ...then I pick 3.
>
>>>
>>> Having the rounding modes as part of the instruction is *much* better
>>> from a SW developer's perspective - but that blows up opcode space for
>>> a feature that is never used.
>> <
>> And there are cases where this philosophy would blow up code space
>> when they were used. Does the compiler have to generate code for
>> foo() in RNE, and RTPI and RTNI, and RTZ, and RTMAG modes?
>
> Does the compiler have to generate code for foo() in binary16, binary32,
> binary64 and binary128? No. The developer selects the precision and
> rounding mode(s). The developer knows what rounding mode to use for a
> particular algorithm.
>

Agreed.

Though, one could imagine a situation where ABI-level register handling
gets messed up (say, because the official ABI and GCC disagreed as to
whose responsibility it is to save/restore the register), and the
decoded FPU instructions effectively get turned into confetti.

>>>
>>> Perhaps your prefix instruction paradigm could be used for this (e.g.
>>> like "CARRY")?
>> <
>> If I wanted 'per instruction RMs' this would be how I did it. A
>> CARRY-like
>> RM-instruction-modifier could cast RM over 5-ish subsequent instructions.
>
> Sounds reasonable.
>
>>>
>>>> < Probably the
>>>>> "least awful" would probably be to use an Op64 encoding, which then
>>>>> uses
>>>>> some of the Immed extension bits to encode a rounding mode.
>>>> <
>>>> The argument against having them in instructions is that this prevents
>>>> someone from running the code several times with different rounding
>>>> modes set to detect any sensitivity to the actually chosen rounding
>>>> mode.
>>>> Kahan said he uses this a lot.
>>>
>>> ......let me question the usefulness of that. If I were to do such
>>> experiments I would just compile the algorithm with different rounding
>>> modes set explicitly (e.g. as a C++ template argument or something).
>> <
>> So you would end up with 5 copies of FPPPP, one for each rounding mode
>> (at the cost of 8K×sizeof(inst) per rounding mode = 32KB/mode = 165KB)
>> ??!!?
>
> The use case given here sounds like it has more to do with experimenting
> with different settings than to actually generate production code. As
> such it would be easy to just re-compile the source code with different
> rounding mode settings. I occasionally do this when experimenting with
> different floating-point precisions (usually single precision vs
> double-precision), e.g. using a C DEFINE.
>

Bigger question is what is the best way to go about telling the C
compiler to use a non-default rounding.

I am torn partly between either a pragma, of function-scale attributes.
There is also a possible choice here between:
* __declspec(whatever)
And:
* [[whatever]]

Though, this is more aesthetic than technical, and in BGBCC they are
more or less interchangeable:
* [[dllexport]] void Foo() { ... }
Which technically works I guess...

Then again, it is sort of a similar issue to how, in the recent addition
of lambdas to BGBCC, there are now several semi-redundant notations, eg:
__var():type { ... }
__function[...](...):type { ... }
[...](...)->type { ... }
...

Though, there are some minor differences:
__function allows using a capture list;
Both __var and __function allow lambdas with an unbounded lifespan.

The C23 proposal's lambdas seem to be automatic-lifetime only, and one
doesn't want to use unbounded lifespan where automatic is specified as
this would result in a memory leak (whereas __var and __function default
to unbounded lifetime, with an option to specify automatic lifetime via
__var! or __function!, which is similar to their behavior in BS2).

The capture list is not particularly useful IMO, though it does allow
doing things like: foo = [x=expr1, y=expr2]()->int { return x+y; }
Though, in the present implementation, due to limitations with how it is
currently implemented, ends up falling back to using "variant" for these
values (during the stage where lambdas are folded out into their own
structures and functions, the types are not in a form which allows
type-inferring the result of an expression).

....

>>>
>>>>>
>>>>>
>>>>> * FFw0_00ii_F0nm_5eo8 FADD Rm, Ro, Rn, Imm8
>>>>> * FFw0_00ii_F0nm_5eo9 FSUB Rm, Ro, Rn, Imm8
>>>>> * FFw0_00ii_F0nm_5eoA FMUL Rm, Ro, Rn, Imm8
>>>>>
>>>>> Where the Imm8 field encodes the rounding mode, say:
>>>>> 00 = Round to Nearest.
>>>>> 01 = Truncate.
>>>>>

Side note: The above is now more or less what I decided to go and
implement, though with a few more rounding modes.

>>>>> Or could go the SR route, but I don't want FPU behavior to depend
>>>>> on SR.
>>>> <
>>>> When one has multi-threading and control/status register, one simply
>>>> reads the RM field and delivers it to the FU as an operand. A couple
>>>> of interlock checks means you don't really have to stall the pipeline
>>>> because these modes don't change all that often.
>>>> <
>>>
>>> Again, that's the HW perspective. But from a SW perspective you *don't*
>>> want RM to be part of a configuration register.
>>>
>>> Global variables are bad. This was well established some 50 years ago
>>> (e.g. Wulf, W.A., Shaw, M., "Global Variables Considered Harmful" from
>>> 1973). Global configuration registers even more so.
>> <
>> So Root pointers are now considered Harmful ?!?

I think it is more an issue of having actively changing state in global
variables, or global state with a non-local area of effect.

>>>
>>>>>> Since the hidden bit is already hidden at this point, andy rounding
>>>>>> overflow of the mantissa from 0xfff.. to 0x000.. will cause the
>>>>>> exponent
>>>>>> term to be incremented, possibly all the way to Inf. In all cases,
>>>>>> this
>>>>>> is the exactly correct behaviour.
>>>>>>
>>>>> Yep.
>>>>>
>>>>> Main limiting factor though is that for bigger formats (Double or
>>>>> FP96),
>>>>> propagating the carry that far can be an issue.
>>>> <
>>>> Koogie-Stone adders !
>>>>>
>>>>> In the vast majority of cases, the carry gets absorbed within the
>>>>> low 8
>>>>> or 16 bits or so (or if it doesn't, leave these bits as-is).
>>>>>
>>>>> For narrowing conversions to Binary16 or Binary32, full width rounding
>>>>> is both easier and more useful.
>>>>>
>>>>>
>>>>>
>>>>> For FADD/FSUB, the vast majority of cases where a very long stream of
>>>>> 1's would have occured can be avoided by doing the math internally in
>>>>> twos complement form.
>>>>>
>>>>> Though, in this case, one can save a little cost by implementing the
>>>>> "twos complement" as essentially ones' complement with a carry bit
>>>>> input
>>>>> to the adder (one can't arrive at a case where both inputs are
>>>>> negative
>>>>> with FADD).
>>>> <
>>>> This is a standard trick that everyone should know--I first saw it
>>>> in the
>>>> PDP-8 in the Complement and increment instruction--but it has come in
>>>> handy several times and is the way operands are negated and
>>>> complemented
>>>> in My 66000. The operand is conditionally complemented with a carry in
>>>> conditionally asserted. IF the operand is being processed is integer
>>>> there
>>>> is an adder that deals with the carry in. If the operand is logical,
>>>> there is
>>>> no adder and the carry in is ignored.
>>>>>
>>>>>
>>>>> Cases can occur though where the result mantissa comes up negative
>>>>> though, which can itself require a sign inversion. The only
>>>>> alternative
>>>>> is to compare mantissa input values by value if the exponents are
>>>>> equal,
>>>>> which is also fairly expensive.
>>>>>
>>>>> Though, potentially one could use the rounding step to "absorb"
>>>>> part of
>>>>> the cost of the second sign inversion.
>>>>>
>>>>> Another possibility here could be to have an adder which produces two
>>>>> outputs, namely both ((A+B)+Cin) and (~(A+B)+(!Cin)), and then
>>>>> using the
>>>>> second output if the first came up negative.
>>>>>
>>>>> ...
>

Subject	Replies	Author
The value of floating-point exceptions? By: Marcus on Wed, 21 Jul 2021	225	Marcus

There are some things worth dying for. -- Kirk, "Errand of Mercy", stardate 3201.7

computers / comp.arch / Re: Configurable rounding modes (was The value of floating-point exceptions?)