novaBBS - comp.arch - Re: Mixed EGU/EGO floating-point

On 5/15/2022 12:08 PM, MitchAlsup wrote:
> On Sunday, May 15, 2022 at 3:04:30 AM UTC-5, BGB wrote:
>> On 5/14/2022 4:29 PM, MitchAlsup wrote:
>>> On Saturday, May 14, 2022 at 3:57:06 PM UTC-5, BGB wrote:
>>>> On 5/13/2022 9:06 AM, MitchAlsup wrote:
>
>
>>> Infiniity has a defined magnitude :: bigger than any IEEE representable number
>>> NaN has a defined non-value :: not comparable to any IEEE representable
>>> number (including even itself).
>>> <
>>> I don't see why you would want Infinity to just be another NaN.
>>> <
>> If either Inf or NaN happens, it usually means "something has gone
>> wrong". Having them as distinct cases adds more special cases that need
>> to be detected and handled by hardware, without contributing much beyond
>> slightly different ways of saying "Well, the math has broken".
> <
> Needing to test for an explicit -0.0 has similar problems.

Possibly, though I had found that if one generates a -0, some software
will misbehave. It is seemingly necessary to force all 0 to be positive
zero for software compatibility reasons.

This means one has a few cases, eg:
A or B is NaN (with Inf as a sub-case);
A or B is Zero;

FADD/FSUB:
A==NaN || B==NaN
Result is NaN
Else, Normal Case
Generally, perform calculation internally as twos complement (*).
~ 66 bit so Int64 conversion works.

FMUL:
A==NaN || B==NaN
Result is NaN
A==Zero || B==Zero
Result is Zero
Else, Normal Case
~ DP: 54*54 ~> 68 (discards low-order results).

Assuming here an implementation which lacks an FDIV instruction.

*: One could argue that ones' complement would be cheaper here than twos
complement, but the difference in the results is more obvious, and one
ends up with an FPU where frequently (A+B)!=(B+A), where (A+B)==(B+A) is
a decidedly "nice to have" property.

Also, if one assumes the ability to express integer values as floating
point values, and the ability to operate them and still get integer
results, then one is going to need twos complement here.

For a 32-bit machine, it might make sense to only perform Binary32 ops
hardware, and fall back to software emulation for Binary64.

Though, this does lead to some problem cases:
'math.h' functions;
'atof' and similar;
...

Which are likely to take a pretty big performance hit if the hardware
only does Binary32.

If one implements FP compare with 'EQ' and 'GT' operators, then probably
only EQ needs to deal with NaN:
A==NaN || B==NaN: EQ always gives False.
Other cases: EQ is equivalent to integer EQ.

Mostly, because for a single 'GT' operator, there is no real sensible
way to handle NaN's, so it is easier to simply ignore their existence in
this case.

Where, say:
A==B: EQ(A,B),BT
A!=B: EQ(A,B),BF
A> B: GT(A,B),BT
A<=B: GT(A,B),BF
A>=B: GT(B,A),BF
A< B: GT(B,A),BT

Granted, one could try to argue for a different interpretation if one
assumes an ISA built around compare-and-branch instructions, or one
built on condition-code branches.

>>
>> In practice, the distinction contributes little in terms of either
>> practical use cases nor is much benefit for debugging.
>>
>> Though, I am still generally in favor of keeping NaN around.
>>>> Define "Denormal As Zero" as canonical;
>>> <
>>> This no longer saves circuitry.
>>> <
>> Denormal numbers are only really "free" if one has an FMA unit rather
>> than separate FADD/FMUL (so, eg, both FMUL and FADD can share the same
>> renormalization logic).
>>
>> Though, the cheaper option here seemingly being to not have FMA, in
>> which case it is no longer free.
> <
> Then you are not compliant with IEEE 754-2008 so why bother with the
> rest of it.

I am not assuming this to match up with IEEE-754-2008, that is not the
point.

This would likely need to be a separate spec, but would still basically
match up with IEEE 754 in terms of floating-point formats and similar.

But, it would be nice to be able to have another spec, that is like IEEE
754 but intended mostly for cheap embedded processors and
microcontrollers, but with much looser requirements and thus easier to hit.

In these cases, options thus far are typically either:
Cheap FPU which doesn't fully implement IEEE 754
And probably couldn't do so cost-effectively in the first place
Falling back to software emulation, which is significantly slower.

But, if one wants any semblance of floating-point performance, relying
on software emulation is basically no-go.

So, the claim of the this spec is not so much that it will match up with
the numerical results of a PC or similar but "has an FPU basically
sufficient to do basic FPU tasks".

Contrast would be something like an MSP430 or AVR8 based
microcontroller, where trying to use floating point math is best
summarized as, "No, don't even try".

>>
>> Though, I would assume the particular interpretation of DAZ as FTZ
>> (Flush to Zero) on the results, since the interpretation where "result
>> exponent may be random garbage" results in other (generally worse)
>> issues regarding the semantics.
>>
>>
>> Also technically cheaper to implement FMUL and FADD in a way where most
>> of the low order bits which "fall off the bottom" are effectively
>> discarded from the calculation, because only a relatively limited number
>> of bits below the ULP are likely to have much effect on the rounded result.
> <
> Yes, and FMAC is a bit larger than a FMUL and an FADD. But not hideously so.

If one assumes the need to work with full-width intermediate values (as
opposed to discarding all the low order bits from the result), it can
get a bit much.

One drawback of gluing the units together, is that one would end up with
a unit that needs a higher cycle latency than either an FMUL or an FADD
(at least, when these units are implemented with cost-cutting measures).

>>
>> Though, FADD does need a mantissa large enough internally to deal with
>> integer conversion (so, say, 66 bits to deal with Binary64<->Int64
>> conversion).
>>
>> Reusing FADD for conversion makes more sense, since FADD already has
>> most of the logic needed for doing conversions, and this is cheaper than
>> repeating the logic for a dedicated module.
>>
>>
>>
>> For FMUL, given the relatively limited dynamic range of the results (for
>> normalized inputs), the renormalization step is very minimal:
>> Result is 1<=x<2, All is good (do nothing);
>> Result is 2<=x<4, Shift right by 1 bit and add 1 to exponent.
> <
> Only if you FTZ. Otherwise if you process denorms by not "inventing"
> the hidden bit, you have to scan for the hidden bit and shift the
> result accordingly.

Yes, I am assuming DAZ+FTZ semantics here.

The cost of using a general-purpose normalizer (like one would use in an
FADD), also being fairly slow and expensive.

I suspect probably a bit part of the cost of the FADD is the
normalization logic.

>>
>> Main expensive part of FMUL being the "multiply the two mantissas
>> together" aspect.
>>>> ...
>>>>
>>>> FPU operations:
>>>> ADD/SUB/MUL
>>>> CMP, CONV
>>>>
>>>> Rounding:
>>>> Mostly Undefined (Rounding modes may be unsupported or ignored)
>>>> ADD/SUB/MUL are ULP +/- 1.5 or 2 or similar
>>> <
>>> Even the GPUs are migrating towards full IEEE 754 compliance.
> <
>> This is more likely due to GPGPU uses than due to full 754 being
>> particularly useful for graphics processing and similar.
> <
> I was told that GPUs were migrating towards full IEEE so as to reduce
> image "shimmer".

OK.

I haven't really noticed any big issues here.

IMHO, 3D graphics in games basically reached the "good enough" point
roughly 20 years ago, and most "improvement" since then hasn't really
contributed all that much to the overall experience.

The biggest "significant" improvement on this front was probably RTX,
but this still doesn't add enough to convince me to spend the money
needed to go and buy a graphics card which supports it.

Eg, still running a second hand GTX 980, basically good enough...
Before this, was running a second-hand GTX 460 for a while.

>>> <
>>>> Conversion to integer is always "truncate towards zero".
>>> <
>>> i = ICEIL( x );
>>> ...
>> There are ways to implement floor/ceil/... that don't depend on having
>> multiple rounding modes in hardware, or multiple float->int conversions.
> <
> When you look at the circuitry required, it is small. So, the best thing is
> to create direct instructions for these things. Given HW than can perform
> <
> i = TRUNK( x );
> <
> adding CEIL, FLOOR, RND; adds only a few percent more gates.

Possibly.

I can note that in my case, my ISA has instructions with explicit
rounding modes. But, a low-cost FPU probably shouldn't require them.

Though, the rounding modes typically only work with the low-order bits
and will not round if this would require a significant carry propagation.

>>
>> For example, it is frequently useful to implement float->int conversion
>> in a way that rounds towards negative infinity, but usually this is
>> handled by doing something like, say:
>> long floor_to_long(double x)
>> {
>> if(x<0)
>> {
>> return(-(long)((-x)+0.999999999999));
>> }
>> return((long)x);
>> }
>>
>> Or similar...
>>>>
>>>> Could still require that, for the same inputs, the operators will still
>>>> produce the same output each time.
>>>>
>>>> Divide and Square-Root are software, and "somewhere in generally the
>>>> right area" is regarded as sufficient.
>>> <
>>> Unlikely to be accepted by the market.
>> This likely depends on what the processor can pull off effectively.
>>
>>
>> I have yet to figure out a "good" and "cheap" way to do FDIV and FSQRT
>> moderately quickly in hardware.
>>
>> So, doing it in software is still faster in my case.
>>
>>
>> And, in software, one can use versions which cut back on the number of
>> N-R stages. Say, for example, one finds that for a given calculation,
>> two N-R stages is sufficient (and we don't want to spend the cycles to
>> converge it all the way to the ULP).
>>>>
>>>>
>>>> Or, basically, the cheapest FPU possible which is still sufficient as to
>>>> be basically usable.
>>>>
>>> Will end up with the same "avid" following as FIAT here in USA.
>> Dunno.
>>>>
>>>> Possible optional additions:
>>>> Denormalized formats, which lack a hidden bit
>>>> More like the x87 long-double format;
>>>> Intermediate precision formats, such as:
>>>> Binary48 (S.E11.F36)
>>>> Binary24 (S.E8.F15).
>>>> If stored in a 32 or 64 bit container:
>>>> Will use the same format as Binary32 or Binary64
>>>> Will ignore the low order bits.
>>> <
>>> It seems to me that this is not a job of CPU architects, but a job
>>> for people who want to use quality FP implementations.
>> These options would be like faster or cheaper alternatives for the full
>> width versions.
>>
>> Though, trying to pass them off as their full-width siblings is unlikely
>> to go unnoticed.
>>
>>
>> But, for example, semantically-truncating Binary32 to 24 bits could be
>> useful for SIMD in cases where Binary16 is insufficient, but where full
>> Binary32 precision isn't needed, in cases where the truncated form could
>> be handled in fewer clock cycles.

Add:
The idea for an x87-like format was, one could implement an FPU where
the native format is non-normalized (like in x87), and then require an
explicit re-normalization step before converting to IEEE formats.

This is potentially a double-edged sword though, as it merely moves the
cost from one place to another; and would adding a lot more instructions
when working with floating-point values (vs keeping them in the IEEE
formats).

In retrospect, probably not such a great idea...

Subject	Replies	Author
Mixed EGU/EGO floating-point By: Quadibloc on Fri, 13 May 2022	116	Quadibloc

Memory fault - where am I?

computers / comp.arch / Re: Mixed EGU/EGO floating-point