novaBBS - comp.arch - Re: Extended double precision decimal floating point

On 4/24/2022 3:53 AM, Terje Mathisen wrote:
> BGB wrote:
>> On 4/23/2022 3:25 PM, MitchAlsup wrote:
>>> On Saturday, April 23, 2022 at 2:50:34 PM UTC-5, Ivan Godard wrote:
>>>> On 4/23/2022 12:35 PM, BGB wrote:
>>>>> On 4/23/2022 7:08 AM, Thomas Koenig wrote:
>>>>>> Ivan Godard <iv...@millcomputing.com> schrieb:
>>>>>>
>>>>>>> A (the?) significant use for decimal is Numeric in DBMSs and COBOL.
>>>>>>> That's 33 digits IIRC. The COBOL committee spent quite a lot of time
>>>>>>> with the IFFF-754 committee (I was a IEEE member) before the last
>>>>>>> IEEE
>>>>>>> standard, to make sure that the IEEE decimal would be big enough for
>>>>>>> real uses in commerce: numbers like the world GDP expressed in
>>>>>>> Zimbabwean Dollars to the mil.
>>>>>>
>>>>>> POWER has 16 digits with long DFP, 34 with DFP Extended (128 bits).
>>>>>>
>>>>>> World GDP was around 80,934,771,028,340 (for 2017?) so 14 digits.
>>>>>> In 2009, before one of the re-valuations, a Zimbabwean Dollar
>>>>>> was 300,000,000,000,000 to the US dollar (according to Wikipedia),
>>>>>> 15 digits. So... it would fit, but just about.
>>>>>
>>>>>
>>>>> I suspect there are reasons why the Decimal crowd was mostly going for
>>>>> 128-bit formats...
>>>>>
>>>>> As for whether or not it makes much sense for a hardware FPU is more
>>>>> debatable.
>>>>>
>>>>> Even for something like DBMS or COBOL, probably only a small part
>>>>> of the
>>>>> total time is likely to be spent on decimal arithmetic, and
>>>>> traditionally DBMS's are mostly IO bound anyways, ...
>>> <
>>>> Commercial work is throughput oriented. When we were defining the IEEE
>>>> decimal revision we got some numbers from Oracle: IIRC, 40% of CPU
>>>> cycles were in decimal emulation routines. Terje - does that match your
>>>> memory?
>>> <
>>> This 40% would vary massively if the base CPU had decimal instructions
>>> (like IBM 360) versus if it had nothing (MIPS) versus if it had only
>>> a trifling
>>> (x86).
>>
>> With a number like this, I would assume it probably means that they
>> are doing all of the decimal stuff by working on it
>> one-digit-at-a-time in C or something...
>>
>>
>> But, yeah, the x86 BCD instructions are, in a way, so limited as to be
>> borderline useless.
>>
>> If one has instructions which can ADD/SUB BCD 16 digits at a time or
>> similar, this should be at least a little more usable.
>>
>> Nevermind if MUL/DIV, Int<->BCD conversion, ... would still be
>> potential bottlenecks.
>>
>> Granted, if one had semi-fast conversion to/from binary integer types,
>> this would make MUL/DIV less of an issue, because then one could reuse
>> the (probably significantly faster) binary integer multiply and divide
>> operations.
>
> You have seen the algorithm I invented 20+ years ago to do really fast
> conversion from binary to BCD or ASCII? Michael S have tweaked it a bit
> more since then: You get the full 32-bit unsigned to 10 ASCII digits in
> less than the time for a single DIV, while moving to 64-bit regs allows
> you do do the same for 64-bit unsigned in maybe 50% more cycles.
>

There are several algorithms I am aware of.
I am not sure if I know which one you have in mind.

Binary to Decimal can be done via multiplying by a fixed point value
which "squeezes" groups of one or more digits above the decimal point.

This approach is a little fiddly though (if not scaled or biased
correctly, it will generate garbage output).

So, say, for a pre-scaled value in R4:
MOV 0x028F5C28, R16 // (1<<32)/100
MOV 0x00000063, R17 // magic bias (99)
//above can be reused for multiple iterations
DMULU.L R4, R16, R18 //move digits above decimal point.
ADD R17, R18 //add bias
SHLD.Q R18, -32, R2 //get top 2 digits (2 digit output)
EXTU.L R18, R4 //discard bits above point (for next step)

The value needs to be prescaled to the correct range (and pre-biased),
which is another fixed-point multiply.

In particular, say, the full value range needs to map to the 32-bit unit
range: ~ (2^32)/1000000000, but this is not an integer multiple (so
would be another fixed point multiply, bias, and shift).

Getting from values in the 00..99 range to BCD could be done with a
lookup table.

The process could be partially unrolled if done as an ASM blob.

Could do the conversion to/from BCD between 2 and 4 digits at a time,
would mostly effect the size of the lookup tables, eg:
2-digit: ~ 154B + 100B
3-digit: ~ 3K + 1K
4-digit: ~ 40K + 10K

Not really sure if there is a faster approach.

There is also the more traditional algorithm (for binary to decimal), of
divide and modulo by powers of 10, but as noted, this is slower.

The traditional approach is more stable though, and does not need bias
and prescale steps in order to work.

BCD to binary would be a lookup table followed by multiply-and-add.

> Moving on to 128-bit DFP you have to split the mantissa into two or
> three chunks first (div/mod 1e16 or 1e11), then you can convert these in
> parallel.
>

Possibly.

My thinking is that one would likely need to split them into groups of 9
digits.

I guess parallel conversion could work, in my case one would just need
to take care as DMULU.L is a Lane-1 instruction.

> BTW, it was Michael who suggested that you can do reciprocal
> multiplication to emulate division by powers of ten for numbers that are
> larger than 2^64, i.e. larger than 1e19: All the way up to 27 you can do
> so by first extracting the powers of two with a simple SHRD, then you
> divide by the corresponding factor of 5.
>

Probably depends on ISA.

On BJX2, would make sense to stick with 32-bit units for this part, as
currently my options are:
DMULU.L // ~ 3 cycle (32*32->64)
DMULU.Q // 67 cycle (64*64->128)

Though, the later can also be done in software (via decomposing the
value and using DMULU.L ops). In my tasting, it is roughly break-even,
but this seems to be mostly due to function-call related overheads (less
of an issue for a specialized ASM blob). In bare ASM, doing it via
32-bit multiplies is faster.

The cost in this case is more indirect, as the cost is less with the
called function, and more the impacts that performing a function call
has on the generated code in the caller (which is apparently enough to
"eat the delta" in this case).

For divide, the hardware shift-add unit generally beats out the software
shift-subtract loops, but more specialized options (such as
multiply-by-reciprocal) are faster than the DIV instructions.

For integer divide by constant, my compiler will typically turn it into
a multiply-by-reciprocal.

As for splitting up or combining a 64-bit or 128-bit number most
efficiently into groups of 9 digits or similar, dunno...

I mostly think doing DIV/MOD by going through binary could be faster, as
by my current estimate, doing 16 or 32 digit long-multiply and
long-divide is likely to be "particularly slow".

Also note that BCD types in BGBCC would likely be (primarily)
runtime-call based.

I had started adding stuff for them, treating them mostly as subset of
the SIMD vector types.

Where, type-tower:
SmallInt128:
int128, uint128
SmallLong:
int64(long), uint64(ulong)
SmallInt:
int, uint
short, ushort
char, uchar
...
SmallFloat128:
SmallLong
float128
SmallDouble:
SmallInt
double
SmallFloat:
float
binary16 (short float)
SmallComplexDouble:
SmallDouble
'_Complex double'
SmallComplexFloat:
SmallFloat
'_Complex float'
SmallQuat:
SmallComplexFloat
'__quatf' (quaternion)

SmallM64: (subtypes of "__m64")
__m64
__vec2f, __vec4h/__vec4sf, ...
__bcd64
...
SmallM128:
__m128
__vec4f, __vec2d, ...
__quatf
__bcd128
...

This mostly means that as-is, they will probably not auto-convert
to/from integer types (but manual casts will be supported).

The 'SmallWhatever' predicates mostly reflect types that accept a given
type or a logical subtype, and thus (implicitly) support auto-promotion
along those paths.

The BCD types didn't really make sense as part of the main number tower,
hence being lumped with the SIMD types.

__bcd64 v;
uint64_t li;
v=(__bcd64)li; //value-convert to BCD
li=(uint64_t)v; //likewise

A raw bitwise conversion would be possible by performing casts via __m64
or __m128. In BGBCC, this can also be used to force-convert bit patterns
between floating-point and integer types while keeping the values in
registers (unlike traditional "convert via memory ops" approaches).

Note that one can't perform operations on __m64 or __m128 directly, as
the compiler treats them as opaque (sort of similar to "void *" pointers).

This can be contrast with MSVC and friends which treat __m64 and __m128
like "magic opaque structs" rather than as built-in types (and thus they
don't have any sub-types or cast-conversion support).

Note that BGBCC's SIMD vector system is non-standard, and was (in some
regards) more modeled after GLSL and similar (albeit with slightly
different type-names and notation).

Though, it is rarely used directly (since this would make code
non-portable), so is usually wrapped in macros or similar.

It can also mimic a subset of GCC's vector notation as well.

Despite reusing '__m128' and similar, it does not actually implement the
"xmmintrin.h" system (and BJX2's SIMD support is somewhat different than
SSE as well, so it wouldn't really make as much sense).

....

> Terje

Subject	Replies	Author
Extended double precision decimal floating point By: robf...@gmail.com on Thu, 21 Apr 2022	113	robf...@gmail.com

Xerox never comes up with anything original.

computers / comp.arch / Re: Extended double precision decimal floating point