novaBBS - comp.arch - Improved routines for gcc/gfortran quadmath arithmetic

This thread is for attention of Thomas Koenig.

Few months ago I spend significant time coding replacement routines for
__addtf3 and __multf3.
Back then I thought that they are part of quadmath library.
quadmath library appears to be relatively low profile project with
little burocracy involved so I thout that integration of my replacement
will be easy.
Later on (in June) I learned that these routines are in fact part of glibc.
That reduced my enthusiasm. My impression is that glibc project is more
bureaucratic (good thing, generally, but not for me, as a hobbyist that seeks his
work integrated) and that glibc developers have strong territorial instincts.
I could be wrong about both points, but that was my impression.
Simultaneously, I had more "real work" to do that left little time. Often I came
home tired. And when I don't do this sort of hobby things continuously
night after night I tend to lose focus ans interest. In short, I almost forgot
about whole thing. Luckily, by then both routines were already in MUCH better shape
than originals. I mean, in terms of speed.
Today and tomorrow I happen to have time, so recollected about it and stored routines
and tests in my public github repository.
https://github.com/already5chosen/extfloat/tree/master/binary128

Thomas, please look at this code and tell me if there is a chance to integrate it.
Preferable, in glibc, but I am not optimistic about it.
If not in glibc then, maybe, in more specialized places, like matrix primitives of
gfortran.
BTW, for later I have potentially more interesting routines - multiplication of
array by scalar and addition of two vectors. Due two reduction in parsing and
call overhead they are tens of percents faster than __multf3/__addtf3 called in loops.
Hopefully, I will add them to repository later tonight or tomorrow.

Michael S <already5chosen@yahoo.com> writes:
>This thread is for attention of Thomas Koenig.
>
>Few months ago I spend significant time coding replacement routines for
>__addtf3 and __multf3.
>Back then I thought that they are part of quadmath library.
>quadmath library appears to be relatively low profile project with
>little burocracy involved so I thout that integration of my replacement
>will be easy.
>Later on (in June) I learned that these routines are in fact part of glibc.

The names look like they belong to libgcc (part of gcc), not glibc,
and indeed, on a Debian 11 system

objdump -d /lib/x86_64-linux-gnu/libgcc_s.so.1|grep addtf3

shows code for __addtf3, while

objdump -d /lib/x86_64-linux-gnu/libc-2.31.so| grep addtf3

comes up blank (apparently not even a call to __addtf3 there).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Improved routines for gcc/gfortran quadmath arithmetic

<0786f832-4a38-453b-aa84-1c634d25cde5n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=29891&group=comp.arch#29891

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:e16:b0:6ff:a525:7971 with SMTP id y22-20020a05620a0e1600b006ffa5257971mr882157qkm.538.1672309361982;
Thu, 29 Dec 2022 02:22:41 -0800 (PST)
X-Received: by 2002:a05:6870:aa05:b0:14f:b93f:15e9 with SMTP id
gv5-20020a056870aa0500b0014fb93f15e9mr1077946oab.113.1672309361687; Thu, 29
Dec 2022 02:22:41 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!feeder.erje.net!border-1.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 29 Dec 2022 02:22:41 -0800 (PST)
In-Reply-To: <2022Dec29.100452@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:6929:12d7:bc35:ada3;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:6929:12d7:bc35:ada3
References: <b23aa9b1-fc39-49c2-83c7-935fe07d46aen@googlegroups.com> <2022Dec29.100452@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <0786f832-4a38-453b-aa84-1c634d25cde5n@googlegroups.com>
Subject: Re: Improved routines for gcc/gfortran quadmath arithmetic
From: already5...@yahoo.com (Michael S)
Injection-Date: Thu, 29 Dec 2022 10:22:41 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 28

by: Michael S - Thu, 29 Dec 2022 10:22 UTC

On Thursday, December 29, 2022 at 11:08:42 AM UTC+2, Anton Ertl wrote:
> Michael S <already...@yahoo.com> writes:
> >This thread is for attention of Thomas Koenig.
> >
> >Few months ago I spend significant time coding replacement routines for
> >__addtf3 and __multf3.
> >Back then I thought that they are part of quadmath library.
> >quadmath library appears to be relatively low profile project with
> >little burocracy involved so I thout that integration of my replacement
> >will be easy.
> >Later on (in June) I learned that these routines are in fact part of glibc.
> The names look like they belong to libgcc (part of gcc), not glibc,
> and indeed, on a Debian 11 system
>
> objdump -d /lib/x86_64-linux-gnu/libgcc_s.so.1|grep addtf3
>
> shows code for __addtf3, while
>
> objdump -d /lib/x86_64-linux-gnu/libc-2.31.so| grep addtf3
>
> comes up blank (apparently not even a call to __addtf3 there).
>
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

You are right, I was confused about similar library names.
It does not change my point - it's still much higher profile project than libquadmath.

Michael S <already5chosen@yahoo.com> schrieb:
> This thread is for attention of Thomas Koenig.
>
> Few months ago I spend significant time coding replacement routines for
> __addtf3 and __multf3.
> Back then I thought that they are part of quadmath library.
> quadmath library appears to be relatively low profile project with
> little burocracy involved so I thout that integration of my replacement
> will be easy.
> Later on (in June) I learned that these routines are in fact part of glibc.
> That reduced my enthusiasm. My impression is that glibc project is more
> bureaucratic (good thing, generally, but not for me, as a hobbyist that seeks his
> work integrated) and that glibc developers have strong territorial instincts.

I cannot really speak to that, I hardly know the glibc people. (I
probably ran across them at a GNU Cauldron, but I don't remember talking
to them).

The current status seems to be that these functions are part of libgcc,
and that they are copied over from glibc.

> I could be wrong about both points, but that was my impression.
> Simultaneously, I had more "real work" to do that left little time.

I know that only too well (I currently have my own project, outside of
work, which keeps me quite busy).

> Often I came
> home tired. And when I don't do this sort of hobby things continuously
> night after night I tend to lose focus ans interest. In short, I almost forgot
> about whole thing. Luckily, by then both routines were already in MUCH better shape
> than originals. I mean, in terms of speed.
> Today and tomorrow I happen to have time, so recollected about it and stored routines
> and tests in my public github repository.
> https://github.com/already5chosen/extfloat/tree/master/binary128
>
> Thomas, please look at this code and tell me if there is a chance to integrate it.

I will test this for a bit. A next step would be a discussion on
the gcc mailing list. A possible scenario would be to create a
branch, and to merge that into trunk once gcc14 development starts.

The code may require some adjustment for other platforms, or for
special conditions in libgcc. For example, I am not sure if using
x86 intrinsics in libgcc would cause problems, or if using uint64_t
works, or if using __float128 is correct or it should be replaced by
TFMode.

Anyway, I'll drop you an e-mail in the near future.

> Preferable, in glibc, but I am not optimistic about it.
> If not in glibc then, maybe, in more specialized places, like matrix primitives of
> gfortran.
> BTW, for later I have potentially more interesting routines - multiplication of
> array by scalar and addition of two vectors. Due two reduction in parsing and
> call overhead they are tens of percents faster than __multf3/__addtf3 called in loops.
> Hopefully, I will add them to repository later tonight or tomorrow.

That also sounds interesting :-)

Re: Improved routines for gcc/gfortran quadmath arithmetic

<4a1b157b-7ba7-4aff-b227-3db762074c5cn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=29893&group=comp.arch#29893

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:1b0f:b0:3a9:7719:2175 with SMTP id bb15-20020a05622a1b0f00b003a977192175mr1177940qtb.651.1672314987249;
Thu, 29 Dec 2022 03:56:27 -0800 (PST)
X-Received: by 2002:a05:6870:4729:b0:14e:9c17:1804 with SMTP id
b41-20020a056870472900b0014e9c171804mr2280909oaq.186.1672314986964; Thu, 29
Dec 2022 03:56:26 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 29 Dec 2022 03:56:26 -0800 (PST)
In-Reply-To: <tojqju$1n9l5$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:6929:12d7:bc35:ada3;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:6929:12d7:bc35:ada3
References: <b23aa9b1-fc39-49c2-83c7-935fe07d46aen@googlegroups.com> <tojqju$1n9l5$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4a1b157b-7ba7-4aff-b227-3db762074c5cn@googlegroups.com>
Subject: Re: Improved routines for gcc/gfortran quadmath arithmetic
From: already5...@yahoo.com (Michael S)
Injection-Date: Thu, 29 Dec 2022 11:56:27 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 6837

by: Michael S - Thu, 29 Dec 2022 11:56 UTC

On Thursday, December 29, 2022 at 12:36:48 PM UTC+2, Thomas Koenig wrote:
> Michael S <already...@yahoo.com> schrieb:
> > This thread is for attention of Thomas Koenig.
> >
> > Few months ago I spend significant time coding replacement routines for
> > __addtf3 and __multf3.
> > Back then I thought that they are part of quadmath library.
> > quadmath library appears to be relatively low profile project with
> > little burocracy involved so I thout that integration of my replacement
> > will be easy.
> > Later on (in June) I learned that these routines are in fact part of glibc.
> > That reduced my enthusiasm. My impression is that glibc project is more
> > bureaucratic (good thing, generally, but not for me, as a hobbyist that seeks his
> > work integrated) and that glibc developers have strong territorial instincts.
> I cannot really speak to that, I hardly know the glibc people. (I
> probably ran across them at a GNU Cauldron, but I don't remember talking
> to them).
>
> The current status seems to be that these functions are part of libgcc,
> and that they are copied over from glibc.
> > I could be wrong about both points, but that was my impression.
> > Simultaneously, I had more "real work" to do that left little time.
> I know that only too well (I currently have my own project, outside of
> work, which keeps me quite busy).
> > Often I came
> > home tired. And when I don't do this sort of hobby things continuously
> > night after night I tend to lose focus ans interest. In short, I almost forgot
> > about whole thing. Luckily, by then both routines were already in MUCH better shape
> > than originals. I mean, in terms of speed.
> > Today and tomorrow I happen to have time, so recollected about it and stored routines
> > and tests in my public github repository.
> > https://github.com/already5chosen/extfloat/tree/master/binary128
> >
> > Thomas, please look at this code and tell me if there is a chance to integrate it.
> I will test this for a bit. A next step would be a discussion on
> the gcc mailing list. A possible scenario would be to create a
> branch, and to merge that into trunk once gcc14 development starts.
>
> The code may require some adjustment for other platforms, or for
> special conditions in libgcc. For example, I am not sure if using
> x86 intrinsics in libgcc would cause problems,

All uses of x86 intrinsic functions are guarded with #ifdef __amd64.
They are here in order to fight stupidities of gcc compiler, most commonly
its hyperactive SLP vectorizer.
The code is 100% correct without this parts, but non-trivially slower when
compiled with gcc12.
Making them unnneded in gcc 13 or 14 sounds like a worthy goal for gcc,
but I am not optimistic. Until now the trend was opposite - SLP vectorizer
only gets more aggressive.

> or if using uint64_t works,

My code certainly requires both uint64_t and __int128.
But it appears to be almost o.k. since we don't want to support all gcc platforms.
All we are interested in are platforms that support binary128 FP.
By now those are:
x86-64,
MIPS64,
aarch64,
rv64gc,
rv32gc,
s390x,
power64le,
power64
The first four are o.k.
The last two (or three?) have hardware so don't need sw implementation.
That leave one odd case of rv32gc. I don't know what to do about it.
IMHO, the wisest is to drop support for binary128 on this platform.
I can't imagine that it is actually used by anybody.

> or if using __float128 is correct or it should be replaced by
> TFMode.

I don't know what TFMode means. Ideally, it should be _Float128.
I used __float128 because I wanted to compile with clang that does not
understand __float128. That not a good reason, so probably has to change.

But the real reason why I don't believe that my code can be integrates into
glibc or even into libgcc is different: I don't support rounding modes.
That is, I always round to nearest with breaks rounded to even.
I suppose that it's what is wanted by nearly all users, but glibc has different idea.
They think that binary128 rounding mode should be the same as current
rounding mode for binary64/binary32. They say, it's required by IEEE-754.
They could even be correct about it, but it just shows that IEEE-754 is not perfect.

Support to other rounding modes can be added to my routines, but doing it in
platform-independent manner will slow things down significantly and doing it
in platform-dependent manner will reduce portability (IMHO, *not* for a good reason)
and it still would be somewhat slower than current code.

That's why I am more hopeful for integration into gfortran and esp. into gfortran's
matrix/vector libraries.
My impression is that these places don't care about quadmath non-default rounding modes.

>
> Anyway, I'll drop you an e-mail in the near future.

Normally, I don't check this e-mail address regularly.
I can, of course, but why not keep conversation public? It could have other benefits.

> > Preferable, in glibc, but I am not optimistic about it.
> > If not in glibc then, maybe, in more specialized places, like matrix primitives of
> > gfortran.
> > BTW, for later I have potentially more interesting routines - multiplication of
> > array by scalar and addition of two vectors. Due two reduction in parsing and
> > call overhead they are tens of percents faster than __multf3/__addtf3 called in loops.
> > Hopefully, I will add them to repository later tonight or tomorrow.
> That also sounds interesting :-)

Michael S <already5chosen@yahoo.com> schrieb:
> On Thursday, December 29, 2022 at 12:36:48 PM UTC+2, Thomas Koenig wrote:
>> Michael S <already...@yahoo.com> schrieb:
>> > This thread is for attention of Thomas Koenig.
>> >
>> > Few months ago I spend significant time coding replacement routines for
>> > __addtf3 and __multf3.
>> > Back then I thought that they are part of quadmath library.
>> > quadmath library appears to be relatively low profile project with
>> > little burocracy involved so I thout that integration of my replacement
>> > will be easy.
>> > Later on (in June) I learned that these routines are in fact part of glibc.
>> > That reduced my enthusiasm. My impression is that glibc project is more
>> > bureaucratic (good thing, generally, but not for me, as a hobbyist that seeks his
>> > work integrated) and that glibc developers have strong territorial instincts.
>> I cannot really speak to that, I hardly know the glibc people. (I
>> probably ran across them at a GNU Cauldron, but I don't remember talking
>> to them).
>>
>> The current status seems to be that these functions are part of libgcc,
>> and that they are copied over from glibc.
>> > I could be wrong about both points, but that was my impression.
>> > Simultaneously, I had more "real work" to do that left little time.
>> I know that only too well (I currently have my own project, outside of
>> work, which keeps me quite busy).
>> > Often I came
>> > home tired. And when I don't do this sort of hobby things continuously
>> > night after night I tend to lose focus ans interest. In short, I almost forgot
>> > about whole thing. Luckily, by then both routines were already in MUCH better shape
>> > than originals. I mean, in terms of speed.
>> > Today and tomorrow I happen to have time, so recollected about it and stored routines
>> > and tests in my public github repository.
>> > https://github.com/already5chosen/extfloat/tree/master/binary128
>> >
>> > Thomas, please look at this code and tell me if there is a chance to integrate it.
>> I will test this for a bit.

OK, I plugged your code into libgcc/soft-fp/addtf3.c and
libgcc/soft-fp/multf3.c, and, on the first try, hit an error:

In file included from /home/tkoenig/trunk-bin/gcc/include/immintrin.h:39,
from /home/tkoenig/trunk-bin/gcc/include/x86intrin.h:32,
from ../../../trunk/libgcc/soft-fp/addtf3.c:4:
/home/tkoenig/trunk-bin/gcc/include/smmintrin.h: In function 'f128_to_u128':
/home/tkoenig/trunk-bin/gcc/include/smmintrin.h:455:1: error: inlining failed in call to 'always_inline' '_mm_extract_epi64': target specific option mismatch
455 | _mm_extract_epi64 (__m128i __X, const int __N)
| ^~~~~~~~~~~~~~~~~
.../../../trunk/libgcc/soft-fp/addtf3.c:70:17: note: called from here
70 | uint64_t hi = _mm_extract_epi64(v, 1);
| ^~~~~~~~~~~~~~~~~~~~~~~
/home/tkoenig/trunk-bin/gcc/include/smmintrin.h:455:1: error: inlining failed in call to 'always_inline' '_mm_extract_epi64': target specific option mismatch
455 | _mm_extract_epi64 (__m128i __X, const int __N)
| ^~~~~~~~~~~~~~~~~
.../../../trunk/libgcc/soft-fp/addtf3.c:69:17: note: called from here
69 | uint64_t lo = _mm_extract_epi64(v, 0);
| ^~~~~~~~~~~~~~~~~~~~~~~
make[3]: *** [../../../trunk/libgcc/shared-object.mk:14: addtf3.o] Error 1

(I'm not quite sure what "target specific option mismatch" actually
means in this context). This kind of thing is not unexpected,
because the build environment inside gcc is different from normal
userland. So, some debugging would be needed to bring this into libgcc.

>>A next step would be a discussion on
>> the gcc mailing list. A possible scenario would be to create a
>> branch, and to merge that into trunk once gcc14 development starts.
>>
>> The code may require some adjustment for other platforms, or for
>> special conditions in libgcc. For example, I am not sure if using
>> x86 intrinsics in libgcc would cause problems,
>
> All uses of x86 intrinsic functions are guarded with #ifdef __amd64.
> They are here in order to fight stupidities of gcc compiler, most commonly
> its hyperactive SLP vectorizer.
> The code is 100% correct without this parts, but non-trivially slower when
> compiled with gcc12.
> Making them unnneded in gcc 13 or 14 sounds like a worthy goal for gcc,
> but I am not optimistic. Until now the trend was opposite - SLP vectorizer
> only gets more aggressive.
>
>> or if using uint64_t works,
>
> My code certainly requires both uint64_t and __int128.
> But it appears to be almost o.k. since we don't want to support all gcc platforms.
> All we are interested in are platforms that support binary128 FP.
> By now those are:
> x86-64,
> MIPS64,
> aarch64,
> rv64gc,
> rv32gc,
> s390x,
> power64le,
> power64
> The first four are o.k.
> The last two (or three?) have hardware so don't need sw implementation.
> That leave one odd case of rv32gc. I don't know what to do about it.
> IMHO, the wisest is to drop support for binary128 on this platform.
> I can't imagine that it is actually used by anybody.

Looking at it through gfortran's glasses, we do not support
REAL(KIND=16) on 32-bit systems, so that would be OK.

>
>> or if using __float128 is correct or it should be replaced by
>> TFMode.
>
> I don't know what TFMode means.

TFMode is the name for 128-bit reals inside gcc (see
https://gcc.gnu.org/onlinedocs/gccint/Machine-Modes.html ). If you
look at libgcc, you will find it is actually implemented as a struct
of two longs. That does not mean it absolutely has to be that way.

>Ideally, it should be _Float128.
> I used __float128 because I wanted to compile with clang that does not
> understand __float128. That not a good reason, so probably has to change.

> But the real reason why I don't believe that my code can be integrates into
> glibc or even into libgcc is different: I don't support rounding modes.
> That is, I always round to nearest with breaks rounded to even.
> I suppose that it's what is wanted by nearly all users, but glibc has different idea.
> They think that binary128 rounding mode should be the same as current
> rounding mode for binary64/binary32. They say, it's required by IEEE-754.
> They could even be correct about it, but it just shows that IEEE-754 is not perfect.

I understand them being sticklers for accuracy (Terje? :-)

However, this does not mean that your code cannot be included. It would
be possible to have the traditional function as a fallback in the (rare)
case where people want something else - a conditional jump which would
be predicted 99.99.... % accurately in a tight loop.

> Support to other rounding modes can be added to my routines, but doing it in
> platform-independent manner will slow things down significantly and doing it
> in platform-dependent manner will reduce portability (IMHO, *not* for a good reason)
> and it still would be somewhat slower than current code.
>
> That's why I am more hopeful for integration into gfortran and esp. into gfortran's
> matrix/vector libraries.
> My impression is that these places don't care about quadmath non-default rounding modes.

Yes. Fortran, specifically, restores IEEE rounding modes to default
on procedure entry.

I would still prefer to get your code into libgcc, though, because it
would then be useful for every place where somebody uses 128-bit
in a program.

Have you ever compiled gcc? I often use the gcc compile farm at
https://cfarm.tetaneutral.net/ , it has some rather beefy machines
which cuts down on bootstrap time (but it will still take a couple
of hours). You can request at the comile farm.

>>
>> Anyway, I'll drop you an e-mail in the near future.
>
> Normally, I don't check this e-mail address regularly.
> I can, of course, but why not keep conversation public? It could have other benefits.

We will probably have to move over to gcc@gcc.gnu.org eventually.

[...]

Re: Improved routines for gcc/gfortran quadmath arithmetic

<e12551aa-41b3-424b-ab5f-8c2af6a2163en@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=29897&group=comp.arch#29897

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:9ad5:0:b0:6fe:bafb:72cb with SMTP id c204-20020a379ad5000000b006febafb72cbmr1370155qke.616.1672344143582;
Thu, 29 Dec 2022 12:02:23 -0800 (PST)
X-Received: by 2002:a05:6870:4729:b0:14e:9c17:1804 with SMTP id
b41-20020a056870472900b0014e9c171804mr2406258oaq.186.1672344143142; Thu, 29
Dec 2022 12:02:23 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 29 Dec 2022 12:02:22 -0800 (PST)
In-Reply-To: <tokl1e$1nspb$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=87.68.183.157; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 87.68.183.157
References: <b23aa9b1-fc39-49c2-83c7-935fe07d46aen@googlegroups.com>
<tojqju$1n9l5$1@newsreader4.netcologne.de> <4a1b157b-7ba7-4aff-b227-3db762074c5cn@googlegroups.com>
<tokl1e$1nspb$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e12551aa-41b3-424b-ab5f-8c2af6a2163en@googlegroups.com>
Subject: Re: Improved routines for gcc/gfortran quadmath arithmetic
From: already5...@yahoo.com (Michael S)
Injection-Date: Thu, 29 Dec 2022 20:02:23 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 10430

by: Michael S - Thu, 29 Dec 2022 20:02 UTC

On Thursday, December 29, 2022 at 8:07:45 PM UTC+2, Thomas Koenig wrote:
> Michael S <already...@yahoo.com> schrieb:
> > On Thursday, December 29, 2022 at 12:36:48 PM UTC+2, Thomas Koenig wrote:
> >> Michael S <already...@yahoo.com> schrieb:
> >> > This thread is for attention of Thomas Koenig.
> >> >
> >> > Few months ago I spend significant time coding replacement routines for
> >> > __addtf3 and __multf3.
> >> > Back then I thought that they are part of quadmath library.
> >> > quadmath library appears to be relatively low profile project with
> >> > little burocracy involved so I thout that integration of my replacement
> >> > will be easy.
> >> > Later on (in June) I learned that these routines are in fact part of glibc.
> >> > That reduced my enthusiasm. My impression is that glibc project is more
> >> > bureaucratic (good thing, generally, but not for me, as a hobbyist that seeks his
> >> > work integrated) and that glibc developers have strong territorial instincts.
> >> I cannot really speak to that, I hardly know the glibc people. (I
> >> probably ran across them at a GNU Cauldron, but I don't remember talking
> >> to them).
> >>
> >> The current status seems to be that these functions are part of libgcc,
> >> and that they are copied over from glibc.
> >> > I could be wrong about both points, but that was my impression.
> >> > Simultaneously, I had more "real work" to do that left little time.
> >> I know that only too well (I currently have my own project, outside of
> >> work, which keeps me quite busy).
> >> > Often I came
> >> > home tired. And when I don't do this sort of hobby things continuously
> >> > night after night I tend to lose focus ans interest. In short, I almost forgot
> >> > about whole thing. Luckily, by then both routines were already in MUCH better shape
> >> > than originals. I mean, in terms of speed.
> >> > Today and tomorrow I happen to have time, so recollected about it and stored routines
> >> > and tests in my public github repository.
> >> > https://github.com/already5chosen/extfloat/tree/master/binary128
> >> >
> >> > Thomas, please look at this code and tell me if there is a chance to integrate it.
> >> I will test this for a bit.
> OK, I plugged your code into libgcc/soft-fp/addtf3.c and
> libgcc/soft-fp/multf3.c, and, on the first try, hit an error:
>
> In file included from /home/tkoenig/trunk-bin/gcc/include/immintrin.h:39,
> from /home/tkoenig/trunk-bin/gcc/include/x86intrin.h:32,
> from ../../../trunk/libgcc/soft-fp/addtf3.c:4:
> /home/tkoenig/trunk-bin/gcc/include/smmintrin.h: In function 'f128_to_u128':
> /home/tkoenig/trunk-bin/gcc/include/smmintrin.h:455:1: error: inlining failed in call to 'always_inline' '_mm_extract_epi64': target specific option mismatch
> 455 | _mm_extract_epi64 (__m128i __X, const int __N)
> | ^~~~~~~~~~~~~~~~~
> ../../../trunk/libgcc/soft-fp/addtf3.c:70:17: note: called from here
> 70 | uint64_t hi = _mm_extract_epi64(v, 1);
> | ^~~~~~~~~~~~~~~~~~~~~~~
> /home/tkoenig/trunk-bin/gcc/include/smmintrin.h:455:1: error: inlining failed in call to 'always_inline' '_mm_extract_epi64': target specific option mismatch
> 455 | _mm_extract_epi64 (__m128i __X, const int __N)
> | ^~~~~~~~~~~~~~~~~
> ../../../trunk/libgcc/soft-fp/addtf3.c:69:17: note: called from here
> 69 | uint64_t lo = _mm_extract_epi64(v, 0);
> | ^~~~~~~~~~~~~~~~~~~~~~~
> make[3]: *** [../../../trunk/libgcc/shared-object.mk:14: addtf3.o] Error 1
>
> (I'm not quite sure what "target specific option mismatch" actually
> means in this context). This kind of thing is not unexpected,
> because the build environment inside gcc is different from normal
> userland. So, some debugging would be needed to bring this into libgcc.

That's because I never tried to compile for AMD64 target that does not have at least SSE4 :(
Always compiled with -march=native on machines that were not older that 10-11 y.o.
I'd try to fix it.
In the mean time you could add -march=nehalem or -march=x86-64-v2 or -march=bdver1
or -msse4 to your set of compilation flags.

> >>A next step would be a discussion on
> >> the gcc mailing list. A possible scenario would be to create a
> >> branch, and to merge that into trunk once gcc14 development starts.
> >>
> >> The code may require some adjustment for other platforms, or for
> >> special conditions in libgcc. For example, I am not sure if using
> >> x86 intrinsics in libgcc would cause problems,
> >
> > All uses of x86 intrinsic functions are guarded with #ifdef __amd64.
> > They are here in order to fight stupidities of gcc compiler, most commonly
> > its hyperactive SLP vectorizer.
> > The code is 100% correct without this parts, but non-trivially slower when
> > compiled with gcc12.
> > Making them unnneded in gcc 13 or 14 sounds like a worthy goal for gcc,
> > but I am not optimistic. Until now the trend was opposite - SLP vectorizer
> > only gets more aggressive.
> >
> >> or if using uint64_t works,
> >
> > My code certainly requires both uint64_t and __int128.
> > But it appears to be almost o.k. since we don't want to support all gcc platforms.
> > All we are interested in are platforms that support binary128 FP.
> > By now those are:
> > x86-64,
> > MIPS64,
> > aarch64,
> > rv64gc,
> > rv32gc,
> > s390x,
> > power64le,
> > power64
> > The first four are o.k.
> > The last two (or three?) have hardware so don't need sw implementation.
> > That leave one odd case of rv32gc. I don't know what to do about it.
> > IMHO, the wisest is to drop support for binary128 on this platform.
> > I can't imagine that it is actually used by anybody.
> Looking at it through gfortran's glasses, we do not support
> REAL(KIND=16) on 32-bit systems, so that would be OK.
> >
> >> or if using __float128 is correct or it should be replaced by
> >> TFMode.
> >
> > I don't know what TFMode means.
> TFMode is the name for 128-bit reals inside gcc (see
> https://gcc.gnu.org/onlinedocs/gccint/Machine-Modes.html ). If you
> look at libgcc, you will find it is actually implemented as a struct
> of two longs. That does not mean it absolutely has to be that way.
> >Ideally, it should be _Float128.
> > I used __float128 because I wanted to compile with clang that does not
> > understand __float128. That not a good reason, so probably has to change.
>
> > But the real reason why I don't believe that my code can be integrates into
> > glibc or even into libgcc is different: I don't support rounding modes.
> > That is, I always round to nearest with breaks rounded to even.
> > I suppose that it's what is wanted by nearly all users, but glibc has different idea.
> > They think that binary128 rounding mode should be the same as current
> > rounding mode for binary64/binary32. They say, it's required by IEEE-754.
> > They could even be correct about it, but it just shows that IEEE-754 is not perfect.
> I understand them being sticklers for accuracy (Terje? :-)
>
> However, this does not mean that your code cannot be included. It would
> be possible to have the traditional function as a fallback in the (rare)
> case where people want something else - a conditional jump which would
> be predicted 99.99.... % accurately in a tight loop.

I don't think it is that easy.
The problem is not the branch but reading the rounding mode from where
gcc stores it.

> > Support to other rounding modes can be added to my routines, but doing it in
> > platform-independent manner will slow things down significantly and doing it
> > in platform-dependent manner will reduce portability (IMHO, *not* for a good reason)
> > and it still would be somewhat slower than current code.
> >
> > That's why I am more hopeful for integration into gfortran and esp. into gfortran's
> > matrix/vector libraries.
> > My impression is that these places don't care about quadmath non-default rounding modes.
> Yes. Fortran, specifically, restores IEEE rounding modes to default
> on procedure entry.
>
> I would still prefer to get your code into libgcc, though, because it
> would then be useful for every place where somebody uses 128-bit
> in a program.
>
> Have you ever compiled gcc?

Click here to read the complete article

Thomas Koenig wrote:
> Michael S <already5chosen@yahoo.com> schrieb:
>> But the real reason why I don't believe that my code can be integrates into
>> glibc or even into libgcc is different: I don't support rounding modes.
>> That is, I always round to nearest with breaks rounded to even.
>> I suppose that it's what is wanted by nearly all users, but glibc has different idea.
>> They think that binary128 rounding mode should be the same as current
>> rounding mode for binary64/binary32. They say, it's required by IEEE-754.
>> They could even be correct about it, but it just shows that IEEE-754 is not perfect.
>
> I understand them being sticklers for accuracy (Terje? :-)

Absolutely so!

Supporting all required rounding modes turns out to be easy however: If
you already support the default round_to_nearest_or_even (RNE), then you
already have the 4 required decision bits:

Sign, Ulp, Guard & Sticky (well, Sign isn't actually needed for RNE,
but it is very easy to grab. :-) )

Using those 4 bits as index into a bitmap (i.e. a 16-bit constant) you
get out the increment needed to round the intermediate result.

Supporting multiple rounding modes just means grabbing the correct
16-bit value, or you can use the sign bit to select between two 64-bit
constants and then use rounding_mode*8+ulp*4+guard*2+sticky as a shift
count to end up with the desired rounding bit.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Improved routines for gcc/gfortran quadmath arithmetic

<7ca83984-18d4-4871-acdf-0cf1e3e1179dn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=29900&group=comp.arch#29900

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:5684:0:b0:3ab:88cb:97c6 with SMTP id h4-20020ac85684000000b003ab88cb97c6mr695186qta.465.1672349568880;
Thu, 29 Dec 2022 13:32:48 -0800 (PST)
X-Received: by 2002:a05:6870:d87:b0:143:af88:3b6c with SMTP id
mj7-20020a0568700d8700b00143af883b6cmr2635879oab.79.1672349568550; Thu, 29
Dec 2022 13:32:48 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 29 Dec 2022 13:32:48 -0800 (PST)
In-Reply-To: <e12551aa-41b3-424b-ab5f-8c2af6a2163en@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:6929:12d7:bc35:ada3;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:6929:12d7:bc35:ada3
References: <b23aa9b1-fc39-49c2-83c7-935fe07d46aen@googlegroups.com>
<tojqju$1n9l5$1@newsreader4.netcologne.de> <4a1b157b-7ba7-4aff-b227-3db762074c5cn@googlegroups.com>
<tokl1e$1nspb$1@newsreader4.netcologne.de> <e12551aa-41b3-424b-ab5f-8c2af6a2163en@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <7ca83984-18d4-4871-acdf-0cf1e3e1179dn@googlegroups.com>
Subject: Re: Improved routines for gcc/gfortran quadmath arithmetic
From: already5...@yahoo.com (Michael S)
Injection-Date: Thu, 29 Dec 2022 21:32:48 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 5655

by: Michael S - Thu, 29 Dec 2022 21:32 UTC

On Thursday, December 29, 2022 at 10:02:25 PM UTC+2, Michael S wrote:
> On Thursday, December 29, 2022 at 8:07:45 PM UTC+2, Thomas Koenig wrote:
> > Michael S <already...@yahoo.com> schrieb:
> > > On Thursday, December 29, 2022 at 12:36:48 PM UTC+2, Thomas Koenig wrote:
> > >> Michael S <already...@yahoo.com> schrieb:
> > >> > This thread is for attention of Thomas Koenig.
> > >> >
> > >> > Few months ago I spend significant time coding replacement routines for
> > >> > __addtf3 and __multf3.
> > >> > Back then I thought that they are part of quadmath library.
> > >> > quadmath library appears to be relatively low profile project with
> > >> > little burocracy involved so I thout that integration of my replacement
> > >> > will be easy.
> > >> > Later on (in June) I learned that these routines are in fact part of glibc.
> > >> > That reduced my enthusiasm. My impression is that glibc project is more
> > >> > bureaucratic (good thing, generally, but not for me, as a hobbyist that seeks his
> > >> > work integrated) and that glibc developers have strong territorial instincts.
> > >> I cannot really speak to that, I hardly know the glibc people. (I
> > >> probably ran across them at a GNU Cauldron, but I don't remember talking
> > >> to them).
> > >>
> > >> The current status seems to be that these functions are part of libgcc,
> > >> and that they are copied over from glibc.
> > >> > I could be wrong about both points, but that was my impression.
> > >> > Simultaneously, I had more "real work" to do that left little time.
> > >> I know that only too well (I currently have my own project, outside of
> > >> work, which keeps me quite busy).
> > >> > Often I came
> > >> > home tired. And when I don't do this sort of hobby things continuously
> > >> > night after night I tend to lose focus ans interest. In short, I almost forgot
> > >> > about whole thing. Luckily, by then both routines were already in MUCH better shape
> > >> > than originals. I mean, in terms of speed.
> > >> > Today and tomorrow I happen to have time, so recollected about it and stored routines
> > >> > and tests in my public github repository.
> > >> > https://github.com/already5chosen/extfloat/tree/master/binary128
> > >> >
> > >> > Thomas, please look at this code and tell me if there is a chance to integrate it.
> > >> I will test this for a bit.
> > OK, I plugged your code into libgcc/soft-fp/addtf3.c and
> > libgcc/soft-fp/multf3.c, and, on the first try, hit an error:
> >
> > In file included from /home/tkoenig/trunk-bin/gcc/include/immintrin.h:39,
> > from /home/tkoenig/trunk-bin/gcc/include/x86intrin.h:32,
> > from ../../../trunk/libgcc/soft-fp/addtf3.c:4:
> > /home/tkoenig/trunk-bin/gcc/include/smmintrin.h: In function 'f128_to_u128':
> > /home/tkoenig/trunk-bin/gcc/include/smmintrin.h:455:1: error: inlining failed in call to 'always_inline' '_mm_extract_epi64': target specific option mismatch
> > 455 | _mm_extract_epi64 (__m128i __X, const int __N)
> > | ^~~~~~~~~~~~~~~~~
> > ../../../trunk/libgcc/soft-fp/addtf3.c:70:17: note: called from here
> > 70 | uint64_t hi = _mm_extract_epi64(v, 1);
> > | ^~~~~~~~~~~~~~~~~~~~~~~
> > /home/tkoenig/trunk-bin/gcc/include/smmintrin.h:455:1: error: inlining failed in call to 'always_inline' '_mm_extract_epi64': target specific option mismatch
> > 455 | _mm_extract_epi64 (__m128i __X, const int __N)
> > | ^~~~~~~~~~~~~~~~~
> > ../../../trunk/libgcc/soft-fp/addtf3.c:69:17: note: called from here
> > 69 | uint64_t lo = _mm_extract_epi64(v, 0);
> > | ^~~~~~~~~~~~~~~~~~~~~~~
> > make[3]: *** [../../../trunk/libgcc/shared-object.mk:14: addtf3.o] Error 1
> >
> > (I'm not quite sure what "target specific option mismatch" actually
> > means in this context). This kind of thing is not unexpected,
> > because the build environment inside gcc is different from normal
> > userland. So, some debugging would be needed to bring this into libgcc.
> That's because I never tried to compile for AMD64 target that does not have at least SSE4 :(
> Always compiled with -march=native on machines that were not older that 10-11 y.o.
> I'd try to fix it.

Fixed.

Re: Improved routines for gcc/gfortran quadmath arithmetic

<30a123cc-cbb4-4cfa-a28c-351ccadf93d1n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=29901&group=comp.arch#29901

copy link Newsgroups: comp.arch

X-Received: by 2002:ae9:e204:0:b0:6ff:a7f1:ff4e with SMTP id c4-20020ae9e204000000b006ffa7f1ff4emr967428qkc.292.1672350970872;
Thu, 29 Dec 2022 13:56:10 -0800 (PST)
X-Received: by 2002:a05:6870:14cd:b0:13b:6986:2649 with SMTP id
l13-20020a05687014cd00b0013b69862649mr2428882oab.261.1672350970543; Thu, 29
Dec 2022 13:56:10 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 29 Dec 2022 13:56:10 -0800 (PST)
In-Reply-To: <tokvtd$1g1t$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:6929:12d7:bc35:ada3;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:6929:12d7:bc35:ada3
References: <b23aa9b1-fc39-49c2-83c7-935fe07d46aen@googlegroups.com>
<tojqju$1n9l5$1@newsreader4.netcologne.de> <4a1b157b-7ba7-4aff-b227-3db762074c5cn@googlegroups.com>
<tokl1e$1nspb$1@newsreader4.netcologne.de> <tokvtd$1g1t$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <30a123cc-cbb4-4cfa-a28c-351ccadf93d1n@googlegroups.com>
Subject: Re: Improved routines for gcc/gfortran quadmath arithmetic
From: already5...@yahoo.com (Michael S)
Injection-Date: Thu, 29 Dec 2022 21:56:10 +0000
Content-Type: text/plain; charset="UTF-8"

by: Michael S - Thu, 29 Dec 2022 21:56 UTC

On Thursday, December 29, 2022 at 11:13:20 PM UTC+2, Terje Mathisen wrote:
> Thomas Koenig wrote:
> > Michael S <already...@yahoo.com> schrieb:
> >> But the real reason why I don't believe that my code can be integrates into
> >> glibc or even into libgcc is different: I don't support rounding modes.
> >> That is, I always round to nearest with breaks rounded to even.
> >> I suppose that it's what is wanted by nearly all users, but glibc has different idea.
> >> They think that binary128 rounding mode should be the same as current
> >> rounding mode for binary64/binary32. They say, it's required by IEEE-754.
> >> They could even be correct about it, but it just shows that IEEE-754 is not perfect.
> >
> > I understand them being sticklers for accuracy (Terje? :-)
> Absolutely so!
>

I am pretty sure that in typical use cases one doesn't want binary128
rounding mode to be prescribed by the same control world as binary32/binary64.
Generally, binary64 non-default rounding modes are for experimentation.
And binary128 is for comparison of results of experimentation with "master"
values. And for "master" values you want "best" precision which is achieved in
default rounding mode.
mprf does not tie its rounding modes to binary32/binary64 and mprf is correct.

> Supporting all required rounding modes turns out to be easy however: If
> you already support the default round_to_nearest_or_even (RNE), then you
> already have the 4 required decision bits:
>
> Sign, Ulp, Guard & Sticky (well, Sign isn't actually needed for RNE,
> but it is very easy to grab. :-) )
>
> Using those 4 bits as index into a bitmap (i.e. a 16-bit constant) you
> get out the increment needed to round the intermediate result.
>
> Supporting multiple rounding modes just means grabbing the correct
> 16-bit value, or you can use the sign bit to select between two 64-bit
> constants and then use rounding_mode*8+ulp*4+guard*2+sticky as a shift
> count to end up with the desired rounding bit.
>
> Terje
>

I am repeating myself for the 3rd or 4th time - the main cost is *not*
implementation of non-default rounding modes. For that task branch
prediction will do a near perfect job.
The main cost is reading of FP control word that contains the relevant bits
and interpreting this bits in GPR domain. It is especially expensive if done
in portable way, i.e. via fegetround().
It's not a lot of cycles in absolute sense, but in such primitives like fadd
and fmul every cycle counts. We want to do each of them in less than
25 cycles on average. On Apple silicon, hopefully, in less than 15 cycles.

> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Michael S <already5chosen@yahoo.com> schrieb:
> On Thursday, December 29, 2022 at 11:13:20 PM UTC+2, Terje Mathisen wrote:
>> Thomas Koenig wrote:
>> > Michael S <already...@yahoo.com> schrieb:
>> >> But the real reason why I don't believe that my code can be integrates into
>> >> glibc or even into libgcc is different: I don't support rounding modes.
>> >> That is, I always round to nearest with breaks rounded to even.
>> >> I suppose that it's what is wanted by nearly all users, but glibc has different idea.
>> >> They think that binary128 rounding mode should be the same as current
>> >> rounding mode for binary64/binary32. They say, it's required by IEEE-754.
>> >> They could even be correct about it, but it just shows that IEEE-754 is not perfect.
>> >
>> > I understand them being sticklers for accuracy (Terje? :-)
>> Absolutely so!
>>
>
> I am pretty sure that in typical use cases one doesn't want binary128
> rounding mode to be prescribed by the same control world as binary32/binary64.
> Generally, binary64 non-default rounding modes are for experimentation.
> And binary128 is for comparison of results of experimentation with "master"
> values. And for "master" values you want "best" precision which is achieved in
> default rounding mode.
> mprf does not tie its rounding modes to binary32/binary64 and mprf is correct.

I think you have to understand where people are coming from.
This is a originally a soft-float routine, meant for implementing
functions in hardware for CPUs that lack the feature. With that
in mind, it is clear why people would want to look at the hardware
settings for its behavior.

We are slightly abusing this at the moment, but I think we can do
much better.

>
>> Supporting all required rounding modes turns out to be easy however: If
>> you already support the default round_to_nearest_or_even (RNE), then you
>> already have the 4 required decision bits:
>>
>> Sign, Ulp, Guard & Sticky (well, Sign isn't actually needed for RNE,
>> but it is very easy to grab. :-) )
>>
>> Using those 4 bits as index into a bitmap (i.e. a 16-bit constant) you
>> get out the increment needed to round the intermediate result.
>>
>> Supporting multiple rounding modes just means grabbing the correct
>> 16-bit value, or you can use the sign bit to select between two 64-bit
>> constants and then use rounding_mode*8+ulp*4+guard*2+sticky as a shift
>> count to end up with the desired rounding bit.
>>
>> Terje
>>
>
> I am repeating myself for the 3rd or 4th time - the main cost is *not*
> implementation of non-default rounding modes. For that task branch
> prediction will do a near perfect job.
> The main cost is reading of FP control word that contains the relevant bits
> and interpreting this bits in GPR domain. It is especially expensive if done
> in portable way, i.e. via fegetround().

OK.

I think we can use Fortran semantics for an advantageous solution, at
least for gfortran.

Each procedure has a processor-dependent rounding mode on entry,
and these are restored on exit. (This actually makes a lot of sense
somebody might have changed the rounding mode somewhere else,
and the results should not depend on this), so there is no need
to look at global state.

A procedure which does not call IEEE_SET_ROUNDING_MODE does not change
rounding modes, so it can also use the default. This should cover the
vast majority of programs.

If IEEE_SET_ROUNDING_MODE is called, it sets the processor status
register, but it can also do something else, like set a flag local
to the routine.

So, a strategy could be to implement three functions:

The simplest one of them would be called from the compiler if there
is no call too IEEE_SET_ROUNDING_MODE in sight, and it would do
exactly what you have already implemented.

The second one would take an additional argument, and implements
all additional rounding modes. This would be called with the
local flag as additional argument. (If there turns out to be no
speed disadvantage to using this with a constant argument vs.
the first one, the two could also be rolled into one).

The third one, under the original name, actually reads the processor
status register and then tail-calls the second one to do the work.

For gfortran, because we implement our library in C, we would also
need to add a flag which tells C (or the middle end) just to use
the default rounding mode, which we can then use as a flag when
building libgfortran.

Does this sound reasonable?

> It's not a lot of cycles in absolute sense, but in such primitives like fadd
> and fmul every cycle counts. We want to do each of them in less than
> 25 cycles on average. On Apple silicon, hopefully, in less than 15 cycles.

Absolutely. Like the old filk song "Every cycle is sacred"

Michael S <already5chosen@yahoo.com> writes:
>It's not a lot of cycles in absolute sense, but in such primitives like fadd
>and fmul every cycle counts. We want to do each of them in less than
>25 cycles on average. On Apple silicon, hopefully, in less than 15 cycles.

If the rounding mode is set only through routines like fesetround(),
all such routines could be changed to either

1) Change __addtf3 etc. to the slow, but rounding-mode-conforming
versions, and change fesetround() etc. to a version that just sets the
rounding mode. I.e., after the first fesetround() etc., you always
get the slow __addtf3, but fesetround() is relatively fast.

2) If the target rounding mode is different from the current rounding
mode: if the target is nearest-or-even, set the fast __addtf3 etc,
otherwise the slow __addtf3 etc. Don't change fesetround(). This
results in fast nearest-or-even __addtf3 etc., but relatively slow
fesetround() etc.

Note that these functions are already using some indirection thanks to
dynamic linking, so they are already changeable.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Improved routines for gcc/gfortran quadmath arithmetic

<443e45c1-4e73-4c90-9a77-a1a8a7cb7e15n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=29914&group=comp.arch#29914

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:22c9:b0:702:5666:43aa with SMTP id o9-20020a05620a22c900b00702566643aamr1065164qki.194.1672396228143;
Fri, 30 Dec 2022 02:30:28 -0800 (PST)
X-Received: by 2002:a05:6830:1e85:b0:678:310f:6bd3 with SMTP id
n5-20020a0568301e8500b00678310f6bd3mr1965443otr.23.1672396227826; Fri, 30 Dec
2022 02:30:27 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 30 Dec 2022 02:30:27 -0800 (PST)
In-Reply-To: <tomb6v$1ovf4$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:6929:12d7:bc35:ada3;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:6929:12d7:bc35:ada3
References: <b23aa9b1-fc39-49c2-83c7-935fe07d46aen@googlegroups.com>
<tojqju$1n9l5$1@newsreader4.netcologne.de> <4a1b157b-7ba7-4aff-b227-3db762074c5cn@googlegroups.com>
<tokl1e$1nspb$1@newsreader4.netcologne.de> <tokvtd$1g1t$1@gioia.aioe.org>
<30a123cc-cbb4-4cfa-a28c-351ccadf93d1n@googlegroups.com> <tomb6v$1ovf4$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <443e45c1-4e73-4c90-9a77-a1a8a7cb7e15n@googlegroups.com>
Subject: Re: Improved routines for gcc/gfortran quadmath arithmetic
From: already5...@yahoo.com (Michael S)
Injection-Date: Fri, 30 Dec 2022 10:30:28 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 8015

by: Michael S - Fri, 30 Dec 2022 10:30 UTC

On Friday, December 30, 2022 at 11:32:20 AM UTC+2, Thomas Koenig wrote:
> Michael S <already...@yahoo.com> schrieb:
> > On Thursday, December 29, 2022 at 11:13:20 PM UTC+2, Terje Mathisen wrote:
> >> Thomas Koenig wrote:
> >> > Michael S <already...@yahoo.com> schrieb:
> >> >> But the real reason why I don't believe that my code can be integrates into
> >> >> glibc or even into libgcc is different: I don't support rounding modes.
> >> >> That is, I always round to nearest with breaks rounded to even.
> >> >> I suppose that it's what is wanted by nearly all users, but glibc has different idea.
> >> >> They think that binary128 rounding mode should be the same as current
> >> >> rounding mode for binary64/binary32. They say, it's required by IEEE-754.
> >> >> They could even be correct about it, but it just shows that IEEE-754 is not perfect.
> >> >
> >> > I understand them being sticklers for accuracy (Terje? :-)
> >> Absolutely so!
> >>
> >
> > I am pretty sure that in typical use cases one doesn't want binary128
> > rounding mode to be prescribed by the same control world as binary32/binary64.
> > Generally, binary64 non-default rounding modes are for experimentation.
> > And binary128 is for comparison of results of experimentation with "master"
> > values. And for "master" values you want "best" precision which is achieved in
> > default rounding mode.
> > mprf does not tie its rounding modes to binary32/binary64 and mprf is correct.
> I think you have to understand where people are coming from.
> This is a originally a soft-float routine, meant for implementing
> functions in hardware for CPUs that lack the feature. With that
> in mind, it is clear why people would want to look at the hardware
> settings for its behavior.
>

If it was on CPUs that has two models, one with binary128 hardware and other
without such hardware, like presence/absence of FP co-processor on CPUs
from the 80s, then I'll take this explanation.
But it's not the case. These routines are used primarily on x86-64 and ARM64
that never had binary128 and, it seems, are not planning to add it in the
[foreseeable] future.

> We are slightly abusing this at the moment, but I think we can do
> much better.
> >
> >> Supporting all required rounding modes turns out to be easy however: If
> >> you already support the default round_to_nearest_or_even (RNE), then you
> >> already have the 4 required decision bits:
> >>
> >> Sign, Ulp, Guard & Sticky (well, Sign isn't actually needed for RNE,
> >> but it is very easy to grab. :-) )
> >>
> >> Using those 4 bits as index into a bitmap (i.e. a 16-bit constant) you
> >> get out the increment needed to round the intermediate result.
> >>
> >> Supporting multiple rounding modes just means grabbing the correct
> >> 16-bit value, or you can use the sign bit to select between two 64-bit
> >> constants and then use rounding_mode*8+ulp*4+guard*2+sticky as a shift
> >> count to end up with the desired rounding bit.
> >>
> >> Terje
> >>
> >
> > I am repeating myself for the 3rd or 4th time - the main cost is *not*
> > implementation of non-default rounding modes. For that task branch
> > prediction will do a near perfect job.
> > The main cost is reading of FP control word that contains the relevant bits
> > and interpreting this bits in GPR domain. It is especially expensive if done
> > in portable way, i.e. via fegetround().
> OK.
>
> I think we can use Fortran semantics for an advantageous solution, at
> least for gfortran.
>
> Each procedure has a processor-dependent rounding mode on entry,
> and these are restored on exit. (This actually makes a lot of sense
> somebody might have changed the rounding mode somewhere else,
> and the results should not depend on this), so there is no need
> to look at global state.
>
> A procedure which does not call IEEE_SET_ROUNDING_MODE does not change
> rounding modes, so it can also use the default. This should cover the
> vast majority of programs.
>

So if procedure Foo calls IEEE_SET_ROUNDING_MODE and later calls and
then calls procedure Bar the Bar expected to operate with default (==RNE)
rounding mode?
That's sound great for my purposes, but also sounds like violation of
intentions of IEEE-754 and may be even of the letter of the Standard.

> If IEEE_SET_ROUNDING_MODE is called, it sets the processor status
> register, but it can also do something else, like set a flag local
> to the routine.
>
> So, a strategy could be to implement three functions:
>
> The simplest one of them would be called from the compiler if there
> is no call too IEEE_SET_ROUNDING_MODE in sight, and it would do
> exactly what you have already implemented.
>
> The second one would take an additional argument, and implements
> all additional rounding modes. This would be called with the
> local flag as additional argument. (If there turns out to be no
> speed disadvantage to using this with a constant argument vs.
> the first one, the two could also be rolled into one).
>
> The third one, under the original name, actually reads the processor
> status register and then tail-calls the second one to do the work.
>
> For gfortran, because we implement our library in C, we would also
> need to add a flag which tells C (or the middle end) just to use
> the default rounding mode, which we can then use as a flag when
> building libgfortran.
>
> Does this sound reasonable?

Mostly.
The only problem that I see so far is specific to AMD64 Windows.
The 3rd routine will be slow. Well, not absolutely, but slower than
necessary.
That's because by Windows ABI _Float64 is just a regular structure
that is returned from functions like any other structures that are
bigger than 64 bits, i.e. caller passes pointer to location on stack
and callee stores value here. So far so good.
The problem is that gcc's tail call elimination does not work in
this case. We can, of course, hope that they will fix it in v13 or
v14, but considering that they were not able to do it in 30+ years
I am not holding my breath in anticipation.

> > It's not a lot of cycles in absolute sense, but in such primitives like fadd
> > and fmul every cycle counts. We want to do each of them in less than
> > 25 cycles on average. On Apple silicon, hopefully, in less than 15 cycles.
> Absolutely. Like the old filk song "Every cycle is sacred"

Re: Improved routines for gcc/gfortran quadmath arithmetic

<945dc265-e432-4dc1-966d-a9ea2ffbeb36n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=29915&group=comp.arch#29915

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:1014:b0:6fb:7c45:bd5 with SMTP id z20-20020a05620a101400b006fb7c450bd5mr1519837qkj.304.1672396860373;
Fri, 30 Dec 2022 02:41:00 -0800 (PST)
X-Received: by 2002:a05:6830:1496:b0:66d:8b98:683f with SMTP id
s22-20020a056830149600b0066d8b98683fmr2307259otq.40.1672396860137; Fri, 30
Dec 2022 02:41:00 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 30 Dec 2022 02:40:59 -0800 (PST)
In-Reply-To: <2022Dec30.110454@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:6929:12d7:bc35:ada3;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:6929:12d7:bc35:ada3
References: <b23aa9b1-fc39-49c2-83c7-935fe07d46aen@googlegroups.com>
<tojqju$1n9l5$1@newsreader4.netcologne.de> <4a1b157b-7ba7-4aff-b227-3db762074c5cn@googlegroups.com>
<tokl1e$1nspb$1@newsreader4.netcologne.de> <tokvtd$1g1t$1@gioia.aioe.org>
<30a123cc-cbb4-4cfa-a28c-351ccadf93d1n@googlegroups.com> <2022Dec30.110454@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <945dc265-e432-4dc1-966d-a9ea2ffbeb36n@googlegroups.com>
Subject: Re: Improved routines for gcc/gfortran quadmath arithmetic
From: already5...@yahoo.com (Michael S)
Injection-Date: Fri, 30 Dec 2022 10:41:00 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 3162

by: Michael S - Fri, 30 Dec 2022 10:40 UTC

On Friday, December 30, 2022 at 12:16:41 PM UTC+2, Anton Ertl wrote:
> Michael S <already...@yahoo.com> writes:
> >It's not a lot of cycles in absolute sense, but in such primitives like fadd
> >and fmul every cycle counts. We want to do each of them in less than
> >25 cycles on average. On Apple silicon, hopefully, in less than 15 cycles.
> If the rounding mode is set only through routines like fesetround(),
> all such routines could be changed to either
>
> 1) Change __addtf3 etc. to the slow, but rounding-mode-conforming
> versions, and change fesetround() etc. to a version that just sets the
> rounding mode. I.e., after the first fesetround() etc., you always
> get the slow __addtf3, but fesetround() is relatively fast.
>
> 2) If the target rounding mode is different from the current rounding
> mode: if the target is nearest-or-even, set the fast __addtf3 etc,
> otherwise the slow __addtf3 etc. Don't change fesetround(). This
> results in fast nearest-or-even __addtf3 etc., but relatively slow
> fesetround() etc.
>
> Note that these functions are already using some indirection thanks to
> dynamic linking, so they are already changeable.
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Things like that should be discussed with gcc and libgcc maintainer.
I have neither ambitions nor desire to become one.

As my personal opinion, I hope that one day in the future statically linked
libraries infrastructure will make a comeback. Not as default, but as an
option people can use without jumping through hoops.

Michael S <already5chosen@yahoo.com> schrieb:
> On Friday, December 30, 2022 at 11:32:20 AM UTC+2, Thomas Koenig wrote:
>> Michael S <already...@yahoo.com> schrieb:
>> > On Thursday, December 29, 2022 at 11:13:20 PM UTC+2, Terje Mathisen wrote:
>> >> Thomas Koenig wrote:
>> >> > Michael S <already...@yahoo.com> schrieb:
>> >> >> But the real reason why I don't believe that my code can be integrates into
>> >> >> glibc or even into libgcc is different: I don't support rounding modes.
>> >> >> That is, I always round to nearest with breaks rounded to even.
>> >> >> I suppose that it's what is wanted by nearly all users, but glibc has different idea.
>> >> >> They think that binary128 rounding mode should be the same as current
>> >> >> rounding mode for binary64/binary32. They say, it's required by IEEE-754.
>> >> >> They could even be correct about it, but it just shows that IEEE-754 is not perfect.
>> >> >
>> >> > I understand them being sticklers for accuracy (Terje? :-)
>> >> Absolutely so!
>> >>
>> >
>> > I am pretty sure that in typical use cases one doesn't want binary128
>> > rounding mode to be prescribed by the same control world as binary32/binary64.
>> > Generally, binary64 non-default rounding modes are for experimentation.
>> > And binary128 is for comparison of results of experimentation with "master"
>> > values. And for "master" values you want "best" precision which is achieved in
>> > default rounding mode.
>> > mprf does not tie its rounding modes to binary32/binary64 and mprf is correct.
>> I think you have to understand where people are coming from.
>> This is a originally a soft-float routine, meant for implementing
>> functions in hardware for CPUs that lack the feature. With that
>> in mind, it is clear why people would want to look at the hardware
>> settings for its behavior.
>>
>
> If it was on CPUs that has two models, one with binary128 hardware and other
> without such hardware, like presence/absence of FP co-processor on CPUs
> from the 80s, then I'll take this explanation.
> But it's not the case. These routines are used primarily on x86-64 and ARM64
> that never had binary128 and, it seems, are not planning to add it in the
> [foreseeable] future.

Well, we are not going be able to change observable behavior of that
function (politically), but I don't think it matters that much.
We shuld just avoid calling this inefficient function if it is
possible to avoid it, and I don't think we need to do so from Fortran
at all. For other programming languages, I am not sure.

>> We are slightly abusing this at the moment, but I think we can do
>> much better.
>> >
>> >> Supporting all required rounding modes turns out to be easy however: If
>> >> you already support the default round_to_nearest_or_even (RNE), then you
>> >> already have the 4 required decision bits:
>> >>
>> >> Sign, Ulp, Guard & Sticky (well, Sign isn't actually needed for RNE,
>> >> but it is very easy to grab. :-) )
>> >>
>> >> Using those 4 bits as index into a bitmap (i.e. a 16-bit constant) you
>> >> get out the increment needed to round the intermediate result.
>> >>
>> >> Supporting multiple rounding modes just means grabbing the correct
>> >> 16-bit value, or you can use the sign bit to select between two 64-bit
>> >> constants and then use rounding_mode*8+ulp*4+guard*2+sticky as a shift
>> >> count to end up with the desired rounding bit.
>> >>
>> >> Terje
>> >>
>> >
>> > I am repeating myself for the 3rd or 4th time - the main cost is *not*
>> > implementation of non-default rounding modes. For that task branch
>> > prediction will do a near perfect job.
>> > The main cost is reading of FP control word that contains the relevant bits
>> > and interpreting this bits in GPR domain. It is especially expensive if done
>> > in portable way, i.e. via fegetround().
>> OK.
>>
>> I think we can use Fortran semantics for an advantageous solution, at
>> least for gfortran.
>>
>> Each procedure has a processor-dependent rounding mode on entry,
>> and these are restored on exit. (This actually makes a lot of sense
>> somebody might have changed the rounding mode somewhere else,
>> and the results should not depend on this), so there is no need
>> to look at global state.
>>
>> A procedure which does not call IEEE_SET_ROUNDING_MODE does not change
>> rounding modes, so it can also use the default. This should cover the
>> vast majority of programs.
>>
>
> So if procedure Foo calls IEEE_SET_ROUNDING_MODE and later calls and
> then calls procedure Bar the Bar expected to operate with default (==RNE)
> rounding mode?

Correct. It also makes trying to set the rounding flags in a subroutine
a no-op :-)

> That's sound great for my purposes, but also sounds like violation of
> intentions of IEEE-754 and may be even of the letter of the Standard.

Because IEEE-754 is one of those pesky standards where there is no
publically available final comittee draft, I cannot speak to that,
but I certainly think that the J3 people knew what they were doing
when they specified the Fortran interface to IEEE.

Maybe Terje can comment from the IEEE-754 side?

>> If IEEE_SET_ROUNDING_MODE is called, it sets the processor status
>> register, but it can also do something else, like set a flag local
>> to the routine.
>>
>> So, a strategy could be to implement three functions:
>>
>> The simplest one of them would be called from the compiler if there
>> is no call too IEEE_SET_ROUNDING_MODE in sight, and it would do
>> exactly what you have already implemented.
>>
>> The second one would take an additional argument, and implements
>> all additional rounding modes. This would be called with the
>> local flag as additional argument. (If there turns out to be no
>> speed disadvantage to using this with a constant argument vs.
>> the first one, the two could also be rolled into one).
>>
>> The third one, under the original name, actually reads the processor
>> status register and then tail-calls the second one to do the work.
>>
>> For gfortran, because we implement our library in C, we would also
>> need to add a flag which tells C (or the middle end) just to use
>> the default rounding mode, which we can then use as a flag when
>> building libgfortran.
>>
>> Does this sound reasonable?
>
> Mostly.
> The only problem that I see so far is specific to AMD64 Windows.
> The 3rd routine will be slow. Well, not absolutely, but slower than
> necessary.
> That's because by Windows ABI _Float64 is just a regular structure
> that is returned from functions like any other structures that are
> bigger than 64 bits, i.e. caller passes pointer to location on stack
> and callee stores value here. So far so good.
> The problem is that gcc's tail call elimination does not work in
> this case. We can, of course, hope that they will fix it in v13 or
> v14, but considering that they were not able to do it in 30+ years
> I am not holding my breath in anticipation.

If it is inefficient in Windows, then one other possibility is to
mark the second function inline, and let the compiler take care
of it, or whatever else turns out to be fastest. What I am thinking
about is mainly to reduce the call overhead to the absolute minimum.

Michael S <already5chosen@yahoo.com> writes:
>On Friday, December 30, 2022 at 12:16:41 PM UTC+2, Anton Ertl wrote:
>> If the rounding mode is set only through routines like fesetround(),
>> all such routines could be changed to either
>>
>> 1) Change __addtf3 etc. to the slow, but rounding-mode-conforming
>> versions, and change fesetround() etc. to a version that just sets the
>> rounding mode. I.e., after the first fesetround() etc., you always
>> get the slow __addtf3, but fesetround() is relatively fast.
>>
>> 2) If the target rounding mode is different from the current rounding
>> mode: if the target is nearest-or-even, set the fast __addtf3 etc,
>> otherwise the slow __addtf3 etc. Don't change fesetround(). This
>> results in fast nearest-or-even __addtf3 etc., but relatively slow
>> fesetround() etc.
>>
>> Note that these functions are already using some indirection thanks to
>> dynamic linking, so they are already changeable.
>> - anton
>> --
>> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
>> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>
>
>Things like that should be discussed with gcc and libgcc maintainer.

Worse, it needs both gcc (libgcc) and glibc maintainers, because
__addtf3 is part of libgcc, while fesetround() is part of glibc. And
gcc also needs to work with other libcs, not just glibc, so it may not
be an easy problem. I agree that it's not your problem, though.

>As my personal opinion, I hope that one day in the future statically linked
>libraries infrastructure will make a comeback.

They have been doing for some time. Languages like Go and Rust use
static linking (at least for stuff written in those languages),
because it's apparently too hard to maintain binary compatibility for
libraries written in those languages. I think the growth of main
memory sizes compared to the heydays of dynamic linking in the 1990s
also plays a role.

>Not as default, but as an
>option people can use without jumping through hoops.

Yes, it tends to be hard to statically link a C program these days.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Improved routines for gcc/gfortran quadmath arithmetic

<2f2b7525-e482-4436-a5a0-68c995466fadn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=29920&group=comp.arch#29920

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:ac11:0:b0:6ff:c8a2:9784 with SMTP id e17-20020a37ac11000000b006ffc8a29784mr1386393qkm.376.1672407324883;
Fri, 30 Dec 2022 05:35:24 -0800 (PST)
X-Received: by 2002:a05:6870:aa05:b0:14f:b93f:15e9 with SMTP id
gv5-20020a056870aa0500b0014fb93f15e9mr1317368oab.113.1672407324624; Fri, 30
Dec 2022 05:35:24 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 30 Dec 2022 05:35:24 -0800 (PST)
In-Reply-To: <tomh8t$1p2e6$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:6929:12d7:bc35:ada3;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:6929:12d7:bc35:ada3
References: <b23aa9b1-fc39-49c2-83c7-935fe07d46aen@googlegroups.com>
<tojqju$1n9l5$1@newsreader4.netcologne.de> <4a1b157b-7ba7-4aff-b227-3db762074c5cn@googlegroups.com>
<tokl1e$1nspb$1@newsreader4.netcologne.de> <tokvtd$1g1t$1@gioia.aioe.org>
<30a123cc-cbb4-4cfa-a28c-351ccadf93d1n@googlegroups.com> <tomb6v$1ovf4$1@newsreader4.netcologne.de>
<443e45c1-4e73-4c90-9a77-a1a8a7cb7e15n@googlegroups.com> <tomh8t$1p2e6$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2f2b7525-e482-4436-a5a0-68c995466fadn@googlegroups.com>
Subject: Re: Improved routines for gcc/gfortran quadmath arithmetic
From: already5...@yahoo.com (Michael S)
Injection-Date: Fri, 30 Dec 2022 13:35:24 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 10753

by: Michael S - Fri, 30 Dec 2022 13:35 UTC

On Friday, December 30, 2022 at 1:15:46 PM UTC+2, Thomas Koenig wrote:
> Michael S <already...@yahoo.com> schrieb:
> > On Friday, December 30, 2022 at 11:32:20 AM UTC+2, Thomas Koenig wrote:
> >> Michael S <already...@yahoo.com> schrieb:
> >> > On Thursday, December 29, 2022 at 11:13:20 PM UTC+2, Terje Mathisen wrote:
> >> >> Thomas Koenig wrote:
> >> >> > Michael S <already...@yahoo.com> schrieb:
> >> >> >> But the real reason why I don't believe that my code can be integrates into
> >> >> >> glibc or even into libgcc is different: I don't support rounding modes.
> >> >> >> That is, I always round to nearest with breaks rounded to even.
> >> >> >> I suppose that it's what is wanted by nearly all users, but glibc has different idea.
> >> >> >> They think that binary128 rounding mode should be the same as current
> >> >> >> rounding mode for binary64/binary32. They say, it's required by IEEE-754.
> >> >> >> They could even be correct about it, but it just shows that IEEE-754 is not perfect.
> >> >> >
> >> >> > I understand them being sticklers for accuracy (Terje? :-)
> >> >> Absolutely so!
> >> >>
> >> >
> >> > I am pretty sure that in typical use cases one doesn't want binary128
> >> > rounding mode to be prescribed by the same control world as binary32/binary64.
> >> > Generally, binary64 non-default rounding modes are for experimentation.
> >> > And binary128 is for comparison of results of experimentation with "master"
> >> > values. And for "master" values you want "best" precision which is achieved in
> >> > default rounding mode.
> >> > mprf does not tie its rounding modes to binary32/binary64 and mprf is correct.
> >> I think you have to understand where people are coming from.
> >> This is a originally a soft-float routine, meant for implementing
> >> functions in hardware for CPUs that lack the feature. With that
> >> in mind, it is clear why people would want to look at the hardware
> >> settings for its behavior.
> >>
> >
> > If it was on CPUs that has two models, one with binary128 hardware and other
> > without such hardware, like presence/absence of FP co-processor on CPUs
> > from the 80s, then I'll take this explanation.
> > But it's not the case. These routines are used primarily on x86-64 and ARM64
> > that never had binary128 and, it seems, are not planning to add it in the
> > [foreseeable] future.
> Well, we are not going be able to change observable behavior of that
> function (politically), but I don't think it matters that much.
> We shuld just avoid calling this inefficient function if it is
> possible to avoid it, and I don't think we need to do so from Fortran
> at all. For other programming languages, I am not sure.
> >> We are slightly abusing this at the moment, but I think we can do
> >> much better.
> >> >
> >> >> Supporting all required rounding modes turns out to be easy however: If
> >> >> you already support the default round_to_nearest_or_even (RNE), then you
> >> >> already have the 4 required decision bits:
> >> >>
> >> >> Sign, Ulp, Guard & Sticky (well, Sign isn't actually needed for RNE,
> >> >> but it is very easy to grab. :-) )
> >> >>
> >> >> Using those 4 bits as index into a bitmap (i.e. a 16-bit constant) you
> >> >> get out the increment needed to round the intermediate result.
> >> >>
> >> >> Supporting multiple rounding modes just means grabbing the correct
> >> >> 16-bit value, or you can use the sign bit to select between two 64-bit
> >> >> constants and then use rounding_mode*8+ulp*4+guard*2+sticky as a shift
> >> >> count to end up with the desired rounding bit.
> >> >>
> >> >> Terje
> >> >>
> >> >
> >> > I am repeating myself for the 3rd or 4th time - the main cost is *not*
> >> > implementation of non-default rounding modes. For that task branch
> >> > prediction will do a near perfect job.
> >> > The main cost is reading of FP control word that contains the relevant bits
> >> > and interpreting this bits in GPR domain. It is especially expensive if done
> >> > in portable way, i.e. via fegetround().
> >> OK.
> >>
> >> I think we can use Fortran semantics for an advantageous solution, at
> >> least for gfortran.
> >>
> >> Each procedure has a processor-dependent rounding mode on entry,
> >> and these are restored on exit. (This actually makes a lot of sense
> >> somebody might have changed the rounding mode somewhere else,
> >> and the results should not depend on this), so there is no need
> >> to look at global state.
> >>
> >> A procedure which does not call IEEE_SET_ROUNDING_MODE does not change
> >> rounding modes, so it can also use the default. This should cover the
> >> vast majority of programs.
> >>
> >
> > So if procedure Foo calls IEEE_SET_ROUNDING_MODE and later calls and
> > then calls procedure Bar the Bar expected to operate with default (==RNE)
> > rounding mode?
> Correct. It also makes trying to set the rounding flags in a subroutine
> a no-op :-)
> > That's sound great for my purposes, but also sounds like violation of
> > intentions of IEEE-754 and may be even of the letter of the Standard.
> Because IEEE-754 is one of those pesky standards where there is no
> publically available final comittee draft, I cannot speak to that,
> but I certainly think that the J3 people knew what they were doing
> when they specified the Fortran interface to IEEE.
>
> Maybe Terje can comment from the IEEE-754 side?
> >> If IEEE_SET_ROUNDING_MODE is called, it sets the processor status
> >> register, but it can also do something else, like set a flag local
> >> to the routine.
> >>
> >> So, a strategy could be to implement three functions:
> >>
> >> The simplest one of them would be called from the compiler if there
> >> is no call too IEEE_SET_ROUNDING_MODE in sight, and it would do
> >> exactly what you have already implemented.
> >>
> >> The second one would take an additional argument, and implements
> >> all additional rounding modes. This would be called with the
> >> local flag as additional argument. (If there turns out to be no
> >> speed disadvantage to using this with a constant argument vs.
> >> the first one, the two could also be rolled into one).
> >>
> >> The third one, under the original name, actually reads the processor
> >> status register and then tail-calls the second one to do the work.
> >>
> >> For gfortran, because we implement our library in C, we would also
> >> need to add a flag which tells C (or the middle end) just to use
> >> the default rounding mode, which we can then use as a flag when
> >> building libgfortran.
> >>
> >> Does this sound reasonable?
> >
> > Mostly.
> > The only problem that I see so far is specific to AMD64 Windows.
> > The 3rd routine will be slow. Well, not absolutely, but slower than
> > necessary.
> > That's because by Windows ABI _Float64 is just a regular structure
> > that is returned from functions like any other structures that are
> > bigger than 64 bits, i.e. caller passes pointer to location on stack
> > and callee stores value here. So far so good.
> > The problem is that gcc's tail call elimination does not work in
> > this case. We can, of course, hope that they will fix it in v13 or
> > v14, but considering that they were not able to do it in 30+ years
> > I am not holding my breath in anticipation.
> If it is inefficient in Windows, then one other possibility is to
> mark the second function inline, and let the compiler take care
> of it, or whatever else turns out to be fastest. What I am thinking
> about is mainly to reduce the call overhead to the absolute minimum.

That's a solution, too.
The library will be bigger, but at source code level we don't repeat themselves.
I don't like that sort of bloat, but relatively to thousands of other bloats that
people today consider acceptable this one is quite minor.

Click here to read the complete article

Michael S <already5chosen@yahoo.com> schrieb:

[...]

> On Friday, December 30, 2022 at 1:15:46 PM UTC+2, Thomas Koenig wrote:

>> If it is inefficient in Windows, then one other possibility is to
>> mark the second function inline, and let the compiler take care
>> of it, or whatever else turns out to be fastest. What I am thinking
>> about is mainly to reduce the call overhead to the absolute minimum.
>
> That's a solution, too.
> The library will be bigger, but at source code level we don't repeat themselves.
> I don't like that sort of bloat, but relatively to thousands of other bloats that
> people today consider acceptable this one is quite minor.
>
> The best technical solution of Windows would be a change in API of
> compiler's support functions from pass by value to pass by reference.
> I.e. instead of __float128 __multf3(__float128 srcx, __float128 srcy) it should be
> void __multf3(__float128* dst, __float128* srcx, __float128* srcy).
> It helps the problem with TCE, but not only that.
> In my measurements such API ends up significantly faster, at least in
> matmul benchmark.
> But such change means breaking compatibility with previous version of compiler.

I am not sure it is only that; I have no idea where this ABI is
specified. Did Microsoft publish anything about this?

However, https://godbolt.org/z/7oqYsxG3P tells me that
icc has a different calling convention from both gcc and clang,
presumably on Linux, so the mess is even bigger...

In general, ABI changes are never untaken lightly, only if there is
a pressing need - if either a mistake in the previous implementation
or a change in a standard requires something new.

> Also, more problematically, it means that gcc compiler people have to do
> additional works for sake of Windows and Windows alone. My impression is
> that they hate to do it.

People are wary to work on systems they don't know well. There are,
however, some people active in gcc development on these platforms.

> Of course, the same change can be made on Linux, too, but it is harder sell.
> both because on Linux the problem of compatibility with previous versions
> is more serious and because on Linux there present other popular compiler
> (clang) that is sometimes used together with gcc and that supports
> _Float64 and people expect interoperability.
> Also on Linux the performaance gain from change of the API is smaller.

For Linux, the ABI is prescribed in the

System V Application Binary Interface
AMD64 Architecture Processor Supplement
(With LP64 and ILP32 Programming Models)
Version 1.0

(to give it its full title, you'll find it as x86-64-psABI-1.0.pdf).
I do not think you will get anybody to change that. Hmm... seems
that Intel is actually in violation of that specification.
Interesting...

Re: Improved routines for gcc/gfortran quadmath arithmetic

<d4938ef6-7055-46b4-87f6-40485596ca05n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=29923&group=comp.arch#29923

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:2b8b:b0:6fc:a03e:fcdf with SMTP id dz11-20020a05620a2b8b00b006fca03efcdfmr836261qkb.139.1672411165558;
Fri, 30 Dec 2022 06:39:25 -0800 (PST)
X-Received: by 2002:a05:6808:14cb:b0:35e:cee9:4de7 with SMTP id
f11-20020a05680814cb00b0035ecee94de7mr2064262oiw.23.1672411165263; Fri, 30
Dec 2022 06:39:25 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 30 Dec 2022 06:39:25 -0800 (PST)
In-Reply-To: <tomsa9$1pa8h$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:6929:12d7:bc35:ada3;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:6929:12d7:bc35:ada3
References: <b23aa9b1-fc39-49c2-83c7-935fe07d46aen@googlegroups.com>
<tojqju$1n9l5$1@newsreader4.netcologne.de> <4a1b157b-7ba7-4aff-b227-3db762074c5cn@googlegroups.com>
<tokl1e$1nspb$1@newsreader4.netcologne.de> <tokvtd$1g1t$1@gioia.aioe.org>
<30a123cc-cbb4-4cfa-a28c-351ccadf93d1n@googlegroups.com> <tomb6v$1ovf4$1@newsreader4.netcologne.de>
<443e45c1-4e73-4c90-9a77-a1a8a7cb7e15n@googlegroups.com> <tomh8t$1p2e6$1@newsreader4.netcologne.de>
<2f2b7525-e482-4436-a5a0-68c995466fadn@googlegroups.com> <tomsa9$1pa8h$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d4938ef6-7055-46b4-87f6-40485596ca05n@googlegroups.com>
Subject: Re: Improved routines for gcc/gfortran quadmath arithmetic
From: already5...@yahoo.com (Michael S)
Injection-Date: Fri, 30 Dec 2022 14:39:25 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 5092

by: Michael S - Fri, 30 Dec 2022 14:39 UTC

On Friday, December 30, 2022 at 4:24:12 PM UTC+2, Thomas Koenig wrote:
> Michael S <already...@yahoo.com> schrieb:
>
> [...]
> > On Friday, December 30, 2022 at 1:15:46 PM UTC+2, Thomas Koenig wrote:
>
> >> If it is inefficient in Windows, then one other possibility is to
> >> mark the second function inline, and let the compiler take care
> >> of it, or whatever else turns out to be fastest. What I am thinking
> >> about is mainly to reduce the call overhead to the absolute minimum.
> >
> > That's a solution, too.
> > The library will be bigger, but at source code level we don't repeat themselves.
> > I don't like that sort of bloat, but relatively to thousands of other bloats that
> > people today consider acceptable this one is quite minor.
> >
> > The best technical solution of Windows would be a change in API of
> > compiler's support functions from pass by value to pass by reference.
> > I.e. instead of __float128 __multf3(__float128 srcx, __float128 srcy) it should be
> > void __multf3(__float128* dst, __float128* srcx, __float128* srcy).
> > It helps the problem with TCE, but not only that.
> > In my measurements such API ends up significantly faster, at least in
> > matmul benchmark.
> > But such change means breaking compatibility with previous version of compiler.
> I am not sure it is only that; I have no idea where this ABI is
> specified. Did Microsoft publish anything about this?
>

As far as Microsoft is concerned, __float128/_Float128 does not exist.
So, naturally, they are not in the official ABI. As I said above, from Windows
ABI perspective those types, if implemented, are yet another 16-byte structures.

> However, https://godbolt.org/z/7oqYsxG3P tells me that
> icc has a different calling convention from both gcc and clang,
> presumably on Linux, so the mess is even bigger...
>
> In general, ABI changes are never untaken lightly, only if there is
> a pressing need - if either a mistake in the previous implementation
> or a change in a standard requires something new.
> > Also, more problematically, it means that gcc compiler people have to do
> > additional works for sake of Windows and Windows alone. My impression is
> > that they hate to do it.
> People are wary to work on systems they don't know well. There are,
> however, some people active in gcc development on these platforms.
> > Of course, the same change can be made on Linux, too, but it is harder sell.
> > both because on Linux the problem of compatibility with previous versions
> > is more serious and because on Linux there present other popular compiler
> > (clang) that is sometimes used together with gcc and that supports
> > _Float64 and people expect interoperability.
> > Also on Linux the performaance gain from change of the API is smaller.
> For Linux, the ABI is prescribed in the
>
> System V Application Binary Interface
> AMD64 Architecture Processor Supplement
> (With LP64 and ILP32 Programming Models)
> Version 1.0
>
> (to give it its full title, you'll find it as x86-64-psABI-1.0.pdf).
> I do not think you will get anybody to change that. Hmm... seems
> that Intel is actually in violation of that specification.
> Interesting...

You mean, official ABI prescribes the names and prototypes of
compiler support routines __multf3, __addtf3 and __subtf3 ?

Re: Improved routines for gcc/gfortran quadmath arithmetic

<tomupn$1pb7b$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=29927&group=comp.arch#29927

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2a0a-a540-f43-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Improved routines for gcc/gfortran quadmath arithmetic
Date: Fri, 30 Dec 2022 15:06:31 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <tomupn$1pb7b$1@newsreader4.netcologne.de>
References: <b23aa9b1-fc39-49c2-83c7-935fe07d46aen@googlegroups.com>
<tojqju$1n9l5$1@newsreader4.netcologne.de>
<4a1b157b-7ba7-4aff-b227-3db762074c5cn@googlegroups.com>
<tokl1e$1nspb$1@newsreader4.netcologne.de> <tokvtd$1g1t$1@gioia.aioe.org>
<30a123cc-cbb4-4cfa-a28c-351ccadf93d1n@googlegroups.com>
<tomb6v$1ovf4$1@newsreader4.netcologne.de>
<443e45c1-4e73-4c90-9a77-a1a8a7cb7e15n@googlegroups.com>
<tomh8t$1p2e6$1@newsreader4.netcologne.de>
<2f2b7525-e482-4436-a5a0-68c995466fadn@googlegroups.com>
<tomsa9$1pa8h$1@newsreader4.netcologne.de>
<d4938ef6-7055-46b4-87f6-40485596ca05n@googlegroups.com>
Injection-Date: Fri, 30 Dec 2022 15:06:31 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2a0a-a540-f43-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2a0a:a540:f43:0:7285:c2ff:fe6c:992d";
logging-data="1879275"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Fri, 30 Dec 2022 15:06 UTC

Michael S <already5chosen@yahoo.com> schrieb:
> On Friday, December 30, 2022 at 4:24:12 PM UTC+2, Thomas Koenig wrote:
>> Michael S <already...@yahoo.com> schrieb:
>>
>> [...]
>> > On Friday, December 30, 2022 at 1:15:46 PM UTC+2, Thomas Koenig wrote:
>>
>> >> If it is inefficient in Windows, then one other possibility is to
>> >> mark the second function inline, and let the compiler take care
>> >> of it, or whatever else turns out to be fastest. What I am thinking
>> >> about is mainly to reduce the call overhead to the absolute minimum.
>> >
>> > That's a solution, too.
>> > The library will be bigger, but at source code level we don't repeat themselves.
>> > I don't like that sort of bloat, but relatively to thousands of other bloats that
>> > people today consider acceptable this one is quite minor.
>> >
>> > The best technical solution of Windows would be a change in API of
>> > compiler's support functions from pass by value to pass by reference.
>> > I.e. instead of __float128 __multf3(__float128 srcx, __float128 srcy) it should be
>> > void __multf3(__float128* dst, __float128* srcx, __float128* srcy).
>> > It helps the problem with TCE, but not only that.
>> > In my measurements such API ends up significantly faster, at least in
>> > matmul benchmark.
>> > But such change means breaking compatibility with previous version of compiler.
>> I am not sure it is only that; I have no idea where this ABI is
>> specified. Did Microsoft publish anything about this?
>>
>
> As far as Microsoft is concerned, __float128/_Float128 does not exist.
> So, naturally, they are not in the official ABI. As I said above, from Windows
> ABI perspective those types, if implemented, are yet another 16-byte structures.

OK.

>
>> However, https://godbolt.org/z/7oqYsxG3P tells me that
>> icc has a different calling convention from both gcc and clang,
>> presumably on Linux, so the mess is even bigger...
>>
>> In general, ABI changes are never untaken lightly, only if there is
>> a pressing need - if either a mistake in the previous implementation
>> or a change in a standard requires something new.
>> > Also, more problematically, it means that gcc compiler people have to do
>> > additional works for sake of Windows and Windows alone. My impression is
>> > that they hate to do it.
>> People are wary to work on systems they don't know well. There are,
>> however, some people active in gcc development on these platforms.
>> > Of course, the same change can be made on Linux, too, but it is harder sell.
>> > both because on Linux the problem of compatibility with previous versions
>> > is more serious and because on Linux there present other popular compiler
>> > (clang) that is sometimes used together with gcc and that supports
>> > _Float64 and people expect interoperability.
>> > Also on Linux the performaance gain from change of the API is smaller.
>> For Linux, the ABI is prescribed in the
>>
>> System V Application Binary Interface
>> AMD64 Architecture Processor Supplement
>> (With LP64 and ILP32 Programming Models)
>> Version 1.0
>>
>> (to give it its full title, you'll find it as x86-64-psABI-1.0.pdf).
>> I do not think you will get anybody to change that. Hmm... seems
>> that Intel is actually in violation of that specification.
>> Interesting...
>
> You mean, official ABI prescribes the names and prototypes of
> compiler support routines __multf3, __addtf3 and __subtf3 ?

No, it just prescribes the argument passing conventions if
you pass 128-bit floats by value (in two halves, in SSE
registers).

However, there is no ABI-based reason why the auxiliary
multiplication funcdtions for use by Fortran (or other value)
should use pass by value. It is also possible to use

void foo (__float128 *const restrict a, __float128 *const restrict b,
__float128 *restrict c)

or some variant if that turns out to generate better code.

Michael S <already5chosen@yahoo.com> schrieb:
> On Thursday, December 29, 2022 at 10:02:25 PM UTC+2, Michael S wrote:
>> On Thursday, December 29, 2022 at 8:07:45 PM UTC+2, Thomas Koenig wrote:

>> > In file included from /home/tkoenig/trunk-bin/gcc/include/immintrin.h:39,
>> > from /home/tkoenig/trunk-bin/gcc/include/x86intrin.h:32,
>> > from ../../../trunk/libgcc/soft-fp/addtf3.c:4:
>> > /home/tkoenig/trunk-bin/gcc/include/smmintrin.h: In function 'f128_to_u128':
>> > /home/tkoenig/trunk-bin/gcc/include/smmintrin.h:455:1: error: inlining failed in call to 'always_inline' '_mm_extract_epi64': target specific option mismatch
>> > 455 | _mm_extract_epi64 (__m128i __X, const int __N)
>> > | ^~~~~~~~~~~~~~~~~
>> > ../../../trunk/libgcc/soft-fp/addtf3.c:70:17: note: called from here
>> > 70 | uint64_t hi = _mm_extract_epi64(v, 1);
>> > | ^~~~~~~~~~~~~~~~~~~~~~~
>> > /home/tkoenig/trunk-bin/gcc/include/smmintrin.h:455:1: error: inlining failed in call to 'always_inline' '_mm_extract_epi64': target specific option mismatch
>> > 455 | _mm_extract_epi64 (__m128i __X, const int __N)
>> > | ^~~~~~~~~~~~~~~~~
>> > ../../../trunk/libgcc/soft-fp/addtf3.c:69:17: note: called from here
>> > 69 | uint64_t lo = _mm_extract_epi64(v, 0);
>> > | ^~~~~~~~~~~~~~~~~~~~~~~
>> > make[3]: *** [../../../trunk/libgcc/shared-object.mk:14: addtf3.o] Error 1
>> >
>> > (I'm not quite sure what "target specific option mismatch" actually
>> > means in this context). This kind of thing is not unexpected,
>> > because the build environment inside gcc is different from normal
>> > userland. So, some debugging would be needed to bring this into libgcc.
>> That's because I never tried to compile for AMD64 target that does not have at least SSE4 :(
>> Always compiled with -march=native on machines that were not older that 10-11 y.o.
>> I'd try to fix it.
>
> Fixed.

I plugged in the current version into current trunk, and it works. Very good.
A regression test shows the following failures introduced by the patch
(at all optimization levels):

gcc.dg/torture/float128-exact-underflow.c
gcc.dg/torture/float128-ieee-nan.c
gcc.dg/torture/float128-nan.c
gcc.dg/torture/float128-nan-floath.c
gfortran.dg/ieee/large_3.F9

These tests check various IEEE compliance aspects, including IEEE
flags for the Fortran test. I'd have to look in detail what
the causes are, but I suspect that, for C, we will have to
use something with the original functionality.

Michael S wrote:
> On Thursday, December 29, 2022 at 11:13:20 PM UTC+2, Terje Mathisen wrote:
>> Thomas Koenig wrote:
>>> Michael S <already...@yahoo.com> schrieb:
>>>> But the real reason why I don't believe that my code can be integrates into
>>>> glibc or even into libgcc is different: I don't support rounding modes.
>>>> That is, I always round to nearest with breaks rounded to even.
>>>> I suppose that it's what is wanted by nearly all users, but glibc has different idea.
>>>> They think that binary128 rounding mode should be the same as current
>>>> rounding mode for binary64/binary32. They say, it's required by IEEE-754.
>>>> They could even be correct about it, but it just shows that IEEE-754 is not perfect.
>>>
>>> I understand them being sticklers for accuracy (Terje? :-)
>> Absolutely so!
>>
>
> I am pretty sure that in typical use cases one doesn't want binary128
> rounding mode to be prescribed by the same control world as binary32/binary64.
> Generally, binary64 non-default rounding modes are for experimentation.
> And binary128 is for comparison of results of experimentation with "master"
> values. And for "master" values you want "best" precision which is achieved in
> default rounding mode.
> mprf does not tie its rounding modes to binary32/binary64 and mprf is correct.
>
>> Supporting all required rounding modes turns out to be easy however: If
>> you already support the default round_to_nearest_or_even (RNE), then you
>> already have the 4 required decision bits:
>>
>> Sign, Ulp, Guard & Sticky (well, Sign isn't actually needed for RNE,
>> but it is very easy to grab. :-) )
>>
>> Using those 4 bits as index into a bitmap (i.e. a 16-bit constant) you
>> get out the increment needed to round the intermediate result.
>>
>> Supporting multiple rounding modes just means grabbing the correct
>> 16-bit value, or you can use the sign bit to select between two 64-bit
>> constants and then use rounding_mode*8+ulp*4+guard*2+sticky as a shift
>> count to end up with the desired rounding bit.
>>
>> Terje
>>
>
> I am repeating myself for the 3rd or 4th time - the main cost is *not*
> implementation of non-default rounding modes. For that task branch
> prediction will do a near perfect job.
> The main cost is reading of FP control word that contains the relevant bits
> and interpreting this bits in GPR domain. It is especially expensive if done
> in portable way, i.e. via fegetround().
> It's not a lot of cycles in absolute sense, but in such primitives like fadd
> and fmul every cycle counts. We want to do each of them in less than
> 25 cycles on average. On Apple silicon, hopefully, in less than 15 cycles.

Ouch!

Sorry, I had absolutely no intention to suggest you should try to adapt
to whatever the current HW rounding mode is! IMHO the only sane way to
do it is to assume the user will set the rounding mode with the standard
function to do so, and at that point you save away a copy, and load the
corresponding 16-bit rounding lookup value.

If a library user then goes "behind your back" using direct asm
instructions to set the rounding mode to something else, then just
disregard that.

OK?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Improved routines for gcc/gfortran quadmath arithmetic

<tonjb8$4k4$2@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=29936&group=comp.arch#29936

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!rd9pRsUZyxkRLAEK7e/Uzw.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Improved routines for gcc/gfortran quadmath arithmetic
Date: Fri, 30 Dec 2022 21:57:12 +0100
Organization: Aioe.org NNTP Server
Message-ID: <tonjb8$4k4$2@gioia.aioe.org>
References: <b23aa9b1-fc39-49c2-83c7-935fe07d46aen@googlegroups.com>
<tojqju$1n9l5$1@newsreader4.netcologne.de>
<4a1b157b-7ba7-4aff-b227-3db762074c5cn@googlegroups.com>
<tokl1e$1nspb$1@newsreader4.netcologne.de> <tokvtd$1g1t$1@gioia.aioe.org>
<30a123cc-cbb4-4cfa-a28c-351ccadf93d1n@googlegroups.com>
<tomb6v$1ovf4$1@newsreader4.netcologne.de>
<443e45c1-4e73-4c90-9a77-a1a8a7cb7e15n@googlegroups.com>
<tomh8t$1p2e6$1@newsreader4.netcologne.de>
<2f2b7525-e482-4436-a5a0-68c995466fadn@googlegroups.com>
<tomsa9$1pa8h$1@newsreader4.netcologne.de>
<d4938ef6-7055-46b4-87f6-40485596ca05n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="4740"; posting-host="rd9pRsUZyxkRLAEK7e/Uzw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.14
X-Notice: Filtered by postfilter v. 0.9.2

by: Terje Mathisen - Fri, 30 Dec 2022 20:57 UTC

Michael S wrote:
> On Friday, December 30, 2022 at 4:24:12 PM UTC+2, Thomas Koenig wrote:
>> Michael S <already...@yahoo.com> schrieb:
>>
>> [...]
>>> On Friday, December 30, 2022 at 1:15:46 PM UTC+2, Thomas Koenig wrote:
>>
>>>> If it is inefficient in Windows, then one other possibility is to
>>>> mark the second function inline, and let the compiler take care
>>>> of it, or whatever else turns out to be fastest. What I am thinking
>>>> about is mainly to reduce the call overhead to the absolute minimum.
>>>
>>> That's a solution, too.
>>> The library will be bigger, but at source code level we don't repeat themselves.
>>> I don't like that sort of bloat, but relatively to thousands of other bloats that
>>> people today consider acceptable this one is quite minor.
>>>
>>> The best technical solution of Windows would be a change in API of
>>> compiler's support functions from pass by value to pass by reference.
>>> I.e. instead of __float128 __multf3(__float128 srcx, __float128 srcy) it should be
>>> void __multf3(__float128* dst, __float128* srcx, __float128* srcy).
>>> It helps the problem with TCE, but not only that.
>>> In my measurements such API ends up significantly faster, at least in
>>> matmul benchmark.
>>> But such change means breaking compatibility with previous version of compiler.
>> I am not sure it is only that; I have no idea where this ABI is
>> specified. Did Microsoft publish anything about this?
>>
>
> As far as Microsoft is concerned, __float128/_Float128 does not exist.
> So, naturally, they are not in the official ABI. As I said above, from Windows
> ABI perspective those types, if implemented, are yet another 16-byte structures.

Microsoft do support 16-byte return values all over the place in thje
several 100's of AVX SIMD intrinsics which use __m256i as both argument
and return types. Could you (ab)use those via some typedefs?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Improved routines for gcc/gfortran quadmath arithmetic

<690a1a05-1fc1-45ed-96dc-e28acd04b755n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=29940&group=comp.arch#29940

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:b7c3:0:b0:702:4a94:48e7 with SMTP id h186-20020a37b7c3000000b007024a9448e7mr1212197qkf.578.1672438370644;
Fri, 30 Dec 2022 14:12:50 -0800 (PST)
X-Received: by 2002:a05:6870:42d2:b0:144:9878:46be with SMTP id
z18-20020a05687042d200b00144987846bemr2430254oah.245.1672438370307; Fri, 30
Dec 2022 14:12:50 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!3.us.feeder.erje.net!feeder.erje.net!border-1.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 30 Dec 2022 14:12:50 -0800 (PST)
In-Reply-To: <tonik1$4k4$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:10c5:d562:c274:5fcc;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:10c5:d562:c274:5fcc
References: <b23aa9b1-fc39-49c2-83c7-935fe07d46aen@googlegroups.com>
<tojqju$1n9l5$1@newsreader4.netcologne.de> <4a1b157b-7ba7-4aff-b227-3db762074c5cn@googlegroups.com>
<tokl1e$1nspb$1@newsreader4.netcologne.de> <tokvtd$1g1t$1@gioia.aioe.org>
<30a123cc-cbb4-4cfa-a28c-351ccadf93d1n@googlegroups.com> <tonik1$4k4$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <690a1a05-1fc1-45ed-96dc-e28acd04b755n@googlegroups.com>
Subject: Re: Improved routines for gcc/gfortran quadmath arithmetic
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 30 Dec 2022 22:12:50 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 34

by: MitchAlsup - Fri, 30 Dec 2022 22:12 UTC

On Friday, December 30, 2022 at 2:44:52 PM UTC-6, Terje Mathisen wrote:
> Michael S wrote:

> > I am repeating myself for the 3rd or 4th time - the main cost is *not*
> > implementation of non-default rounding modes. For that task branch
> > prediction will do a near perfect job.
> > The main cost is reading of FP control word that contains the relevant bits
> > and interpreting this bits in GPR domain. It is especially expensive if done
> > in portable way, i.e. via fegetround().
> > It's not a lot of cycles in absolute sense, but in such primitives like fadd
> > and fmul every cycle counts. We want to do each of them in less than
> > 25 cycles on average. On Apple silicon, hopefully, in less than 15 cycles.
> Ouch!
>
> Sorry, I had absolutely no intention to suggest you should try to adapt
> to whatever the current HW rounding mode is! IMHO the only sane way to
> do it is to assume the user will set the rounding mode with the standard
> function to do so, and at that point you save away a copy, and load the
> corresponding 16-bit rounding lookup value.
>
> If a library user then goes "behind your back" using direct asm
> instructions to set the rounding mode to something else, then just
> disregard that.
<
If you have an environment where you have a standard way to get/set
the rounding mode, and some programmer goes around that interface
with ASM to get/set the rounding mode, that programmer (ASM) should
be summarily dismissed.
>
> OK?
>
> Terje
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Re: Improved routines for gcc/gfortran quadmath arithmetic

<5af8fd3e-7d20-4c58-a1ec-bc51a7a94384n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=29941&group=comp.arch#29941

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:5ed6:0:b0:3a7:e66a:8c0d with SMTP id s22-20020ac85ed6000000b003a7e66a8c0dmr1515997qtx.337.1672438438505;
Fri, 30 Dec 2022 14:13:58 -0800 (PST)
X-Received: by 2002:a05:6870:44c6:b0:143:dea4:c591 with SMTP id
t6-20020a05687044c600b00143dea4c591mr3236859oai.106.1672438438183; Fri, 30
Dec 2022 14:13:58 -0800 (PST)
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!2.eu.feeder.erje.net!feeder.erje.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 30 Dec 2022 14:13:57 -0800 (PST)
In-Reply-To: <tonjb8$4k4$2@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:10c5:d562:c274:5fcc;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:10c5:d562:c274:5fcc
References: <b23aa9b1-fc39-49c2-83c7-935fe07d46aen@googlegroups.com>
<tojqju$1n9l5$1@newsreader4.netcologne.de> <4a1b157b-7ba7-4aff-b227-3db762074c5cn@googlegroups.com>
<tokl1e$1nspb$1@newsreader4.netcologne.de> <tokvtd$1g1t$1@gioia.aioe.org>
<30a123cc-cbb4-4cfa-a28c-351ccadf93d1n@googlegroups.com> <tomb6v$1ovf4$1@newsreader4.netcologne.de>
<443e45c1-4e73-4c90-9a77-a1a8a7cb7e15n@googlegroups.com> <tomh8t$1p2e6$1@newsreader4.netcologne.de>
<2f2b7525-e482-4436-a5a0-68c995466fadn@googlegroups.com> <tomsa9$1pa8h$1@newsreader4.netcologne.de>
<d4938ef6-7055-46b4-87f6-40485596ca05n@googlegroups.com> <tonjb8$4k4$2@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5af8fd3e-7d20-4c58-a1ec-bc51a7a94384n@googlegroups.com>
Subject: Re: Improved routines for gcc/gfortran quadmath arithmetic
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 30 Dec 2022 22:13:58 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Fri, 30 Dec 2022 22:13 UTC

Don't sweat it -- it's only ones and zeros. -- P. Skelly

devel / comp.arch / Improved routines for gcc/gfortran quadmath arithmetic

Subject	Author
Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Anton Ertl
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Thomas Koenig
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Thomas Koenig
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Thomas Koenig
Re: Improved routines for gcc/gfortran quadmath arithmetic	Thomas Koenig
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Anton Ertl
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	MitchAlsup
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	MitchAlsup
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	MitchAlsup
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	MitchAlsup
Re: Improved routines for gcc/gfortran quadmath arithmetic	Terje Mathisen
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Terje Mathisen
Re: Improved routines for gcc/gfortran quadmath arithmetic	Terje Mathisen
Re: Improved routines for gcc/gfortran quadmath arithmetic	Anton Ertl
Re: Improved routines for gcc/gfortran quadmath arithmetic	Terje Mathisen
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Anton Ertl
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Scott Lurndal
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Scott Lurndal
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Scott Lurndal
Re: Improved routines for gcc/gfortran quadmath arithmetic	Anton Ertl
Re: Improved routines for gcc/gfortran quadmath arithmetic	Scott Lurndal
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	MitchAlsup
Re: Improved routines for gcc/gfortran quadmath arithmetic	Anton Ertl
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Terje Mathisen
Re: Improved routines for gcc/gfortran quadmath arithmetic	Scott Lurndal
Re: Improved routines for gcc/gfortran quadmath arithmetic	MitchAlsup
Re: Improved routines for gcc/gfortran quadmath arithmetic	Kent Dickey
Re: Improved routines for gcc/gfortran quadmath arithmetic	Terje Mathisen
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Terje Mathisen
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	MitchAlsup
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Terje Mathisen
Re: Improved routines for gcc/gfortran quadmath arithmetic	Thomas Koenig
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Thomas Koenig
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Thomas Koenig
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Thomas Koenig
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Thomas Koenig
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	MitchAlsup
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Thomas Koenig
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Thomas Koenig
Re: Improved routines for gcc/gfortran quadmath arithmetic	Stephen Fuld
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Stephen Fuld
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Scott Lurndal
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Thomas Koenig
Re: Improved routines for gcc/gfortran quadmath arithmetic	Thomas Koenig
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Thomas Koenig
Re: Improved routines for gcc/gfortran quadmath arithmetic	Thomas Koenig
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Thomas Koenig
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Thomas Koenig
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	MitchAlsup
Re: Improved routines for gcc/gfortran quadmath arithmetic	Terje Mathisen
Re: Improved routines for gcc/gfortran quadmath arithmetic	Terje Mathisen
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Thomas Koenig
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Thomas Koenig
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Thomas Koenig
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	Thomas Koenig
Re: Improved routines for gcc/gfortran quadmath arithmetic	Terje Mathisen
Re: Improved routines for gcc/gfortran quadmath arithmetic	Anton Ertl
Re: Improved routines for gcc/gfortran quadmath arithmetic	Terje Mathisen
Re: Improved routines for gcc/gfortran quadmath arithmetic	Michael S
Re: Improved routines for gcc/gfortran quadmath arithmetic	robf...@gmail.com