novaBBS - comp.arch - Re: Approximate reciprocals

Re: Approximate reciprocals

<t1kui4$l4a$1@gioia.aioe.org>

https://www.novabbs.com/devel/article-flat.php?id=24473&group=comp.arch#24473

Path: i2pn2.org!i2pn.org!aioe.org!rd9pRsUZyxkRLAEK7e/Uzw.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Fri, 25 Mar 2022 18:36:13 +0100
Organization: Aioe.org NNTP Server
Message-ID: <t1kui4$l4a$1@gioia.aioe.org>
References: <t1c154$j5t$1@dont-email.me>
<81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com>
<t1d0r8$o4v$1@newsreader4.netcologne.de>
<903965ad-5226-49d5-9883-57b1bc836fd7n@googlegroups.com>
<t1dckv$u7$2@newsreader4.netcologne.de>
<526d6018-1e28-44f7-86e6-89ccbda1f663n@googlegroups.com>
<t1fmss$i30$2@newsreader4.netcologne.de>
<5991ffcb-7857-49ba-9204-7201850b64a6n@googlegroups.com>
<t1helc$mtc$1@newsreader4.netcologne.de>
<b58e87e7-5cad-4867-835e-ea84b192b230n@googlegroups.com>
<t1i106$4jp$1@newsreader4.netcologne.de>
<4a14747b-b131-4619-af63-e87caa1186cen@googlegroups.com>
<5c553807-0d0a-45f4-8b4e-a52480359c8cn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="21642"; posting-host="rd9pRsUZyxkRLAEK7e/Uzw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.11
X-Notice: Filtered by postfilter v. 0.9.2

by: Terje Mathisen - Fri, 25 Mar 2022 17:36 UTC

Michael S wrote:
> But one thing is sure - [on Intel] my quad-precision sqrt() is close
> to 15 times faster than the one supplied with GCC Quad-Precision Math
> Library.

That seems almost impossible, how is it even possible to generate
something that slow?

Assuming a naive/classic SQRT NR iteration with a division might explain
it, but it seems like a stretch even so.

If your version is approximate (0.5x ulp but not exactly rounded for all
inputs?), then that might explain a factor of 2, but not 15!

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Approximate reciprocals

<35f5c160-b3f8-4d9f-9d38-57b3010705cdn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24476&group=comp.arch#24476

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:68c9:0:b0:67b:21e6:297c with SMTP id d192-20020a3768c9000000b0067b21e6297cmr7978286qkc.464.1648232612158;
Fri, 25 Mar 2022 11:23:32 -0700 (PDT)
X-Received: by 2002:a05:6808:8e4:b0:2ec:aea1:353a with SMTP id
d4-20020a05680808e400b002ecaea1353amr6047397oic.27.1648232611962; Fri, 25 Mar
2022 11:23:31 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 25 Mar 2022 11:23:31 -0700 (PDT)
In-Reply-To: <t1kui4$l4a$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:a0b7:ad45:609a:53bb;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:a0b7:ad45:609a:53bb
References: <t1c154$j5t$1@dont-email.me> <81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com>
<t1d0r8$o4v$1@newsreader4.netcologne.de> <903965ad-5226-49d5-9883-57b1bc836fd7n@googlegroups.com>
<t1dckv$u7$2@newsreader4.netcologne.de> <526d6018-1e28-44f7-86e6-89ccbda1f663n@googlegroups.com>
<t1fmss$i30$2@newsreader4.netcologne.de> <5991ffcb-7857-49ba-9204-7201850b64a6n@googlegroups.com>
<t1helc$mtc$1@newsreader4.netcologne.de> <b58e87e7-5cad-4867-835e-ea84b192b230n@googlegroups.com>
<t1i106$4jp$1@newsreader4.netcologne.de> <4a14747b-b131-4619-af63-e87caa1186cen@googlegroups.com>
<5c553807-0d0a-45f4-8b4e-a52480359c8cn@googlegroups.com> <t1kui4$l4a$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <35f5c160-b3f8-4d9f-9d38-57b3010705cdn@googlegroups.com>
Subject: Re: Approximate reciprocals
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 25 Mar 2022 18:23:32 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 32

by: MitchAlsup - Fri, 25 Mar 2022 18:23 UTC

On Friday, March 25, 2022 at 12:36:10 PM UTC-5, Terje Mathisen wrote:
> Michael S wrote:
> > But one thing is sure - [on Intel] my quad-precision sqrt() is close
> > to 15 times faster than the one supplied with GCC Quad-Precision Math
> > Library.
> That seems almost impossible, how is it even possible to generate
> something that slow?
>
> Assuming a naive/classic SQRT NR iteration with a division might explain
> it, but it seems like a stretch even so.
<
Convert quad to double:: dozen cycles :: zero cycles in HW
SQRT(double() :: 2 dozen cycles .......... :: 20-24 cycles in HW
Convert double to quad:: ½ dozen cycles :: zero cycles in HW
2 Newton-Raphson iterations (4 quad FMAC delays) :: ½ in HW N-R*
<
(*) in HW one can perform Newton-Raphson iteration without having to
round/normalize except for the final result delivery.
<
>
> If your version is approximate (0.5x ulp but not exactly rounded for all
> inputs?), then that might explain a factor of 2, but not 15!
<
I only see 4×-5× slower.
<
> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Re: Approximate reciprocals

<t1lcmp$1ef7$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24479&group=comp.arch#24479

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!rd9pRsUZyxkRLAEK7e/Uzw.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Fri, 25 Mar 2022 22:37:29 +0100
Organization: Aioe.org NNTP Server
Message-ID: <t1lcmp$1ef7$1@gioia.aioe.org>
References: <t1c154$j5t$1@dont-email.me>
<81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com>
<t1d0r8$o4v$1@newsreader4.netcologne.de>
<903965ad-5226-49d5-9883-57b1bc836fd7n@googlegroups.com>
<t1dckv$u7$2@newsreader4.netcologne.de>
<526d6018-1e28-44f7-86e6-89ccbda1f663n@googlegroups.com>
<t1fmss$i30$2@newsreader4.netcologne.de>
<5991ffcb-7857-49ba-9204-7201850b64a6n@googlegroups.com>
<t1helc$mtc$1@newsreader4.netcologne.de>
<b58e87e7-5cad-4867-835e-ea84b192b230n@googlegroups.com>
<t1i106$4jp$1@newsreader4.netcologne.de>
<4a14747b-b131-4619-af63-e87caa1186cen@googlegroups.com>
<5c553807-0d0a-45f4-8b4e-a52480359c8cn@googlegroups.com>
<t1kui4$l4a$1@gioia.aioe.org>
<35f5c160-b3f8-4d9f-9d38-57b3010705cdn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="47591"; posting-host="rd9pRsUZyxkRLAEK7e/Uzw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.11
X-Notice: Filtered by postfilter v. 0.9.2

by: Terje Mathisen - Fri, 25 Mar 2022 21:37 UTC

MitchAlsup wrote:
> On Friday, March 25, 2022 at 12:36:10 PM UTC-5, Terje Mathisen wrote:
>> Michael S wrote:
>>> But one thing is sure - [on Intel] my quad-precision sqrt() is close
>>> to 15 times faster than the one supplied with GCC Quad-Precision Math
>>> Library.
>> That seems almost impossible, how is it even possible to generate
>> something that slow?
>>
>> Assuming a naive/classic SQRT NR iteration with a division might explain
>> it, but it seems like a stretch even so.
> <
> Convert quad to double:: dozen cycles :: zero cycles in HW
> SQRT(double() :: 2 dozen cycles .......... :: 20-24 cycles in HW
> Convert double to quad:: ½ dozen cycles :: zero cycles in HW
> 2 Newton-Raphson iterations (4 quad FMAC delays) :: ½ in HW N-R*
> <
> (*) in HW one can perform Newton-Raphson iteration without having to
> round/normalize except for the final result delivery.
> <
>>
>> If your version is approximate (0.5x ulp but not exactly rounded for all
>> inputs?), then that might explain a factor of 2, but not 15!
> <
> I only see 4×-5× slower.

I agree.

What about a classic bit/iteration (or CORDIC?), that could eeasily take
700-1000 cycles?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Approximate reciprocals

<1c0e4f2e-9ecf-4467-a8cb-87613af14e65n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24480&group=comp.arch#24480

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:18a4:b0:2e1:e7a5:98ba with SMTP id v36-20020a05622a18a400b002e1e7a598bamr11766340qtc.424.1648247529152;
Fri, 25 Mar 2022 15:32:09 -0700 (PDT)
X-Received: by 2002:a9d:644b:0:b0:5cd:a627:c439 with SMTP id
m11-20020a9d644b000000b005cda627c439mr5479018otl.112.1648247528916; Fri, 25
Mar 2022 15:32:08 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 25 Mar 2022 15:32:08 -0700 (PDT)
In-Reply-To: <t1lcmp$1ef7$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:a0b7:ad45:609a:53bb;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:a0b7:ad45:609a:53bb
References: <t1c154$j5t$1@dont-email.me> <81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com>
<t1d0r8$o4v$1@newsreader4.netcologne.de> <903965ad-5226-49d5-9883-57b1bc836fd7n@googlegroups.com>
<t1dckv$u7$2@newsreader4.netcologne.de> <526d6018-1e28-44f7-86e6-89ccbda1f663n@googlegroups.com>
<t1fmss$i30$2@newsreader4.netcologne.de> <5991ffcb-7857-49ba-9204-7201850b64a6n@googlegroups.com>
<t1helc$mtc$1@newsreader4.netcologne.de> <b58e87e7-5cad-4867-835e-ea84b192b230n@googlegroups.com>
<t1i106$4jp$1@newsreader4.netcologne.de> <4a14747b-b131-4619-af63-e87caa1186cen@googlegroups.com>
<5c553807-0d0a-45f4-8b4e-a52480359c8cn@googlegroups.com> <t1kui4$l4a$1@gioia.aioe.org>
<35f5c160-b3f8-4d9f-9d38-57b3010705cdn@googlegroups.com> <t1lcmp$1ef7$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1c0e4f2e-9ecf-4467-a8cb-87613af14e65n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 25 Mar 2022 22:32:09 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 55

by: MitchAlsup - Fri, 25 Mar 2022 22:32 UTC

On Friday, March 25, 2022 at 4:37:32 PM UTC-5, Terje Mathisen wrote:
> MitchAlsup wrote:
> > On Friday, March 25, 2022 at 12:36:10 PM UTC-5, Terje Mathisen wrote:
> >> Michael S wrote:
> >>> But one thing is sure - [on Intel] my quad-precision sqrt() is close
> >>> to 15 times faster than the one supplied with GCC Quad-Precision Math
> >>> Library.
> >> That seems almost impossible, how is it even possible to generate
> >> something that slow?
> >>
> >> Assuming a naive/classic SQRT NR iteration with a division might explain
> >> it, but it seems like a stretch even so.
> > <
> > Convert quad to double:: dozen cycles :: zero cycles in HW
> > SQRT(double() :: 2 dozen cycles .......... :: 20-24 cycles in HW
> > Convert double to quad:: ½ dozen cycles :: zero cycles in HW
> > 2 Newton-Raphson iterations (4 quad FMAC delays) :: ½ in HW N-R*
> > <
> > (*) in HW one can perform Newton-Raphson iteration without having to
> > round/normalize except for the final result delivery.
> > <
> >>
> >> If your version is approximate (0.5x ulp but not exactly rounded for all
> >> inputs?), then that might explain a factor of 2, but not 15!
> > <
> > I only see 4×-5× slower.
> I agree.
>
> What about a classic bit/iteration (or CORDIC?), that could eeasily take
> 700-1000 cycles?
<
Never more than bit-width(fraction)×2+5 (RCP, RSQRT)
<
The CORDIC you remember took 250-ish cycles to do (effecitvely) Payne-
Hanek argument reduction, followed by 2×fraction CORDIC--giving 300-400
SIN, COS,... Give an argument of 2.7E+709 to COS and you can easily take
700-1400 cycles in argument reduction, followed by a rather brisk 100 cycles
for the actual transcendental calculation. That is basically why we don't do
CORDIC anymore.
<
> Terje
>
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Re: Approximate reciprocals

<t1mnc6$8lb$2@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24484&group=comp.arch#24484

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!news.freedyn.de!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-30bd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Sat, 26 Mar 2022 09:45:42 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <t1mnc6$8lb$2@newsreader4.netcologne.de>
References: <t1c154$j5t$1@dont-email.me>
<81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com>
<t1d0r8$o4v$1@newsreader4.netcologne.de>
<903965ad-5226-49d5-9883-57b1bc836fd7n@googlegroups.com>
<t1dckv$u7$2@newsreader4.netcologne.de>
<526d6018-1e28-44f7-86e6-89ccbda1f663n@googlegroups.com>
<t1fmss$i30$2@newsreader4.netcologne.de>
<5991ffcb-7857-49ba-9204-7201850b64a6n@googlegroups.com>
<t1helc$mtc$1@newsreader4.netcologne.de>
<b58e87e7-5cad-4867-835e-ea84b192b230n@googlegroups.com>
<t1i106$4jp$1@newsreader4.netcologne.de>
<4a14747b-b131-4619-af63-e87caa1186cen@googlegroups.com>
<5c553807-0d0a-45f4-8b4e-a52480359c8cn@googlegroups.com>
Injection-Date: Sat, 26 Mar 2022 09:45:42 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-30bd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:30bd:0:7285:c2ff:fe6c:992d";
logging-data="8875"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Sat, 26 Mar 2022 09:45 UTC

Michael S <already5chosen@yahoo.com> schrieb:

> For reference, I compiled your program (MSYS2, gfortran 11.2.0, -O2) and run it on all tree systems.
> Core i5-3450: 3.75962 seconds
> Xeon E3-1271 v3: 2.96402 seconds
> Xeon E-2176G: 2.70312 seconds
>
> So, it seems, either your Zen1 CPU runs at much higher frequency (over 4.5 GHz?) or your version of quadmath library
> is much better than the one, supplied with MSYS2 (mingw-w64-x86_64-gcc-libgfortran 11.2.0-10) or the library
> likes AMD CPUs and hates Intel's.

Or something is going away in the calling sequence of the quadmath
library.

If you have Linux installed, a timing comparison would be of interest.

> Another, not very probable possibility is that your cpu_time() call is lying, but that's easily verifiable with running
> binary under 'time' utility.
> And yet another possibility is that your compiler managed to parallelize a line 'a = sqrt(a)'. Crazy suggestion, I know.
>
> But one thing is sure - [on Intel] my quad-precision sqrt()
> is close to 15 times faster than the one supplied with GCC
> Quad-Precision Math Library.

Have you considered submitting a patch?

Re: Approximate reciprocals

<t1mnh0$14p3$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24485&group=comp.arch#24485

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!rd9pRsUZyxkRLAEK7e/Uzw.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Sat, 26 Mar 2022 10:48:20 +0100
Organization: Aioe.org NNTP Server
Message-ID: <t1mnh0$14p3$1@gioia.aioe.org>
References: <t1c154$j5t$1@dont-email.me>
<81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com>
<t1d0r8$o4v$1@newsreader4.netcologne.de>
<903965ad-5226-49d5-9883-57b1bc836fd7n@googlegroups.com>
<t1dckv$u7$2@newsreader4.netcologne.de>
<526d6018-1e28-44f7-86e6-89ccbda1f663n@googlegroups.com>
<t1fmss$i30$2@newsreader4.netcologne.de>
<5991ffcb-7857-49ba-9204-7201850b64a6n@googlegroups.com>
<t1helc$mtc$1@newsreader4.netcologne.de>
<b58e87e7-5cad-4867-835e-ea84b192b230n@googlegroups.com>
<t1i106$4jp$1@newsreader4.netcologne.de>
<4a14747b-b131-4619-af63-e87caa1186cen@googlegroups.com>
<5c553807-0d0a-45f4-8b4e-a52480359c8cn@googlegroups.com>
<t1kui4$l4a$1@gioia.aioe.org>
<35f5c160-b3f8-4d9f-9d38-57b3010705cdn@googlegroups.com>
<t1lcmp$1ef7$1@gioia.aioe.org>
<1c0e4f2e-9ecf-4467-a8cb-87613af14e65n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="37667"; posting-host="rd9pRsUZyxkRLAEK7e/Uzw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.11
X-Notice: Filtered by postfilter v. 0.9.2

by: Terje Mathisen - Sat, 26 Mar 2022 09:48 UTC

MitchAlsup wrote:
> On Friday, March 25, 2022 at 4:37:32 PM UTC-5, Terje Mathisen wrote:
>> MitchAlsup wrote:
>>> On Friday, March 25, 2022 at 12:36:10 PM UTC-5, Terje Mathisen wrote:
>>>> Michael S wrote:
>>>>> But one thing is sure - [on Intel] my quad-precision sqrt() is close
>>>>> to 15 times faster than the one supplied with GCC Quad-Precision Math
>>>>> Library.
>>>> That seems almost impossible, how is it even possible to generate
>>>> something that slow?
>>>>
>>>> Assuming a naive/classic SQRT NR iteration with a division might explain
>>>> it, but it seems like a stretch even so.
>>> <
>>> Convert quad to double:: dozen cycles :: zero cycles in HW
>>> SQRT(double() :: 2 dozen cycles .......... :: 20-24 cycles in HW
>>> Convert double to quad:: ½ dozen cycles :: zero cycles in HW
>>> 2 Newton-Raphson iterations (4 quad FMAC delays) :: ½ in HW N-R*
>>> <
>>> (*) in HW one can perform Newton-Raphson iteration without having to
>>> round/normalize except for the final result delivery.
>>> <
>>>>
>>>> If your version is approximate (0.5x ulp but not exactly rounded for all
>>>> inputs?), then that might explain a factor of 2, but not 15!
>>> <
>>> I only see 4×-5× slower.
>> I agree.
>>
>> What about a classic bit/iteration (or CORDIC?), that could eeasily take
>> 700-1000 cycles?
> <
> Never more than bit-width(fraction)×2+5 (RCP, RSQRT)
> <
> The CORDIC you remember took 250-ish cycles to do (effecitvely) Payne-
> Hanek argument reduction, followed by 2×fraction CORDIC--giving 300-400
> SIN, COS,... Give an argument of 2.7E+709 to COS and you can easily take
> 700-1400 cycles in argument reduction, followed by a rather brisk 100 cycles
> for the actual transcendental calculation. That is basically why we don't do
> CORDIC anymore.

This was purely in a QSQRT() context, i.e. no need for any argument
reduction, just grab the bottom exponent bit and the mantissa, pick a
starting estimate from the top bits and then start churning away,
conditionally adding bits as we go along while maintaining the current
sqrt estimate and its square, and the residue?

Going from x = y*y to x' = (y+1)*(y+1) = x + 2y + 1 which is very easy
in fixed-point, and when we end up we have done one extra (guard) bit,
at which point the residue is the sticky term, making any rounding mode
easy.

I haven't tried to actually write this algorithm, but it seems like it
could be done with LEA (to generate the 2y+1 term), a subtract and a
CMOV to pick either the new smaller residue (adding a '1' bit at the
end) or skipping it. Probably smart to do so branchless since the bits
will be unpredictable in the general case.

If we start with the HW DSQRT, do a little bit of setup work and then
run the algorithm for the next 60 iterations then we need to work with
256-bit variables for the residue and the squared bottom bit, so each
add/sub/cmp will take ~4 cycles, then the CMOVs add 3-4 more, but
definitely doable in 10 cycles/iteration which would land us at slightly
below my 700 cycle estimate.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Approximate reciprocals

<77c22d62-30bb-4822-9587-9208039ff3den@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24486&group=comp.arch#24486

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:196:b0:2e0:705c:35b2 with SMTP id s22-20020a05622a019600b002e0705c35b2mr14667532qtw.567.1648313996957;
Sat, 26 Mar 2022 09:59:56 -0700 (PDT)
X-Received: by 2002:a9d:6c58:0:b0:5cd:8ccb:c128 with SMTP id
g24-20020a9d6c58000000b005cd8ccbc128mr6879473otq.254.1648313996704; Sat, 26
Mar 2022 09:59:56 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 26 Mar 2022 09:59:56 -0700 (PDT)
In-Reply-To: <t1kui4$l4a$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:7408:496f:7430:392a;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:7408:496f:7430:392a
References: <t1c154$j5t$1@dont-email.me> <81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com>
<t1d0r8$o4v$1@newsreader4.netcologne.de> <903965ad-5226-49d5-9883-57b1bc836fd7n@googlegroups.com>
<t1dckv$u7$2@newsreader4.netcologne.de> <526d6018-1e28-44f7-86e6-89ccbda1f663n@googlegroups.com>
<t1fmss$i30$2@newsreader4.netcologne.de> <5991ffcb-7857-49ba-9204-7201850b64a6n@googlegroups.com>
<t1helc$mtc$1@newsreader4.netcologne.de> <b58e87e7-5cad-4867-835e-ea84b192b230n@googlegroups.com>
<t1i106$4jp$1@newsreader4.netcologne.de> <4a14747b-b131-4619-af63-e87caa1186cen@googlegroups.com>
<5c553807-0d0a-45f4-8b4e-a52480359c8cn@googlegroups.com> <t1kui4$l4a$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <77c22d62-30bb-4822-9587-9208039ff3den@googlegroups.com>
Subject: Re: Approximate reciprocals
From: already5...@yahoo.com (Michael S)
Injection-Date: Sat, 26 Mar 2022 16:59:56 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 27

by: Michael S - Sat, 26 Mar 2022 16:59 UTC

On Friday, March 25, 2022 at 8:36:10 PM UTC+3, Terje Mathisen wrote:
> Michael S wrote:
> > But one thing is sure - [on Intel] my quad-precision sqrt() is close
> > to 15 times faster than the one supplied with GCC Quad-Precision Math
> > Library.
> That seems almost impossible, how is it even possible to generate
> something that slow?
>
> Assuming a naive/classic SQRT NR iteration with a division might explain
> it, but it seems like a stretch even so.
>
> If your version is approximate (0.5x ulp but not exactly rounded for all
> inputs?), then that might explain a factor of 2, but not 15!

As far as I remember, my version is exact.
Indeed, it was 5 years ago, but looking at the code it seems that I calculate
a low estimate of result with few bits of extra precision then if estimate
is slightly lower than mid-point between two representable numbers then
and only then I calculate a square of the mid-point which tells me where
to round. So, for a common case it costs me only one or two cycles and
for the "bad" case it costs branch mispredict plus, may be 10-15 cycles.

> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Re: Approximate reciprocals

<ae3841d6-c26f-4703-99f8-9893cb7c4701n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24487&group=comp.arch#24487

copy link Newsgroups: comp.arch

X-Received: by 2002:a0c:f70d:0:b0:441:4558:b70c with SMTP id w13-20020a0cf70d000000b004414558b70cmr14437961qvn.82.1648315379578;
Sat, 26 Mar 2022 10:22:59 -0700 (PDT)
X-Received: by 2002:aca:905:0:b0:2ee:f62a:e08e with SMTP id
5-20020aca0905000000b002eef62ae08emr8061997oij.54.1648315379349; Sat, 26 Mar
2022 10:22:59 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 26 Mar 2022 10:22:59 -0700 (PDT)
In-Reply-To: <t1mnc6$8lb$2@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:7408:496f:7430:392a;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:7408:496f:7430:392a
References: <t1c154$j5t$1@dont-email.me> <81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com>
<t1d0r8$o4v$1@newsreader4.netcologne.de> <903965ad-5226-49d5-9883-57b1bc836fd7n@googlegroups.com>
<t1dckv$u7$2@newsreader4.netcologne.de> <526d6018-1e28-44f7-86e6-89ccbda1f663n@googlegroups.com>
<t1fmss$i30$2@newsreader4.netcologne.de> <5991ffcb-7857-49ba-9204-7201850b64a6n@googlegroups.com>
<t1helc$mtc$1@newsreader4.netcologne.de> <b58e87e7-5cad-4867-835e-ea84b192b230n@googlegroups.com>
<t1i106$4jp$1@newsreader4.netcologne.de> <4a14747b-b131-4619-af63-e87caa1186cen@googlegroups.com>
<5c553807-0d0a-45f4-8b4e-a52480359c8cn@googlegroups.com> <t1mnc6$8lb$2@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ae3841d6-c26f-4703-99f8-9893cb7c4701n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: already5...@yahoo.com (Michael S)
Injection-Date: Sat, 26 Mar 2022 17:22:59 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 38

by: Michael S - Sat, 26 Mar 2022 17:22 UTC

On Saturday, March 26, 2022 at 12:45:45 PM UTC+3, Thomas Koenig wrote:
> Michael S <already...@yahoo.com> schrieb:
> > For reference, I compiled your program (MSYS2, gfortran 11.2.0, -O2) and run it on all tree systems.
> > Core i5-3450: 3.75962 seconds
> > Xeon E3-1271 v3: 2.96402 seconds
> > Xeon E-2176G: 2.70312 seconds
> >
> > So, it seems, either your Zen1 CPU runs at much higher frequency (over 4.5 GHz?) or your version of quadmath library
> > is much better than the one, supplied with MSYS2 (mingw-w64-x86_64-gcc-libgfortran 11.2.0-10) or the library
> > likes AMD CPUs and hates Intel's.
> Or something is going away in the calling sequence of the quadmath
> library.
>
> If you have Linux installed, a timing comparison would be of interest.

No, not at home.
Even at work, I am not sure that I have Linux with Fortran compiler installed.

Could you run you program under 'time' ?
At very least it would prove that no autopar involved.

> > Another, not very probable possibility is that your cpu_time() call is lying, but that's easily verifiable with running
> > binary under 'time' utility.
> > And yet another possibility is that your compiler managed to parallelize a line 'a = sqrt(a)'. Crazy suggestion, I know.
> >
> > But one thing is sure - [on Intel] my quad-precision sqrt()
> > is close to 15 times faster than the one supplied with GCC
> > Quad-Precision Math Library.
> Have you considered submitting a patch?

My quad-precision format is different from IEEE binary128==GNU __float128.
So, it would take some work.
And since I am not a great fun of GPL, I'd rather prefer my own work to be given away under
more anarchistic license.

But I don't mind if you or your friends take my source and adapte it to quadmath.
Can put a needed license statement in my public github repository if somebody tells
me a relevant lawyerspeak.

Re: Approximate reciprocals

<t1nome$2jv$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24488&group=comp.arch#24488

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Sat, 26 Mar 2022 20:14:21 +0100
Organization: A noiseless patient Spider
Lines: 65
Message-ID: <t1nome$2jv$1@dont-email.me>
References: <t1c154$j5t$1@dont-email.me> <t1cin7$hpc$1@gioia.aioe.org>
<t1cprk$o4b$1@dont-email.me> <t1ep43$104g$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 26 Mar 2022 19:14:22 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="900baf0792658c563ab04f26dacad5dc";
logging-data="2687"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/6/izcbrAITxT4HdBIASwItpWh7nkvyOs="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.7.0
Cancel-Lock: sha1:q2rKdEZmhMSVK5Cb6bv2+Jc6cts=
In-Reply-To: <t1ep43$104g$1@gioia.aioe.org>
Content-Language: en-US

by: Marcus - Sat, 26 Mar 2022 19:14 UTC

On 2022-03-23, Terje Mathisen wrote:
> Marcus wrote:
>> On 2022-03-22, Terje Mathisen wrote:
>>> Marcus wrote:
>>>> What are your feelings towards including such instructions in an ISA?
>>>
>>> Stupid not to do it?
>>>
>>>>
>>>> My own feelings are mixed.
>>>>
>>>> Pros:
>>>> * Easy to implement in hardware.
>>>> * Can provide a significant speedup for certain workloads, especially
>>>> Â Â if limited accuracy is acceptable.
>>>
>>> Even when you need exact results, having that reciprocal starting
>>> point means that you will converge on that value significantly faster.
>>>
>>> I.e. starting with a ~12-bit approximation gives you float with
>>> fractional ulp accuracy after one NR stage and exact float with one
>>> more iteration, still significantly faster than an SRT divider/sqrt
>>> circuit.
>>
>> Follow-up question 1: What approximation accuracy should one aim for?
>> I've seen 8 bits (yielding SP precision in 2 NR steps), and you mention
>> 12 bits. With 13 bits, would you not get full SP precision in 1 NR step?
>> CRAY gave a full 30 bits in the first approximation (!).
>>
>> Follow-up question 2: Implementation-wise, is it better to do a raw
>> nearest-neighbor ROM lookup, or more sophisticated ROM + linear
>> interpolation? It feels like the latter would give you more bang for
>> the buck, possibly at the cost of added latency.
>
> If you aim for single-cycle SIMD lookup, then I really can't see any
> room for any form of interpolation, while accessing the same ROM table 4
> times is doable, right?

I was initially thinking about having multiple LUTs (one per
FU/"lane"), but my current design is probably simpler since it only
supports the following levels of parallelism:

- 1 x 32 bit FP (scalar), 23 fractional bits
- 2 x 16 bit FP (SIMD), 10 fractional bits
- 4 x 8 bit FP (SIMD), 3 fractional bits

So, I'd probably end up using something like 2 8-bit LUTs and 2 3-bit
LUTs, or something similar. Alternatively, if better accuracy is
required for the initial estimation, 1 12-bit LUT, 1 10-bit LUT and
2 3-bit LUTs. For a wider implementation these figures would have to
increase. But I suspect that it's highly dependent on the implementation
technology. E.g. FPGA vs ASIC would differ I suppose.

The interpolation I was thinking about would be something like a 4-bit
multiplier + an integer adder, just to buy a few extra bits without
having to expand the LUTs too much. The slope factor would be stored as
a few extra bits in the LUT.

>
> A 12 (or 13?) bit lookup giving 12-bit results, along with the required
> exp trickery looks like about 6-7 Kbits of table space.
>
> Terje
>

Re: Approximate reciprocals

<2c217414-0867-4345-b412-5e32ef99a3f5n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24489&group=comp.arch#24489

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:508a:b0:440:f824:3d55 with SMTP id kk10-20020a056214508a00b00440f8243d55mr14399257qvb.26.1648323786042;
Sat, 26 Mar 2022 12:43:06 -0700 (PDT)
X-Received: by 2002:a05:6830:40c8:b0:5cb:557a:bba6 with SMTP id
h8-20020a05683040c800b005cb557abba6mr6808461otu.350.1648323785808; Sat, 26
Mar 2022 12:43:05 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 26 Mar 2022 12:43:05 -0700 (PDT)
In-Reply-To: <t1ep43$104g$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:d1e8:54f0:b949:b2f6;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:d1e8:54f0:b949:b2f6
References: <t1c154$j5t$1@dont-email.me> <t1cin7$hpc$1@gioia.aioe.org>
<t1cprk$o4b$1@dont-email.me> <t1ep43$104g$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2c217414-0867-4345-b412-5e32ef99a3f5n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 26 Mar 2022 19:43:06 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 20

by: MitchAlsup - Sat, 26 Mar 2022 19:43 UTC

On Wednesday, March 23, 2022 at 4:26:30 AM UTC-5, Terje Mathisen wrote:
> If you aim for single-cycle SIMD lookup, then I really can't see any
> room for any form of interpolation, while accessing the same ROM table 4
> times is doable, right?
>
> A 12 (or 13?) bit lookup giving 12-bit results, along with the required
> exp trickery looks like about 6-7 Kbits of table space.
<
This table is too big for single cycle lookup except for custom designs.
And besides, 12-bit index to 12-bit data followed by a multiply (of some
sort) only gives you 9.5-bits of accurate result.
<
The other thing to note is to get N-R or Goldschmidt to converge with negative
error under the desired result, wastes 1.5 bits of output from the table. This
makes rounding correctly possible.
<
> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Re: Approximate reciprocals

<t1nqv2$2b2$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24490&group=comp.arch#24490

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!news.freedyn.de!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-30bd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Sat, 26 Mar 2022 19:53:06 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <t1nqv2$2b2$1@newsreader4.netcologne.de>
References: <t1c154$j5t$1@dont-email.me>
<81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com>
<t1d0r8$o4v$1@newsreader4.netcologne.de>
<903965ad-5226-49d5-9883-57b1bc836fd7n@googlegroups.com>
<t1dckv$u7$2@newsreader4.netcologne.de>
<526d6018-1e28-44f7-86e6-89ccbda1f663n@googlegroups.com>
<t1fmss$i30$2@newsreader4.netcologne.de>
<5991ffcb-7857-49ba-9204-7201850b64a6n@googlegroups.com>
<t1helc$mtc$1@newsreader4.netcologne.de>
<b58e87e7-5cad-4867-835e-ea84b192b230n@googlegroups.com>
<t1i106$4jp$1@newsreader4.netcologne.de>
<4a14747b-b131-4619-af63-e87caa1186cen@googlegroups.com>
<5c553807-0d0a-45f4-8b4e-a52480359c8cn@googlegroups.com>
<t1mnc6$8lb$2@newsreader4.netcologne.de>
<ae3841d6-c26f-4703-99f8-9893cb7c4701n@googlegroups.com>
Injection-Date: Sat, 26 Mar 2022 19:53:06 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-30bd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:30bd:0:7285:c2ff:fe6c:992d";
logging-data="2402"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Sat, 26 Mar 2022 19:53 UTC

Michael S <already5chosen@yahoo.com> schrieb:
> On Saturday, March 26, 2022 at 12:45:45 PM UTC+3, Thomas Koenig wrote:
>> Michael S <already...@yahoo.com> schrieb:
>> > For reference, I compiled your program (MSYS2, gfortran 11.2.0, -O2) and run it on all tree systems.
>> > Core i5-3450: 3.75962 seconds
>> > Xeon E3-1271 v3: 2.96402 seconds
>> > Xeon E-2176G: 2.70312 seconds
>> >
>> > So, it seems, either your Zen1 CPU runs at much higher frequency (over 4.5 GHz?) or your version of quadmath library
>> > is much better than the one, supplied with MSYS2 (mingw-w64-x86_64-gcc-libgfortran 11.2.0-10) or the library
>> > likes AMD CPUs and hates Intel's.
>> Or something is going away in the calling sequence of the quadmath
>> library.
>>
>> If you have Linux installed, a timing comparison would be of interest.
>
> No, not at home.
> Even at work, I am not sure that I have Linux with Fortran compiler installed.
>
> Could you run you program under 'time' ?
> At very least it would prove that no autopar involved.

As a maintainer of gfortran, I can assure you that there is no
autoparallelization :-).

However, here it is:

$ cat a.f90
program main
implicit none
integer, parameter :: qp = selected_real_kind(30)
integer, parameter :: n = 10**6, m = 10
real :: t1, t2
character(len=20) :: c
integer :: i,k
real(kind=qp), dimension(:), allocatable :: a
allocate (a(n))
c = '10'
call random_number(a)
a = a * 1e50_qp
call cpu_time(t1)
do i=1,m
a = sqrt(a)
read (unit=c,fmt=*) k
write (*,*) a(k)
end do
call cpu_time(t2)
write (*,'(A,F12.5,A,ES15.5,A)') "Used ",t2-t1, " seconds to calculate ", &
real(n*m,qp), " square roots."
end program main
$ gfortran -O3 a.f90
$ time ./a.out
7308311667482999228141838.93337335438
2703388922719.59256619755162850493042
1644198.56547790254000409031824041555
1282.26306406988988805467367099802872
35.8087009547943513227570444215179866
5.98403717859392575839410139231541218
2.44622917540322165097869738482238220
1.56404257467730770560212117881721192
1.25061687765570623839109221766572087
1.11830983079632553071493517444861311
Used 2.35760 seconds to calculate 1.00000E+07 square roots.

real 0m2,475s
user 0m2,445s
sys 0m0,008s
$ head /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 23
model : 1
model name : AMD Ryzen 7 1700X Eight-Core Processor
stepping : 1
microcode : 0x8001129
cpu MHz : 2056.406
cache size : 512 KB
physical id : 0

>
>> > Another, not very probable possibility is that your cpu_time() call is lying, but that's easily verifiable with running
>> > binary under 'time' utility.
>> > And yet another possibility is that your compiler managed to parallelize a line 'a = sqrt(a)'. Crazy suggestion, I know.
>> >
>> > But one thing is sure - [on Intel] my quad-precision sqrt()
>> > is close to 15 times faster than the one supplied with GCC
>> > Quad-Precision Math Library.
>> Have you considered submitting a patch?
>
> My quad-precision format is different from IEEE binary128==GNU __float128.
> So, it would take some work.
> And since I am not a great fun of GPL, I'd rather prefer my own work to be given away under
> more anarchistic license.

Unfortunately, this is not what I (usually) do - I am mainly
concerned with the gfortran front end.

> But I don't mind if you or your friends take my source and adapte it to quadmath.
> Can put a needed license statement in my public github repository if somebody tells
> me a relevant lawyerspeak.

I can at least submit a PR. If there is a description of the
algorithm in the source, maybe somebody can take this up and run
with it.

Re: Approximate reciprocals

<c98cc15b-e3bf-4636-a2fc-0232aaf544f6n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24491&group=comp.arch#24491

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:6407:0:b0:67e:4423:7127 with SMTP id y7-20020a376407000000b0067e44237127mr11022579qkb.526.1648326643407;
Sat, 26 Mar 2022 13:30:43 -0700 (PDT)
X-Received: by 2002:aca:bb56:0:b0:2ef:6652:5581 with SMTP id
l83-20020acabb56000000b002ef66525581mr8933361oif.270.1648326643127; Sat, 26
Mar 2022 13:30:43 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 26 Mar 2022 13:30:42 -0700 (PDT)
In-Reply-To: <t1nqv2$2b2$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:7408:496f:7430:392a;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:7408:496f:7430:392a
References: <t1c154$j5t$1@dont-email.me> <81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com>
<t1d0r8$o4v$1@newsreader4.netcologne.de> <903965ad-5226-49d5-9883-57b1bc836fd7n@googlegroups.com>
<t1dckv$u7$2@newsreader4.netcologne.de> <526d6018-1e28-44f7-86e6-89ccbda1f663n@googlegroups.com>
<t1fmss$i30$2@newsreader4.netcologne.de> <5991ffcb-7857-49ba-9204-7201850b64a6n@googlegroups.com>
<t1helc$mtc$1@newsreader4.netcologne.de> <b58e87e7-5cad-4867-835e-ea84b192b230n@googlegroups.com>
<t1i106$4jp$1@newsreader4.netcologne.de> <4a14747b-b131-4619-af63-e87caa1186cen@googlegroups.com>
<5c553807-0d0a-45f4-8b4e-a52480359c8cn@googlegroups.com> <t1mnc6$8lb$2@newsreader4.netcologne.de>
<ae3841d6-c26f-4703-99f8-9893cb7c4701n@googlegroups.com> <t1nqv2$2b2$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c98cc15b-e3bf-4636-a2fc-0232aaf544f6n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: already5...@yahoo.com (Michael S)
Injection-Date: Sat, 26 Mar 2022 20:30:43 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 113

by: Michael S - Sat, 26 Mar 2022 20:30 UTC

On Saturday, March 26, 2022 at 10:53:09 PM UTC+3, Thomas Koenig wrote:
> Michael S <already...@yahoo.com> schrieb:
> > On Saturday, March 26, 2022 at 12:45:45 PM UTC+3, Thomas Koenig wrote:
> >> Michael S <already...@yahoo.com> schrieb:
> >> > For reference, I compiled your program (MSYS2, gfortran 11.2.0, -O2) and run it on all tree systems.
> >> > Core i5-3450: 3.75962 seconds
> >> > Xeon E3-1271 v3: 2.96402 seconds
> >> > Xeon E-2176G: 2.70312 seconds
> >> >
> >> > So, it seems, either your Zen1 CPU runs at much higher frequency (over 4.5 GHz?) or your version of quadmath library
> >> > is much better than the one, supplied with MSYS2 (mingw-w64-x86_64-gcc-libgfortran 11.2.0-10) or the library
> >> > likes AMD CPUs and hates Intel's.
> >> Or something is going away in the calling sequence of the quadmath
> >> library.
> >>
> >> If you have Linux installed, a timing comparison would be of interest.
> >
> > No, not at home.
> > Even at work, I am not sure that I have Linux with Fortran compiler installed.
> >
> > Could you run you program under 'time' ?
> > At very least it would prove that no autopar involved.
> As a maintainer of gfortran, I can assure you that there is no
> autoparallelization :-).
>
> However, here it is:
>
> $ cat a.f90
> program main
> implicit none
> integer, parameter :: qp = selected_real_kind(30)
> integer, parameter :: n = 10**6, m = 10
> real :: t1, t2
> character(len=20) :: c
> integer :: i,k
> real(kind=qp), dimension(:), allocatable :: a
> allocate (a(n))
> c = '10'
> call random_number(a)
> a = a * 1e50_qp
> call cpu_time(t1)
> do i=1,m
> a = sqrt(a)
> read (unit=c,fmt=*) k
> write (*,*) a(k)
> end do
> call cpu_time(t2)
> write (*,'(A,F12.5,A,ES15.5,A)') "Used ",t2-t1, " seconds to calculate ", &
> real(n*m,qp), " square roots."
> end program main
> $ gfortran -O3 a.f90
> $ time ./a.out
> 7308311667482999228141838.93337335438
> 2703388922719.59256619755162850493042
> 1644198.56547790254000409031824041555
> 1282.26306406988988805467367099802872
> 35.8087009547943513227570444215179866
> 5.98403717859392575839410139231541218
> 2.44622917540322165097869738482238220
> 1.56404257467730770560212117881721192
> 1.25061687765570623839109221766572087
> 1.11830983079632553071493517444861311
> Used 2.35760 seconds to calculate 1.00000E+07 square roots.
>
> real 0m2,475s
> user 0m2,445s
> sys 0m0,008s
> $ head /proc/cpuinfo
> processor : 0
> vendor_id : AuthenticAMD
> cpu family : 23
> model : 1
> model name : AMD Ryzen 7 1700X Eight-Core Processor
> stepping : 1
> microcode : 0x8001129
> cpu MHz : 2056.406
> cache size : 512 KB
> physical id : 0

1700X is listed as base frequency=3400MHz, Turbo=3800MHz and "extra turbo"=3900MHz.
Even at 3900MHz I'd expect that it wouldn't run the variant of sqrtq(), supplied with MSYS2
faster than ~3.0 sec.
So, it has to be better version, somehow.
Unless the library does something that heavily favors AMD. I can't even think what it could be.
May be, when back at work, I'd test on some AMD CPU, but it seems that all AMDs at work are
of Zen3 variety.

Could you test on Intel?

> >
> >> > Another, not very probable possibility is that your cpu_time() call is lying, but that's easily verifiable with running
> >> > binary under 'time' utility.
> >> > And yet another possibility is that your compiler managed to parallelize a line 'a = sqrt(a)'. Crazy suggestion, I know.
> >> >
> >> > But one thing is sure - [on Intel] my quad-precision sqrt()
> >> > is close to 15 times faster than the one supplied with GCC
> >> > Quad-Precision Math Library.
> >> Have you considered submitting a patch?
> >
> > My quad-precision format is different from IEEE binary128==GNU __float128.
> > So, it would take some work.
> > And since I am not a great fun of GPL, I'd rather prefer my own work to be given away under
> > more anarchistic license.
> Unfortunately, this is not what I (usually) do - I am mainly
> concerned with the gfortran front end.
> > But I don't mind if you or your friends take my source and adapte it to quadmath.
> > Can put a needed license statement in my public github repository if somebody tells
> > me a relevant lawyerspeak.
> I can at least submit a PR. If there is a description of the
> algorithm in the source, maybe somebody can take this up and run
> with it.

Description (comments) is minimal, unfortunately.
But code should be easily understood by person, skilled in the field.

Re: Approximate reciprocals

<285f1ec1-4a48-4ab7-953a-1a4a2b4f8094n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24493&group=comp.arch#24493

copy link Newsgroups: comp.arch

X-Received: by 2002:ad4:5fcf:0:b0:441:5f3:65ab with SMTP id jq15-20020ad45fcf000000b0044105f365abmr14854339qvb.87.1648332882613;
Sat, 26 Mar 2022 15:14:42 -0700 (PDT)
X-Received: by 2002:a05:6830:22ea:b0:5b2:35c1:de3c with SMTP id
t10-20020a05683022ea00b005b235c1de3cmr7251553otc.282.1648332882396; Sat, 26
Mar 2022 15:14:42 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!feeder.erje.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 26 Mar 2022 15:14:42 -0700 (PDT)
In-Reply-To: <c98cc15b-e3bf-4636-a2fc-0232aaf544f6n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:7408:496f:7430:392a;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:7408:496f:7430:392a
References: <t1c154$j5t$1@dont-email.me> <81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com>
<t1d0r8$o4v$1@newsreader4.netcologne.de> <903965ad-5226-49d5-9883-57b1bc836fd7n@googlegroups.com>
<t1dckv$u7$2@newsreader4.netcologne.de> <526d6018-1e28-44f7-86e6-89ccbda1f663n@googlegroups.com>
<t1fmss$i30$2@newsreader4.netcologne.de> <5991ffcb-7857-49ba-9204-7201850b64a6n@googlegroups.com>
<t1helc$mtc$1@newsreader4.netcologne.de> <b58e87e7-5cad-4867-835e-ea84b192b230n@googlegroups.com>
<t1i106$4jp$1@newsreader4.netcologne.de> <4a14747b-b131-4619-af63-e87caa1186cen@googlegroups.com>
<5c553807-0d0a-45f4-8b4e-a52480359c8cn@googlegroups.com> <t1mnc6$8lb$2@newsreader4.netcologne.de>
<ae3841d6-c26f-4703-99f8-9893cb7c4701n@googlegroups.com> <t1nqv2$2b2$1@newsreader4.netcologne.de>
<c98cc15b-e3bf-4636-a2fc-0232aaf544f6n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <285f1ec1-4a48-4ab7-953a-1a4a2b4f8094n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: already5...@yahoo.com (Michael S)
Injection-Date: Sat, 26 Mar 2022 22:14:42 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 93

by: Michael S - Sat, 26 Mar 2022 22:14 UTC

On Saturday, March 26, 2022 at 11:30:44 PM UTC+3, Michael S wrote:
> On Saturday, March 26, 2022 at 10:53:09 PM UTC+3, Thomas Koenig wrote:
> > Michael S <already...@yahoo.com> schrieb:
> > > On Saturday, March 26, 2022 at 12:45:45 PM UTC+3, Thomas Koenig wrote:
> > >> Michael S <already...@yahoo.com> schrieb:
> > >> > For reference, I compiled your program (MSYS2, gfortran 11.2.0, -O2) and run it on all tree systems.
> > >> > Core i5-3450: 3.75962 seconds
> > >> > Xeon E3-1271 v3: 2.96402 seconds
> > >> > Xeon E-2176G: 2.70312 seconds
> > >> >
> > >> > So, it seems, either your Zen1 CPU runs at much higher frequency (over 4.5 GHz?) or your version of quadmath library
> > >> > is much better than the one, supplied with MSYS2 (mingw-w64-x86_64-gcc-libgfortran 11.2.0-10) or the library
> > >> > likes AMD CPUs and hates Intel's.
> > >> Or something is going away in the calling sequence of the quadmath
> > >> library.
> > >>
> > >> If you have Linux installed, a timing comparison would be of interest.
> > >
> > > No, not at home.
> > > Even at work, I am not sure that I have Linux with Fortran compiler installed.
> > >
> > > Could you run you program under 'time' ?
> > > At very least it would prove that no autopar involved.
> > As a maintainer of gfortran, I can assure you that there is no
> > autoparallelization :-).
> >
> > However, here it is:
> >
> > $ cat a.f90
> > program main
> > implicit none
> > integer, parameter :: qp = selected_real_kind(30)
> > integer, parameter :: n = 10**6, m = 10
> > real :: t1, t2
> > character(len=20) :: c
> > integer :: i,k
> > real(kind=qp), dimension(:), allocatable :: a
> > allocate (a(n))
> > c = '10'
> > call random_number(a)
> > a = a * 1e50_qp
> > call cpu_time(t1)
> > do i=1,m
> > a = sqrt(a)
> > read (unit=c,fmt=*) k
> > write (*,*) a(k)
> > end do
> > call cpu_time(t2)
> > write (*,'(A,F12.5,A,ES15.5,A)') "Used ",t2-t1, " seconds to calculate ", &
> > real(n*m,qp), " square roots."
> > end program main
> > $ gfortran -O3 a.f90
> > $ time ./a.out
> > 7308311667482999228141838.93337335438
> > 2703388922719.59256619755162850493042
> > 1644198.56547790254000409031824041555
> > 1282.26306406988988805467367099802872
> > 35.8087009547943513227570444215179866
> > 5.98403717859392575839410139231541218
> > 2.44622917540322165097869738482238220
> > 1.56404257467730770560212117881721192
> > 1.25061687765570623839109221766572087
> > 1.11830983079632553071493517444861311
> > Used 2.35760 seconds to calculate 1.00000E+07 square roots.
> >
> > real 0m2,475s
> > user 0m2,445s
> > sys 0m0,008s
> > $ head /proc/cpuinfo
> > processor : 0
> > vendor_id : AuthenticAMD
> > cpu family : 23
> > model : 1
> > model name : AMD Ryzen 7 1700X Eight-Core Processor
> > stepping : 1
> > microcode : 0x8001129
> > cpu MHz : 2056.406
> > cache size : 512 KB
> > physical id : 0
> 1700X is listed as base frequency=3400MHz, Turbo=3800MHz and "extra turbo"=3900MHz.
> Even at 3900MHz I'd expect that it wouldn't run the variant of sqrtq(), supplied with MSYS2
> faster than ~3.0 sec.
> So, it has to be better version, somehow.
> Unless the library does something that heavily favors AMD. I can't even think what it could be.
> May be, when back at work, I'd test on some AMD CPU, but it seems that all AMDs at work are
> of Zen3 variety.
>
> Could you test on Intel?

On my system(s) I see another very strange timing effect:
The speed of sqrtq() strongly depends on the range of the input and does it in unexpected manner:
it is much faster (1.5x to 2x) when input is *outside* the range of double precision!

Re: Approximate reciprocals

<600fb4d3-5e5f-490f-b7b6-4301a93f656en@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24494&group=comp.arch#24494

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:a558:0:b0:67b:1141:4754 with SMTP id o85-20020a37a558000000b0067b11414754mr11371102qke.328.1648333173509;
Sat, 26 Mar 2022 15:19:33 -0700 (PDT)
X-Received: by 2002:a05:6808:1b11:b0:2da:73df:2dbd with SMTP id
bx17-20020a0568081b1100b002da73df2dbdmr13100302oib.293.1648333173327; Sat, 26
Mar 2022 15:19:33 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 26 Mar 2022 15:19:33 -0700 (PDT)
In-Reply-To: <285f1ec1-4a48-4ab7-953a-1a4a2b4f8094n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:2c99:fda9:eae1:7421;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:2c99:fda9:eae1:7421
References: <t1c154$j5t$1@dont-email.me> <81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com>
<t1d0r8$o4v$1@newsreader4.netcologne.de> <903965ad-5226-49d5-9883-57b1bc836fd7n@googlegroups.com>
<t1dckv$u7$2@newsreader4.netcologne.de> <526d6018-1e28-44f7-86e6-89ccbda1f663n@googlegroups.com>
<t1fmss$i30$2@newsreader4.netcologne.de> <5991ffcb-7857-49ba-9204-7201850b64a6n@googlegroups.com>
<t1helc$mtc$1@newsreader4.netcologne.de> <b58e87e7-5cad-4867-835e-ea84b192b230n@googlegroups.com>
<t1i106$4jp$1@newsreader4.netcologne.de> <4a14747b-b131-4619-af63-e87caa1186cen@googlegroups.com>
<5c553807-0d0a-45f4-8b4e-a52480359c8cn@googlegroups.com> <t1mnc6$8lb$2@newsreader4.netcologne.de>
<ae3841d6-c26f-4703-99f8-9893cb7c4701n@googlegroups.com> <t1nqv2$2b2$1@newsreader4.netcologne.de>
<c98cc15b-e3bf-4636-a2fc-0232aaf544f6n@googlegroups.com> <285f1ec1-4a48-4ab7-953a-1a4a2b4f8094n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <600fb4d3-5e5f-490f-b7b6-4301a93f656en@googlegroups.com>
Subject: Re: Approximate reciprocals
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 26 Mar 2022 22:19:33 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Sat, 26 Mar 2022 22:19 UTC

On Saturday, March 26, 2022 at 5:14:44 PM UTC-5, Michael S wrote:
> On Saturday, March 26, 2022 at 11:30:44 PM UTC+3, Michael S wrote:
> > On Saturday, March 26, 2022 at 10:53:09 PM UTC+3, Thomas Koenig wrote:
> > > Michael S <already...@yahoo.com> schrieb:
> > > > On Saturday, March 26, 2022 at 12:45:45 PM UTC+3, Thomas Koenig wrote:
> > > >> Michael S <already...@yahoo.com> schrieb:
> > > >> > For reference, I compiled your program (MSYS2, gfortran 11.2.0, -O2) and run it on all tree systems.
> > > >> > Core i5-3450: 3.75962 seconds
> > > >> > Xeon E3-1271 v3: 2.96402 seconds
> > > >> > Xeon E-2176G: 2.70312 seconds
> > > >> >
> > > >> > So, it seems, either your Zen1 CPU runs at much higher frequency (over 4.5 GHz?) or your version of quadmath library
> > > >> > is much better than the one, supplied with MSYS2 (mingw-w64-x86_64-gcc-libgfortran 11.2.0-10) or the library
> > > >> > likes AMD CPUs and hates Intel's.
> > > >> Or something is going away in the calling sequence of the quadmath
> > > >> library.
> > > >>
> > > >> If you have Linux installed, a timing comparison would be of interest.
> > > >
> > > > No, not at home.
> > > > Even at work, I am not sure that I have Linux with Fortran compiler installed.
> > > >
> > > > Could you run you program under 'time' ?
> > > > At very least it would prove that no autopar involved.
> > > As a maintainer of gfortran, I can assure you that there is no
> > > autoparallelization :-).
> > >
> > > However, here it is:
> > >
> > > $ cat a.f90
> > > program main
> > > implicit none
> > > integer, parameter :: qp = selected_real_kind(30)
> > > integer, parameter :: n = 10**6, m = 10
> > > real :: t1, t2
> > > character(len=20) :: c
> > > integer :: i,k
> > > real(kind=qp), dimension(:), allocatable :: a
> > > allocate (a(n))
> > > c = '10'
> > > call random_number(a)
> > > a = a * 1e50_qp
> > > call cpu_time(t1)
> > > do i=1,m
> > > a = sqrt(a)
> > > read (unit=c,fmt=*) k
> > > write (*,*) a(k)
> > > end do
> > > call cpu_time(t2)
> > > write (*,'(A,F12.5,A,ES15.5,A)') "Used ",t2-t1, " seconds to calculate ", &
> > > real(n*m,qp), " square roots."
> > > end program main
> > > $ gfortran -O3 a.f90
> > > $ time ./a.out
> > > 7308311667482999228141838.93337335438
> > > 2703388922719.59256619755162850493042
> > > 1644198.56547790254000409031824041555
> > > 1282.26306406988988805467367099802872
> > > 35.8087009547943513227570444215179866
> > > 5.98403717859392575839410139231541218
> > > 2.44622917540322165097869738482238220
> > > 1.56404257467730770560212117881721192
> > > 1.25061687765570623839109221766572087
> > > 1.11830983079632553071493517444861311
> > > Used 2.35760 seconds to calculate 1.00000E+07 square roots.
> > >
> > > real 0m2,475s
> > > user 0m2,445s
> > > sys 0m0,008s
> > > $ head /proc/cpuinfo
> > > processor : 0
> > > vendor_id : AuthenticAMD
> > > cpu family : 23
> > > model : 1
> > > model name : AMD Ryzen 7 1700X Eight-Core Processor
> > > stepping : 1
> > > microcode : 0x8001129
> > > cpu MHz : 2056.406
> > > cache size : 512 KB
> > > physical id : 0
> > 1700X is listed as base frequency=3400MHz, Turbo=3800MHz and "extra turbo"=3900MHz.
> > Even at 3900MHz I'd expect that it wouldn't run the variant of sqrtq(), supplied with MSYS2
> > faster than ~3.0 sec.
> > So, it has to be better version, somehow.
> > Unless the library does something that heavily favors AMD. I can't even think what it could be.
> > May be, when back at work, I'd test on some AMD CPU, but it seems that all AMDs at work are
> > of Zen3 variety.
> >
> > Could you test on Intel?
<
> On my system(s) I see another very strange timing effect:
> The speed of sqrtq() strongly depends on the range of the input and does it in unexpected manner:
> it is much faster (1.5x to 2x) when input is *outside* the range of double precision!
<
Can't rely on :: CVT Q->D when Q would overflow D.
<
But the algorithm only needs the LoB of the exponent and 8-12 MsBs of the fraction to get started.
<
First 3 iterations are SP, next iteration is DP, next 2 are QP.

Re: Approximate reciprocals

<t1o3sm$977$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24495&group=comp.arch#24495

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!news.freedyn.de!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-30bd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Sat, 26 Mar 2022 22:25:26 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <t1o3sm$977$1@newsreader4.netcologne.de>
References: <t1c154$j5t$1@dont-email.me>
<81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com>
<t1d0r8$o4v$1@newsreader4.netcologne.de>
<903965ad-5226-49d5-9883-57b1bc836fd7n@googlegroups.com>
<t1dckv$u7$2@newsreader4.netcologne.de>
<526d6018-1e28-44f7-86e6-89ccbda1f663n@googlegroups.com>
<t1fmss$i30$2@newsreader4.netcologne.de>
<5991ffcb-7857-49ba-9204-7201850b64a6n@googlegroups.com>
<t1helc$mtc$1@newsreader4.netcologne.de>
<b58e87e7-5cad-4867-835e-ea84b192b230n@googlegroups.com>
<t1i106$4jp$1@newsreader4.netcologne.de>
<4a14747b-b131-4619-af63-e87caa1186cen@googlegroups.com>
<5c553807-0d0a-45f4-8b4e-a52480359c8cn@googlegroups.com>
<t1mnc6$8lb$2@newsreader4.netcologne.de>
<ae3841d6-c26f-4703-99f8-9893cb7c4701n@googlegroups.com>
<t1nqv2$2b2$1@newsreader4.netcologne.de>
<c98cc15b-e3bf-4636-a2fc-0232aaf544f6n@googlegroups.com>
<285f1ec1-4a48-4ab7-953a-1a4a2b4f8094n@googlegroups.com>
Injection-Date: Sat, 26 Mar 2022 22:25:26 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-30bd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:30bd:0:7285:c2ff:fe6c:992d";
logging-data="9447"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Sat, 26 Mar 2022 22:25 UTC

Michael S <already5chosen@yahoo.com> schrieb:

> On my system(s) I see another very strange timing effect:
> The speed of sqrtq() strongly depends on the range of the input and does it in unexpected manner:
> it is much faster (1.5x to 2x) when input is *outside* the range of double precision!

Source for the sqrt routine in libquadmath is here:

https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libquadmath/math/sqrtq.c;h=56ea5d3243c06df2276689849b448eebacae6367;hb=refs/heads/master

Hmm.. they first do a test if it is within double precision range,
with two Newton iterations, then a test if it is within long double
range with a single Newton iteration, and if it is outside then
they pick apart the number and run two Newton iterations.

Were the numbers outside the range of double inside the range
of long double? That could explain things (and suggest
an improvement).

Re: Approximate reciprocals

<b48d1398-004f-4c6e-8c27-94c635d23c52n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24496&group=comp.arch#24496

copy link Newsgroups: comp.arch

X-Received: by 2002:ad4:5965:0:b0:440:fee0:bef2 with SMTP id eq5-20020ad45965000000b00440fee0bef2mr15186147qvb.68.1648337553119;
Sat, 26 Mar 2022 16:32:33 -0700 (PDT)
X-Received: by 2002:a05:6808:211f:b0:2da:84f6:9eed with SMTP id
r31-20020a056808211f00b002da84f69eedmr12760924oiw.239.1648337552920; Sat, 26
Mar 2022 16:32:32 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!1.us.feeder.erje.net!feeder.erje.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 26 Mar 2022 16:32:32 -0700 (PDT)
In-Reply-To: <t1o3sm$977$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=87.68.183.72; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 87.68.183.72
References: <t1c154$j5t$1@dont-email.me> <81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com>
<t1d0r8$o4v$1@newsreader4.netcologne.de> <903965ad-5226-49d5-9883-57b1bc836fd7n@googlegroups.com>
<t1dckv$u7$2@newsreader4.netcologne.de> <526d6018-1e28-44f7-86e6-89ccbda1f663n@googlegroups.com>
<t1fmss$i30$2@newsreader4.netcologne.de> <5991ffcb-7857-49ba-9204-7201850b64a6n@googlegroups.com>
<t1helc$mtc$1@newsreader4.netcologne.de> <b58e87e7-5cad-4867-835e-ea84b192b230n@googlegroups.com>
<t1i106$4jp$1@newsreader4.netcologne.de> <4a14747b-b131-4619-af63-e87caa1186cen@googlegroups.com>
<5c553807-0d0a-45f4-8b4e-a52480359c8cn@googlegroups.com> <t1mnc6$8lb$2@newsreader4.netcologne.de>
<ae3841d6-c26f-4703-99f8-9893cb7c4701n@googlegroups.com> <t1nqv2$2b2$1@newsreader4.netcologne.de>
<c98cc15b-e3bf-4636-a2fc-0232aaf544f6n@googlegroups.com> <285f1ec1-4a48-4ab7-953a-1a4a2b4f8094n@googlegroups.com>
<t1o3sm$977$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b48d1398-004f-4c6e-8c27-94c635d23c52n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: already5...@yahoo.com (Michael S)
Injection-Date: Sat, 26 Mar 2022 23:32:33 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 25

by: Michael S - Sat, 26 Mar 2022 23:32 UTC

On Sunday, March 27, 2022 at 1:25:29 AM UTC+3, Thomas Koenig wrote:
> Michael S <already...@yahoo.com> schrieb:
> > On my system(s) I see another very strange timing effect:
> > The speed of sqrtq() strongly depends on the range of the input and does it in unexpected manner:
> > it is much faster (1.5x to 2x) when input is *outside* the range of double precision!
> Source for the sqrt routine in libquadmath is here:
>
> https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libquadmath/math/sqrtq.c;h=56ea5d3243c06df2276689849b448eebacae6367;hb=refs/heads/master
>
> Hmm.. they first do a test if it is within double precision range,
> with two Newton iterations, then a test if it is within long double
> range with a single Newton iteration, and if it is outside then
> they pick apart the number and run two Newton iterations.
>
> Were the numbers outside the range of double inside the range
> of long double?

Yes.
I didn't test "outside long double".
If "long double" is 80-bit x87 format then "outside" would be mostly sub-normals, right?

> That could explain things (and suggest
> an improvement).

I think, the whole routine needs rewrite, rather than improvements.
The original author probably was not thinking very hard.

Re: Approximate reciprocals

<t1qf0u$oko$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24497&group=comp.arch#24497

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Sun, 27 Mar 2022 21:47:42 +0200
Organization: A noiseless patient Spider
Lines: 67
Message-ID: <t1qf0u$oko$1@dont-email.me>
References: <t1c154$j5t$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 27 Mar 2022 19:47:42 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="af3bc60ffa69880f5f23ca2aa97b1a02";
logging-data="25240"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX188w2U4d0W9zxQoUINi2hfATJVexdEaLdc="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.7.0
Cancel-Lock: sha1:O0edXZ49aWmYJH+srfbcDROaU2U=
In-Reply-To: <t1c154$j5t$1@dont-email.me>
Content-Language: en-US

by: Marcus - Sun, 27 Mar 2022 19:47 UTC

On 2022-03-22, Marcus wrote:
> Hello group!
>
> A class of instructions that is very tempting to include in an ISA is
> approximate floating-point reciprocals (1/x & 1/sqrt(x)), and
> possibly specialized instructions for improving precision
> (a Newton-Raphson step).
>
> Many ISAs have such instructions:
>
> CRAY:
> * 070ijx - Floating reciprocal approximation
> * 067ijk - Reciprocal iteration
>
> ARM:
> * FRECPE - Floating-point reciprocal estimate
> * FRECPS - Floating-point reciprocal step
> * FRSQRTE - Floating-point reciprocal square root estimate
> * FRSQRTS - Floating-point reciprocal square root step
>
> POWER:
> * FRES - Floating reciprocal estimate single
> * FRSQRTE - Floating reciprocal square root estimate
>
> x86:
> * RSQRTSS - Approximate reciprocal square root
>
> TI C67x:
> * RCPSP - Floating-Point reciprocal approximation
> * RSQRSP - Floating-Point reciprocal square root approximation
>
> ...and there are probably others.
>
> What are your feelings towards including such instructions in an ISA?
>
> My own feelings are mixed.
>
> Pros:
> * Easy to implement in hardware.
> * Can provide a significant speedup for certain workloads, especially
> if limited accuracy is acceptable.
>
> Cons:
> * Hard to specify exact operation.
> * Likely a source of poor portability (borderline undefined behavior).
> * The "step/iteration" instructions can usually be replaced by FMA.

As an aside...

Today I found a pretty nice looking 2nd order polynomial for
approximating 1/x in the interval [1.0, 2.0):

(2*x^2 - 9*x + 13) / 6

Maybe this is known stuff, and I'm not sure that it would be useful for
anything, but I thought I'd share it anyway.

The accuracy that you get is roughly 6-7 bits, so it's not better than a
LUT.

The coefficients are "nice" from an implementation perspective, if you
were to hard-wire the polynomial. It should be possible to implement the
dividend with a single multiplier and a few integer adders. The 3 in the
divisor is muddying the waters, though. Is there a fast way to divide by
three?

/Marcus

Re: Approximate reciprocals

<t1qkql$ui0$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24498&group=comp.arch#24498

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-30bd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Sun, 27 Mar 2022 21:26:45 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <t1qkql$ui0$1@newsreader4.netcologne.de>
References: <t1c154$j5t$1@dont-email.me> <t1qf0u$oko$1@dont-email.me>
Injection-Date: Sun, 27 Mar 2022 21:26:45 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-30bd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:30bd:0:7285:c2ff:fe6c:992d";
logging-data="31296"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Sun, 27 Mar 2022 21:26 UTC

Marcus <m.delete@this.bitsnbites.eu> schrieb:

> Today I found a pretty nice looking 2nd order polynomial for
> approximating 1/x in the interval [1.0, 2.0):
>
> (2*x^2 - 9*x + 13) / 6

Interesting. Could be useful for the start of a Newton iteration
for a reciprocal, where the iteration formula would be
x_n+1 = 2*x_n - a * x_n^2 (division-free), or x_n*(2-a*x_n).

>
> Maybe this is known stuff, and I'm not sure that it would be useful for
> anything, but I thought I'd share it anyway.
>
> The accuracy that you get is roughly 6-7 bits, so it's not better than a
> LUT.
>
> The coefficients are "nice" from an implementation perspective, if you
> were to hard-wire the polynomial. It should be possible to implement the
> dividend with a single multiplier and a few integer adders. The 3 in the
> divisor is muddying the waters, though. Is there a fast way to divide by
> three?

The standard trick of dividing by three in fixed point: For
32-bit numbers, multiply the number by the magic number 2863311531
(0xAAAAAAAB) and shift right by 33 bits (in practice, use a
"multiply high" and shift right one bit). For floating point
numbers, this would have to be adjusted somewhat.

See Hacker's Delight, "Unsigned Division by Divisors >= 1".

Or just encode the constants.

Re: Approximate reciprocals

<394168eb-53ed-49c2-a349-4035c3177361n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24499&group=comp.arch#24499

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:570c:0:b0:2e1:ee0c:71c5 with SMTP id 12-20020ac8570c000000b002e1ee0c71c5mr19051724qtw.365.1648420828200;
Sun, 27 Mar 2022 15:40:28 -0700 (PDT)
X-Received: by 2002:a05:6870:e9a7:b0:de:e59a:7376 with SMTP id
r39-20020a056870e9a700b000dee59a7376mr2829166oao.194.1648420827939; Sun, 27
Mar 2022 15:40:27 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!eternal-september.org!reader02.eternal-september.org!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 27 Mar 2022 15:40:27 -0700 (PDT)
In-Reply-To: <t1qkql$ui0$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:7408:496f:7430:392a;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:7408:496f:7430:392a
References: <t1c154$j5t$1@dont-email.me> <t1qf0u$oko$1@dont-email.me> <t1qkql$ui0$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <394168eb-53ed-49c2-a349-4035c3177361n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: already5...@yahoo.com (Michael S)
Injection-Date: Sun, 27 Mar 2022 22:40:28 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 45

by: Michael S - Sun, 27 Mar 2022 22:40 UTC

On Monday, March 28, 2022 at 12:26:48 AM UTC+3, Thomas Koenig wrote:
> Marcus <m.de...@this.bitsnbites.eu> schrieb:
> > Today I found a pretty nice looking 2nd order polynomial for
> > approximating 1/x in the interval [1.0, 2.0):
> >
> > (2*x^2 - 9*x + 13) / 6
> Interesting. Could be useful for the start of a Newton iteration
> for a reciprocal, where the iteration formula would be
> x_n+1 = 2*x_n - a * x_n^2 (division-free), or x_n*(2-a*x_n).
> >
> > Maybe this is known stuff, and I'm not sure that it would be useful for
> > anything, but I thought I'd share it anyway.
> >
> > The accuracy that you get is roughly 6-7 bits, so it's not better than a
> > LUT.
> >
> > The coefficients are "nice" from an implementation perspective, if you
> > were to hard-wire the polynomial. It should be possible to implement the
> > dividend with a single multiplier and a few integer adders. The 3 in the
> > divisor is muddying the waters, though. Is there a fast way to divide by
> > three?
> The standard trick of dividing by three in fixed point: For
> 32-bit numbers, multiply the number by the magic number 2863311531
> (0xAAAAAAAB) and shift right by 33 bits (in practice, use a
> "multiply high" and shift right one bit). For floating point
> numbers, this would have to be adjusted somewhat.
>

And that causes the whole calculation to be of approximately the same
computational complexity as 2nd-order polynomials with arbitrary 32-bit coefficients,
potentially much better coefficients than those suggested.
The suggestion of Marcus is very naive.
Any suggestion to do sqrt or rsqrt approximation without lookup table is a loose proposition.
Even very small table at the first step, like 32 or 64 entries, helps a lot and saves many steps
down the road.
The only exception to that could be when you do very wide SIMD that has no tools for even small lookups.
Or when you don't care about speed, but that case is sort of off topic.

>
> See Hacker's Delight, "Unsigned Division by Divisors >= 1".
>
> Or just encode the constants.

Re: Approximate reciprocals

<732dbebf-adb7-45ff-998f-3d1e069cb7c7n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24500&group=comp.arch#24500

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:10d:b0:2e1:db41:66d with SMTP id u13-20020a05622a010d00b002e1db41066dmr19284512qtw.670.1648422629820;
Sun, 27 Mar 2022 16:10:29 -0700 (PDT)
X-Received: by 2002:a05:6808:1152:b0:2da:c7f:66c2 with SMTP id
u18-20020a056808115200b002da0c7f66c2mr15539914oiu.253.1648422629619; Sun, 27
Mar 2022 16:10:29 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!3.us.feeder.erje.net!feeder.erje.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 27 Mar 2022 16:10:29 -0700 (PDT)
In-Reply-To: <t1qf0u$oko$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:6c4b:825:6d0:525;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:6c4b:825:6d0:525
References: <t1c154$j5t$1@dont-email.me> <t1qf0u$oko$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <732dbebf-adb7-45ff-998f-3d1e069cb7c7n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 27 Mar 2022 23:10:29 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 90

by: MitchAlsup - Sun, 27 Mar 2022 23:10 UTC

On Sunday, March 27, 2022 at 2:47:45 PM UTC-5, Marcus wrote:
> On 2022-03-22, Marcus wrote:
> > Hello group!
> >
> > A class of instructions that is very tempting to include in an ISA is
> > approximate floating-point reciprocals (1/x & 1/sqrt(x)), and
> > possibly specialized instructions for improving precision
> > (a Newton-Raphson step).
> >
> > Many ISAs have such instructions:
> >
> > CRAY:
> > * 070ijx - Floating reciprocal approximation
> > * 067ijk - Reciprocal iteration
> >
> > ARM:
> > * FRECPE - Floating-point reciprocal estimate
> > * FRECPS - Floating-point reciprocal step
> > * FRSQRTE - Floating-point reciprocal square root estimate
> > * FRSQRTS - Floating-point reciprocal square root step
> >
> > POWER:
> > * FRES - Floating reciprocal estimate single
> > * FRSQRTE - Floating reciprocal square root estimate
> >
> > x86:
> > * RSQRTSS - Approximate reciprocal square root
> >
> > TI C67x:
> > * RCPSP - Floating-Point reciprocal approximation
> > * RSQRSP - Floating-Point reciprocal square root approximation
> >
> > ...and there are probably others.
> >
> > What are your feelings towards including such instructions in an ISA?
> >
> > My own feelings are mixed.
> >
> > Pros:
> > * Easy to implement in hardware.
> > * Can provide a significant speedup for certain workloads, especially
> > if limited accuracy is acceptable.
> >
> > Cons:
> > * Hard to specify exact operation.
> > * Likely a source of poor portability (borderline undefined behavior).
> > * The "step/iteration" instructions can usually be replaced by FMA.
> As an aside...
>
> Today I found a pretty nice looking 2nd order polynomial for
> approximating 1/x in the interval [1.0, 2.0):
>
> (2*x^2 - 9*x + 13) / 6
<
Equivalent to::
<
.33333×x^2 - 1.5×x + 2.1666
<
The Chebychev 2nd order Coefficients on the interval [1..2) are:
<
0.32323232×x^2 -0.48484848×x + 0.66666667
<
with 6.63 bits of accuracy. BTW the 9th order cheby polynomial only has 21.89
bits of accuracy. Clearly, this is not the right way to start.
<
Splitting the interval into [1..1.5)..[1.5..2] gives 8.9 bits of accuracy
Splitting into 4 intervals gives 11.5 bits of accuracy.
Splitting into 8 intervals gives 14.25 bits of accuracy.
>
> Maybe this is known stuff, and I'm not sure that it would be useful for
> anything, but I thought I'd share it anyway.
>
> The accuracy that you get is roughly 6-7 bits, so it's not better than a
> LUT.
>
> The coefficients are "nice" from an implementation perspective, if you
> were to hard-wire the polynomial. It should be possible to implement the
> dividend with a single multiplier and a few integer adders. The 3 in the
> divisor is muddying the waters, though. Is there a fast way to divide by
> three?
>
> /Marcus

Re: Approximate reciprocals

<t1rjon$pqm$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24501&group=comp.arch#24501

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: not_va...@comcast.net (James Van Buskirk)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Mon, 28 Mar 2022 00:14:25 -0600
Organization: A noiseless patient Spider
Lines: 5
Message-ID: <t1rjon$pqm$1@dont-email.me>
References: <t1c154$j5t$1@dont-email.me> <81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com> <t1d0r8$o4v$1@newsreader4.netcologne.de> <903965ad-5226-49d5-9883-57b1bc836fd7n@googlegroups.com> <t1dckv$u7$2@newsreader4.netcologne.de> <526d6018-1e28-44f7-86e6-89ccbda1f663n@googlegroups.com> <t1fmss$i30$2@newsreader4.netcologne.de> <5991ffcb-7857-49ba-9204-7201850b64a6n@googlegroups.com> <t1helc$mtc$1@newsreader4.netcologne.de> <b58e87e7-5cad-4867-835e-ea84b192b230n@googlegroups.com> <t1i106$4jp$1@newsreader4.netcologne.de> <4a14747b-b131-4619-af63-e87caa1186cen@googlegroups.com> <5c553807-0d0a-45f4-8b4e-a52480359c8cn@googlegroups.com> <t1mnc6$8lb$2@newsreader4.netcologne.de> <ae3841d6-c26f-4703-99f8-9893cb7c4701n@googlegroups.com> <t1nqv2$2b2$1@newsreader4.netcologne.de> <c98cc15b-e3bf-4636-a2fc-0232aaf544f6n@googlegroups.com> <285f1ec1-4a48-4ab7-953a-1a4a2b4f8094n@googlegroups.com> <600fb4d3-5e5f-490f-b7b6-4301a93f656en@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain;
format=flowed;
charset="UTF-8";
reply-type=original
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 28 Mar 2022 06:14:49 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="a4cd4104ef5256d08d58065474b9bf26";
logging-data="26454"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+r/7iNIKeutOd4aLe0fYFzt8pmS6y6is8="
Cancel-Lock: sha1:2g5rmQQyaLGAQiAWpP8ifeuAGgU=
X-MimeOLE: Produced By Microsoft MimeOLE V16.4.3528.331
In-Reply-To: <600fb4d3-5e5f-490f-b7b6-4301a93f656en@googlegroups.com>
X-Newsreader: Microsoft Windows Live Mail 16.4.3528.331
Importance: Normal
X-Priority: 3
X-MSMail-Priority: Normal

by: James Van Buskirk - Mon, 28 Mar 2022 06:14 UTC

"MitchAlsup" wrote in message
news:600fb4d3-5e5f-490f-b7b6-4301a93f656en@googlegroups.com...

> First 3 iterations are SP, next iteration is DP, next 2 are QP.

Have you investigated the potential of using variable numbers of
terms from the series

x/(1-(1-x*D)) = x*sum([((1-x*D)**n,n=0,∞)])
x/sqrt(1-(1-x**2*D)) =
x*sum([(gamma(n+0.5)/(sqrt(pi)*gamma(n+1.0))*(1-x**2*D)**n,n=0,∞)])

to perhaps remove an iteration somewhere?

Re: Approximate reciprocals

<t1rm34$pg9$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24502&group=comp.arch#24502

copy link Newsgroups: comp.arch

by: Terje Mathisen - Mon, 28 Mar 2022 06:54 UTC

Michael S wrote:
> On Monday, March 28, 2022 at 12:26:48 AM UTC+3, Thomas Koenig wrote:
>> Marcus <m.de...@this.bitsnbites.eu> schrieb:
>>> Today I found a pretty nice looking 2nd order polynomial for
>>> approximating 1/x in the interval [1.0, 2.0):
>>>
>>> (2*x^2 - 9*x + 13) / 6
>> Interesting. Could be useful for the start of a Newton iteration
>> for a reciprocal, where the iteration formula would be
>> x_n+1 = 2*x_n - a * x_n^2 (division-free), or x_n*(2-a*x_n).
>>>
>>> Maybe this is known stuff, and I'm not sure that it would be useful for
>>> anything, but I thought I'd share it anyway.
>>>
>>> The accuracy that you get is roughly 6-7 bits, so it's not better than a
>>> LUT.
>>>
>>> The coefficients are "nice" from an implementation perspective, if you
>>> were to hard-wire the polynomial. It should be possible to implement the
>>> dividend with a single multiplier and a few integer adders. The 3 in the
>>> divisor is muddying the waters, though. Is there a fast way to divide by
>>> three?
>> The standard trick of dividing by three in fixed point: For
>> 32-bit numbers, multiply the number by the magic number 2863311531
>> (0xAAAAAAAB) and shift right by 33 bits (in practice, use a
>> "multiply high" and shift right one bit). For floating point
>> numbers, this would have to be adjusted somewhat.
>>
>
> And that causes the whole calculation to be of approximately the same
> computational complexity as 2nd-order polynomials with arbitrary 32-bit coefficients,
> potentially much better coefficients than those suggested.
> The suggestion of Marcus is very naive.
> Any suggestion to do sqrt or rsqrt approximation without lookup table is a loose proposition.
> Even very small table at the first step, like 32 or 64 entries, helps a lot and saves many steps
> down the road.
> The only exception to that could be when you do very wide SIMD that has no tools for even small lookups.
> Or when you don't care about speed, but that case is sort of off topic.

If you are doing SIMD, but without a reciprocal/reciprocal square root
lookup opcode, then I would use the infamous invsqrt() trick. With the
latest Cheby-style tweaks to all the constants, this delivers 10+ bits
after a single iteration.

This is significantly better than an in-register 4-bit/16-term table lookup.

Anyway, I am sure Mitch has much better polynomials to do it all
directly. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Approximate reciprocals

<7029a173-963d-402b-a184-642120b5e1b8n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24503&group=comp.arch#24503

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:470d:b0:67d:d8a8:68c6 with SMTP id bs13-20020a05620a470d00b0067dd8a868c6mr15256148qkb.717.1648455584416;
Mon, 28 Mar 2022 01:19:44 -0700 (PDT)
X-Received: by 2002:a4a:3f56:0:b0:324:bc64:6713 with SMTP id
x22-20020a4a3f56000000b00324bc646713mr8231236ooe.50.1648455582689; Mon, 28
Mar 2022 01:19:42 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 28 Mar 2022 01:19:42 -0700 (PDT)
In-Reply-To: <t1rm34$pg9$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=199.203.251.52; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 199.203.251.52
References: <t1c154$j5t$1@dont-email.me> <t1qf0u$oko$1@dont-email.me>
<t1qkql$ui0$1@newsreader4.netcologne.de> <394168eb-53ed-49c2-a349-4035c3177361n@googlegroups.com>
<t1rm34$pg9$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <7029a173-963d-402b-a184-642120b5e1b8n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: already5...@yahoo.com (Michael S)
Injection-Date: Mon, 28 Mar 2022 08:19:44 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 79

by: Michael S - Mon, 28 Mar 2022 08:19 UTC

On Monday, March 28, 2022 at 9:54:31 AM UTC+3, Terje Mathisen wrote:
> Michael S wrote:
> > On Monday, March 28, 2022 at 12:26:48 AM UTC+3, Thomas Koenig wrote:
> >> Marcus <m.de...@this.bitsnbites.eu> schrieb:
> >>> Today I found a pretty nice looking 2nd order polynomial for
> >>> approximating 1/x in the interval [1.0, 2.0):
> >>>
> >>> (2*x^2 - 9*x + 13) / 6
> >> Interesting. Could be useful for the start of a Newton iteration
> >> for a reciprocal, where the iteration formula would be
> >> x_n+1 = 2*x_n - a * x_n^2 (division-free), or x_n*(2-a*x_n).
> >>>
> >>> Maybe this is known stuff, and I'm not sure that it would be useful for
> >>> anything, but I thought I'd share it anyway.
> >>>
> >>> The accuracy that you get is roughly 6-7 bits, so it's not better than a
> >>> LUT.
> >>>
> >>> The coefficients are "nice" from an implementation perspective, if you
> >>> were to hard-wire the polynomial. It should be possible to implement the
> >>> dividend with a single multiplier and a few integer adders. The 3 in the
> >>> divisor is muddying the waters, though. Is there a fast way to divide by
> >>> three?
> >> The standard trick of dividing by three in fixed point: For
> >> 32-bit numbers, multiply the number by the magic number 2863311531
> >> (0xAAAAAAAB) and shift right by 33 bits (in practice, use a
> >> "multiply high" and shift right one bit). For floating point
> >> numbers, this would have to be adjusted somewhat.
> >>
> >
> > And that causes the whole calculation to be of approximately the same
> > computational complexity as 2nd-order polynomials with arbitrary 32-bit coefficients,
> > potentially much better coefficients than those suggested.
> > The suggestion of Marcus is very naive.
> > Any suggestion to do sqrt or rsqrt approximation without lookup table is a loose proposition.
> > Even very small table at the first step, like 32 or 64 entries, helps a lot and saves many steps
> > down the road.
> > The only exception to that could be when you do very wide SIMD that has no tools for even small lookups.
> > Or when you don't care about speed, but that case is sort of off topic.
> If you are doing SIMD, but without a reciprocal/reciprocal square root
> lookup opcode, then I would use the infamous invsqrt() trick.

Magic constant and such?
I don't think it's a good idea if you seek top performance.
One thing to realize is that if you went to trouble of using rsqrt() NR steps
based on custom polynomials, with different coefficients on each step, then it
is better to keep error one-sided on all steps except possibly the last one.
Apart from of being slightly more precise, such strategy has advantage of being
simpler to do with unsigned integer math.

> With the
> latest Cheby-style tweaks to all the constants, this delivers 10+ bits
> after a single iteration.
>
> This is significantly better than an in-register 4-bit/16-term table lookup.
>
> Anyway, I am sure Mitch has much better polynomials to do it all
> directly. :-)

Mitch appears to like Chebychev polynomials.
Likely, because they are easy to find out :-)

They are very close to optimal (in sense of minimizing two-sided error, but
probably can be easily offseted to minimize one-sided error too) when the
interval of input error is already small. But when the interval is large, as it
is a case on the first step without lookup table and, may be, on the second
step, when the 1st step was of low order, then Chebychev polynomials are
measurably suboptimal vs true equiripple polys. The later can be found by Remez
exchange algorithm.
The reason for this is that Chebychev poly of order N is defined as the best
approximation for poly of order N+1. So, when term N+1 in infinite Tailor
series is much bigger than sum of terms N+2 to Infinity then Cheby is good, but
when term N+1 is not dominant it is less good.

> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Re: Approximate reciprocals

<t1s85t$1bm8$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24504&group=comp.arch#24504

copy link Newsgroups: comp.arch

by: Terje Mathisen - Mon, 28 Mar 2022 12:03 UTC

Michael S wrote:
> On Monday, March 28, 2022 at 9:54:31 AM UTC+3, Terje Mathisen wrote:
>> Michael S wrote:
>>> On Monday, March 28, 2022 at 12:26:48 AM UTC+3, Thomas Koenig wrote:
>>>> Marcus <m.de...@this.bitsnbites.eu> schrieb:
>>>>> Today I found a pretty nice looking 2nd order polynomial for
>>>>> approximating 1/x in the interval [1.0, 2.0):
>>>>>
>>>>> (2*x^2 - 9*x + 13) / 6
>>>> Interesting. Could be useful for the start of a Newton iteration
>>>> for a reciprocal, where the iteration formula would be
>>>> x_n+1 = 2*x_n - a * x_n^2 (division-free), or x_n*(2-a*x_n).
>>>>>
>>>>> Maybe this is known stuff, and I'm not sure that it would be useful for
>>>>> anything, but I thought I'd share it anyway.
>>>>>
>>>>> The accuracy that you get is roughly 6-7 bits, so it's not better than a
>>>>> LUT.
>>>>>
>>>>> The coefficients are "nice" from an implementation perspective, if you
>>>>> were to hard-wire the polynomial. It should be possible to implement the
>>>>> dividend with a single multiplier and a few integer adders. The 3 in the
>>>>> divisor is muddying the waters, though. Is there a fast way to divide by
>>>>> three?
>>>> The standard trick of dividing by three in fixed point: For
>>>> 32-bit numbers, multiply the number by the magic number 2863311531
>>>> (0xAAAAAAAB) and shift right by 33 bits (in practice, use a
>>>> "multiply high" and shift right one bit). For floating point
>>>> numbers, this would have to be adjusted somewhat.
>>>>
>>>
>>> And that causes the whole calculation to be of approximately the same
>>> computational complexity as 2nd-order polynomials with arbitrary 32-bit coefficients,
>>> potentially much better coefficients than those suggested.
>>> The suggestion of Marcus is very naive.
>>> Any suggestion to do sqrt or rsqrt approximation without lookup table is a loose proposition.
>>> Even very small table at the first step, like 32 or 64 entries, helps a lot and saves many steps
>>> down the road.
>>> The only exception to that could be when you do very wide SIMD that has no tools for even small lookups.
>>> Or when you don't care about speed, but that case is sort of off topic.
>> If you are doing SIMD, but without a reciprocal/reciprocal square root
>> lookup opcode, then I would use the infamous invsqrt() trick.
>
> Magic constant and such?
> I don't think it's a good idea if you seek top performance.
> One thing to realize is that if you went to trouble of using rsqrt() NR steps
> based on custom polynomials, with different coefficients on each step, then it
> is better to keep error one-sided on all steps except possibly the last one.
> Apart from of being slightly more precise, such strategy has advantage of being
> simpler to do with unsigned integer math.
>
>> With the
>> latest Cheby-style tweaks to all the constants, this delivers 10+ bits
>> after a single iteration.
>>
>> This is significantly better than an in-register 4-bit/16-term table lookup.
>>
>> Anyway, I am sure Mitch has much better polynomials to do it all
>> directly. :-)
>
> Mitch appears to like Chebychev polynomials.
> Likely, because they are easy to find out :-)
>
> They are very close to optimal (in sense of minimizing two-sided error, but
> probably can be easily offseted to minimize one-sided error too) when the
> interval of input error is already small. But when the interval is large, as it
> is a case on the first step without lookup table and, may be, on the second
> step, when the 1st step was of low order, then Chebychev polynomials are
> measurably suboptimal vs true equiripple polys. The later can be found by Remez
> exchange algorithm.

OK, I might have been misusing the "Cheby" term! I've always assumed
that an optimal N-term poly will have N+1 ripples, all equally far
(typically using a relative measure) away from the true solution at
those points. Sometimes I'll additional restrictions, like wanting the
error at both ends to be identical in sign and magnitude which is a good
idea if I use range reduction and individual coefficients for each
sub-range.

> The reason for this is that Chebychev poly of order N is defined as the best
> approximation for poly of order N+1. So, when term N+1 in infinite Tailor
> series is much bigger than sum of terms N+2 to Infinity then Cheby is good, but
> when term N+1 is not dominant it is less good.

OK,
thanks!

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Approximate reciprocals

<4bdfaba8-898f-4c1e-8ca1-234bf4d3ffc8n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24505&group=comp.arch#24505

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:10d:b0:2e1:db41:66d with SMTP id u13-20020a05622a010d00b002e1db41066dmr22397611qtw.670.1648480977336;
Mon, 28 Mar 2022 08:22:57 -0700 (PDT)
X-Received: by 2002:a05:6870:1607:b0:de:984:496d with SMTP id
b7-20020a056870160700b000de0984496dmr15740620oae.253.1648480977057; Mon, 28
Mar 2022 08:22:57 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 28 Mar 2022 08:22:56 -0700 (PDT)
In-Reply-To: <7029a173-963d-402b-a184-642120b5e1b8n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:60:82f3:436c:9b29;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:60:82f3:436c:9b29
References: <t1c154$j5t$1@dont-email.me> <t1qf0u$oko$1@dont-email.me>
<t1qkql$ui0$1@newsreader4.netcologne.de> <394168eb-53ed-49c2-a349-4035c3177361n@googlegroups.com>
<t1rm34$pg9$1@gioia.aioe.org> <7029a173-963d-402b-a184-642120b5e1b8n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4bdfaba8-898f-4c1e-8ca1-234bf4d3ffc8n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 28 Mar 2022 15:22:57 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 98

by: MitchAlsup - Mon, 28 Mar 2022 15:22 UTC

On Monday, March 28, 2022 at 3:19:46 AM UTC-5, Michael S wrote:
> On Monday, March 28, 2022 at 9:54:31 AM UTC+3, Terje Mathisen wrote:
> > Michael S wrote:
> > > On Monday, March 28, 2022 at 12:26:48 AM UTC+3, Thomas Koenig wrote:
> > >> Marcus <m.de...@this.bitsnbites.eu> schrieb:
> > >>> Today I found a pretty nice looking 2nd order polynomial for
> > >>> approximating 1/x in the interval [1.0, 2.0):
> > >>>
> > >>> (2*x^2 - 9*x + 13) / 6
> > >> Interesting. Could be useful for the start of a Newton iteration
> > >> for a reciprocal, where the iteration formula would be
> > >> x_n+1 = 2*x_n - a * x_n^2 (division-free), or x_n*(2-a*x_n).
> > >>>
> > >>> Maybe this is known stuff, and I'm not sure that it would be useful for
> > >>> anything, but I thought I'd share it anyway.
> > >>>
> > >>> The accuracy that you get is roughly 6-7 bits, so it's not better than a
> > >>> LUT.
> > >>>
> > >>> The coefficients are "nice" from an implementation perspective, if you
> > >>> were to hard-wire the polynomial. It should be possible to implement the
> > >>> dividend with a single multiplier and a few integer adders. The 3 in the
> > >>> divisor is muddying the waters, though. Is there a fast way to divide by
> > >>> three?
> > >> The standard trick of dividing by three in fixed point: For
> > >> 32-bit numbers, multiply the number by the magic number 2863311531
> > >> (0xAAAAAAAB) and shift right by 33 bits (in practice, use a
> > >> "multiply high" and shift right one bit). For floating point
> > >> numbers, this would have to be adjusted somewhat.
> > >>
> > >
> > > And that causes the whole calculation to be of approximately the same
> > > computational complexity as 2nd-order polynomials with arbitrary 32-bit coefficients,
> > > potentially much better coefficients than those suggested.
> > > The suggestion of Marcus is very naive.
> > > Any suggestion to do sqrt or rsqrt approximation without lookup table is a loose proposition.
> > > Even very small table at the first step, like 32 or 64 entries, helps a lot and saves many steps
> > > down the road.
> > > The only exception to that could be when you do very wide SIMD that has no tools for even small lookups.
> > > Or when you don't care about speed, but that case is sort of off topic.
> > If you are doing SIMD, but without a reciprocal/reciprocal square root
> > lookup opcode, then I would use the infamous invsqrt() trick.
<
> Magic constant and such?
> I don't think it's a good idea if you seek top performance.
> One thing to realize is that if you went to trouble of using rsqrt() NR steps
> based on custom polynomials, with different coefficients on each step, then it
> is better to keep error one-sided on all steps except possibly the last one.
<
Notice that the error term for N-R iterations flip the sign every iteration.
<
> Apart from of being slightly more precise, such strategy has advantage of being
> simpler to do with unsigned integer math.
<
You need to choose the approximat such that the error is uniformly positive after
the last iteration. This allows you to choose from {r+0, r+1, and r+2} for the properly
rounded result; which is easier than choosing {r-1, r+0, r+1} for the properly rounded
result.
<
> > With the
> > latest Cheby-style tweaks to all the constants, this delivers 10+ bits
> > after a single iteration.
> >
> > This is significantly better than an in-register 4-bit/16-term table lookup.
> >
> > Anyway, I am sure Mitch has much better polynomials to do it all
> > directly. :-)
> Mitch appears to like Chebychev polynomials.
> Likely, because they are easy to find out :-)
<
Once you figure out how to calculate them, they are, indeed, easy to determine.
>
> They are very close to optimal (in sense of minimizing two-sided error, but
> probably can be easily offseted to minimize one-sided error too) when the
> interval of input error is already small. But when the interval is large, as it
> is a case on the first step without lookup table and, may be, on the second
> step, when the 1st step was of low order, then Chebychev polynomials are
> measurably suboptimal vs true equiripple polys. The later can be found by Remez
> exchange algorithm.
<
Cheby polynomials are within 1 ULP of optimal, where Remez is better, it is
a lot harder to calculate, and if you are working to the last bit of precision,
even calculating Remez requires more precision. Cheby does not have this
property.
<
> The reason for this is that Chebychev poly of order N is defined as the best
> approximation for poly of order N+1. So, when term N+1 in infinite Tailor
> series is much bigger than sum of terms N+2 to Infinity then Cheby is good, but
> when term N+1 is not dominant it is less good.
<
I calculated my SIN(), COS() Cheby polynomials by starting with a Taylor series
3 terms longer and then removed N+3, N+2, and N+1 to give me the N-th order
cheby polynomial in x^2
<
> > Terje
> >
> > --
> > - <Terje.Mathisen at tmsw.no>
> > "almost all programming can be viewed as an exercise in caching"

"Laugh while you can, monkey-boy." -- Dr. Emilio Lizardo

devel / comp.arch / Re: Approximate reciprocals

devel / comp.arch / Re: Approximate reciprocals

Subject	Author
Approximate reciprocals	Marcus
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	robf...@gmail.com
Re: Approximate reciprocals	Marcus
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	Marcus
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Quadibloc
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Marcus
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	BGB
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	Quadibloc
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	James Van Buskirk
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	Quadibloc
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	robf...@gmail.com
Useful floating point instructions (was: Approximate reciprocals)	Thomas Koenig
Re: Useful floating point instructions	Terje Mathisen
Re: Useful floating point instructions	Stephen Fuld
Re: Useful floating point instructions	MitchAlsup
Re: Useful floating point instructions	Stephen Fuld
Re: Useful floating point instructions	MitchAlsup
Re: Useful floating point instructions	Michael S
Re: Useful floating point instructions	Stephen Fuld
Re: Useful floating point instructions	Terje Mathisen
Re: Useful floating point instructions	Terje Mathisen
Re: Useful floating point instructions	Stefan Monnier
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	George Neuner
Re: Approximate reciprocals	Anton Ertl
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Anton Ertl
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	George Neuner
Re: Approximate reciprocals	Anton Ertl
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	John Dallman
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	George Neuner
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	EricP
Re: Approximate reciprocals	Anton Ertl
Re: Approximate reciprocals	Anton Ertl
Re: Approximate reciprocals	John Dallman
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	Elijah Stone
Re: Approximate reciprocals	Marcus
Re: Approximate reciprocals	Marcus