novaBBS - comp.arch - Re: Approximate reciprocals

Hello group!

A class of instructions that is very tempting to include in an ISA is
approximate floating-point reciprocals (1/x & 1/sqrt(x)), and
possibly specialized instructions for improving precision
(a Newton-Raphson step).

Many ISAs have such instructions:

CRAY:
* 070ijx - Floating reciprocal approximation
* 067ijk - Reciprocal iteration

ARM:
* FRECPE - Floating-point reciprocal estimate
* FRECPS - Floating-point reciprocal step
* FRSQRTE - Floating-point reciprocal square root estimate
* FRSQRTS - Floating-point reciprocal square root step

POWER:
* FRES - Floating reciprocal estimate single
* FRSQRTE - Floating reciprocal square root estimate

x86:
* RSQRTSS - Approximate reciprocal square root

TI C67x:
* RCPSP - Floating-Point reciprocal approximation
* RSQRSP - Floating-Point reciprocal square root approximation

....and there are probably others.

What are your feelings towards including such instructions in an ISA?

My own feelings are mixed.

Pros:
* Easy to implement in hardware.
* Can provide a significant speedup for certain workloads, especially
if limited accuracy is acceptable.

Cons:
* Hard to specify exact operation.
* Likely a source of poor portability (borderline undefined behavior).
* The "step/iteration" instructions can usually be replaced by FMA.

/Marcus

Marcus wrote:
> Hello group!
>
> A class of instructions that is very tempting to include in an ISA is
> approximate floating-point reciprocals (1/x & 1/sqrt(x)), and
> possibly specialized instructions for improving precision
> (a Newton-Raphson step).
>
> Many ISAs have such instructions:
>
> CRAY:
> * 070ijx - Floating reciprocal approximation
> * 067ijk - Reciprocal iteration
>
> ARM:
> * FRECPE - Floating-point reciprocal estimate
> * FRECPS - Floating-point reciprocal step
> * FRSQRTE - Floating-point reciprocal square root estimate
> * FRSQRTS - Floating-point reciprocal square root step
>
> POWER:
> * FRES - Floating reciprocal estimate single
> * FRSQRTE - Floating reciprocal square root estimate
>
> x86:
> * RSQRTSS - Approximate reciprocal square root
>
> TI C67x:
> * RCPSP - Floating-Point reciprocal approximation
> * RSQRSP - Floating-Point reciprocal square root approximation
>
> ...and there are probably others.
>
> What are your feelings towards including such instructions in an ISA?

Stupid not to do it?

>
> My own feelings are mixed.
>
> Pros:
> * Easy to implement in hardware.
> * Can provide a significant speedup for certain workloads, especially
> if limited accuracy is acceptable.

Even when you need exact results, having that reciprocal starting point
means that you will converge on that value significantly faster.

I.e. starting with a ~12-bit approximation gives you float with
fractional ulp accuracy after one NR stage and exact float with one more
iteration, still significantly faster than an SRT divider/sqrt circuit.

>
> Cons:
> * Hard to specify exact operation.
> * Likely a source of poor portability (borderline undefined behavior).
> * The "step/iteration" instructions can usually be replaced by FMA.

This is true, which is why only the initial lookup is crucial.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Approximate reciprocals

<274c8e84-ea45-4fcb-9f48-eaa509f7b660n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24405&group=comp.arch#24405

copy link Newsgroups: comp.arch

X-Received: by 2002:ad4:5b8f:0:b0:441:248c:2ae0 with SMTP id 15-20020ad45b8f000000b00441248c2ae0mr7881062qvp.39.1647959983205;
Tue, 22 Mar 2022 07:39:43 -0700 (PDT)
X-Received: by 2002:a05:6870:1607:b0:de:984:496d with SMTP id
b7-20020a056870160700b000de0984496dmr1839987oae.253.1647959982982; Tue, 22
Mar 2022 07:39:42 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 22 Mar 2022 07:39:42 -0700 (PDT)
In-Reply-To: <t1cin7$hpc$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2607:fea8:1de4:1f00:d4e5:9ef0:4ed5:f074;
posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 2607:fea8:1de4:1f00:d4e5:9ef0:4ed5:f074
References: <t1c154$j5t$1@dont-email.me> <t1cin7$hpc$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <274c8e84-ea45-4fcb-9f48-eaa509f7b660n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: robfi...@gmail.com (robf...@gmail.com)
Injection-Date: Tue, 22 Mar 2022 14:39:43 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 65

by: robf...@gmail.com - Tue, 22 Mar 2022 14:39 UTC

On Tuesday, March 22, 2022 at 9:24:58 AM UTC-4, Terje Mathisen wrote:
> Marcus wrote:
> > Hello group!
> >
> > A class of instructions that is very tempting to include in an ISA is
> > approximate floating-point reciprocals (1/x & 1/sqrt(x)), and
> > possibly specialized instructions for improving precision
> > (a Newton-Raphson step).
> >
> > Many ISAs have such instructions:
> >
> > CRAY:
> > * 070ijx - Floating reciprocal approximation
> > * 067ijk - Reciprocal iteration
> >
> > ARM:
> > * FRECPE - Floating-point reciprocal estimate
> > * FRECPS - Floating-point reciprocal step
> > * FRSQRTE - Floating-point reciprocal square root estimate
> > * FRSQRTS - Floating-point reciprocal square root step
> >
> > POWER:
> > * FRES - Floating reciprocal estimate single
> > * FRSQRTE - Floating reciprocal square root estimate
> >
> > x86:
> > * RSQRTSS - Approximate reciprocal square root
> >
> > TI C67x:
> > * RCPSP - Floating-Point reciprocal approximation
> > * RSQRSP - Floating-Point reciprocal square root approximation
> >
> > ...and there are probably others.
> >
> > What are your feelings towards including such instructions in an ISA?
> Stupid not to do it?
> >
> > My own feelings are mixed.
> >
> > Pros:
> > * Easy to implement in hardware.
> > * Can provide a significant speedup for certain workloads, especially
> > if limited accuracy is acceptable.
> Even when you need exact results, having that reciprocal starting point
> means that you will converge on that value significantly faster.
>
> I.e. starting with a ~12-bit approximation gives you float with
> fractional ulp accuracy after one NR stage and exact float with one more
> iteration, still significantly faster than an SRT divider/sqrt circuit.
> >
> > Cons:
> > * Hard to specify exact operation.
> > * Likely a source of poor portability (borderline undefined behavior).
> > * The "step/iteration" instructions can usually be replaced by FMA.
> This is true, which is why only the initial lookup is crucial.
>
> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Should be included.

I have sigmoid approximate too.

Re: Approximate reciprocals

<81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24407&group=comp.arch#24407

copy link Newsgroups: comp.arch

X-Received: by 2002:ad4:5848:0:b0:441:4092:c385 with SMTP id de8-20020ad45848000000b004414092c385mr3610864qvb.24.1647962227884;
Tue, 22 Mar 2022 08:17:07 -0700 (PDT)
X-Received: by 2002:a05:6808:13c2:b0:2da:6007:8317 with SMTP id
d2-20020a05680813c200b002da60078317mr2274814oiw.7.1647962227651; Tue, 22 Mar
2022 08:17:07 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 22 Mar 2022 08:17:07 -0700 (PDT)
In-Reply-To: <t1c154$j5t$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:50e0:1014:d0d0:a52e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:50e0:1014:d0d0:a52e
References: <t1c154$j5t$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 22 Mar 2022 15:17:07 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 54

by: MitchAlsup - Tue, 22 Mar 2022 15:17 UTC

On Tuesday, March 22, 2022 at 3:25:11 AM UTC-5, Marcus wrote:
> Hello group!
>
> A class of instructions that is very tempting to include in an ISA is
> approximate floating-point reciprocals (1/x & 1/sqrt(x)), and
> possibly specialized instructions for improving precision
> (a Newton-Raphson step).
>
> Many ISAs have such instructions:
>
> CRAY:
> * 070ijx - Floating reciprocal approximation
> * 067ijk - Reciprocal iteration
>
> ARM:
> * FRECPE - Floating-point reciprocal estimate
> * FRECPS - Floating-point reciprocal step
> * FRSQRTE - Floating-point reciprocal square root estimate
> * FRSQRTS - Floating-point reciprocal square root step
>
> POWER:
> * FRES - Floating reciprocal estimate single
> * FRSQRTE - Floating reciprocal square root estimate
>
> x86:
> * RSQRTSS - Approximate reciprocal square root
>
> TI C67x:
> * RCPSP - Floating-Point reciprocal approximation
> * RSQRSP - Floating-Point reciprocal square root approximation
>
> ...and there are probably others.
>
> What are your feelings towards including such instructions in an ISA?
>
> My own feelings are mixed.
<
My feelings are::
a) screw the approximations
b) because doing them to faithful accuracy is easy
c) the right answer is only 14 cycles
d) if you build the FMAC unit correctly
See: USPTO 10,761,806
>
> Pros:
> * Easy to implement in hardware.
> * Can provide a significant speedup for certain workloads, especially
> if limited accuracy is acceptable.
>
> Cons:
> * Hard to specify exact operation.
> * Likely a source of poor portability (borderline undefined behavior).
> * The "step/iteration" instructions can usually be replaced by FMA.
>
> /Marcus

Re: Approximate reciprocals

<t1cprk$o4b$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24408&group=comp.arch#24408

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Tue, 22 Mar 2022 16:26:43 +0100
Organization: A noiseless patient Spider
Lines: 75
Message-ID: <t1cprk$o4b$1@dont-email.me>
References: <t1c154$j5t$1@dont-email.me> <t1cin7$hpc$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 22 Mar 2022 15:26:44 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="71a6ce8dc80147149ee1faf361bb969e";
logging-data="24715"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/stdAuMibTnEz8imR9wq1jpVzCFBnD/x4="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.5.0
Cancel-Lock: sha1:+wG1NKar79/5HhmnUJ9YPkP+Xxc=
In-Reply-To: <t1cin7$hpc$1@gioia.aioe.org>
Content-Language: en-US

by: Marcus - Tue, 22 Mar 2022 15:26 UTC

On 2022-03-22, Terje Mathisen wrote:
> Marcus wrote:
>> Hello group!
>>
>> A class of instructions that is very tempting to include in an ISA is
>> approximate floating-point reciprocals (1/x & 1/sqrt(x)), and
>> possibly specialized instructions for improving precision
>> (a Newton-Raphson step).
>>
>> Many ISAs have such instructions:
>>
>> CRAY:
>> * 070ijx - Floating reciprocal approximation
>> * 067ijk - Reciprocal iteration
>>
>> ARM:
>> * FRECPE - Floating-point reciprocal estimate
>> * FRECPS - Floating-point reciprocal step
>> * FRSQRTE - Floating-point reciprocal square root estimate
>> * FRSQRTS - Floating-point reciprocal square root step
>>
>> POWER:
>> * FRES - Floating reciprocal estimate single
>> * FRSQRTE - Floating reciprocal square root estimate
>>
>> x86:
>> * RSQRTSS - Approximate reciprocal square root
>>
>> TI C67x:
>> * RCPSP - Floating-Point reciprocal approximation
>> * RSQRSP - Floating-Point reciprocal square root approximation
>>
>> ...and there are probably others.
>>
>> What are your feelings towards including such instructions in an ISA?
>
> Stupid not to do it?
>
>>
>> My own feelings are mixed.
>>
>> Pros:
>> * Easy to implement in hardware.
>> * Can provide a significant speedup for certain workloads, especially
>> if limited accuracy is acceptable.
>
> Even when you need exact results, having that reciprocal starting point
> means that you will converge on that value significantly faster.
>
> I.e. starting with a ~12-bit approximation gives you float with
> fractional ulp accuracy after one NR stage and exact float with one more
> iteration, still significantly faster than an SRT divider/sqrt circuit.

Follow-up question 1: What approximation accuracy should one aim for?
I've seen 8 bits (yielding SP precision in 2 NR steps), and you mention
12 bits. With 13 bits, would you not get full SP precision in 1 NR step?
CRAY gave a full 30 bits in the first approximation (!).

Follow-up question 2: Implementation-wise, is it better to do a raw
nearest-neighbor ROM lookup, or more sophisticated ROM + linear
interpolation? It feels like the latter would give you more bang for
the buck, possibly at the cost of added latency.

>
>>
>> Cons:
>> * Hard to specify exact operation.
>> * Likely a source of poor portability (borderline undefined behavior).
>> * The "step/iteration" instructions can usually be replaced by FMA.
>
> This is true, which is why only the initial lookup is crucial.
>
> Terje
>

Re: Approximate reciprocals

<t1crij$4p6$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24410&group=comp.arch#24410

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Tue, 22 Mar 2022 16:56:03 +0100
Organization: A noiseless patient Spider
Lines: 69
Message-ID: <t1crij$4p6$1@dont-email.me>
References: <t1c154$j5t$1@dont-email.me>
<81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 22 Mar 2022 15:56:03 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="71a6ce8dc80147149ee1faf361bb969e";
logging-data="4902"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+d7QM1vuNuafxMF3Z60c5ukH+rShPpRRk="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.5.0
Cancel-Lock: sha1:cZYyuwAnr7aS0UrF4f7N4y10CdU=
In-Reply-To: <81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com>
Content-Language: en-US

by: Marcus - Tue, 22 Mar 2022 15:56 UTC

On 2022-03-22, MitchAlsup wrote:
> On Tuesday, March 22, 2022 at 3:25:11 AM UTC-5, Marcus wrote:
>> Hello group!
>>
>> A class of instructions that is very tempting to include in an ISA is
>> approximate floating-point reciprocals (1/x & 1/sqrt(x)), and
>> possibly specialized instructions for improving precision
>> (a Newton-Raphson step).
>>
>> Many ISAs have such instructions:
>>
>> CRAY:
>> * 070ijx - Floating reciprocal approximation
>> * 067ijk - Reciprocal iteration
>>
>> ARM:
>> * FRECPE - Floating-point reciprocal estimate
>> * FRECPS - Floating-point reciprocal step
>> * FRSQRTE - Floating-point reciprocal square root estimate
>> * FRSQRTS - Floating-point reciprocal square root step
>>
>> POWER:
>> * FRES - Floating reciprocal estimate single
>> * FRSQRTE - Floating reciprocal square root estimate
>>
>> x86:
>> * RSQRTSS - Approximate reciprocal square root
>>
>> TI C67x:
>> * RCPSP - Floating-Point reciprocal approximation
>> * RSQRSP - Floating-Point reciprocal square root approximation
>>
>> ...and there are probably others.
>>
>> What are your feelings towards including such instructions in an ISA?
>>
>> My own feelings are mixed.
> <
> My feelings are::
> a) screw the approximations
> b) because doing them to faithful accuracy is easy
> c) the right answer is only 14 cycles
> d) if you build the FMAC unit correctly
> See: USPTO 10,761,806

Thanks, that makes for a good read!

While I see the benefits of fast argument reduction and a generalized
coefficient + FMA engine for implementing transcendental functions, the
reciprocal and reciprocal sqrt functions are simpler and common enough
to possibly warrant a dedicated solution?

E.g. the CRAY could do a full DP division by means of approximate
reciprocal + NR-step + multiply by nominator in just 3-4 clock cycles
(vectorized), which is pretty impressive even by today's standards.

>>
>> Pros:
>> * Easy to implement in hardware.
>> * Can provide a significant speedup for certain workloads, especially
>> if limited accuracy is acceptable.
>>
>> Cons:
>> * Hard to specify exact operation.
>> * Likely a source of poor portability (borderline undefined behavior).
>> * The "step/iteration" instructions can usually be replaced by FMA.
>>
>> /Marcus

Re: Approximate reciprocals

<1af37d95-6b43-4d9d-a59a-cf9cc70ced47n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24412&group=comp.arch#24412

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:56b:b0:62c:eff4:fe8d with SMTP id p11-20020a05620a056b00b0062ceff4fe8dmr15738608qkp.459.1647967397184;
Tue, 22 Mar 2022 09:43:17 -0700 (PDT)
X-Received: by 2002:a9d:d81:0:b0:5cd:9d25:b872 with SMTP id
1-20020a9d0d81000000b005cd9d25b872mr1679391ots.227.1647967396914; Tue, 22 Mar
2022 09:43:16 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 22 Mar 2022 09:43:16 -0700 (PDT)
In-Reply-To: <t1cprk$o4b$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:50e0:1014:d0d0:a52e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:50e0:1014:d0d0:a52e
References: <t1c154$j5t$1@dont-email.me> <t1cin7$hpc$1@gioia.aioe.org> <t1cprk$o4b$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1af37d95-6b43-4d9d-a59a-cf9cc70ced47n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 22 Mar 2022 16:43:17 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 83

by: MitchAlsup - Tue, 22 Mar 2022 16:43 UTC

On Tuesday, March 22, 2022 at 10:26:47 AM UTC-5, Marcus wrote:
> On 2022-03-22, Terje Mathisen wrote:
> > Marcus wrote:
> >> Hello group!
> >>
> >> A class of instructions that is very tempting to include in an ISA is
> >> approximate floating-point reciprocals (1/x & 1/sqrt(x)), and
> >> possibly specialized instructions for improving precision
> >> (a Newton-Raphson step).
> >>
> >> Many ISAs have such instructions:
> >>
> >> CRAY:
> >> * 070ijx - Floating reciprocal approximation
> >> * 067ijk - Reciprocal iteration
> >>
> >> ARM:
> >> * FRECPE - Floating-point reciprocal estimate
> >> * FRECPS - Floating-point reciprocal step
> >> * FRSQRTE - Floating-point reciprocal square root estimate
> >> * FRSQRTS - Floating-point reciprocal square root step
> >>
> >> POWER:
> >> * FRES - Floating reciprocal estimate single
> >> * FRSQRTE - Floating reciprocal square root estimate
> >>
> >> x86:
> >> * RSQRTSS - Approximate reciprocal square root
> >>
> >> TI C67x:
> >> * RCPSP - Floating-Point reciprocal approximation
> >> * RSQRSP - Floating-Point reciprocal square root approximation
> >>
> >> ...and there are probably others.
> >>
> >> What are your feelings towards including such instructions in an ISA?
> >
> > Stupid not to do it?
> >
> >>
> >> My own feelings are mixed.
> >>
> >> Pros:
> >> * Easy to implement in hardware.
> >> * Can provide a significant speedup for certain workloads, especially
> >> if limited accuracy is acceptable.
> >
> > Even when you need exact results, having that reciprocal starting point
> > means that you will converge on that value significantly faster.
> >
> > I.e. starting with a ~12-bit approximation gives you float with
> > fractional ulp accuracy after one NR stage and exact float with one more
> > iteration, still significantly faster than an SRT divider/sqrt circuit.
<
> Follow-up question 1: What approximation accuracy should one aim for?
<
0.510 ULP or better Double Precision.
<
> I've seen 8 bits (yielding SP precision in 2 NR steps), and you mention
> 12 bits. With 13 bits, would you not get full SP precision in 1 NR step?
> CRAY gave a full 30 bits in the first approximation (!).
<
There are a couple hundred cases where you have 26-bits of accuracy
do a Newton-Raphson iteration and you still cannot round correctly,
so you have to do a N-R iteration after you get more than 52-bit of
precision.
>
> Follow-up question 2: Implementation-wise, is it better to do a raw
> nearest-neighbor ROM lookup, or more sophisticated ROM + linear
> interpolation? It feels like the latter would give you more bang for
> the buck, possibly at the cost of added latency.
<
I do a 7-th order polynomial
> >
> >>
> >> Cons:
> >> * Hard to specify exact operation.
> >> * Likely a source of poor portability (borderline undefined behavior).
> >> * The "step/iteration" instructions can usually be replaced by FMA.
> >
> > This is true, which is why only the initial lookup is crucial.
> >
> > Terje
> >

Re: Approximate reciprocals

<4e849fc4-6ceb-46c8-9ad2-c0152c60111fn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24413&group=comp.arch#24413

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:181:b0:2e1:e70a:ec2a with SMTP id s1-20020a05622a018100b002e1e70aec2amr21059179qtw.42.1647967466341;
Tue, 22 Mar 2022 09:44:26 -0700 (PDT)
X-Received: by 2002:a4a:b343:0:b0:324:512e:e340 with SMTP id
n3-20020a4ab343000000b00324512ee340mr8704375ooo.59.1647967466107; Tue, 22 Mar
2022 09:44:26 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 22 Mar 2022 09:44:25 -0700 (PDT)
In-Reply-To: <t1crij$4p6$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:50e0:1014:d0d0:a52e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:50e0:1014:d0d0:a52e
References: <t1c154$j5t$1@dont-email.me> <81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com>
<t1crij$4p6$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4e849fc4-6ceb-46c8-9ad2-c0152c60111fn@googlegroups.com>
Subject: Re: Approximate reciprocals
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 22 Mar 2022 16:44:26 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 71

by: MitchAlsup - Tue, 22 Mar 2022 16:44 UTC

On Tuesday, March 22, 2022 at 10:56:06 AM UTC-5, Marcus wrote:
> On 2022-03-22, MitchAlsup wrote:
> > On Tuesday, March 22, 2022 at 3:25:11 AM UTC-5, Marcus wrote:
> >> Hello group!
> >>
> >> A class of instructions that is very tempting to include in an ISA is
> >> approximate floating-point reciprocals (1/x & 1/sqrt(x)), and
> >> possibly specialized instructions for improving precision
> >> (a Newton-Raphson step).
> >>
> >> Many ISAs have such instructions:
> >>
> >> CRAY:
> >> * 070ijx - Floating reciprocal approximation
> >> * 067ijk - Reciprocal iteration
> >>
> >> ARM:
> >> * FRECPE - Floating-point reciprocal estimate
> >> * FRECPS - Floating-point reciprocal step
> >> * FRSQRTE - Floating-point reciprocal square root estimate
> >> * FRSQRTS - Floating-point reciprocal square root step
> >>
> >> POWER:
> >> * FRES - Floating reciprocal estimate single
> >> * FRSQRTE - Floating reciprocal square root estimate
> >>
> >> x86:
> >> * RSQRTSS - Approximate reciprocal square root
> >>
> >> TI C67x:
> >> * RCPSP - Floating-Point reciprocal approximation
> >> * RSQRSP - Floating-Point reciprocal square root approximation
> >>
> >> ...and there are probably others.
> >>
> >> What are your feelings towards including such instructions in an ISA?
> >>
> >> My own feelings are mixed.
> > <
> > My feelings are::
> > a) screw the approximations
> > b) because doing them to faithful accuracy is easy
> > c) the right answer is only 14 cycles
> > d) if you build the FMAC unit correctly
> > See: USPTO 10,761,806
> Thanks, that makes for a good read!
>
> While I see the benefits of fast argument reduction and a generalized
> coefficient + FMA engine for implementing transcendental functions, the
> reciprocal and reciprocal sqrt functions are simpler and common enough
> to possibly warrant a dedicated solution?
<
You get to make those choices.
>
> E.g. the CRAY could do a full DP division by means of approximate
> reciprocal + NR-step + multiply by nominator in just 3-4 clock cycles
> (vectorized), which is pretty impressive even by today's standards.
<
Its OK if you are content with 46-bits of accuracy.
<
> >>
> >> Pros:
> >> * Easy to implement in hardware.
> >> * Can provide a significant speedup for certain workloads, especially
> >> if limited accuracy is acceptable.
> >>
> >> Cons:
> >> * Hard to specify exact operation.
> >> * Likely a source of poor portability (borderline undefined behavior).
> >> * The "step/iteration" instructions can usually be replaced by FMA.
> >>
> >> /Marcus

MitchAlsup <MitchAlsup@aol.com> schrieb:

> My feelings are::
> a) screw the approximations
> b) because doing them to faithful accuracy is easy
> c) the right answer is only 14 cycles
> d) if you build the FMAC unit correctly
> See: USPTO 10,761,806

Impressive (as I said before). This being a patent means that
you forbid everybody to do what you claim, except outside the US
(I believe Marcus is based in Sweden) and, of course, research is
OK too. It is unclear if an FPGA-based implementation would fall
under this patent ("processor" does not seem to be defined, and as
I have learnt the hard way, you need to define things in patents).

Regarding approximate reciprocals: I would prefer a CPU which has
different widths of floating point data (16, 32, 64 and possibly 128
bits), if the smaller types are faster and more energy efficient
than the larger types. People who need a low-cost, low-accuracy
sqrt could then use the 16-bit version.

Re: Approximate reciprocals

<903965ad-5226-49d5-9883-57b1bc836fd7n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24415&group=comp.arch#24415

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:621:b0:432:5e0d:cb64 with SMTP id a1-20020a056214062100b004325e0dcb64mr20700922qvx.65.1647975274676;
Tue, 22 Mar 2022 11:54:34 -0700 (PDT)
X-Received: by 2002:a05:6808:1451:b0:2ec:cfe4:21e with SMTP id
x17-20020a056808145100b002eccfe4021emr3057145oiv.147.1647975274416; Tue, 22
Mar 2022 11:54:34 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!2.eu.feeder.erje.net!feeder.erje.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 22 Mar 2022 11:54:34 -0700 (PDT)
In-Reply-To: <t1d0r8$o4v$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:39b3:578e:3aea:81d0;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:39b3:578e:3aea:81d0
References: <t1c154$j5t$1@dont-email.me> <81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com>
<t1d0r8$o4v$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <903965ad-5226-49d5-9883-57b1bc836fd7n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 22 Mar 2022 18:54:34 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 37

by: MitchAlsup - Tue, 22 Mar 2022 18:54 UTC

On Tuesday, March 22, 2022 at 12:26:03 PM UTC-5, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > My feelings are::
> > a) screw the approximations
> > b) because doing them to faithful accuracy is easy
> > c) the right answer is only 14 cycles
> > d) if you build the FMAC unit correctly
> > See: USPTO 10,761,806
<
> Impressive (as I said before). This being a patent means that
> you forbid everybody to do what you claim, except outside the US
> (I believe Marcus is based in Sweden) and, of course, research is
> OK too. It is unclear if an FPGA-based implementation would fall
> under this patent ("processor" does not seem to be defined, and as
> I have learnt the hard way, you need to define things in patents).
<
I am willing to sell licenses for less money that it takes you to
assign a design engineer and have him/her read the document
thoroughly.
<
What is claimed is the necessary organization. An FPGA with
the necessary organization would infringe.
>
> Regarding approximate reciprocals: I would prefer a CPU which has
> different widths of floating point data (16, 32, 64 and possibly 128
> bits), if the smaller types are faster and more energy efficient
> than the larger types. People who need a low-cost, low-accuracy
> sqrt could then use the 16-bit version.
<
I am nor arguing against that point. From GPU work, the single precision
transcendentals take 5 cycles for 1 ULP accuracy (not quite faithful),
1 cycle throughput. Due to the organization of SP trans, 16-bit takes
just as long as 32-bit <in GPUs>.
<
In order to get this working at the 128-bit level, you are going to need
FP with 128-bits of fraction; just so you can correctly calculate the
coefficients the polynomial is based upon. My estimate is that 128-bit
transcendentals will cost ~40 cycles.

MitchAlsup <MitchAlsup@aol.com> schrieb:
> On Tuesday, March 22, 2022 at 12:26:03 PM UTC-5, Thomas Koenig wrote:
>> MitchAlsup <Mitch...@aol.com> schrieb:
>> > My feelings are::
>> > a) screw the approximations
>> > b) because doing them to faithful accuracy is easy
>> > c) the right answer is only 14 cycles
>> > d) if you build the FMAC unit correctly
>> > See: USPTO 10,761,806
><
>> Impressive (as I said before). This being a patent means that
>> you forbid everybody to do what you claim, except outside the US
>> (I believe Marcus is based in Sweden) and, of course, research is
>> OK too. It is unclear if an FPGA-based implementation would fall
>> under this patent ("processor" does not seem to be defined, and as
>> I have learnt the hard way, you need to define things in patents).
><
> I am willing to sell licenses for less money that it takes you to
> assign a design engineer and have him/her read the document
> thoroughly.

I'm not going to pay for this, my profession is something
quite different :-)

> What is claimed is the necessary organization. An FPGA with
> the necessary organization would infringe.

Hm, if I had been involved in writing that patent, I would probably
have included a description of what "processor" actually means
in this context (which is why stuff I (co)write tends to be much
longer, possibly unnecessarily so; it is, however, in a field
totally unrelated to computer architecture).

But it is of course your patent and your style of writing :-)

>> Regarding approximate reciprocals: I would prefer a CPU which has
>> different widths of floating point data (16, 32, 64 and possibly 128
>> bits), if the smaller types are faster and more energy efficient
>> than the larger types. People who need a low-cost, low-accuracy
>> sqrt could then use the 16-bit version.
><
> I am nor arguing against that point. From GPU work, the single precision
> transcendentals take 5 cycles for 1 ULP accuracy (not quite faithful),
> 1 cycle throughput. Due to the organization of SP trans, 16-bit takes
> just as long as 32-bit <in GPUs>.
><
> In order to get this working at the 128-bit level, you are going to need
> FP with 128-bits of fraction; just so you can correctly calculate the
> coefficients the polynomial is based upon. My estimate is that 128-bit
> transcendentals will cost ~40 cycles.

Certainly faster than doing the same thing in software.

The little test program

calculates 10**7 16-byte square roots in around 2.5 seconds
on a CPU with 2.2 GHz, so around 550 cycles per square root,
on a Zen 1 (I have not bothered to filter out the overhead).

Re: Approximate reciprocals

<526d6018-1e28-44f7-86e6-89ccbda1f663n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24418&group=comp.arch#24418

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:5fd2:0:b0:2e1:b346:7505 with SMTP id k18-20020ac85fd2000000b002e1b3467505mr21854085qta.94.1647988524543;
Tue, 22 Mar 2022 15:35:24 -0700 (PDT)
X-Received: by 2002:a05:6808:1b11:b0:2da:73df:2dbd with SMTP id
bx17-20020a0568081b1100b002da73df2dbdmr3251086oib.293.1647988524097; Tue, 22
Mar 2022 15:35:24 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!feeder.erje.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 22 Mar 2022 15:35:23 -0700 (PDT)
In-Reply-To: <t1dckv$u7$2@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:2129:c853:ffd9:145d;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:2129:c853:ffd9:145d
References: <t1c154$j5t$1@dont-email.me> <81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com>
<t1d0r8$o4v$1@newsreader4.netcologne.de> <903965ad-5226-49d5-9883-57b1bc836fd7n@googlegroups.com>
<t1dckv$u7$2@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <526d6018-1e28-44f7-86e6-89ccbda1f663n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: already5...@yahoo.com (Michael S)
Injection-Date: Tue, 22 Mar 2022 22:35:24 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 83

by: Michael S - Tue, 22 Mar 2022 22:35 UTC

On Tuesday, March 22, 2022 at 10:47:31 PM UTC+2, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > On Tuesday, March 22, 2022 at 12:26:03 PM UTC-5, Thomas Koenig wrote:
> >> MitchAlsup <Mitch...@aol.com> schrieb:
> >> > My feelings are::
> >> > a) screw the approximations
> >> > b) because doing them to faithful accuracy is easy
> >> > c) the right answer is only 14 cycles
> >> > d) if you build the FMAC unit correctly
> >> > See: USPTO 10,761,806
> ><
> >> Impressive (as I said before). This being a patent means that
> >> you forbid everybody to do what you claim, except outside the US
> >> (I believe Marcus is based in Sweden) and, of course, research is
> >> OK too. It is unclear if an FPGA-based implementation would fall
> >> under this patent ("processor" does not seem to be defined, and as
> >> I have learnt the hard way, you need to define things in patents).
> ><
> > I am willing to sell licenses for less money that it takes you to
> > assign a design engineer and have him/her read the document
> > thoroughly.
> I'm not going to pay for this, my profession is something
> quite different :-)
> > What is claimed is the necessary organization. An FPGA with
> > the necessary organization would infringe.
> Hm, if I had been involved in writing that patent, I would probably
> have included a description of what "processor" actually means
> in this context (which is why stuff I (co)write tends to be much
> longer, possibly unnecessarily so; it is, however, in a field
> totally unrelated to computer architecture).
>
> But it is of course your patent and your style of writing :-)
> >> Regarding approximate reciprocals: I would prefer a CPU which has
> >> different widths of floating point data (16, 32, 64 and possibly 128
> >> bits), if the smaller types are faster and more energy efficient
> >> than the larger types. People who need a low-cost, low-accuracy
> >> sqrt could then use the 16-bit version.
> ><
> > I am nor arguing against that point. From GPU work, the single precision
> > transcendentals take 5 cycles for 1 ULP accuracy (not quite faithful),
> > 1 cycle throughput. Due to the organization of SP trans, 16-bit takes
> > just as long as 32-bit <in GPUs>.
> ><
> > In order to get this working at the 128-bit level, you are going to need
> > FP with 128-bits of fraction; just so you can correctly calculate the
> > coefficients the polynomial is based upon. My estimate is that 128-bit
> > transcendentals will cost ~40 cycles.
> Certainly faster than doing the same thing in software.
>
> The little test program
>
> program main
> implicit none
> integer, parameter :: qp = selected_real_kind(30)
> integer, parameter :: n = 10**6, m = 10
> real :: t1, t2
> character(len=20) :: c
> integer :: i,k
> real(kind=qp), dimension(:), allocatable :: a
> allocate (a(n))
> c = '10'
> call random_number(a)
> do i=1,m
> a = sqrt(a)
> read (unit=c,fmt=*) k
> write (*,*) a(k)
> end do
> end program main
>
> calculates 10**7 16-byte square roots in around 2.5 seconds
> on a CPU with 2.2 GHz, so around 550 cycles per square root,
> on a Zen 1 (I have not bothered to filter out the overhead).

If I understand correctly, real(kind=qp) is equivalent of IEEE binary128.
IIRC, my own quad-precision class calculates square roots 2-3 times faster.
And it has better precision that gnu binary128 - 128bit mantissa instead of 113 bits.
So, it seems, for binary128 I should be able to calculate sqrt a little faster yet.
Of course, not in 40 clock cycles, but may be, in 105-110.
Or may be not. My internal format is easier to parse in sw than binary128
so, possibly more complicated parsing will bring the speed back to the same point.

Besides, does not your test end up calculating square root of exactly 1 for majority of loop iterations?
Since we're talking about sw, there is a danger that this case is not representative of typical timing.

Re: Approximate reciprocals

<4c28367d-1871-431d-bd11-a9d9f6639e13n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24419&group=comp.arch#24419

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:578b:0:b0:2e0:6d91:1294 with SMTP id v11-20020ac8578b000000b002e06d911294mr22300886qta.386.1647989486390;
Tue, 22 Mar 2022 15:51:26 -0700 (PDT)
X-Received: by 2002:a05:6808:113:b0:2ec:b7db:df66 with SMTP id
b19-20020a056808011300b002ecb7dbdf66mr3502948oie.108.1647989486147; Tue, 22
Mar 2022 15:51:26 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.mixmin.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 22 Mar 2022 15:51:25 -0700 (PDT)
In-Reply-To: <t1cin7$hpc$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:7ccf:583e:2bb4:5ed0;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:7ccf:583e:2bb4:5ed0
References: <t1c154$j5t$1@dont-email.me> <t1cin7$hpc$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4c28367d-1871-431d-bd11-a9d9f6639e13n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Tue, 22 Mar 2022 22:51:26 +0000
Content-Type: text/plain; charset="UTF-8"

by: Quadibloc - Tue, 22 Mar 2022 22:51 UTC

On Tuesday, March 22, 2022 at 7:24:58 AM UTC-6, Terje Mathisen wrote:

> Even when you need exact results, having that reciprocal starting point
> means that you will converge on that value significantly faster.

Of course, but if the only use of an approximate reciprocal is to
begin a software division routine... why not just have a divide
instruction instead? One that is fast and efficient.

John Savard

Marcus wrote:
> On 2022-03-22, Terje Mathisen wrote:
>> Marcus wrote:
>>> What are your feelings towards including such instructions in an ISA?
>>
>> Stupid not to do it?
>>
>>>
>>> My own feelings are mixed.
>>>
>>> Pros:
>>> * Easy to implement in hardware.
>>> * Can provide a significant speedup for certain workloads, especially
>>> Â Â if limited accuracy is acceptable.
>>
>> Even when you need exact results, having that reciprocal starting
>> point means that you will converge on that value significantly faster.
>>
>> I.e. starting with a ~12-bit approximation gives you float with
>> fractional ulp accuracy after one NR stage and exact float with one
>> more iteration, still significantly faster than an SRT divider/sqrt
>> circuit.
>
> Follow-up question 1: What approximation accuracy should one aim for?
> I've seen 8 bits (yielding SP precision in 2 NR steps), and you mention
> 12 bits. With 13 bits, would you not get full SP precision in 1 NR step?
> CRAY gave a full 30 bits in the first approximation (!).
>
> Follow-up question 2: Implementation-wise, is it better to do a raw
> nearest-neighbor ROM lookup, or more sophisticated ROM + linear
> interpolation? It feels like the latter would give you more bang for
> the buck, possibly at the cost of added latency.

If you aim for single-cycle SIMD lookup, then I really can't see any
room for any form of interpolation, while accessing the same ROM table 4
times is doable, right?

A 12 (or 13?) bit lookup giving 12-bit results, along with the required
exp trickery looks like about 6-7 Kbits of table space.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Thomas Koenig wrote:
> MitchAlsup <MitchAlsup@aol.com> schrieb:
>> On Tuesday, March 22, 2022 at 12:26:03 PM UTC-5, Thomas Koenig wrote:
>>> MitchAlsup <Mitch...@aol.com> schrieb:
>>>> My feelings are::
>>>> a) screw the approximations
>>>> b) because doing them to faithful accuracy is easy
>>>> c) the right answer is only 14 cycles
>>>> d) if you build the FMAC unit correctly
>>>> See: USPTO 10,761,806
>> <
>>> Impressive (as I said before). This being a patent means that
>>> you forbid everybody to do what you claim, except outside the US
>>> (I believe Marcus is based in Sweden) and, of course, research is
>>> OK too. It is unclear if an FPGA-based implementation would fall
>>> under this patent ("processor" does not seem to be defined, and as
>>> I have learnt the hard way, you need to define things in patents).
>> <
>> I am willing to sell licenses for less money that it takes you to
>> assign a design engineer and have him/her read the document
>> thoroughly.
>
> I'm not going to pay for this, my profession is something
> quite different :-)
>
>
>> What is claimed is the necessary organization. An FPGA with
>> the necessary organization would infringe.
>
> Hm, if I had been involved in writing that patent, I would probably
> have included a description of what "processor" actually means
> in this context (which is why stuff I (co)write tends to be much
> longer, possibly unnecessarily so; it is, however, in a field
> totally unrelated to computer architecture).
>
> But it is of course your patent and your style of writing :-)
>
>>> Regarding approximate reciprocals: I would prefer a CPU which has
>>> different widths of floating point data (16, 32, 64 and possibly 128
>>> bits), if the smaller types are faster and more energy efficient
>>> than the larger types. People who need a low-cost, low-accuracy
>>> sqrt could then use the 16-bit version.
>> <
>> I am nor arguing against that point. From GPU work, the single precision
>> transcendentals take 5 cycles for 1 ULP accuracy (not quite faithful),
>> 1 cycle throughput. Due to the organization of SP trans, 16-bit takes
>> just as long as 32-bit <in GPUs>.
>> <
>> In order to get this working at the 128-bit level, you are going to need
>> FP with 128-bits of fraction; just so you can correctly calculate the
>> coefficients the polynomial is based upon. My estimate is that 128-bit
>> transcendentals will cost ~40 cycles.
>
> Certainly faster than doing the same thing in software.
>
> The little test program
>
> program main
> implicit none
> integer, parameter :: qp = selected_real_kind(30)
> integer, parameter :: n = 10**6, m = 10
> real :: t1, t2
> character(len=20) :: c
> integer :: i,k
> real(kind=qp), dimension(:), allocatable :: a
> allocate (a(n))
> c = '10'
> call random_number(a)
> do i=1,m
> a = sqrt(a)
> read (unit=c,fmt=*) k
> write (*,*) a(k)
> end do
> end program main
>
> calculates 10**7 16-byte square roots in around 2.5 seconds
> on a CPU with 2.2 GHz, so around 550 cycles per square root,
> on a Zen 1 (I have not bothered to filter out the overhead).
>
If those random inputs cover the full 128-bit range, then I'm quite
impressed, 550 cycles really isn't that bad:

I would probably implement sqrtq() by taking the input, extracting the
top 53 mantissa bits and the bottom exponent bit (plus a 1022 or 1023
bias), then take the sqrt() of that. Two NR iterations with 128-bit and
256-bit precision would get _very_ close to perfect rounding.

OTOH, it might make more sense to start with the reciprocal sqrt lookup
(assume at least 9 bits), then do the much simpler InvSqrt NR with
increasing precision, where only the last two iterations would need
extended precision, and the final one is modified to produce sqrt
instead of InvSqrt. To get perfect rounding for the final 112-bit
mantissa means that we need almost 230 bits for that last stage, making
it very hard to do with just double precision variables (needing 3
double variables), while having a 64x64->128 unsigned int multiplier
makes it all work out nicely.

A 128x128->256 MUL needs 4 64x64->128 operations plus a bunch of
carries, I'm guessing it could be done in less than 20 cycles?

;; Inputs in r10:r9 and r12:r11, result in r16:r15:r14:r13
mov rax,r9
mul r11
mov r13,rax
mov r14,rdx

mov rax,r10
mul r11
xor r15,r15
add r14,rax
adc r15,rdx

mov rax,r9
mul r12
xor r16,r16
add r14,rax
adc r15,rdx
adc r16,0

mov rax,r10
mul r12
add r15,rax
adc r16,rdx

With a wide OoO cpu those MULs are all independent so they can overlap,
only the carries are sequential, so in theory it could run in close to
10 cycles using a 4-cycle MUL as the building block, but 20 seems safe.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Quadibloc wrote:
> On Tuesday, March 22, 2022 at 7:24:58 AM UTC-6, Terje Mathisen wrote:
>
>> Even when you need exact results, having that reciprocal starting point
>> means that you will converge on that value significantly faster.
>
> Of course, but if the only use of an approximate reciprocal is to
> begin a software division routine... why not just have a divide
> instruction instead? One that is fast and efficient.

I was more worried about InvSqrt(), which can of course be used as a
building block for both FSQRT and FDIV as well.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Approximate reciprocals

<t1fmss$i30$2@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24429&group=comp.arch#24429

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!news.freedyn.de!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-30bd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Wed, 23 Mar 2022 17:54:36 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <t1fmss$i30$2@newsreader4.netcologne.de>
References: <t1c154$j5t$1@dont-email.me>
<81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com>
<t1d0r8$o4v$1@newsreader4.netcologne.de>
<903965ad-5226-49d5-9883-57b1bc836fd7n@googlegroups.com>
<t1dckv$u7$2@newsreader4.netcologne.de>
<526d6018-1e28-44f7-86e6-89ccbda1f663n@googlegroups.com>
Injection-Date: Wed, 23 Mar 2022 17:54:36 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-30bd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:30bd:0:7285:c2ff:fe6c:992d";
logging-data="18528"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Wed, 23 Mar 2022 17:54 UTC

Michael S <already5chosen@yahoo.com> schrieb:

>> The little test program
>>
>> program main
>> implicit none
>> integer, parameter :: qp = selected_real_kind(30)
>> integer, parameter :: n = 10**6, m = 10
>> real :: t1, t2
>> character(len=20) :: c
>> integer :: i,k
>> real(kind=qp), dimension(:), allocatable :: a
>> allocate (a(n))
>> c = '10'
>> call random_number(a)
>> do i=1,m
>> a = sqrt(a)
>> read (unit=c,fmt=*) k
>> write (*,*) a(k)
>> end do
>> end program main
>>
>> calculates 10**7 16-byte square roots in around 2.5 seconds
>> on a CPU with 2.2 GHz, so around 550 cycles per square root,
>> on a Zen 1 (I have not bothered to filter out the overhead).
>
> If I understand correctly, real(kind=qp) is equivalent of IEEE binary128.

This is what GNU Fortran uses on most platforms, using the quadmath
library.

> IIRC, my own quad-precision class calculates square roots 2-3 times faster.

That would be interesting. Could you post a benchmarks?

> And it has better precision that gnu binary128 - 128bit mantissa instead of 113 bits.
> So, it seems, for binary128 I should be able to calculate sqrt a little faster yet.
> Of course, not in 40 clock cycles, but may be, in 105-110.
> Or may be not. My internal format is easier to parse in sw than binary128
> so, possibly more complicated parsing will bring the speed back to the same point.
>
> Besides, does not your test end up calculating square root of
> exactly 1 for majority of loop iterations? Since we're talking
> about sw, there is a danger that this case is not representative
> of typical timing.

Not exactly one, but close to one.

Here's an updated program:

program main
implicit none
integer, parameter :: qp = selected_real_kind(30)
integer, parameter :: n = 10**6, m = 10
real :: t1, t2
character(len=20) :: c
integer :: i,k
real(kind=qp), dimension(:), allocatable :: a
allocate (a(n))
c = '10'
call random_number(a)
a = a * 1e50_qp
call cpu_time(t1)
do i=1,m
a = sqrt(a)
read (unit=c,fmt=*) k
write (*,*) a(k)
end do
call cpu_time(t2)
write (*,'(A,F12.5,A,ES15.5,A)') "Used ",t2-t1, " seconds to calculate ", &
real(n*m,qp), " square roots."
end program main

whose output is

7671809140563249423302811.97887168825
2769803086965.43433961116406142440907
1664272.53987002812039255042083903983
1290.06687418522150052441522368030520
35.9175009457119872785306931352753634
5.99312113557802094243808280085584780
2.44808519777764698252724820741238982
1.56463580355865785822592147498455777
1.25085402967678760971072508986042575
1.11841585721805089654732018717047560
Used 2.37363 seconds to calculate 1.00000E+07 square roots.

so the number of cycles is similar.

I just ran the same program on a POWER with hardware
support for 128-bit floats (using to-be-released support for
-mabi=ieeelongdouble), and the output was

8625926447798841161143265.26897396334
2936992755830.16056596538977565335987
1713765.66537848475066226136266112131
1309.10872939511206616748020241410596
36.1816076120881072462540037278106935
6.01511492925015583581640205312340543
2.45257312413924894901462105871990474
1.56606932290344316961162320054757086
1.25142691472712187692445496527231813
1.11867194240631684707050674605147131
Used 0.31389 seconds to calculate 1.00000E+07 square roots

This machine also uses a 2.2 GHz clock (or claims to - hard to
be sure with a virtual machine, it is likely to lie about its CPU
frequency), so that would come to around 70 cycles including the
loop and load/store overhead.

Re: Approximate reciprocals

<5991ffcb-7857-49ba-9204-7201850b64a6n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24436&group=comp.arch#24436

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:3d3:b0:2e2:1294:5817 with SMTP id k19-20020a05622a03d300b002e212945817mr1681714qtx.638.1648070655743;
Wed, 23 Mar 2022 14:24:15 -0700 (PDT)
X-Received: by 2002:a05:6870:1607:b0:de:984:496d with SMTP id
b7-20020a056870160700b000de0984496dmr5345588oae.253.1648070655513; Wed, 23
Mar 2022 14:24:15 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 23 Mar 2022 14:24:15 -0700 (PDT)
In-Reply-To: <t1fmss$i30$2@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:2129:c853:ffd9:145d;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:2129:c853:ffd9:145d
References: <t1c154$j5t$1@dont-email.me> <81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com>
<t1d0r8$o4v$1@newsreader4.netcologne.de> <903965ad-5226-49d5-9883-57b1bc836fd7n@googlegroups.com>
<t1dckv$u7$2@newsreader4.netcologne.de> <526d6018-1e28-44f7-86e6-89ccbda1f663n@googlegroups.com>
<t1fmss$i30$2@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5991ffcb-7857-49ba-9204-7201850b64a6n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: already5...@yahoo.com (Michael S)
Injection-Date: Wed, 23 Mar 2022 21:24:15 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 133

by: Michael S - Wed, 23 Mar 2022 21:24 UTC

On Wednesday, March 23, 2022 at 7:54:38 PM UTC+2, Thomas Koenig wrote:
> Michael S <already...@yahoo.com> schrieb:
> >> The little test program
> >>
> >> program main
> >> implicit none
> >> integer, parameter :: qp = selected_real_kind(30)
> >> integer, parameter :: n = 10**6, m = 10
> >> real :: t1, t2
> >> character(len=20) :: c
> >> integer :: i,k
> >> real(kind=qp), dimension(:), allocatable :: a
> >> allocate (a(n))
> >> c = '10'
> >> call random_number(a)
> >> do i=1,m
> >> a = sqrt(a)
> >> read (unit=c,fmt=*) k
> >> write (*,*) a(k)
> >> end do
> >> end program main
> >>
> >> calculates 10**7 16-byte square roots in around 2.5 seconds
> >> on a CPU with 2.2 GHz, so around 550 cycles per square root,
> >> on a Zen 1 (I have not bothered to filter out the overhead).
> >
> > If I understand correctly, real(kind=qp) is equivalent of IEEE binary128.
> This is what GNU Fortran uses on most platforms, using the quadmath
> library.
> > IIRC, my own quad-precision class calculates square roots 2-3 times faster.
> That would be interesting. Could you post a benchmarks?

The benchmark is a part of test suite for my quad-precision class.
It is call, unsurprisingly, tst_sqrt.
https://github.com/already5chosen/extfloat/tree/master/128
The code was developed on Windows and tested both with MSVC and gcc (under msys64).
It was never built or tested on Linux or on platforms other than x64, but I expect that it will take
less than hour to make it running on aarch64-linux/gcc or clang or on little-endian POWER-Linux.
The biggest expected difficulty would be in test infrastructure. My timing report uses __rdtsc() that
will have to be replaced.
Big-endian POWER - I am less sure. It should work, but probably not on the first try.

The speed is reported in nominal __rdtsc() cycles so it underestimates the # of actual CPU cycles.
Less so on desktops, more so on laptops or multicore servers.

And it is faster than what I remembered.
On core i5-3450 (Ivy Bridge, nominal freq=3100) it reports 78 cycles.
On Xeon E3-1271 v3 (Haswell, nominal freq=3600) it reports 72 cycles.
On Xeon E-2176G (Skylake, nominal freq=3700) it reports 58 cycles.

For the last machine I know that the actual frequency=4250, so actual # of cycles = 67.
For the other two I am not sure would guess that i5 is 3400-3500 and Xeon-E3 is at max, turbo=4000.
So, actual # of cycles is, respectively, 86-88 and 80 cycles.
Zen1 is very similar to Haswell, so I expect ~80 cycles, too.

> > And it has better precision that gnu binary128 - 128bit mantissa instead of 113 bits.
> > So, it seems, for binary128 I should be able to calculate sqrt a little faster yet.
> > Of course, not in 40 clock cycles, but may be, in 105-110.
> > Or may be not. My internal format is easier to parse in sw than binary128
> > so, possibly more complicated parsing will bring the speed back to the same point.
> >
> > Besides, does not your test end up calculating square root of
> > exactly 1 for majority of loop iterations? Since we're talking
> > about sw, there is a danger that this case is not representative
> > of typical timing.
> Not exactly one, but close to one.
>
> Here's an updated program:
> program main
> implicit none
> integer, parameter :: qp = selected_real_kind(30)
> integer, parameter :: n = 10**6, m = 10
> real :: t1, t2
> character(len=20) :: c
> integer :: i,k
> real(kind=qp), dimension(:), allocatable :: a
> allocate (a(n))
> c = '10'
> call random_number(a)
> a = a * 1e50_qp
> call cpu_time(t1)
> do i=1,m
> a = sqrt(a)
> read (unit=c,fmt=*) k
> write (*,*) a(k)
> end do
> call cpu_time(t2)
> write (*,'(A,F12.5,A,ES15.5,A)') "Used ",t2-t1, " seconds to calculate ", &
> real(n*m,qp), " square roots."
> end program main
>
> whose output is
>
> 7671809140563249423302811.97887168825
> 2769803086965.43433961116406142440907
> 1664272.53987002812039255042083903983
> 1290.06687418522150052441522368030520
> 35.9175009457119872785306931352753634
> 5.99312113557802094243808280085584780
> 2.44808519777764698252724820741238982
> 1.56463580355865785822592147498455777
> 1.25085402967678760971072508986042575
> 1.11841585721805089654732018717047560
> Used 2.37363 seconds to calculate 1.00000E+07 square roots.
>
> so the number of cycles is similar.
>
> I just ran the same program on a POWER with hardware
> support for 128-bit floats (using to-be-released support for
> -mabi=ieeelongdouble), and the output was
>
> 8625926447798841161143265.26897396334
> 2936992755830.16056596538977565335987
> 1713765.66537848475066226136266112131
> 1309.10872939511206616748020241410596
> 36.1816076120881072462540037278106935
> 6.01511492925015583581640205312340543
> 2.45257312413924894901462105871990474
> 1.56606932290344316961162320054757086
> 1.25142691472712187692445496527231813
> 1.11867194240631684707050674605147131
> Used 0.31389 seconds to calculate 1.00000E+07 square roots
>
> This machine also uses a 2.2 GHz clock (or claims to - hard to
> be sure with a virtual machine, it is likely to lie about its CPU
> frequency), so that would come to around 70 cycles including the
> loop and load/store overhead.

POWER with 128-bit FP hardware running at 2.2 GHz?
It seems to me, such HW does not exist.
The last POWER CPU that was able to run that so slowly was POWER5+, but it had no quad-precision HW.
If I am not mistaken, for POWER9 a minimal clock is 2.8 GHz, for POWER6/7/8 it's higher.

Re: Approximate reciprocals

<t1helc$mtc$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24442&group=comp.arch#24442

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!news.freedyn.de!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-30bd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Thu, 24 Mar 2022 09:46:20 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <t1helc$mtc$1@newsreader4.netcologne.de>
References: <t1c154$j5t$1@dont-email.me>
<81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com>
<t1d0r8$o4v$1@newsreader4.netcologne.de>
<903965ad-5226-49d5-9883-57b1bc836fd7n@googlegroups.com>
<t1dckv$u7$2@newsreader4.netcologne.de>
<526d6018-1e28-44f7-86e6-89ccbda1f663n@googlegroups.com>
<t1fmss$i30$2@newsreader4.netcologne.de>
<5991ffcb-7857-49ba-9204-7201850b64a6n@googlegroups.com>
Injection-Date: Thu, 24 Mar 2022 09:46:20 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-30bd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:30bd:0:7285:c2ff:fe6c:992d";
logging-data="23468"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Thu, 24 Mar 2022 09:46 UTC

Michael S <already5chosen@yahoo.com> schrieb:
> On Wednesday, March 23, 2022 at 7:54:38 PM UTC+2, Thomas Koenig wrote:
>> Michael S <already...@yahoo.com> schrieb:
>> >> The little test program
>> >>
>> >> program main
>> >> implicit none
>> >> integer, parameter :: qp = selected_real_kind(30)
>> >> integer, parameter :: n = 10**6, m = 10
>> >> real :: t1, t2
>> >> character(len=20) :: c
>> >> integer :: i,k
>> >> real(kind=qp), dimension(:), allocatable :: a
>> >> allocate (a(n))
>> >> c = '10'
>> >> call random_number(a)
>> >> do i=1,m
>> >> a = sqrt(a)
>> >> read (unit=c,fmt=*) k
>> >> write (*,*) a(k)
>> >> end do
>> >> end program main
>> >>
>> >> calculates 10**7 16-byte square roots in around 2.5 seconds
>> >> on a CPU with 2.2 GHz, so around 550 cycles per square root,
>> >> on a Zen 1 (I have not bothered to filter out the overhead).
>> >
>> > If I understand correctly, real(kind=qp) is equivalent of IEEE binary128.
>> This is what GNU Fortran uses on most platforms, using the quadmath
>> library.
>> > IIRC, my own quad-precision class calculates square roots 2-3 times faster.
>> That would be interesting. Could you post a benchmarks?
>
> The benchmark is a part of test suite for my quad-precision class.
> It is call, unsurprisingly, tst_sqrt.
> https://github.com/already5chosen/extfloat/tree/master/128
> The code was developed on Windows and tested both with MSVC and gcc (under msys64).
> It was never built or tested on Linux or on platforms other than x64, but I expect that it will take
> less than hour to make it running on aarch64-linux/gcc or clang or on little-endian POWER-Linux.
> The biggest expected difficulty would be in test infrastructure. My timing report uses __rdtsc() that
> will have to be replaced.

Wall-clock timing is probably better because rdtsc is notoriously
unreliable.

> Big-endian POWER - I am less sure. It should work, but probably not on the first try.
>
> The speed is reported in nominal __rdtsc() cycles so it underestimates the # of actual CPU cycles.
> Less so on desktops, more so on laptops or multicore servers.
>
> And it is faster than what I remembered.
> On core i5-3450 (Ivy Bridge, nominal freq=3100) it reports 78 cycles.
> On Xeon E3-1271 v3 (Haswell, nominal freq=3600) it reports 72 cycles.
> On Xeon E-2176G (Skylake, nominal freq=3700) it reports 58 cycles.
>
> For the last machine I know that the actual frequency=4250, so actual # of cycles = 67.
> For the other two I am not sure would guess that i5 is 3400-3500 and Xeon-E3 is at max, turbo=4000.
> So, actual # of cycles is, respectively, 86-88 and 80 cycles.
> Zen1 is very similar to Haswell, so I expect ~80 cycles, too.
>
>> > And it has better precision that gnu binary128 - 128bit mantissa instead of 113 bits.
>> > So, it seems, for binary128 I should be able to calculate sqrt a little faster yet.
>> > Of course, not in 40 clock cycles, but may be, in 105-110.
>> > Or may be not. My internal format is easier to parse in sw than binary128
>> > so, possibly more complicated parsing will bring the speed back to the same point.
>> >
>> > Besides, does not your test end up calculating square root of
>> > exactly 1 for majority of loop iterations? Since we're talking
>> > about sw, there is a danger that this case is not representative
>> > of typical timing.
>> Not exactly one, but close to one.
>>
>> Here's an updated program:
>> program main
>> implicit none
>> integer, parameter :: qp = selected_real_kind(30)
>> integer, parameter :: n = 10**6, m = 10
>> real :: t1, t2
>> character(len=20) :: c
>> integer :: i,k
>> real(kind=qp), dimension(:), allocatable :: a
>> allocate (a(n))
>> c = '10'
>> call random_number(a)
>> a = a * 1e50_qp
>> call cpu_time(t1)
>> do i=1,m
>> a = sqrt(a)
>> read (unit=c,fmt=*) k
>> write (*,*) a(k)
>> end do
>> call cpu_time(t2)
>> write (*,'(A,F12.5,A,ES15.5,A)') "Used ",t2-t1, " seconds to calculate ", &
>> real(n*m,qp), " square roots."
>> end program main
>>
>> whose output is
>>
>> 7671809140563249423302811.97887168825
>> 2769803086965.43433961116406142440907
>> 1664272.53987002812039255042083903983
>> 1290.06687418522150052441522368030520
>> 35.9175009457119872785306931352753634
>> 5.99312113557802094243808280085584780
>> 2.44808519777764698252724820741238982
>> 1.56463580355865785822592147498455777
>> 1.25085402967678760971072508986042575
>> 1.11841585721805089654732018717047560
>> Used 2.37363 seconds to calculate 1.00000E+07 square roots.
>>
>> so the number of cycles is similar.
>>
>> I just ran the same program on a POWER with hardware
>> support for 128-bit floats (using to-be-released support for
>> -mabi=ieeelongdouble), and the output was
>>
>> 8625926447798841161143265.26897396334
>> 2936992755830.16056596538977565335987
>> 1713765.66537848475066226136266112131
>> 1309.10872939511206616748020241410596
>> 36.1816076120881072462540037278106935
>> 6.01511492925015583581640205312340543
>> 2.45257312413924894901462105871990474
>> 1.56606932290344316961162320054757086
>> 1.25142691472712187692445496527231813
>> 1.11867194240631684707050674605147131
>> Used 0.31389 seconds to calculate 1.00000E+07 square roots
>>
>> This machine also uses a 2.2 GHz clock (or claims to - hard to
>> be sure with a virtual machine, it is likely to lie about its CPU
>> frequency), so that would come to around 70 cycles including the
>> loop and load/store overhead.
>
> POWER with 128-bit FP hardware running at 2.2 GHz?
> It seems to me, such HW does not exist.

You can always tune down the frequency...

$ head /proc/cpuinfo
processor : 0
cpu : POWER9 (architected), altivec supported
clock : 2200.000000MHz
revision : 2.2 (pvr 004e 1202)

processor : 1
cpu : POWER9 (architected), altivec supported
clock : 2200.000000MHz
revision : 2.2 (pvr 004e 1202)

> The last POWER CPU that was able to run that so slowly was POWER5+, but it had no quad-precision HW.
> If I am not mistaken, for POWER9 a minimal clock is 2.8 GHz, for POWER6/7/8 it's higher.

or, like I said, it may be lying about its core frequency
because it is running under a hypervisor.

Re: Approximate reciprocals

<b58e87e7-5cad-4867-835e-ea84b192b230n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24450&group=comp.arch#24450

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:40ce:b0:67d:4ebe:f3c2 with SMTP id g14-20020a05620a40ce00b0067d4ebef3c2mr3552530qko.631.1648133649579;
Thu, 24 Mar 2022 07:54:09 -0700 (PDT)
X-Received: by 2002:a05:6808:152b:b0:2ec:f48f:8120 with SMTP id
u43-20020a056808152b00b002ecf48f8120mr2934536oiw.58.1648133649330; Thu, 24
Mar 2022 07:54:09 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 24 Mar 2022 07:54:09 -0700 (PDT)
In-Reply-To: <t1helc$mtc$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=199.203.251.52; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 199.203.251.52
References: <t1c154$j5t$1@dont-email.me> <81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com>
<t1d0r8$o4v$1@newsreader4.netcologne.de> <903965ad-5226-49d5-9883-57b1bc836fd7n@googlegroups.com>
<t1dckv$u7$2@newsreader4.netcologne.de> <526d6018-1e28-44f7-86e6-89ccbda1f663n@googlegroups.com>
<t1fmss$i30$2@newsreader4.netcologne.de> <5991ffcb-7857-49ba-9204-7201850b64a6n@googlegroups.com>
<t1helc$mtc$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b58e87e7-5cad-4867-835e-ea84b192b230n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: already5...@yahoo.com (Michael S)
Injection-Date: Thu, 24 Mar 2022 14:54:09 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 154

by: Michael S - Thu, 24 Mar 2022 14:54 UTC

On Thursday, March 24, 2022 at 11:46:23 AM UTC+2, Thomas Koenig wrote:
> Michael S <already...@yahoo.com> schrieb:
> > On Wednesday, March 23, 2022 at 7:54:38 PM UTC+2, Thomas Koenig wrote:
> >> Michael S <already...@yahoo.com> schrieb:
> >> >> The little test program
> >> >>
> >> >> program main
> >> >> implicit none
> >> >> integer, parameter :: qp = selected_real_kind(30)
> >> >> integer, parameter :: n = 10**6, m = 10
> >> >> real :: t1, t2
> >> >> character(len=20) :: c
> >> >> integer :: i,k
> >> >> real(kind=qp), dimension(:), allocatable :: a
> >> >> allocate (a(n))
> >> >> c = '10'
> >> >> call random_number(a)
> >> >> do i=1,m
> >> >> a = sqrt(a)
> >> >> read (unit=c,fmt=*) k
> >> >> write (*,*) a(k)
> >> >> end do
> >> >> end program main
> >> >>
> >> >> calculates 10**7 16-byte square roots in around 2.5 seconds
> >> >> on a CPU with 2.2 GHz, so around 550 cycles per square root,
> >> >> on a Zen 1 (I have not bothered to filter out the overhead).
> >> >
> >> > If I understand correctly, real(kind=qp) is equivalent of IEEE binary128.
> >> This is what GNU Fortran uses on most platforms, using the quadmath
> >> library.
> >> > IIRC, my own quad-precision class calculates square roots 2-3 times faster.
> >> That would be interesting. Could you post a benchmarks?
> >
> > The benchmark is a part of test suite for my quad-precision class.
> > It is call, unsurprisingly, tst_sqrt.
> > https://github.com/already5chosen/extfloat/tree/master/128
> > The code was developed on Windows and tested both with MSVC and gcc (under msys64).
> > It was never built or tested on Linux or on platforms other than x64, but I expect that it will take
> > less than hour to make it running on aarch64-linux/gcc or clang or on little-endian POWER-Linux.
> > The biggest expected difficulty would be in test infrastructure. My timing report uses __rdtsc() that
> > will have to be replaced.
> Wall-clock timing is probably better because rdtsc is notoriously

If all you want to know is wall clock then [on relative modern x86] rdtsc is very reliable.
At least when running on physical machines, as I did in all my measurements.
The problem with it is not reliability, but portability to non-x86 architectures.

> unreliable.
> > Big-endian POWER - I am less sure. It should work, but probably not on the first try.
> >
> > The speed is reported in nominal __rdtsc() cycles so it underestimates the # of actual CPU cycles.
> > Less so on desktops, more so on laptops or multicore servers.
> >
> > And it is faster than what I remembered.
> > On core i5-3450 (Ivy Bridge, nominal freq=3100) it reports 78 cycles.
> > On Xeon E3-1271 v3 (Haswell, nominal freq=3600) it reports 72 cycles.
> > On Xeon E-2176G (Skylake, nominal freq=3700) it reports 58 cycles.
> >
> > For the last machine I know that the actual frequency=4250, so actual # of cycles = 67.
> > For the other two I am not sure would guess that i5 is 3400-3500 and Xeon-E3 is at max, turbo=4000.
> > So, actual # of cycles is, respectively, 86-88 and 80 cycles.
> > Zen1 is very similar to Haswell, so I expect ~80 cycles, too.
> >
> >> > And it has better precision that gnu binary128 - 128bit mantissa instead of 113 bits.
> >> > So, it seems, for binary128 I should be able to calculate sqrt a little faster yet.
> >> > Of course, not in 40 clock cycles, but may be, in 105-110.
> >> > Or may be not. My internal format is easier to parse in sw than binary128
> >> > so, possibly more complicated parsing will bring the speed back to the same point.
> >> >
> >> > Besides, does not your test end up calculating square root of
> >> > exactly 1 for majority of loop iterations? Since we're talking
> >> > about sw, there is a danger that this case is not representative
> >> > of typical timing.
> >> Not exactly one, but close to one.
> >>
> >> Here's an updated program:
> >> program main
> >> implicit none
> >> integer, parameter :: qp = selected_real_kind(30)
> >> integer, parameter :: n = 10**6, m = 10
> >> real :: t1, t2
> >> character(len=20) :: c
> >> integer :: i,k
> >> real(kind=qp), dimension(:), allocatable :: a
> >> allocate (a(n))
> >> c = '10'
> >> call random_number(a)
> >> a = a * 1e50_qp
> >> call cpu_time(t1)
> >> do i=1,m
> >> a = sqrt(a)
> >> read (unit=c,fmt=*) k
> >> write (*,*) a(k)
> >> end do
> >> call cpu_time(t2)
> >> write (*,'(A,F12.5,A,ES15.5,A)') "Used ",t2-t1, " seconds to calculate ", &
> >> real(n*m,qp), " square roots."
> >> end program main
> >>
> >> whose output is
> >>
> >> 7671809140563249423302811.97887168825
> >> 2769803086965.43433961116406142440907
> >> 1664272.53987002812039255042083903983
> >> 1290.06687418522150052441522368030520
> >> 35.9175009457119872785306931352753634
> >> 5.99312113557802094243808280085584780
> >> 2.44808519777764698252724820741238982
> >> 1.56463580355865785822592147498455777
> >> 1.25085402967678760971072508986042575
> >> 1.11841585721805089654732018717047560
> >> Used 2.37363 seconds to calculate 1.00000E+07 square roots.
> >>
> >> so the number of cycles is similar.
> >>
> >> I just ran the same program on a POWER with hardware
> >> support for 128-bit floats (using to-be-released support for
> >> -mabi=ieeelongdouble), and the output was
> >>
> >> 8625926447798841161143265.26897396334
> >> 2936992755830.16056596538977565335987
> >> 1713765.66537848475066226136266112131
> >> 1309.10872939511206616748020241410596
> >> 36.1816076120881072462540037278106935
> >> 6.01511492925015583581640205312340543
> >> 2.45257312413924894901462105871990474
> >> 1.56606932290344316961162320054757086
> >> 1.25142691472712187692445496527231813
> >> 1.11867194240631684707050674605147131
> >> Used 0.31389 seconds to calculate 1.00000E+07 square roots
> >>
> >> This machine also uses a 2.2 GHz clock (or claims to - hard to
> >> be sure with a virtual machine, it is likely to lie about its CPU
> >> frequency), so that would come to around 70 cycles including the
> >> loop and load/store overhead.
> >
> > POWER with 128-bit FP hardware running at 2.2 GHz?
> > It seems to me, such HW does not exist.
> You can always tune down the frequency...
>
> $ head /proc/cpuinfo
> processor : 0
> cpu : POWER9 (architected), altivec supported
> clock : 2200.000000MHz
> revision : 2.2 (pvr 004e 1202)
>
> processor : 1
> cpu : POWER9 (architected), altivec supported
> clock : 2200.000000MHz
> revision : 2.2 (pvr 004e 1202)
> > The last POWER CPU that was able to run that so slowly was POWER5+, but it had no quad-precision HW.
> > If I am not mistaken, for POWER9 a minimal clock is 2.8 GHz, for POWER6/7/8 it's higher.
> or, like I said, it may be lying about its core frequency
> because it is running under a hypervisor.

Re: Approximate reciprocals

<t1i106$4jp$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24451&group=comp.arch#24451

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-30bd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Thu, 24 Mar 2022 14:59:19 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <t1i106$4jp$1@newsreader4.netcologne.de>
References: <t1c154$j5t$1@dont-email.me>
<81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com>
<t1d0r8$o4v$1@newsreader4.netcologne.de>
<903965ad-5226-49d5-9883-57b1bc836fd7n@googlegroups.com>
<t1dckv$u7$2@newsreader4.netcologne.de>
<526d6018-1e28-44f7-86e6-89ccbda1f663n@googlegroups.com>
<t1fmss$i30$2@newsreader4.netcologne.de>
<5991ffcb-7857-49ba-9204-7201850b64a6n@googlegroups.com>
<t1helc$mtc$1@newsreader4.netcologne.de>
<b58e87e7-5cad-4867-835e-ea84b192b230n@googlegroups.com>
Injection-Date: Thu, 24 Mar 2022 14:59:19 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-30bd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:30bd:0:7285:c2ff:fe6c:992d";
logging-data="4729"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Thu, 24 Mar 2022 14:59 UTC

Michael S <already5chosen@yahoo.com> schrieb:
> On Thursday, March 24, 2022 at 11:46:23 AM UTC+2, Thomas Koenig wrote:
>> Michael S <already...@yahoo.com> schrieb:
>> > On Wednesday, March 23, 2022 at 7:54:38 PM UTC+2, Thomas Koenig wrote:
>> >> Michael S <already...@yahoo.com> schrieb:
>> >> >> The little test program
>> >> >>
>> >> >> program main
>> >> >> implicit none
>> >> >> integer, parameter :: qp = selected_real_kind(30)
>> >> >> integer, parameter :: n = 10**6, m = 10
>> >> >> real :: t1, t2
>> >> >> character(len=20) :: c
>> >> >> integer :: i,k
>> >> >> real(kind=qp), dimension(:), allocatable :: a
>> >> >> allocate (a(n))
>> >> >> c = '10'
>> >> >> call random_number(a)
>> >> >> do i=1,m
>> >> >> a = sqrt(a)
>> >> >> read (unit=c,fmt=*) k
>> >> >> write (*,*) a(k)
>> >> >> end do
>> >> >> end program main
>> >> >>
>> >> >> calculates 10**7 16-byte square roots in around 2.5 seconds
>> >> >> on a CPU with 2.2 GHz, so around 550 cycles per square root,
>> >> >> on a Zen 1 (I have not bothered to filter out the overhead).
>> >> >
>> >> > If I understand correctly, real(kind=qp) is equivalent of IEEE binary128.
>> >> This is what GNU Fortran uses on most platforms, using the quadmath
>> >> library.
>> >> > IIRC, my own quad-precision class calculates square roots 2-3 times faster.
>> >> That would be interesting. Could you post a benchmarks?
>> >
>> > The benchmark is a part of test suite for my quad-precision class.
>> > It is call, unsurprisingly, tst_sqrt.
>> > https://github.com/already5chosen/extfloat/tree/master/128
>> > The code was developed on Windows and tested both with MSVC and gcc (under msys64).
>> > It was never built or tested on Linux or on platforms other than x64, but I expect that it will take
>> > less than hour to make it running on aarch64-linux/gcc or clang or on little-endian POWER-Linux.
>> > The biggest expected difficulty would be in test infrastructure. My timing report uses __rdtsc() that
>> > will have to be replaced.
>> Wall-clock timing is probably better because rdtsc is notoriously
>
> If all you want to know is wall clock then [on relative modern x86] rdtsc is very reliable.

So, what is the wall-clock timing for calculating 10^7 square roots in
128 bit precision, on what sort of system?

Re: Approximate reciprocals

<4a14747b-b131-4619-af63-e87caa1186cen@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24452&group=comp.arch#24452

copy link Newsgroups: comp.arch

X-Received: by 2002:a0c:c447:0:b0:432:8ae6:aee with SMTP id t7-20020a0cc447000000b004328ae60aeemr4912742qvi.88.1648135886405;
Thu, 24 Mar 2022 08:31:26 -0700 (PDT)
X-Received: by 2002:a05:6808:2018:b0:2ec:c22b:15b8 with SMTP id
q24-20020a056808201800b002ecc22b15b8mr7564361oiw.136.1648135886109; Thu, 24
Mar 2022 08:31:26 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 24 Mar 2022 08:31:25 -0700 (PDT)
In-Reply-To: <t1i106$4jp$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=199.203.251.52; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 199.203.251.52
References: <t1c154$j5t$1@dont-email.me> <81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com>
<t1d0r8$o4v$1@newsreader4.netcologne.de> <903965ad-5226-49d5-9883-57b1bc836fd7n@googlegroups.com>
<t1dckv$u7$2@newsreader4.netcologne.de> <526d6018-1e28-44f7-86e6-89ccbda1f663n@googlegroups.com>
<t1fmss$i30$2@newsreader4.netcologne.de> <5991ffcb-7857-49ba-9204-7201850b64a6n@googlegroups.com>
<t1helc$mtc$1@newsreader4.netcologne.de> <b58e87e7-5cad-4867-835e-ea84b192b230n@googlegroups.com>
<t1i106$4jp$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4a14747b-b131-4619-af63-e87caa1186cen@googlegroups.com>
Subject: Re: Approximate reciprocals
From: already5...@yahoo.com (Michael S)
Injection-Date: Thu, 24 Mar 2022 15:31:26 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 53

by: Michael S - Thu, 24 Mar 2022 15:31 UTC

On Thursday, March 24, 2022 at 4:59:22 PM UTC+2, Thomas Koenig wrote:
> Michael S <already...@yahoo.com> schrieb:
> > On Thursday, March 24, 2022 at 11:46:23 AM UTC+2, Thomas Koenig wrote:
> >> Michael S <already...@yahoo.com> schrieb:
> >> > On Wednesday, March 23, 2022 at 7:54:38 PM UTC+2, Thomas Koenig wrote:
> >> >> Michael S <already...@yahoo.com> schrieb:
> >> >> >> The little test program
> >> >> >>
> >> >> >> program main
> >> >> >> implicit none
> >> >> >> integer, parameter :: qp = selected_real_kind(30)
> >> >> >> integer, parameter :: n = 10**6, m = 10
> >> >> >> real :: t1, t2
> >> >> >> character(len=20) :: c
> >> >> >> integer :: i,k
> >> >> >> real(kind=qp), dimension(:), allocatable :: a
> >> >> >> allocate (a(n))
> >> >> >> c = '10'
> >> >> >> call random_number(a)
> >> >> >> do i=1,m
> >> >> >> a = sqrt(a)
> >> >> >> read (unit=c,fmt=*) k
> >> >> >> write (*,*) a(k)
> >> >> >> end do
> >> >> >> end program main
> >> >> >>
> >> >> >> calculates 10**7 16-byte square roots in around 2.5 seconds
> >> >> >> on a CPU with 2.2 GHz, so around 550 cycles per square root,
> >> >> >> on a Zen 1 (I have not bothered to filter out the overhead).
> >> >> >
> >> >> > If I understand correctly, real(kind=qp) is equivalent of IEEE binary128.
> >> >> This is what GNU Fortran uses on most platforms, using the quadmath
> >> >> library.
> >> >> > IIRC, my own quad-precision class calculates square roots 2-3 times faster.
> >> >> That would be interesting. Could you post a benchmarks?
> >> >
> >> > The benchmark is a part of test suite for my quad-precision class.
> >> > It is call, unsurprisingly, tst_sqrt.
> >> > https://github.com/already5chosen/extfloat/tree/master/128
> >> > The code was developed on Windows and tested both with MSVC and gcc (under msys64).
> >> > It was never built or tested on Linux or on platforms other than x64, but I expect that it will take
> >> > less than hour to make it running on aarch64-linux/gcc or clang or on little-endian POWER-Linux.
> >> > The biggest expected difficulty would be in test infrastructure. My timing report uses __rdtsc() that
> >> > will have to be replaced.
> >> Wall-clock timing is probably better because rdtsc is notoriously
> >
> > If all you want to know is wall clock then [on relative modern x86] rdtsc is very reliable.
> So, what is the wall-clock timing for calculating 10^7 square roots in
> 128 bit precision, on what sort of system?

The needed numbers are in a post above.
Core i5-3450: 78/3100e6 *1e7 = 0.252s
Xeon E3-1271 v3: 72/3600e6*1e7= 0.200s
Xeon E-2176G: 58/3700e6*1e7 = 0.157s

Re: Approximate reciprocals

<t1ij9k$1br$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24460&group=comp.arch#24460

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Thu, 24 Mar 2022 15:11:30 -0500
Organization: A noiseless patient Spider
Lines: 121
Message-ID: <t1ij9k$1br$1@dont-email.me>
References: <t1c154$j5t$1@dont-email.me>
<81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com>
<t1crij$4p6$1@dont-email.me>
<4e849fc4-6ceb-46c8-9ad2-c0152c60111fn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 24 Mar 2022 20:11:32 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="82de82794c627957d0b2aa4e1623b11d";
logging-data="1403"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18NHV40zHMg/IuWIQYtktJg"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:RUdLPJ3wwKxTVljpY1cDClD1mzk=
In-Reply-To: <4e849fc4-6ceb-46c8-9ad2-c0152c60111fn@googlegroups.com>
Content-Language: en-US

by: BGB - Thu, 24 Mar 2022 20:11 UTC

On 3/22/2022 11:44 AM, MitchAlsup wrote:
> On Tuesday, March 22, 2022 at 10:56:06 AM UTC-5, Marcus wrote:
>> On 2022-03-22, MitchAlsup wrote:
>>> On Tuesday, March 22, 2022 at 3:25:11 AM UTC-5, Marcus wrote:
>>>> Hello group!
>>>>
>>>> A class of instructions that is very tempting to include in an ISA is
>>>> approximate floating-point reciprocals (1/x & 1/sqrt(x)), and
>>>> possibly specialized instructions for improving precision
>>>> (a Newton-Raphson step).
>>>>
>>>> Many ISAs have such instructions:
>>>>
>>>> CRAY:
>>>> * 070ijx - Floating reciprocal approximation
>>>> * 067ijk - Reciprocal iteration
>>>>
>>>> ARM:
>>>> * FRECPE - Floating-point reciprocal estimate
>>>> * FRECPS - Floating-point reciprocal step
>>>> * FRSQRTE - Floating-point reciprocal square root estimate
>>>> * FRSQRTS - Floating-point reciprocal square root step
>>>>
>>>> POWER:
>>>> * FRES - Floating reciprocal estimate single
>>>> * FRSQRTE - Floating reciprocal square root estimate
>>>>
>>>> x86:
>>>> * RSQRTSS - Approximate reciprocal square root
>>>>
>>>> TI C67x:
>>>> * RCPSP - Floating-Point reciprocal approximation
>>>> * RSQRSP - Floating-Point reciprocal square root approximation
>>>>
>>>> ...and there are probably others.
>>>>
>>>> What are your feelings towards including such instructions in an ISA?
>>>>
>>>> My own feelings are mixed.
>>> <
>>> My feelings are::
>>> a) screw the approximations
>>> b) because doing them to faithful accuracy is easy
>>> c) the right answer is only 14 cycles
>>> d) if you build the FMAC unit correctly
>>> See: USPTO 10,761,806
>> Thanks, that makes for a good read!
>>
>> While I see the benefits of fast argument reduction and a generalized
>> coefficient + FMA engine for implementing transcendental functions, the
>> reciprocal and reciprocal sqrt functions are simpler and common enough
>> to possibly warrant a dedicated solution?
> <
> You get to make those choices.
>>
>> E.g. the CRAY could do a full DP division by means of approximate
>> reciprocal + NR-step + multiply by nominator in just 3-4 clock cycles
>> (vectorized), which is pretty impressive even by today's standards.
> <
> Its OK if you are content with 46-bits of accuracy.
> <

A lot likely depends on the approximate reciprocal.

Say, for FRCP:
MagicConstant-FpBits

Accuracy can be improved slightly by using a lookup table for the
high-order bits, which gives a bias to add to the high-order bits of the
result.

One still needs a few N-R stages though, so I ended up skipping these as
they didn't buy much over doing it with integer ops. Hardware could
potentially do special cases though, like turning Zero into NaN or similar.

The N-R can be mapped to FMAC, though this case is helped if one has
either multiple versions (with built in sign negation), or the ability
to negate values cheaply.

Eg:
A*B+C
A*B-C
-A*B+C
-A*B-C

For graphics-processing and similar (with FP16), one can potentially
skip the N-R step.

So, say:
V2=V0/V1
Becomes:
V2=V0*(MagicBias_Div-V1)

Usually works because people are not overly fussy about exact values
when it comes to pixels.

Similar, goes for SQRT:
(Val>>1)+MagicBias_Sqrt
Or, InvSqrt:
MagicBias_InvSqrt-(Val>>1)

....

Likewise, for FP16 one could possibly get along OK using a single-stage
approximation as the final result in many cases.

>>>>
>>>> Pros:
>>>> * Easy to implement in hardware.
>>>> * Can provide a significant speedup for certain workloads, especially
>>>> if limited accuracy is acceptable.
>>>>
>>>> Cons:
>>>> * Hard to specify exact operation.
>>>> * Likely a source of poor portability (borderline undefined behavior).
>>>> * The "step/iteration" instructions can usually be replaced by FMA.
>>>>
>>>> /Marcus

The itanium had only an fp reciprocal; no division, that had to be done in
software. See https://www.cl.cam.ac.uk/~jrh13/papers/hol00.pdf

On Tue, 22 Mar 2022, Marcus wrote:

> Hello group!
>
> A class of instructions that is very tempting to include in an ISA is
> approximate floating-point reciprocals (1/x & 1/sqrt(x)), and
> possibly specialized instructions for improving precision
> (a Newton-Raphson step).
>
> Many ISAs have such instructions:
>
> CRAY:
> * 070ijx - Floating reciprocal approximation
> * 067ijk - Reciprocal iteration
>
> ARM:
> * FRECPE - Floating-point reciprocal estimate
> * FRECPS - Floating-point reciprocal step
> * FRSQRTE - Floating-point reciprocal square root estimate
> * FRSQRTS - Floating-point reciprocal square root step
>
> POWER:
> * FRES - Floating reciprocal estimate single
> * FRSQRTE - Floating reciprocal square root estimate
>
> x86:
> * RSQRTSS - Approximate reciprocal square root
>
> TI C67x:
> * RCPSP - Floating-Point reciprocal approximation
> * RSQRSP - Floating-Point reciprocal square root approximation
>
> ...and there are probably others.
>
> What are your feelings towards including such instructions in an ISA?
>
> My own feelings are mixed.
>
> Pros:
> * Easy to implement in hardware.
> * Can provide a significant speedup for certain workloads, especially
> if limited accuracy is acceptable.
>
> Cons:
> * Hard to specify exact operation.
> * Likely a source of poor portability (borderline undefined behavior).
> * The "step/iteration" instructions can usually be replaced by FMA.
>
> /Marcus
>

Re: Approximate reciprocals

<5c553807-0d0a-45f4-8b4e-a52480359c8cn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24468&group=comp.arch#24468

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:1127:b0:67e:7670:b5c with SMTP id p7-20020a05620a112700b0067e76700b5cmr7209094qkk.367.1648219157348;
Fri, 25 Mar 2022 07:39:17 -0700 (PDT)
X-Received: by 2002:a05:6870:1607:b0:de:984:496d with SMTP id
b7-20020a056870160700b000de0984496dmr9475479oae.253.1648219157100; Fri, 25
Mar 2022 07:39:17 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 25 Mar 2022 07:39:16 -0700 (PDT)
In-Reply-To: <4a14747b-b131-4619-af63-e87caa1186cen@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:7408:496f:7430:392a;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:7408:496f:7430:392a
References: <t1c154$j5t$1@dont-email.me> <81bd21bb-8e02-4629-9749-d846be44ef43n@googlegroups.com>
<t1d0r8$o4v$1@newsreader4.netcologne.de> <903965ad-5226-49d5-9883-57b1bc836fd7n@googlegroups.com>
<t1dckv$u7$2@newsreader4.netcologne.de> <526d6018-1e28-44f7-86e6-89ccbda1f663n@googlegroups.com>
<t1fmss$i30$2@newsreader4.netcologne.de> <5991ffcb-7857-49ba-9204-7201850b64a6n@googlegroups.com>
<t1helc$mtc$1@newsreader4.netcologne.de> <b58e87e7-5cad-4867-835e-ea84b192b230n@googlegroups.com>
<t1i106$4jp$1@newsreader4.netcologne.de> <4a14747b-b131-4619-af63-e87caa1186cen@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5c553807-0d0a-45f4-8b4e-a52480359c8cn@googlegroups.com>
Subject: Re: Approximate reciprocals
From: already5...@yahoo.com (Michael S)
Injection-Date: Fri, 25 Mar 2022 14:39:17 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 69

by: Michael S - Fri, 25 Mar 2022 14:39 UTC

On Thursday, March 24, 2022 at 5:31:28 PM UTC+2, Michael S wrote:
> On Thursday, March 24, 2022 at 4:59:22 PM UTC+2, Thomas Koenig wrote:
> > Michael S <already...@yahoo.com> schrieb:
> > > On Thursday, March 24, 2022 at 11:46:23 AM UTC+2, Thomas Koenig wrote:
> > >> Michael S <already...@yahoo.com> schrieb:
> > >> > On Wednesday, March 23, 2022 at 7:54:38 PM UTC+2, Thomas Koenig wrote:
> > >> >> Michael S <already...@yahoo.com> schrieb:
> > >> >> >> The little test program
> > >> >> >>
> > >> >> >> program main
> > >> >> >> implicit none
> > >> >> >> integer, parameter :: qp = selected_real_kind(30)
> > >> >> >> integer, parameter :: n = 10**6, m = 10
> > >> >> >> real :: t1, t2
> > >> >> >> character(len=20) :: c
> > >> >> >> integer :: i,k
> > >> >> >> real(kind=qp), dimension(:), allocatable :: a
> > >> >> >> allocate (a(n))
> > >> >> >> c = '10'
> > >> >> >> call random_number(a)
> > >> >> >> do i=1,m
> > >> >> >> a = sqrt(a)
> > >> >> >> read (unit=c,fmt=*) k
> > >> >> >> write (*,*) a(k)
> > >> >> >> end do
> > >> >> >> end program main
> > >> >> >>
> > >> >> >> calculates 10**7 16-byte square roots in around 2.5 seconds
> > >> >> >> on a CPU with 2.2 GHz, so around 550 cycles per square root,
> > >> >> >> on a Zen 1 (I have not bothered to filter out the overhead).
> > >> >> >
> > >> >> > If I understand correctly, real(kind=qp) is equivalent of IEEE binary128.
> > >> >> This is what GNU Fortran uses on most platforms, using the quadmath
> > >> >> library.
> > >> >> > IIRC, my own quad-precision class calculates square roots 2-3 times faster.
> > >> >> That would be interesting. Could you post a benchmarks?
> > >> >
> > >> > The benchmark is a part of test suite for my quad-precision class.
> > >> > It is call, unsurprisingly, tst_sqrt.
> > >> > https://github.com/already5chosen/extfloat/tree/master/128
> > >> > The code was developed on Windows and tested both with MSVC and gcc (under msys64).
> > >> > It was never built or tested on Linux or on platforms other than x64, but I expect that it will take
> > >> > less than hour to make it running on aarch64-linux/gcc or clang or on little-endian POWER-Linux.
> > >> > The biggest expected difficulty would be in test infrastructure. My timing report uses __rdtsc() that
> > >> > will have to be replaced.
> > >> Wall-clock timing is probably better because rdtsc is notoriously
> > >
> > > If all you want to know is wall clock then [on relative modern x86] rdtsc is very reliable.
> > So, what is the wall-clock timing for calculating 10^7 square roots in
> > 128 bit precision, on what sort of system?
> The needed numbers are in a post above.
> Core i5-3450: 78/3100e6 *1e7 = 0.252s
> Xeon E3-1271 v3: 72/3600e6*1e7= 0.200s
> Xeon E-2176G: 58/3700e6*1e7 = 0.157s

For reference, I compiled your program (MSYS2, gfortran 11.2.0, -O2) and run it on all tree systems.
Core i5-3450: 3.75962 seconds
Xeon E3-1271 v3: 2.96402 seconds
Xeon E-2176G: 2.70312 seconds

So, it seems, either your Zen1 CPU runs at much higher frequency (over 4.5 GHz?) or your version of quadmath library
is much better than the one, supplied with MSYS2 (mingw-w64-x86_64-gcc-libgfortran 11.2.0-10) or the library
likes AMD CPUs and hates Intel's.
Another, not very probable possibility is that your cpu_time() call is lying, but that's easily verifiable with running
binary under 'time' utility.
And yet another possibility is that your compiler managed to parallelize a line 'a = sqrt(a)'. Crazy suggestion, I know.

But one thing is sure - [on Intel] my quad-precision sqrt() is close to 15 times faster than the one supplied with GCC Quad-Precision Math Library.

You are lost in the Swamps of Despair.

devel / comp.arch / Re: Approximate reciprocals

Subject	Author
Approximate reciprocals	Marcus
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	robf...@gmail.com
Re: Approximate reciprocals	Marcus
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	Marcus
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Quadibloc
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Marcus
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	BGB
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	Quadibloc
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	James Van Buskirk
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	Quadibloc
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	robf...@gmail.com
Useful floating point instructions (was: Approximate reciprocals)	Thomas Koenig
Re: Useful floating point instructions	Terje Mathisen
Re: Useful floating point instructions	Stephen Fuld
Re: Useful floating point instructions	MitchAlsup
Re: Useful floating point instructions	Stephen Fuld
Re: Useful floating point instructions	MitchAlsup
Re: Useful floating point instructions	Michael S
Re: Useful floating point instructions	Stephen Fuld
Re: Useful floating point instructions	Terje Mathisen
Re: Useful floating point instructions	Terje Mathisen
Re: Useful floating point instructions	Stefan Monnier
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	George Neuner
Re: Approximate reciprocals	Anton Ertl
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Anton Ertl
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	George Neuner
Re: Approximate reciprocals	Anton Ertl
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	John Dallman
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	George Neuner
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	EricP
Re: Approximate reciprocals	Anton Ertl
Re: Approximate reciprocals	Anton Ertl
Re: Approximate reciprocals	John Dallman
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	Elijah Stone
Re: Approximate reciprocals	Marcus
Re: Approximate reciprocals	Marcus