Message-ID:

"Of all the tyrannies that affect mankind, tyranny in religion is the worst." -- Thomas Paine

devel / comp.arch / Re: Approximate reciprocals

Re: Approximate reciprocals

<t2na67$9ue$1@newsreader4.netcologne.de>

https://www.novabbs.com/devel/article-flat.php?id=24636&group=comp.arch#24636

Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-ec41-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Thu, 7 Apr 2022 18:23:03 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <t2na67$9ue$1@newsreader4.netcologne.de>
References: <t1c154$j5t$1@dont-email.me> <t1qf0u$oko$1@dont-email.me>
<t1qkql$ui0$1@newsreader4.netcologne.de>
<394168eb-53ed-49c2-a349-4035c3177361n@googlegroups.com>
<t1rm34$pg9$1@gioia.aioe.org>
<7029a173-963d-402b-a184-642120b5e1b8n@googlegroups.com>
<4bdfaba8-898f-4c1e-8ca1-234bf4d3ffc8n@googlegroups.com>
<t1vd17$5bj$1@newsreader4.netcologne.de>
<1d99080f-3c84-4a44-b2cf-271c2f3f7e90n@googlegroups.com>
<t1vkm4$ar2$1@newsreader4.netcologne.de>
<dc571956-dddd-469a-8b8e-30017e37d5bbn@googlegroups.com>
<t20qrb$4lp$1@newsreader4.netcologne.de>
<1b5bd111-40f0-41e7-9025-787e49f0fd02n@googlegroups.com>
<t22705$2jl$1@newsreader4.netcologne.de>
<c8c6ba2b-1314-48b7-8732-c7df882f0f3en@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 7 Apr 2022 18:23:03 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-ec41-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:ec41:0:7285:c2ff:fe6c:992d";
logging-data="10190"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Thu, 7 Apr 2022 18:23 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:
> On Wednesday, March 30, 2022 at 1:19:52 PM UTC-5, Thomas Koenig wrote:

>> Could you give that Chebyshev formula for 1/x?
><
> p(x) = 0.32323232×x^2 -0.48484848×x + 0.66666667
><
> on the interval 1.0..2.0
><
> Which is compared to 1000 randomized points against 1/x
><
> The highest error encountered has 6.63 bits of precision.
><---------------------------------
> But the x I put into the polynomial was the distance from the mid-point of the interval = 1.5
><---------------------------------
> Making the formula::
><
> p(x) = 0.32323232×(x-1.5)^2 -0.48484848×(x-1.5) + 0.66666667
><
> x in the interval {1.0..2.0} polynomial argument in the range {-0.5..+0.5}
><
><
><
> It is so easy to get lost in eXcel spreadsheet equations.

Which is one reason why I seriously don't like Excel and avoid
its use whenever I can. Just today, I was saved from putting a
bad value into a research report by having checked with MathCad
beforehand :-)

Wasn't there some economic theory based on faulty spreadsheets?
I remember such a story a few years ago...

Re: Approximate reciprocals

<t2nf4n$1rgf$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24637&group=comp.arch#24637

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!rd9pRsUZyxkRLAEK7e/Uzw.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Thu, 7 Apr 2022 21:47:33 +0200
Organization: Aioe.org NNTP Server
Message-ID: <t2nf4n$1rgf$1@gioia.aioe.org>
References: <10f7aa7f-00db-4ade-9e2e-e71602654f49n@googlegroups.com>
<e3cd8ed7-de1d-40ee-a21d-798cd2d3a3b6n@googlegroups.com>
<051bdc59-4b63-4a31-b898-fe9b700dbfc5n@googlegroups.com>
<t2cp1n$6ji$1@newsreader4.netcologne.de>
<1196d0e2-98bd-4fb0-a98f-4c1662e75f0en@googlegroups.com>
<c65c0f4b-e939-43ea-ab44-c09af20ee4fbn@googlegroups.com>
<vofo4hh9npgd0vaefo51khtndt80g440if@4ax.com>
<2022Apr5.181651@mips.complang.tuwien.ac.at>
<kojr4h96ootmmqrm6hdkbijgce4dfuu36s@4ax.com>
<2022Apr7.104701@mips.complang.tuwien.ac.at>
<1398a4bd-bd48-4e60-ab45-383e1bcc0750n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="60943"; posting-host="rd9pRsUZyxkRLAEK7e/Uzw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.11.1
X-Notice: Filtered by postfilter v. 0.9.2

by: Terje Mathisen - Thu, 7 Apr 2022 19:47 UTC

Michael S wrote:
> On Thursday, April 7, 2022 at 12:03:32 PM UTC+3, Anton Ertl wrote:
>> We were actually talking about the program that worked nicely on Linux,
>> and failed on WSL.
>>
>
> More I think about it, less I see a justification for WSL control word defaults.

I agree 100%, the WSL setup is simply broken.

>
>> But yes, the library may not be doing what it should do in a program
>> that actively sets the precision to some other values than long
>> double.
>>
>> I can imagine some measures that would deal with the problem in the
>> usual case: E.g., on first invocation the function for setting the
>> precision control changes the vector of the quad-math functions to
>> versions that set the precision control for their own needs (and save
>> and restore the x87 control word). However, that would not work for
>> programs that don't use these functions, but set the control word
>> directly (in assembly language), so unless the library is documented
>> as requiring the function calls for changing the precision control,
>> the library would still be deficient.
>>
>
> In this particular case, the only winning strategy is to refuse to play.
> I.e. quadmath library should be and could be coded without any use of 80-bit
> FP and with very minimalist use of 64-bit FP. On modern 64-bit x86 (and,
> I suppose, on modern ARM and POWER) it's not only the most robust way,
> precision-wise, but also the fastest.

I'm guessing that in real world use there might actually be significant
instances of numbers which happens to fit inside the 80-bit format, but
not in double. In that particular case using 80-bit is obviously a big
win, but this will break down pretty quickly as soon as you do a number
of FMULQs or a single FDIVQ or FSQRTQ.
>
> But I can imagine other cases where judicious use of 80-bit FP is really
> beneficiary.

Right.

The FSQRTQ example is an obvious case, since using 80-bit gives you a
64-bit mantissa, which after one NR iteration becomes ~128 which will be
correctly rounded in approximately 65535 out 65536 cases, right?

Doing a full precision back-multiplication, by splitting the 128-bit
mantissa into two parts and squaring it (4 uint64_t 64x64->128 muls)
give you the residual and tells you if the initial result needs to be
adjusted by 1 ulp.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Approximate reciprocals

<036a5606-c344-474a-95a1-149a07da1fe5n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24639&group=comp.arch#24639

copy link Newsgroups: comp.arch

X-Received: by 2002:ae9:ed8a:0:b0:67d:6a0d:2a82 with SMTP id c132-20020ae9ed8a000000b0067d6a0d2a82mr11176034qkg.561.1649365131649;
Thu, 07 Apr 2022 13:58:51 -0700 (PDT)
X-Received: by 2002:a05:6870:e9a7:b0:de:e59a:7376 with SMTP id
r39-20020a056870e9a700b000dee59a7376mr7895417oao.194.1649365131182; Thu, 07
Apr 2022 13:58:51 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 7 Apr 2022 13:58:50 -0700 (PDT)
In-Reply-To: <t2nf4n$1rgf$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=87.68.183.72; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 87.68.183.72
References: <10f7aa7f-00db-4ade-9e2e-e71602654f49n@googlegroups.com>
<e3cd8ed7-de1d-40ee-a21d-798cd2d3a3b6n@googlegroups.com> <051bdc59-4b63-4a31-b898-fe9b700dbfc5n@googlegroups.com>
<t2cp1n$6ji$1@newsreader4.netcologne.de> <1196d0e2-98bd-4fb0-a98f-4c1662e75f0en@googlegroups.com>
<c65c0f4b-e939-43ea-ab44-c09af20ee4fbn@googlegroups.com> <vofo4hh9npgd0vaefo51khtndt80g440if@4ax.com>
<2022Apr5.181651@mips.complang.tuwien.ac.at> <kojr4h96ootmmqrm6hdkbijgce4dfuu36s@4ax.com>
<2022Apr7.104701@mips.complang.tuwien.ac.at> <1398a4bd-bd48-4e60-ab45-383e1bcc0750n@googlegroups.com>
<t2nf4n$1rgf$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <036a5606-c344-474a-95a1-149a07da1fe5n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: already5...@yahoo.com (Michael S)
Injection-Date: Thu, 07 Apr 2022 20:58:51 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 84

by: Michael S - Thu, 7 Apr 2022 20:58 UTC

On Thursday, April 7, 2022 at 10:47:42 PM UTC+3, Terje Mathisen wrote:
> Michael S wrote:
> > On Thursday, April 7, 2022 at 12:03:32 PM UTC+3, Anton Ertl wrote:
> >> We were actually talking about the program that worked nicely on Linux,
> >> and failed on WSL.
> >>
> >
> > More I think about it, less I see a justification for WSL control word defaults.
> I agree 100%, the WSL setup is simply broken.
> >
> >> But yes, the library may not be doing what it should do in a program
> >> that actively sets the precision to some other values than long
> >> double.
> >>
> >> I can imagine some measures that would deal with the problem in the
> >> usual case: E.g., on first invocation the function for setting the
> >> precision control changes the vector of the quad-math functions to
> >> versions that set the precision control for their own needs (and save
> >> and restore the x87 control word). However, that would not work for
> >> programs that don't use these functions, but set the control word
> >> directly (in assembly language), so unless the library is documented
> >> as requiring the function calls for changing the precision control,
> >> the library would still be deficient.
> >>
> >
> > In this particular case, the only winning strategy is to refuse to play.
> > I.e. quadmath library should be and could be coded without any use of 80-bit
> > FP and with very minimalist use of 64-bit FP. On modern 64-bit x86 (and,
> > I suppose, on modern ARM and POWER) it's not only the most robust way,
> > precision-wise, but also the fastest.
> I'm guessing that in real world use there might actually be significant
> instances of numbers which happens to fit inside the 80-bit format, but
> not in double. In that particular case using 80-bit is obviously a big
> win, but this will break down pretty quickly as soon as you do a number
> of FMULQs or a single FDIVQ or FSQRTQ.

Finding out that result fits in 80 bit and converting back and force will take more
time than just doing the work. Pay attention that in common cases of FADD/FSUB/FMUL
a work itself, outside of call overhead, parsing inputs and formatting output, is of
order of 10 clocks, at worst 15 clocks on less modern cores.

> >
> > But I can imagine other cases where judicious use of 80-bit FP is really
> > beneficiary.
> Right.
>
> The FSQRTQ example is an obvious case, since using 80-bit gives you a
> 64-bit mantissa, which after one NR iteration becomes ~128 which will be
> correctly rounded in approximately 65535 out 65536 cases, right?
>
> Doing a full precision back-multiplication, by splitting the 128-bit
> mantissa into two parts and squaring it (4 uint64_t 64x64->128 muls)
> give you the residual and tells you if the initial result needs to be
> adjusted by 1 ulp.

I don't think so.
Doing rsqrt on x87 with 64-bit precision is not faster than doing
it with something like ~63.7 bits on the integer side. More likely, not just not faster,
but much slower, esp. when we take into account conversions.
And then, for the algorithm that I have in mind, 63.7 or 64 bits are not much better
than, say, 61 bit. In both cases on the next step you calculate a result with precision
of more than 113 bits, but less than 228 bits required for correct rounding (or is it 227?
I don't remember). So, after that step we examine the closeness of result to to the mid-point
between representable numbers and only when result is close we do additional step
which consists of doing the square of the midpoint, calculating only relevant bits and
then choosing direction of the rounding according to bit[115] of the square.
With susch algorithm the only difference between 64-bit estimate and 61-bit estimate
is the probability of doing the additional step. For 61b probability is low and for 64b it is
extremely low. Such difference is unlikely to cause detectable speed difference in real-world
usage scenario or even in benchmarks, except when they are intentionally crafted to
concentrate on the corner cases.

When I was talking about judicious use I didn't mean quadmath library at all,
but things like solving non-linear equations with double-precision inputs and
outputs where you sometimes want better precision on intermediate in order
to simplify algorithms and improve stability.
May be, something similar can be encountered in matrix factorization when det(A) is
small, but here I am less sure.
In fact, my comment was just an expression of general feeling. It wasn't deeply thought.

> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Re: Approximate reciprocals

<16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24641&group=comp.arch#24641

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: gneun...@comcast.net (George Neuner)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Thu, 07 Apr 2022 19:46:41 -0400
Organization: A noiseless patient Spider
Lines: 54
Message-ID: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com>
References: <vofo4hh9npgd0vaefo51khtndt80g440if@4ax.com> <memo.20220406220122.22520M@jgd.cix.co.uk>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Injection-Info: reader02.eternal-september.org; posting-host="45267da43c0d7a116d4568be17b69859";
logging-data="24826"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19J+LhMtW+0Dg5Csf7hnQ4//wLGGp4r3FY="
User-Agent: ForteAgent/8.00.32.1272
Cancel-Lock: sha1:pFITSqKi8UHP8Uvafv5FJvK+j2I=

by: George Neuner - Thu, 7 Apr 2022 23:46 UTC

On Wed, 6 Apr 2022 22:01 +0100 (BST), jgd@cix.co.uk (John Dallman)
wrote:

>In article <vofo4hh9npgd0vaefo51khtndt80g440if@4ax.com>,
>gneuner2@comcast.net (George Neuner) wrote:
>
>> Blame should fall on the library - if code needs control flags set
>> in some particular way, it should make sure they are set correctly.
>> Relying on system defaults because they happen to line up with
>> expectations is just lazy.
>
>Changing those flags takes time - on the microsecond scale,because it
>often requires the pipeline to empty before the change - so you don't
>want to be doing it at every entry to a quad-precision library, where
>you'd hope that some operations would be timed on the nanosecond scale.
>It's probably better to document what's needed, so that the application
>can control things.
>
>> Also the Windows default x87 setting IS 64-bit (full width)
>> precision.
>
>Default in what circumstances? In the Microsoft C/C++ run-time
>environment, the default has been ordinary double precision, with a
>53-bit mantissa, since 1996 to my certain knowledge. The hardware default
>for the x87 registers is long double, with a 64-bit mantissa.

Since Pentium 4 the compiler has by default used SIMD registers for
floating point. The x87 is not normally used.

But we were discussing use of the x87 by the quadmath library. The
library does/did not explicitly set x87 precision, and Michael found
the default setting under WSL to be wrong.

I was a bit sloppy in my wording, but in context I was addressing
Michael's assertion:

"Somehow, under WSL, x87 control word is set to 53-bit
precision (default Windows settings)."

WSL aside, on /Windows/ x87 precision defaults to 64-bit (long
double). Anton gave a more detailed response further down.

My real point was that library code should not be relying on defaults
for changeable settings. If quadmath explicitly set the needed
precision, the odd default running under WSL would not have mattered.

And yes, I understand both Michael's and Anton's aversion to
unnecessary work ... particularly because changing precision on the
x87 can be very slow. But since the setting can be READ as well as
written, there is no need to change it if it is correct when the
library goes to use it.

George

Re: Approximate reciprocals

<t2ont4$1uh6$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24642&group=comp.arch#24642

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!rd9pRsUZyxkRLAEK7e/Uzw.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Fri, 8 Apr 2022 09:23:14 +0200
Organization: Aioe.org NNTP Server
Message-ID: <t2ont4$1uh6$1@gioia.aioe.org>
References: <10f7aa7f-00db-4ade-9e2e-e71602654f49n@googlegroups.com>
<e3cd8ed7-de1d-40ee-a21d-798cd2d3a3b6n@googlegroups.com>
<051bdc59-4b63-4a31-b898-fe9b700dbfc5n@googlegroups.com>
<t2cp1n$6ji$1@newsreader4.netcologne.de>
<1196d0e2-98bd-4fb0-a98f-4c1662e75f0en@googlegroups.com>
<c65c0f4b-e939-43ea-ab44-c09af20ee4fbn@googlegroups.com>
<vofo4hh9npgd0vaefo51khtndt80g440if@4ax.com>
<2022Apr5.181651@mips.complang.tuwien.ac.at>
<kojr4h96ootmmqrm6hdkbijgce4dfuu36s@4ax.com>
<2022Apr7.104701@mips.complang.tuwien.ac.at>
<1398a4bd-bd48-4e60-ab45-383e1bcc0750n@googlegroups.com>
<t2nf4n$1rgf$1@gioia.aioe.org>
<036a5606-c344-474a-95a1-149a07da1fe5n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="64038"; posting-host="rd9pRsUZyxkRLAEK7e/Uzw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.11.1
X-Notice: Filtered by postfilter v. 0.9.2

by: Terje Mathisen - Fri, 8 Apr 2022 07:23 UTC

Michael S wrote:
> On Thursday, April 7, 2022 at 10:47:42 PM UTC+3, Terje Mathisen wrote:
>> Michael S wrote:
>>> On Thursday, April 7, 2022 at 12:03:32 PM UTC+3, Anton Ertl wrote:
>>>> We were actually talking about the program that worked nicely on Linux,
>>>> and failed on WSL.
>>>>
>>>
>>> More I think about it, less I see a justification for WSL control word defaults.
>> I agree 100%, the WSL setup is simply broken.
>>>
>>>> But yes, the library may not be doing what it should do in a program
>>>> that actively sets the precision to some other values than long
>>>> double.
>>>>
>>>> I can imagine some measures that would deal with the problem in the
>>>> usual case: E.g., on first invocation the function for setting the
>>>> precision control changes the vector of the quad-math functions to
>>>> versions that set the precision control for their own needs (and save
>>>> and restore the x87 control word). However, that would not work for
>>>> programs that don't use these functions, but set the control word
>>>> directly (in assembly language), so unless the library is documented
>>>> as requiring the function calls for changing the precision control,
>>>> the library would still be deficient.
>>>>
>>>
>>> In this particular case, the only winning strategy is to refuse to play.
>>> I.e. quadmath library should be and could be coded without any use of 80-bit
>>> FP and with very minimalist use of 64-bit FP. On modern 64-bit x86 (and,
>>> I suppose, on modern ARM and POWER) it's not only the most robust way,
>>> precision-wise, but also the fastest.
>> I'm guessing that in real world use there might actually be significant
>> instances of numbers which happens to fit inside the 80-bit format, but
>> not in double. In that particular case using 80-bit is obviously a big
>> win, but this will break down pretty quickly as soon as you do a number
>> of FMULQs or a single FDIVQ or FSQRTQ.
>
> Finding out that result fits in 80 bit and converting back and force will take more
> time than just doing the work. Pay attention that in common cases of FADD/FSUB/FMUL
> a work itself, outside of call overhead, parsing inputs and formatting output, is of
> order of 10 clocks, at worst 15 clocks on less modern cores.
>
>>>
>>> But I can imagine other cases where judicious use of 80-bit FP is really
>>> beneficiary.
>> Right.
>>
>> The FSQRTQ example is an obvious case, since using 80-bit gives you a
>> 64-bit mantissa, which after one NR iteration becomes ~128 which will be
>> correctly rounded in approximately 65535 out 65536 cases, right?
>>
>> Doing a full precision back-multiplication, by splitting the 128-bit
>> mantissa into two parts and squaring it (4 uint64_t 64x64->128 muls)
>> give you the residual and tells you if the initial result needs to be
>> adjusted by 1 ulp.
>
> I don't think so.
> Doing rsqrt on x87 with 64-bit precision is not faster than doing
> it with something like ~63.7 bits on the integer side. More likely, not just not faster,
> but much slower, esp. when we take into account conversions.
> And then, for the algorithm that I have in mind, 63.7 or 64 bits are not much better
> than, say, 61 bit. In both cases on the next step you calculate a result with precision
> of more than 113 bits, but less than 228 bits required for correct rounding (or is it 227?

Quad uses 1:15:112, so 113 including the hidden bit? I would like to
have 230 bits to feel perfectly safe, but if it has been proven to work
with one or two less, that's fine.

> I don't remember). So, after that step we examine the closeness of result to to the mid-point
> between representable numbers and only when result is close we do additional step
> which consists of doing the square of the midpoint, calculating only relevant bits and
> then choosing direction of the rounding according to bit[115] of the square.

That is a good idea! We don't need the leading 128 bits of the 256-bit
multiplication result (assuming 64-bit ints here), but that only saves
one 64x64->128 MUL (plus a few ADD/ADC).

> With susch algorithm the only difference between 64-bit estimate and 61-bit estimate
> is the probability of doing the additional step. For 61b probability is low and for 64b it is
> extremely low. Such difference is unlikely to cause detectable speed difference in real-world
> usage scenario or even in benchmarks, except when they are intentionally crafted to
> concentrate on the corner cases.

Yeah, I agree.

>
> When I was talking about judicious use I didn't mean quadmath library at all,
> but things like solving non-linear equations with double-precision inputs and
> outputs where you sometimes want better precision on intermediate in order
> to simplify algorithms and improve stability.
> May be, something similar can be encountered in matrix factorization when det(A) is
> small, but here I am less sure.

OK.

> In fact, my comment was just an expression of general feeling. It wasn't deeply thought.
:-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Approximate reciprocals

<e2639f3e-8a7a-4a55-996d-3f7830363aaen@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24643&group=comp.arch#24643

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:54a:0:b0:69a:f10c:f533 with SMTP id 71-20020a37054a000000b0069af10cf533mr2397847qkf.525.1649421053420;
Fri, 08 Apr 2022 05:30:53 -0700 (PDT)
X-Received: by 2002:a05:6808:1a21:b0:2f9:c3b2:843b with SMTP id
bk33-20020a0568081a2100b002f9c3b2843bmr2341499oib.7.1649421053100; Fri, 08
Apr 2022 05:30:53 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 8 Apr 2022 05:30:52 -0700 (PDT)
In-Reply-To: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com>
Injection-Info: google-groups.googlegroups.com; posting-host=87.68.183.72; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 87.68.183.72
References: <vofo4hh9npgd0vaefo51khtndt80g440if@4ax.com> <memo.20220406220122.22520M@jgd.cix.co.uk>
<16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e2639f3e-8a7a-4a55-996d-3f7830363aaen@googlegroups.com>
Subject: Re: Approximate reciprocals
From: already5...@yahoo.com (Michael S)
Injection-Date: Fri, 08 Apr 2022 12:30:53 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 82

by: Michael S - Fri, 8 Apr 2022 12:30 UTC

On Friday, April 8, 2022 at 2:46:46 AM UTC+3, George Neuner wrote:
> On Wed, 6 Apr 2022 22:01 +0100 (BST), j...@cix.co.uk (John Dallman)
> wrote:
> >In article <vofo4hh9npgd0vaef...@4ax.com>,
> >gneu...@comcast.net (George Neuner) wrote:
> >
> >> Blame should fall on the library - if code needs control flags set
> >> in some particular way, it should make sure they are set correctly.
> >> Relying on system defaults because they happen to line up with
> >> expectations is just lazy.
> >
> >Changing those flags takes time - on the microsecond scale,because it
> >often requires the pipeline to empty before the change - so you don't
> >want to be doing it at every entry to a quad-precision library, where
> >you'd hope that some operations would be timed on the nanosecond scale.
> >It's probably better to document what's needed, so that the application
> >can control things.
> >
> >> Also the Windows default x87 setting IS 64-bit (full width)
> >> precision.
> >
> >Default in what circumstances? In the Microsoft C/C++ run-time
> >environment, the default has been ordinary double precision, with a
> >53-bit mantissa, since 1996 to my certain knowledge. The hardware default
> >for the x87 registers is long double, with a 64-bit mantissa.
> Since Pentium 4 the compiler has by default used SIMD registers for
> floating point. The x87 is not normally used.

I wonder what you mean by that.
For starter, Windows does not have default user-mode compiler.
But the closest thing to "default compiler" is Microsoft's own Visual C++ with no special options.
32-bit Visual C++ with no special options most definitely uses x87 registers for FP math.
On any CPU, Pentium 4 or not.
If programmer want SSE2 then he has to tell it specifically to the compiler.

Now, 64-bit Visual C++ is different. This compiler uses SSE by default and can be told to use
AVX instead. But it can't be told to use x87.
Also, x86-64 Visual C++ is 3-4 years younger than Pentium 4 and it didn't become a default
of Visual Studio until very recently.
Or, may be, a default is still 32-bit? I don't have a latest version of VS to check.

>
> But we were discussing use of the x87 by the quadmath library. The
> library does/did not explicitly set x87 precision, and Michael found
> the default setting under WSL to be wrong.

Google helped me to find out that it was reported as a bug 5 years ago.
It seems, somebody at MS didn't agree with such definition.
https://github.com/microsoft/WSL/issues/1748

>
> I was a bit sloppy in my wording, but in context I was addressing
> Michael's assertion:
>
> "Somehow, under WSL, x87 control word is set to 53-bit
> precision (default Windows settings)."
>
> WSL aside, on /Windows/ x87 precision defaults to 64-bit (long
> double). Anton gave a more detailed response further down.

It's true for Gnu and clang tools for Windows.
For Microsoft tools it's not a case.
Microsoft 's tools default to control word = 0x02F7.
It's true both in 32-bit mode, where it makes a very good sense
because at introduction of WinNT back in 1993 Microsoft's major goal was maximal
compatibility of behavior between IA-32, MIPS and Alpha, and in 64-bit mode,
where it does not make a lot of sense, since compiler does not use x87 anyway so
its only user are asm programmers and libraries that most likely use it exactly due to
higher precision.

>
>
> My real point was that library code should not be relying on defaults
> for changeable settings. If quadmath explicitly set the needed
> precision, the odd default running under WSL would not have mattered.
>
> And yes, I understand both Michael's and Anton's aversion to
> unnecessary work ... particularly because changing precision on the
> x87 can be very slow. But since the setting can be READ as well as
> written, there is no need to change it if it is correct when the
> library goes to use it.
>
> George

Re: Approximate reciprocals

<IGX3K.547929$7F2.230033@fx12.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24644&group=comp.arch#24644

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!news.roellig-ltd.de!open-news-network.org!peer01.ams4!peer.am4.highwinds-media.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx12.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
References: <vofo4hh9npgd0vaefo51khtndt80g440if@4ax.com> <memo.20220406220122.22520M@jgd.cix.co.uk> <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <e2639f3e-8a7a-4a55-996d-3f7830363aaen@googlegroups.com>
In-Reply-To: <e2639f3e-8a7a-4a55-996d-3f7830363aaen@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 90
Message-ID: <IGX3K.547929$7F2.230033@fx12.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Fri, 08 Apr 2022 14:31:04 UTC
Date: Fri, 08 Apr 2022 10:30:56 -0400
X-Received-Bytes: 5358

by: EricP - Fri, 8 Apr 2022 14:30 UTC

Michael S wrote:
> On Friday, April 8, 2022 at 2:46:46 AM UTC+3, George Neuner wrote:
>> On Wed, 6 Apr 2022 22:01 +0100 (BST), j...@cix.co.uk (John Dallman)
>> wrote:
>>> In article <vofo4hh9npgd0vaef...@4ax.com>,
>>> gneu...@comcast.net (George Neuner) wrote:
>>>
>>>> Blame should fall on the library - if code needs control flags set
>>>> in some particular way, it should make sure they are set correctly.
>>>> Relying on system defaults because they happen to line up with
>>>> expectations is just lazy.
>>> Changing those flags takes time - on the microsecond scale,because it
>>> often requires the pipeline to empty before the change - so you don't
>>> want to be doing it at every entry to a quad-precision library, where
>>> you'd hope that some operations would be timed on the nanosecond scale.
>>> It's probably better to document what's needed, so that the application
>>> can control things.
>>>
>>>> Also the Windows default x87 setting IS 64-bit (full width)
>>>> precision.
>>> Default in what circumstances? In the Microsoft C/C++ run-time
>>> environment, the default has been ordinary double precision, with a
>>> 53-bit mantissa, since 1996 to my certain knowledge. The hardware default
>>> for the x87 registers is long double, with a 64-bit mantissa.
>> Since Pentium 4 the compiler has by default used SIMD registers for
>> floating point. The x87 is not normally used.
>
> I wonder what you mean by that.
> For starter, Windows does not have default user-mode compiler.
> But the closest thing to "default compiler" is Microsoft's own Visual C++ with no special options.
> 32-bit Visual C++ with no special options most definitely uses x87 registers for FP math.
> On any CPU, Pentium 4 or not.
> If programmer want SSE2 then he has to tell it specifically to the compiler.
>
> Now, 64-bit Visual C++ is different. This compiler uses SSE by default and can be told to use
> AVX instead. But it can't be told to use x87.
> Also, x86-64 Visual C++ is 3-4 years younger than Pentium 4 and it didn't become a default
> of Visual Studio until very recently.
> Or, may be, a default is still 32-bit? I don't have a latest version of VS to check.
>
>> But we were discussing use of the x87 by the quadmath library. The
>> library does/did not explicitly set x87 precision, and Michael found
>> the default setting under WSL to be wrong.
>
> Google helped me to find out that it was reported as a bug 5 years ago.
> It seems, somebody at MS didn't agree with such definition.
> https://github.com/microsoft/WSL/issues/1748
>
>> I was a bit sloppy in my wording, but in context I was addressing
>> Michael's assertion:
>>
>> "Somehow, under WSL, x87 control word is set to 53-bit
>> precision (default Windows settings)."
>>
>> WSL aside, on /Windows/ x87 precision defaults to 64-bit (long
>> double). Anton gave a more detailed response further down.
>
> It's true for Gnu and clang tools for Windows.
> For Microsoft tools it's not a case.
> Microsoft 's tools default to control word = 0x02F7.
> It's true both in 32-bit mode, where it makes a very good sense
> because at introduction of WinNT back in 1993 Microsoft's major goal was maximal
> compatibility of behavior between IA-32, MIPS and Alpha, and in 64-bit mode,
> where it does not make a lot of sense, since compiler does not use x87 anyway so
> its only user are asm programmers and libraries that most likely use it exactly due to
> higher precision.
>
>>
>> My real point was that library code should not be relying on defaults
>> for changeable settings. If quadmath explicitly set the needed
>> precision, the odd default running under WSL would not have mattered.
>>
>> And yes, I understand both Michael's and Anton's aversion to
>> unnecessary work ... particularly because changing precision on the
>> x87 can be very slow. But since the setting can be READ as well as
>> written, there is no need to change it if it is correct when the
>> library goes to use it.
>>
>> George

This comment seems to think it is due to how fork is implemented:
In WSL1 the FP control word is maintained across a fork, in Linux it is not.
(as with all Internet opinions, that should be taken with a grain of salt).

https://github.com/microsoft/WSL/issues/830#issuecomment-472279984

Later comments indicate it is fixed in WSL2
but then others contradict that.

Re: Approximate reciprocals

<2022Apr8.165147@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24645&group=comp.arch#24645

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Fri, 08 Apr 2022 14:51:47 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 15
Message-ID: <2022Apr8.165147@mips.complang.tuwien.ac.at>
References: <vofo4hh9npgd0vaefo51khtndt80g440if@4ax.com> <memo.20220406220122.22520M@jgd.cix.co.uk> <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <e2639f3e-8a7a-4a55-996d-3f7830363aaen@googlegroups.com>
Injection-Info: reader02.eternal-september.org; posting-host="793179b5b9bb48d5ea2a3da4f3b63bd8";
logging-data="17251"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18LOhRr65hrWO+faynP69Mr"
Cancel-Lock: sha1:gKyoxuVNrTWBe38Lao3+VwzlW1U=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Fri, 8 Apr 2022 14:51 UTC

Michael S <already5chosen@yahoo.com> writes:
> Google helped me to find out that it was reported as a bug 5 years ago.
>It seems, somebody at MS didn't agree with such definition.
>https://github.com/microsoft/WSL/issues/1748

Following this points to
<https://github.com/microsoft/WSL/issues/830>. And if you read down
that issue, you find that this also affects fork(), and fixing this
"requires some fairly substantial changes to the Windows kernel
itself." This issue also shows that this bug hits lots of people.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

In article <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com>,
gneuner2@comcast.net (George Neuner) wrote:

> Since Pentium 4 the compiler has by default used SIMD registers for
> floating point. The x87 is not normally used.

Which compiler? Microsoft Visual Studio is the nearest thing to a default
compiler for Windows. Its x86-64 version has used SSE2 for all floating
point since it first appeared, but that was several years after Pentium 4.
The x86 version became able to compile for SSE2 in VS.2005, and adopted
it as the default in VS.2012, but some versions have used x87 as well as
SSE2 instructions in the same compiles.

> But we were discussing use of the x87 by the quadmath library. The
> library does/did not explicitly set x87 precision, and Michael found
> the default setting under WSL to be wrong.

It is wrong, for accurate emulation of a Linux environment.

> WSL aside, on /Windows/ x87 precision defaults to 64-bit (long
> double). Anton gave a more detailed response further down.

Which environment is this in? That is not true for the Visual Studio
compilers.

> And yes, I understand both Michael's and Anton's aversion to
> unnecessary work ... particularly because changing precision on the
> x87 can be very slow. But since the setting can be READ as well as
> written, there is no need to change it if it is correct when the
> library goes to use it.

If you do it that way, you need to read the flags on every entry, and set
them if they need to change. And if you're going to avoid leaking this
setting to other code that may be upset by it, you need to save the old
controls and restore them. It all adds overhead.

John

Re: Approximate reciprocals

<c0933ca3-804e-48cb-b05c-c7419f7ea117n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24647&group=comp.arch#24647

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:2466:b0:441:2daa:4ab1 with SMTP id im6-20020a056214246600b004412daa4ab1mr16968765qvb.12.1649432964211;
Fri, 08 Apr 2022 08:49:24 -0700 (PDT)
X-Received: by 2002:a05:6808:55:b0:2ec:a4ae:fdde with SMTP id
v21-20020a056808005500b002eca4aefddemr142070oic.106.1649432963950; Fri, 08
Apr 2022 08:49:23 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 8 Apr 2022 08:49:23 -0700 (PDT)
In-Reply-To: <t2ont4$1uh6$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:d0ac:cb8:a3e2:9131;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:d0ac:cb8:a3e2:9131
References: <10f7aa7f-00db-4ade-9e2e-e71602654f49n@googlegroups.com>
<e3cd8ed7-de1d-40ee-a21d-798cd2d3a3b6n@googlegroups.com> <051bdc59-4b63-4a31-b898-fe9b700dbfc5n@googlegroups.com>
<t2cp1n$6ji$1@newsreader4.netcologne.de> <1196d0e2-98bd-4fb0-a98f-4c1662e75f0en@googlegroups.com>
<c65c0f4b-e939-43ea-ab44-c09af20ee4fbn@googlegroups.com> <vofo4hh9npgd0vaefo51khtndt80g440if@4ax.com>
<2022Apr5.181651@mips.complang.tuwien.ac.at> <kojr4h96ootmmqrm6hdkbijgce4dfuu36s@4ax.com>
<2022Apr7.104701@mips.complang.tuwien.ac.at> <1398a4bd-bd48-4e60-ab45-383e1bcc0750n@googlegroups.com>
<t2nf4n$1rgf$1@gioia.aioe.org> <036a5606-c344-474a-95a1-149a07da1fe5n@googlegroups.com>
<t2ont4$1uh6$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c0933ca3-804e-48cb-b05c-c7419f7ea117n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 08 Apr 2022 15:49:24 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 144

by: MitchAlsup - Fri, 8 Apr 2022 15:49 UTC

On Friday, April 8, 2022 at 2:23:19 AM UTC-5, Terje Mathisen wrote:
> Michael S wrote:
> > On Thursday, April 7, 2022 at 10:47:42 PM UTC+3, Terje Mathisen wrote:
> >> Michael S wrote:
> >>> On Thursday, April 7, 2022 at 12:03:32 PM UTC+3, Anton Ertl wrote:
> >>>> We were actually talking about the program that worked nicely on Linux,
> >>>> and failed on WSL.
> >>>>
> >>>
> >>> More I think about it, less I see a justification for WSL control word defaults.
> >> I agree 100%, the WSL setup is simply broken.
> >>>
> >>>> But yes, the library may not be doing what it should do in a program
> >>>> that actively sets the precision to some other values than long
> >>>> double.
> >>>>
> >>>> I can imagine some measures that would deal with the problem in the
> >>>> usual case: E.g., on first invocation the function for setting the
> >>>> precision control changes the vector of the quad-math functions to
> >>>> versions that set the precision control for their own needs (and save
> >>>> and restore the x87 control word). However, that would not work for
> >>>> programs that don't use these functions, but set the control word
> >>>> directly (in assembly language), so unless the library is documented
> >>>> as requiring the function calls for changing the precision control,
> >>>> the library would still be deficient.
> >>>>
> >>>
> >>> In this particular case, the only winning strategy is to refuse to play.
> >>> I.e. quadmath library should be and could be coded without any use of 80-bit
> >>> FP and with very minimalist use of 64-bit FP. On modern 64-bit x86 (and,
> >>> I suppose, on modern ARM and POWER) it's not only the most robust way,
> >>> precision-wise, but also the fastest.
> >> I'm guessing that in real world use there might actually be significant
> >> instances of numbers which happens to fit inside the 80-bit format, but
> >> not in double. In that particular case using 80-bit is obviously a big
> >> win, but this will break down pretty quickly as soon as you do a number
> >> of FMULQs or a single FDIVQ or FSQRTQ.
> >
> > Finding out that result fits in 80 bit and converting back and force will take more
> > time than just doing the work. Pay attention that in common cases of FADD/FSUB/FMUL
> > a work itself, outside of call overhead, parsing inputs and formatting output, is of
> > order of 10 clocks, at worst 15 clocks on less modern cores.
> >
> >>>
> >>> But I can imagine other cases where judicious use of 80-bit FP is really
> >>> beneficiary.
> >> Right.
> >>
> >> The FSQRTQ example is an obvious case, since using 80-bit gives you a
> >> 64-bit mantissa, which after one NR iteration becomes ~128 which will be
> >> correctly rounded in approximately 65535 out 65536 cases, right?
> >>
> >> Doing a full precision back-multiplication, by splitting the 128-bit
> >> mantissa into two parts and squaring it (4 uint64_t 64x64->128 muls)
> >> give you the residual and tells you if the initial result needs to be
> >> adjusted by 1 ulp.
> >
> > I don't think so.
> > Doing rsqrt on x87 with 64-bit precision is not faster than doing
> > it with something like ~63.7 bits on the integer side. More likely, not just not faster,
> > but much slower, esp. when we take into account conversions.
> > And then, for the algorithm that I have in mind, 63.7 or 64 bits are not much better
> > than, say, 61 bit. In both cases on the next step you calculate a result with precision
> > of more than 113 bits, but less than 228 bits required for correct rounding (or is it 227?
>
> Quad uses 1:15:112, so 113 including the hidden bit? I would like to
> have 230 bits to feel perfectly safe, but if it has been proven to work
> with one or two less, that's fine.
<
2×n+3 for (DIV, SQRT, RCP, RSQRT) -> 113×2+3 = 229
<
> > I don't remember). So, after that step we examine the closeness of result to to the mid-point
> > between representable numbers and only when result is close we do additional step
> > which consists of doing the square of the midpoint, calculating only relevant bits and
> > then choosing direction of the rounding according to bit[115] of the square.
> That is a good idea! We don't need the leading 128 bits of the 256-bit
> multiplication result (assuming 64-bit ints here), but that only saves
> one 64x64->128 MUL (plus a few ADD/ADC).
> > With susch algorithm the only difference between 64-bit estimate and 61-bit estimate
> > is the probability of doing the additional step. For 61b probability is low and for 64b it is
> > extremely low. Such difference is unlikely to cause detectable speed difference in real-world
> > usage scenario or even in benchmarks, except when they are intentionally crafted to
> > concentrate on the corner cases.
> Yeah, I agree.
> >
> > When I was talking about judicious use I didn't mean quadmath library at all,
> > but things like solving non-linear equations with double-precision inputs and
> > outputs where you sometimes want better precision on intermediate in order
> > to simplify algorithms and improve stability.
> > May be, something similar can be encountered in matrix factorization when det(A) is
> > small, but here I am less sure.
> OK.
> > In fact, my comment was just an expression of general feeling. It wasn't deeply thought.
> :-)
> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Re: Approximate reciprocals

<bb7b294d-e73b-435e-93f9-c0a05b62572en@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24649&group=comp.arch#24649

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:daa:b0:441:7161:de4b with SMTP id h10-20020a0562140daa00b004417161de4bmr16702846qvh.48.1649433502078;
Fri, 08 Apr 2022 08:58:22 -0700 (PDT)
X-Received: by 2002:a05:6870:45a4:b0:dd:b08e:fa49 with SMTP id
y36-20020a05687045a400b000ddb08efa49mr9185309oao.270.1649433501851; Fri, 08
Apr 2022 08:58:21 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 8 Apr 2022 08:58:21 -0700 (PDT)
In-Reply-To: <2022Apr8.165147@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=87.68.182.229; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 87.68.182.229
References: <vofo4hh9npgd0vaefo51khtndt80g440if@4ax.com> <memo.20220406220122.22520M@jgd.cix.co.uk>
<16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <e2639f3e-8a7a-4a55-996d-3f7830363aaen@googlegroups.com>
<2022Apr8.165147@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <bb7b294d-e73b-435e-93f9-c0a05b62572en@googlegroups.com>
Subject: Re: Approximate reciprocals
From: already5...@yahoo.com (Michael S)
Injection-Date: Fri, 08 Apr 2022 15:58:22 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 22

by: Michael S - Fri, 8 Apr 2022 15:58 UTC

On Friday, April 8, 2022 at 6:03:53 PM UTC+3, Anton Ertl wrote:
> Michael S <already...@yahoo.com> writes:
> > Google helped me to find out that it was reported as a bug 5 years ago.
> >It seems, somebody at MS didn't agree with such definition.
> >https://github.com/microsoft/WSL/issues/1748
> Following this points to
> <https://github.com/microsoft/WSL/issues/830>. And if you read down
> that issue, you find that this also affects fork(), and fixing this
> "requires some fairly substantial changes to the Windows kernel
> itself." This issue also shows that this bug hits lots of people.
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Thank you.
It seems, it all would have been much simpler if not for the old belief of kernel devs
that saving FPU context is expensive so the are obliged to be tricky about it.
I think that saving FPU context is less expensive than their clever tricks for at least
12-13 years (approximately since Nehalem on Intel and slightly longer on AMD),
but it's hard to change an established belief.

Re: Approximate reciprocals

<2022Apr8.183131@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24650&group=comp.arch#24650

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Fri, 08 Apr 2022 16:31:31 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 20
Message-ID: <2022Apr8.183131@mips.complang.tuwien.ac.at>
References: <vofo4hh9npgd0vaefo51khtndt80g440if@4ax.com> <memo.20220406220122.22520M@jgd.cix.co.uk> <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <e2639f3e-8a7a-4a55-996d-3f7830363aaen@googlegroups.com> <IGX3K.547929$7F2.230033@fx12.iad>
Injection-Info: reader02.eternal-september.org; posting-host="793179b5b9bb48d5ea2a3da4f3b63bd8";
logging-data="16902"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18jj73aGTlnPLifnSwdTipU"
Cancel-Lock: sha1:YxDE1m25ijPT1V+JgFk5eXmRptc=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Fri, 8 Apr 2022 16:31 UTC

EricP <ThatWouldBeTelling@thevillage.com> writes:
>This comment seems to think it is due to how fork is implemented:
>In WSL1 the FP control word is maintained across a fork, in Linux it is not.

The other way round. WSL resets the FP control word on fork(), Linux
propagates it.

>https://github.com/microsoft/WSL/issues/830#issuecomment-472279984
>
>Later comments indicate it is fixed in WSL2
>but then others contradict that.

[citation needed] Given that WSL2 runs the Linux kernel in a VM, it is
plausible that WSL2 fixes this. I have not seen any reports that WSL2
does not fix this, only additional requests to also fix this in WSL.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Approximate reciprocals

<kRZ3K.352378$f2a5.257016@fx48.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24651&group=comp.arch#24651

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx48.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
References: <vofo4hh9npgd0vaefo51khtndt80g440if@4ax.com> <memo.20220406220122.22520M@jgd.cix.co.uk> <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <e2639f3e-8a7a-4a55-996d-3f7830363aaen@googlegroups.com> <IGX3K.547929$7F2.230033@fx12.iad> <2022Apr8.183131@mips.complang.tuwien.ac.at>
In-Reply-To: <2022Apr8.183131@mips.complang.tuwien.ac.at>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 29
Message-ID: <kRZ3K.352378$f2a5.257016@fx48.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Fri, 08 Apr 2022 16:58:56 UTC
Date: Fri, 08 Apr 2022 12:58:35 -0400
X-Received-Bytes: 2068

by: EricP - Fri, 8 Apr 2022 16:58 UTC

Anton Ertl wrote:
> EricP <ThatWouldBeTelling@thevillage.com> writes:
>> This comment seems to think it is due to how fork is implemented:
>> In WSL1 the FP control word is maintained across a fork, in Linux it is not.
>
> The other way round. WSL resets the FP control word on fork(), Linux
> propagates it.
>
>> https://github.com/microsoft/WSL/issues/830#issuecomment-472279984

"The symptom indicates the x87 control word is not maintained as part
of the linux process context, it's only in the NT process context."

I read that to mean on fork Linux overwrites FPCW,
WSL does not overwrite and retains the original value.

>>
>> Later comments indicate it is fixed in WSL2
>> but then others contradict that.
>
> [citation needed] Given that WSL2 runs the Linux kernel in a VM, it is
> plausible that WSL2 fixes this. I have not seen any reports that WSL2
> does not fix this, only additional requests to also fix this in WSL.
>
> - anton

The comment that it is fixed in WSL2 is below the above link on 3-Jul.
But below that on 25-Sep it says "people are still hitting this".

Re: Approximate reciprocals

<2022Apr8.192057@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24652&group=comp.arch#24652

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Fri, 08 Apr 2022 17:20:57 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 53
Message-ID: <2022Apr8.192057@mips.complang.tuwien.ac.at>
References: <vofo4hh9npgd0vaefo51khtndt80g440if@4ax.com> <memo.20220406220122.22520M@jgd.cix.co.uk> <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <e2639f3e-8a7a-4a55-996d-3f7830363aaen@googlegroups.com> <IGX3K.547929$7F2.230033@fx12.iad> <2022Apr8.183131@mips.complang.tuwien.ac.at> <kRZ3K.352378$f2a5.257016@fx48.iad>
Injection-Info: reader02.eternal-september.org; posting-host="793179b5b9bb48d5ea2a3da4f3b63bd8";
logging-data="10750"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19aAFrdiK2sSgV/NVZOBCp1"
Cancel-Lock: sha1:wUDsBfR9vszT5BN0/dP0/yjrX1U=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Fri, 8 Apr 2022 17:20 UTC

EricP <ThatWouldBeTelling@thevillage.com> writes:
>Anton Ertl wrote:
>> EricP <ThatWouldBeTelling@thevillage.com> writes:
>>> This comment seems to think it is due to how fork is implemented:
>>> In WSL1 the FP control word is maintained across a fork, in Linux it is not.
>>
>> The other way round. WSL resets the FP control word on fork(), Linux
>> propagates it.
>>
>>> https://github.com/microsoft/WSL/issues/830#issuecomment-472279984
>
>"The symptom indicates the x87 control word is not maintained as part
>of the linux process context, it's only in the NT process context."

luoqi-git showed what actually happens in
<https://github.com/microsoft/WSL/issues/830#issuecomment-472268384>:

I.e., Linux leaves the fcw as-is on fork and thread creation, WSL
changes it.

Not sure if his/her statement you cited was mixed up, or was written
in a way that reading it makes it easy to mix up, but if you read
his/her other statements, it's clear what happens and that he/she
thinks that Linux is POSIX-conformant and WSL is not.

>The comment that it is fixed in WSL2 is below the above link on 3-Jul.
>But below that on 25-Sep it says "people are still hitting this".

That refers to people still using WSL, not WSL2. See also
<https://github.com/microsoft/WSL/issues/830#issuecomment-837200238>
by the same poster:

|A user just hit this again today. So apparently people still are using
|WSL1.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Approximate reciprocals

<t2psrv$1br9$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24653&group=comp.arch#24653

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!rd9pRsUZyxkRLAEK7e/Uzw.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Fri, 8 Apr 2022 19:54:07 +0200
Organization: Aioe.org NNTP Server
Message-ID: <t2psrv$1br9$1@gioia.aioe.org>
References: <vofo4hh9npgd0vaefo51khtndt80g440if@4ax.com>
<memo.20220406220122.22520M@jgd.cix.co.uk>
<16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com>
<e2639f3e-8a7a-4a55-996d-3f7830363aaen@googlegroups.com>
<2022Apr8.165147@mips.complang.tuwien.ac.at>
<bb7b294d-e73b-435e-93f9-c0a05b62572en@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="44905"; posting-host="rd9pRsUZyxkRLAEK7e/Uzw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.11.1
X-Notice: Filtered by postfilter v. 0.9.2

by: Terje Mathisen - Fri, 8 Apr 2022 17:54 UTC

Michael S wrote:
> On Friday, April 8, 2022 at 6:03:53 PM UTC+3, Anton Ertl wrote:
>> Michael S <already...@yahoo.com> writes:
>>> Google helped me to find out that it was reported as a bug 5 years ago.
>>> It seems, somebody at MS didn't agree with such definition.
>>> https://github.com/microsoft/WSL/issues/1748
>> Following this points to
>> <https://github.com/microsoft/WSL/issues/830>. And if you read down
>> that issue, you find that this also affects fork(), and fixing this
>> "requires some fairly substantial changes to the Windows kernel
>> itself." This issue also shows that this bug hits lots of people.
>> - anton
>> --
>> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
>> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>
>
> Thank you.
> It seems, it all would have been much simpler if not for the old belief of kernel devs
> that saving FPU context is expensive so the are obliged to be tricky about it.
> I think that saving FPU context is less expensive than their clever tricks for at least
> 12-13 years (approximately since Nehalem on Intel and slightly longer on AMD),
> but it's hard to change an established belief.

Changing and restoring the FPU context was a key part of the FDIV sw
workaround back in 1994/95:

All compilers at the time were modified to either refuse to work on a
broken x87 part, or to include the workaround which I wrote a large part
of, but Tim Coe and Peter Tang did the heavy mathematical lifting to
prove that it would always work:

The code looked like this (pseudo-code):

double fdiv_fix(double a, double b) /* a / b */
{ uint32_t mant = *(uint32_t * &b +4); // Pick top word
mant = (mant >> 10) & 1023; // Top 10 mantissa bits
fpu_load(a); // Put both operands on the FPU stack
fpu_load(b);
// Lookup the mantissa in a 128 byte/1024-bit table
if (bug_table[mant>>3] & (1 << (mant & 7))) { // 5 one bits
old_fpu = get_fpu_env();
set_fpu_env(LONG_DOUBLE); // Force 80-bit mode
fmul(a80,(15/16.0)); // Scale both operands by 15/16, this is exact
fmul(b80,15/16.0);
set_fpu_env(old_fpu); // Restore to typically 64-bit mode
}
fdiv();
}

I.e. if and only if the top 10 mantissa bits happened to hit one of
those 5 (out of 1024) patterns which made it possible for the SRT
divider to hit one of the 5 missing entries, would we run the workaround
code, otehrwise the only overhead was the integer code to extract those
10 bits and check them in the 128-byte table, so typically less than 10
cycles extra for an operation that normally took 40.

If we did get a hit then we used the beautiful trick (I did not come up
with it) of extending precision to 80-bit and then scale both operand by
15/16, both of those FMULs were guaranteed to be exact since we now had
more exponent and mantissa bits available.

Set the mode back to the original and simply drop down into the final
FDIV opcode which would always give the exactly same result as a working
fpu, while using about 80 cycles, so that even for code which did almost
nothing but FDIV the program would not be noticably slower.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Approximate reciprocals

<tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24656&group=comp.arch#24656

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: gneun...@comcast.net (George Neuner)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Sat, 09 Apr 2022 14:27:04 -0400
Organization: A noiseless patient Spider
Lines: 64
Message-ID: <tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com>
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <memo.20220408160456.22520T@jgd.cix.co.uk>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Injection-Info: reader02.eternal-september.org; posting-host="d16edef8f813cd730dedb0a710c6c35d";
logging-data="17184"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+1+ZbK9QvIUsHrcRG5cUB5EDf2WDdrF5c="
User-Agent: ForteAgent/8.00.32.1272
Cancel-Lock: sha1:coEZPKHGK0SejTR3EW+PCU2hVew=

by: George Neuner - Sat, 9 Apr 2022 18:27 UTC

On Fri, 8 Apr 2022 16:04 +0100 (BST), jgd@cix.co.uk (John Dallman)
wrote:

>In article <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com>,
>gneuner2@comcast.net (George Neuner) wrote:
>
>> Since Pentium 4 the compiler has by default used SIMD registers for
>> floating point. The x87 is not normally used.
>
>Which compiler? Microsoft Visual Studio is the nearest thing to a default
>compiler for Windows. Its x86-64 version has used SSE2 for all floating
>point since it first appeared, but that was several years after Pentium 4.
>The x86 version became able to compile for SSE2 in VS.2005, and adopted
>it as the default in VS.2012, but some versions have used x87 as well as
>SSE2 instructions in the same compiles.

32-bit VisualC(++) v6.0. and 64-bit VisualC(++) v4.0 used SSE2 by
default. Both of these were released in Fall 2000 concurrent with
introduction of the Pentium 4.

Prior to v6, the 32-bit compiler used the x87 by default but v5 could
be back-patched to both use SSE2 and generate SSE2 intrinsics.

>> WSL aside, on /Windows/ x87 precision defaults to 64-bit (long
>> double). Anton gave a more detailed response further down.
>
>Which environment is this in? That is not true for the Visual Studio
>compilers.

Windows creates processes with the x87 flags set to 64-bit precision.
You can see this if you write your program in assembler or step
through the program startup BEFORE the C runtime mucks with things.

At least through v5, the 32-bit compiler maintained the process
default. At some point the runtime (MSVC.DLL) began changing to
53-bit precision before running user code.

I don't recall when that happened ... I didn't write much 64-bit code
for Windows until ~2010, and the image processing applications I was
working on benefited from integer SIMD but were not much affected by
FPU precision or which FPU was used.

>> And yes, I understand both Michael's and Anton's aversion to
>> unnecessary work ... particularly because changing precision on the
>> x87 can be very slow. But since the setting can be READ as well as
>> written, there is no need to change it if it is correct when the
>> library goes to use it.
>
>If you do it that way, you need to read the flags on every entry, and set
>them if they need to change. And if you're going to avoid leaking this
>setting to other code that may be upset by it, you need to save the old
>controls and restore them. It all adds overhead.

Yes it does.

Changing precision on the x87 does not affect the speed of most FPU
instructions. The problem with doing it is that the FPU pipeline has
to be emptied before the change can happen, so the changeover may be
slow (possibly in both directions).

>John
George

Re: Approximate reciprocals

<3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24657&group=comp.arch#24657

copy link Newsgroups: comp.arch

X-Received: by 2002:a0c:8e0b:0:b0:435:1779:7b22 with SMTP id v11-20020a0c8e0b000000b0043517797b22mr21501784qvb.63.1649535597362;
Sat, 09 Apr 2022 13:19:57 -0700 (PDT)
X-Received: by 2002:a05:6870:1607:b0:de:984:496d with SMTP id
b7-20020a056870160700b000de0984496dmr11306457oae.253.1649535597136; Sat, 09
Apr 2022 13:19:57 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 9 Apr 2022 13:19:56 -0700 (PDT)
In-Reply-To: <tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:bda9:15b1:7bf:e89;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:bda9:15b1:7bf:e89
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <memo.20220408160456.22520T@jgd.cix.co.uk>
<tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com>
Subject: Re: Approximate reciprocals
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 09 Apr 2022 20:19:57 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 61

by: MitchAlsup - Sat, 9 Apr 2022 20:19 UTC

On Saturday, April 9, 2022 at 1:27:09 PM UTC-5, George Neuner wrote:
> On Fri, 8 Apr 2022 16:04 +0100 (BST), j...@cix.co.uk (John Dallman)
> wrote:
> >In article <16su4hdjofh949len...@4ax.com>,
> >gneu...@comcast.net (George Neuner) wrote:
> >
> >> Since Pentium 4 the compiler has by default used SIMD registers for
> >> floating point. The x87 is not normally used.
> >
> >Which compiler? Microsoft Visual Studio is the nearest thing to a default
> >compiler for Windows. Its x86-64 version has used SSE2 for all floating
> >point since it first appeared, but that was several years after Pentium 4.
> >The x86 version became able to compile for SSE2 in VS.2005, and adopted
> >it as the default in VS.2012, but some versions have used x87 as well as
> >SSE2 instructions in the same compiles.
> 32-bit VisualC(++) v6.0. and 64-bit VisualC(++) v4.0 used SSE2 by
> default. Both of these were released in Fall 2000 concurrent with
> introduction of the Pentium 4.
>
> Prior to v6, the 32-bit compiler used the x87 by default but v5 could
> be back-patched to both use SSE2 and generate SSE2 intrinsics.
> >> WSL aside, on /Windows/ x87 precision defaults to 64-bit (long
> >> double). Anton gave a more detailed response further down.
> >
> >Which environment is this in? That is not true for the Visual Studio
> >compilers.
> Windows creates processes with the x87 flags set to 64-bit precision.
> You can see this if you write your program in assembler or step
> through the program startup BEFORE the C runtime mucks with things.
>
> At least through v5, the 32-bit compiler maintained the process
> default. At some point the runtime (MSVC.DLL) began changing to
> 53-bit precision before running user code.
>
> I don't recall when that happened ... I didn't write much 64-bit code
> for Windows until ~2010, and the image processing applications I was
> working on benefited from integer SIMD but were not much affected by
> FPU precision or which FPU was used.
> >> And yes, I understand both Michael's and Anton's aversion to
> >> unnecessary work ... particularly because changing precision on the
> >> x87 can be very slow. But since the setting can be READ as well as
> >> written, there is no need to change it if it is correct when the
> >> library goes to use it.
> >
> >If you do it that way, you need to read the flags on every entry, and set
> >them if they need to change. And if you're going to avoid leaking this
> >setting to other code that may be upset by it, you need to save the old
> >controls and restore them. It all adds overhead.
> Yes it does.
>
> Changing precision on the x87 does not affect the speed of most FPU
> instructions. The problem with doing it is that the FPU pipeline has
> to be emptied before the change can happen, so the changeover may be
> slow (possibly in both directions).

<
Bad wording:: it is possible to design a processor that does not HAVE to drain
pipeline to set the mode--Intel has not done so.
<
>
> >John
> George

Re: Approximate reciprocals

<d39ef73f-a572-4d30-a714-ad51957047d6n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24658&group=comp.arch#24658

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:6cc:b0:69b:dd1b:3235 with SMTP id 12-20020a05620a06cc00b0069bdd1b3235mr5813010qky.374.1649537627735;
Sat, 09 Apr 2022 13:53:47 -0700 (PDT)
X-Received: by 2002:a4a:ad46:0:b0:324:498e:4fe1 with SMTP id
s6-20020a4aad46000000b00324498e4fe1mr7991825oon.89.1649537627491; Sat, 09 Apr
2022 13:53:47 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 9 Apr 2022 13:53:47 -0700 (PDT)
In-Reply-To: <3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=87.68.182.229; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 87.68.182.229
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <memo.20220408160456.22520T@jgd.cix.co.uk>
<tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com> <3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d39ef73f-a572-4d30-a714-ad51957047d6n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: already5...@yahoo.com (Michael S)
Injection-Date: Sat, 09 Apr 2022 20:53:47 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 74

by: Michael S - Sat, 9 Apr 2022 20:53 UTC

On Saturday, April 9, 2022 at 11:19:58 PM UTC+3, MitchAlsup wrote:
> On Saturday, April 9, 2022 at 1:27:09 PM UTC-5, George Neuner wrote:
> > On Fri, 8 Apr 2022 16:04 +0100 (BST), j...@cix.co.uk (John Dallman)
> > wrote:
> > >In article <16su4hdjofh949len...@4ax.com>,
> > >gneu...@comcast.net (George Neuner) wrote:
> > >
> > >> Since Pentium 4 the compiler has by default used SIMD registers for
> > >> floating point. The x87 is not normally used.
> > >
> > >Which compiler? Microsoft Visual Studio is the nearest thing to a default
> > >compiler for Windows. Its x86-64 version has used SSE2 for all floating
> > >point since it first appeared, but that was several years after Pentium 4.
> > >The x86 version became able to compile for SSE2 in VS.2005, and adopted
> > >it as the default in VS.2012, but some versions have used x87 as well as
> > >SSE2 instructions in the same compiles.
> > 32-bit VisualC(++) v6.0. and 64-bit VisualC(++) v4.0 used SSE2 by
> > default. Both of these were released in Fall 2000 concurrent with
> > introduction of the Pentium 4.
> >
> > Prior to v6, the 32-bit compiler used the x87 by default but v5 could
> > be back-patched to both use SSE2 and generate SSE2 intrinsics.
> > >> WSL aside, on /Windows/ x87 precision defaults to 64-bit (long
> > >> double). Anton gave a more detailed response further down.
> > >
> > >Which environment is this in? That is not true for the Visual Studio
> > >compilers.
> > Windows creates processes with the x87 flags set to 64-bit precision.
> > You can see this if you write your program in assembler or step
> > through the program startup BEFORE the C runtime mucks with things.
> >
> > At least through v5, the 32-bit compiler maintained the process
> > default. At some point the runtime (MSVC.DLL) began changing to
> > 53-bit precision before running user code.
> >
> > I don't recall when that happened ... I didn't write much 64-bit code
> > for Windows until ~2010, and the image processing applications I was
> > working on benefited from integer SIMD but were not much affected by
> > FPU precision or which FPU was used.
> > >> And yes, I understand both Michael's and Anton's aversion to
> > >> unnecessary work ... particularly because changing precision on the
> > >> x87 can be very slow. But since the setting can be READ as well as
> > >> written, there is no need to change it if it is correct when the
> > >> library goes to use it.
> > >
> > >If you do it that way, you need to read the flags on every entry, and set
> > >them if they need to change. And if you're going to avoid leaking this
> > >setting to other code that may be upset by it, you need to save the old
> > >controls and restore them. It all adds overhead.
> > Yes it does.
> >
> > Changing precision on the x87 does not affect the speed of most FPU
> > instructions. The problem with doing it is that the FPU pipeline has
> > to be emptied before the change can happen, so the changeover may be
> > slow (possibly in both directions).
>
> <
> Bad wording:: it is possible to design a processor that does not HAVE to drain
> pipeline to set the mode--Intel has not done so.

As far as I remember, Intel actually did so, at least as long as the change
was only in rounding control bits. I don't remember whether the change in
precision control bits got the same facelift.
The motivation for the change was speeding out double-to-integer conversion
that started to become a bottleneck for one or more of important customers.
Of course, few years later they introduced SSE4.1 that includes ROUNDSD that
performs double-to-integer conversions just fine without messing with rounding
modes, but I think (although never tested) that fast handling of changes of x87
rounding modes still persists in modern Intel processors even despite it is
no longer important.

> <
> >
> > >John
> > George

In article <tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com>,
gneuner2@comcast.net (George Neuner) wrote:

> 32-bit VisualC(++) v6.0. and 64-bit VisualC(++) v4.0 used SSE2 by
> default. Both of these were released in Fall 2000 concurrent with
> introduction of the Pentium 4.

You are talking about a quite different history of Visual C++ from me. We
appear to be posting across realities.

In my history, Visual C++ 6 was released in 1998,
<https://en.wikipedia.org/wiki/Microsoft_Visual_C%2B%2B#vc6>, and there
wasn't a Visual C++ compiler for 64-bit x86 until Visual C++ 8 in
November 2005.
<https://en.wikipedia.org/wiki/Microsoft_Visual_C%2B%2B#32-bit_and_64-bit_
versions> Visual C++ 4 was from 1995, far roo early for 64-bit support.
There were no separate versions for 64-bit x86, with version numbers
duplicating older version numbers.

In my history the Athlon 64 wasn't released until September 2003
<https://en.wikipedia.org/wiki/Athlon_64>, so a Visual C++ targeting
64-bit x86 in 2000 did not happen.

John

Re: Approximate reciprocals

<5fd7f105-c8ce-48cf-8cfb-13e98a584649n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24660&group=comp.arch#24660

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:4457:0:b0:2ed:f4a:8d9 with SMTP id m23-20020ac84457000000b002ed0f4a08d9mr4747747qtn.396.1649571569343;
Sat, 09 Apr 2022 23:19:29 -0700 (PDT)
X-Received: by 2002:a05:6808:2018:b0:2ec:c22b:15b8 with SMTP id
q24-20020a056808201800b002ecc22b15b8mr2909599oiw.136.1649571569097; Sat, 09
Apr 2022 23:19:29 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 9 Apr 2022 23:19:28 -0700 (PDT)
In-Reply-To: <3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:6947:3c86:73e1:a64e;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:6947:3c86:73e1:a64e
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <memo.20220408160456.22520T@jgd.cix.co.uk>
<tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com> <3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5fd7f105-c8ce-48cf-8cfb-13e98a584649n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Sun, 10 Apr 2022 06:19:29 +0000
Content-Type: text/plain; charset="UTF-8"

by: Quadibloc - Sun, 10 Apr 2022 06:19 UTC

On Saturday, April 9, 2022 at 2:19:58 PM UTC-6, MitchAlsup wrote:
> On Saturday, April 9, 2022 at 1:27:09 PM UTC-5, George Neuner wrote:

> > Changing precision on the x87 does not affect the speed of most FPU
> > instructions. The problem with doing it is that the FPU pipeline has
> > to be emptied before the change can happen, so the changeover may be
> > slow (possibly in both directions).

> Bad wording:: it is possible to design a processor that does not HAVE to drain
> pipeline to set the mode--Intel has not done so.

I can't really agree with your criticism; he is talking about the real world, not
hypotheticals.

But I must admit that I am disappointed to hear this.

You have heard the crazy way *I* would have designed a processor. With separate
pipelines for single precision and double precision and quad precision and one-and-a-half
precision.

But if one were to put a 60-bit floating-point number down the double precision pipeline,
no, one would not have to drain it to change the mode. (The double precision pipeline
would actually be designed for 72-bit floats, which would also use it. 80 bit temporary
reals would go down the double precision pipeline, unless there was one for 96-bit
floats.)

So mixing precisions would make your programs go *faster*, because it would utilize
the other pipelines that otherwise would not be used. The very opposite of Intel!

John Savard

Re: Approximate reciprocals

<40ca14ca-1267-406f-8b34-bea83414ca73n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24661&group=comp.arch#24661

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:470d:b0:67d:d8a8:68c6 with SMTP id bs13-20020a05620a470d00b0067dd8a868c6mr18293446qkb.717.1649571791391;
Sat, 09 Apr 2022 23:23:11 -0700 (PDT)
X-Received: by 2002:a9d:6e89:0:b0:5b2:4c01:2210 with SMTP id
a9-20020a9d6e89000000b005b24c012210mr9366960otr.85.1649571791166; Sat, 09 Apr
2022 23:23:11 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 9 Apr 2022 23:23:10 -0700 (PDT)
In-Reply-To: <memo.20220409225233.22520a@jgd.cix.co.uk>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:6947:3c86:73e1:a64e;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:6947:3c86:73e1:a64e
References: <tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com> <memo.20220409225233.22520a@jgd.cix.co.uk>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <40ca14ca-1267-406f-8b34-bea83414ca73n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Sun, 10 Apr 2022 06:23:11 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 30

by: Quadibloc - Sun, 10 Apr 2022 06:23 UTC

On Saturday, April 9, 2022 at 3:52:38 PM UTC-6, John Dallman wrote:
> In article <tmg35h5jeb295i594...@4ax.com>,
> gneu...@comcast.net (George Neuner) wrote:
>
> > 32-bit VisualC(++) v6.0. and 64-bit VisualC(++) v4.0 used SSE2 by
> > default. Both of these were released in Fall 2000 concurrent with
> > introduction of the Pentium 4.
> You are talking about a quite different history of Visual C++ from me. We
> appear to be posting across realities.
>
> In my history, Visual C++ 6 was released in 1998,
> <https://en.wikipedia.org/wiki/Microsoft_Visual_C%2B%2B#vc6>, and there
> wasn't a Visual C++ compiler for 64-bit x86 until Visual C++ 8 in
> November 2005.
> <https://en.wikipedia.org/wiki/Microsoft_Visual_C%2B%2B#32-bit_and_64-bit_
> versions> Visual C++ 4 was from 1995, far roo early for 64-bit support.
> There were no separate versions for 64-bit x86, with version numbers
> duplicating older version numbers.
>
> In my history the Athlon 64 wasn't released until September 2003
> <https://en.wikipedia.org/wiki/Athlon_64>, so a Visual C++ targeting
> 64-bit x86 in 2000 did not happen.

Just because 32-bit Visual C++ 4.0 was from 1995 doesn't mean that Microsoft
could have released 64-bit Visual C++ 4.0 later. However, I think the actual explanation,
given that the two compilers mentioned were released at the same time, is that he
meant to type "64-bit VisualC(++) v6.0" to match the 32-bit one, and it was a typo.

Much simpler than posting across timelines.

John Savard

Re: Approximate reciprocals

<2022Apr10.103214@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24662&group=comp.arch#24662

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Sun, 10 Apr 2022 08:32:14 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 49
Message-ID: <2022Apr10.103214@mips.complang.tuwien.ac.at>
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <memo.20220408160456.22520T@jgd.cix.co.uk> <tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com> <3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com> <5fd7f105-c8ce-48cf-8cfb-13e98a584649n@googlegroups.com>
Injection-Info: reader02.eternal-september.org; posting-host="475276b09171678866ab4d3ac4f25c13";
logging-data="25801"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+UbMoihXlPzfJMFvLIZJP3"
Cancel-Lock: sha1:eWTgZ23PlsPfXeMLnaCMtjUFOGc=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Sun, 10 Apr 2022 08:32 UTC

Quadibloc <jsavard@ecn.ab.ca> writes:
>You have heard the crazy way *I* would have designed a processor. With separate
>pipelines for single precision and double precision and quad precision and one-and-a-half
>precision.
>
>But if one were to put a 60-bit floating-point number down the double precision pipeline,
>no, one would not have to drain it to change the mode. (The double precision pipeline
>would actually be designed for 72-bit floats, which would also use it. 80 bit temporary
>reals would go down the double precision pipeline, unless there was one for 96-bit
>floats.)
>
>So mixing precisions would make your programs go *faster*, because it would utilize
>the other pipelines that otherwise would not be used. The very opposite of Intel!

In the last weeks I have noticed unusually many people posting
over-long lines (i.e., longer than 80 chars, and ideally you should
limit the lines to 70-72 chars to leave room for quoting). Is there
some new attack on Usenet conventions coming out from Google?

Anyway, my guess for the reason for slow precision-setting is that
Intel and AMD microarchitects want the precision setting to be known
to the decoder, so it can deliver the precision as part of the uop.
This requires that when setting the precision, decoding of subsequent
instructions starts from scratch. An alternative would be to deliver
the precision as another input in the OoO engine, but that would
require additional resources in the OoO engine, an apparently they
thought that spending these resources elsewhere would buy more
performance.

Concerning your separate-pipelines idea, in that setup it's even more
advantageous to know the precision early in instruction processing:
You can have separate queues/ports for the different precisions and
steer the instructions to these ports early, instead of having common
FP queues, and steering the instructions to the right pipelines only
when all the data (including precision) is in; ok, you could also have
two stages of queues, but that introduces additional complication,
area, and probably latency.

Also, even if switching is fast, how frequent is code with mixed
precision? E.g., in DGEMM you only use double precision operations,
while in SGEMM you only use single-precision operations.

Bottom line: There's a reason why Intel and AMD are designing their
CPUs the way they are.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Approximate reciprocals

<fc924a92-b61b-40bf-bc72-5ce991830447n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24663&group=comp.arch#24663

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:621:b0:432:5e0d:cb64 with SMTP id a1-20020a056214062100b004325e0dcb64mr22825554qvx.65.1649583645482;
Sun, 10 Apr 2022 02:40:45 -0700 (PDT)
X-Received: by 2002:a05:6870:1697:b0:e2:a341:a2e with SMTP id
j23-20020a056870169700b000e2a3410a2emr3011104oae.69.1649583645251; Sun, 10
Apr 2022 02:40:45 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 10 Apr 2022 02:40:45 -0700 (PDT)
In-Reply-To: <tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com>
Injection-Info: google-groups.googlegroups.com; posting-host=199.203.251.52; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 199.203.251.52
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <memo.20220408160456.22520T@jgd.cix.co.uk>
<tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <fc924a92-b61b-40bf-bc72-5ce991830447n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: already5...@yahoo.com (Michael S)
Injection-Date: Sun, 10 Apr 2022 09:40:45 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 140

by: Michael S - Sun, 10 Apr 2022 09:40 UTC

On Saturday, April 9, 2022 at 9:27:09 PM UTC+3, George Neuner wrote:
> On Fri, 8 Apr 2022 16:04 +0100 (BST), j...@cix.co.uk (John Dallman)
> wrote:
> >In article <16su4hdjofh949len...@4ax.com>,
> >gneu...@comcast.net (George Neuner) wrote:
> >
> >> Since Pentium 4 the compiler has by default used SIMD registers for
> >> floating point. The x87 is not normally used.
> >
> >Which compiler? Microsoft Visual Studio is the nearest thing to a default
> >compiler for Windows. Its x86-64 version has used SSE2 for all floating
> >point since it first appeared, but that was several years after Pentium 4.
> >The x86 version became able to compile for SSE2 in VS.2005, and adopted
> >it as the default in VS.2012, but some versions have used x87 as well as
> >SSE2 instructions in the same compiles.
> 32-bit VisualC(++) v6.0. and 64-bit VisualC(++) v4.0 used SSE2 by
> default. Both of these were released in Fall 2000 concurrent with
> introduction of the Pentium 4.
>
> Prior to v6, the 32-bit compiler used the x87 by default but v5 could
> be back-patched to both use SSE2 and generate SSE2 intrinsics.

I don't have VS5 or VS6 installed.
[O.T.]
I'd like to have VS5 and it is even moderately important for the maintenance
of one of the old projects, but I don't know where to find it.
[O.T.]
I do have VS2005 (i.e. VS8 if we map it to old numbering scheme).
I tried to compile a following program with default Win32 Release settings:

void GN_test(double* x) {
x[2] = x[0] + x[1];
}

That's what I got:
; Listing generated by Microsoft (R) Optimizing Compiler Version 14.00.50727.762

TITLE d:\michael\George_Neuner_test\gn_test.c
.686P
.XMM
include listing.inc
.model flat

INCLUDELIB OLDNAMES

PUBLIC _GN_test
EXTRN __fltused:DWORD
; Function compile flags: /Ogtpy
; COMDAT _GN_test
_TEXT SEGMENT
_GN_test PROC ; COMDAT
; _x$ = eax
; File d:\michael\george_neuner_test\gn_test.c
; Line 2
fld QWORD PTR [eax+8]
fadd QWORD PTR [eax]
fstp QWORD PTR [eax+16]
; Line 3
ret 0
_GN_test ENDP
END

As you see, it's 100% x87, no traces of SSE2.

Then I tried the same on VS2017 and result was different.
; Listing generated by Microsoft (R) Optimizing Compiler Version 19.16.27032.1

TITLE c:\users\michael_s\work\kuku\george_neuner_test\gn_test\gn_test\gn_test.c
.686P
.XMM
include listing.inc
.model flat

INCLUDELIB OLDNAMES

EXTRN @__security_check_cookie@4:PROC
PUBLIC _GN_test
PUBLIC __xmm@40000000000000003ff0000000000000
EXTRN __fltused:DWORD
; COMDAT __xmm@40000000000000003ff0000000000000
CONST SEGMENT
__xmm@40000000000000003ff0000000000000 DB 00H, 00H, 00H, 00H, 00H, 00H, 0f0H
DB '?', 00H, 00H, 00H, 00H, 00H, 00H, 00H, '@'
CONST ENDS
; Function compile flags: /Ogtp
; COMDAT _GN_test
_TEXT SEGMENT
_GN_test PROC ; COMDAT
; _x$ = ecx
; File c:\users\michael_s\work\kuku\george_neuner_test\gn_test\gn_test\gn_test.c
; Line 2
movsd xmm0, QWORD PTR [ecx+8]
addsd xmm0, QWORD PTR [ecx]
movsd QWORD PTR [ecx+16], xmm0
; Line 3
ret 0
_GN_test ENDP
_TEXT ENDS
END

Here we see SSE2 code.
I don't have other versions of Visual Studio installed right now, so can't be sure when exactly
defaults were changed to SSE2, but it happened at least 3 versions later than you suggested.

> >> WSL aside, on /Windows/ x87 precision defaults to 64-bit (long
> >> double). Anton gave a more detailed response further down.
> >
> >Which environment is this in? That is not true for the Visual Studio
> >compilers.
> Windows creates processes with the x87 flags set to 64-bit precision.
> You can see this if you write your program in assembler or step
> through the program startup BEFORE the C runtime mucks with things.
>
> At least through v5, the 32-bit compiler maintained the process
> default. At some point the runtime (MSVC.DLL) began changing to
> 53-bit precision before running user code.
>
> I don't recall when that happened ... I didn't write much 64-bit code
> for Windows until ~2010, and the image processing applications I was
> working on benefited from integer SIMD but were not much affected by
> FPU precision or which FPU was used.
> >> And yes, I understand both Michael's and Anton's aversion to
> >> unnecessary work ... particularly because changing precision on the
> >> x87 can be very slow. But since the setting can be READ as well as
> >> written, there is no need to change it if it is correct when the
> >> library goes to use it.
> >
> >If you do it that way, you need to read the flags on every entry, and set
> >them if they need to change. And if you're going to avoid leaking this
> >setting to other code that may be upset by it, you need to save the old
> >controls and restore them. It all adds overhead.
> Yes it does.
>
> Changing precision on the x87 does not affect the speed of most FPU
> instructions. The problem with doing it is that the FPU pipeline has
> to be emptied before the change can happen, so the changeover may be
> slow (possibly in both directions).
>
> >John
> George

Re: Approximate reciprocals

<STA4K.349142$Gojc.88544@fx99.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24664&group=comp.arch#24664

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.freedyn.de!newsreader4.netcologne.de!news.netcologne.de!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx99.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <memo.20220408160456.22520T@jgd.cix.co.uk> <tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com> <3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com> <5fd7f105-c8ce-48cf-8cfb-13e98a584649n@googlegroups.com> <2022Apr10.103214@mips.complang.tuwien.ac.at>
In-Reply-To: <2022Apr10.103214@mips.complang.tuwien.ac.at>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 85
Message-ID: <STA4K.349142$Gojc.88544@fx99.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sun, 10 Apr 2022 13:24:02 UTC
Date: Sun, 10 Apr 2022 09:23:18 -0400
X-Received-Bytes: 5280

by: EricP - Sun, 10 Apr 2022 13:23 UTC

Anton Ertl wrote:
> Quadibloc <jsavard@ecn.ab.ca> writes:
>> You have heard the crazy way *I* would have designed a processor. With separate
>> pipelines for single precision and double precision and quad precision and one-and-a-half
>> precision.
>>
>> But if one were to put a 60-bit floating-point number down the double precision pipeline,
>> no, one would not have to drain it to change the mode. (The double precision pipeline
>> would actually be designed for 72-bit floats, which would also use it. 80 bit temporary
>> reals would go down the double precision pipeline, unless there was one for 96-bit
>> floats.)
>>
>> So mixing precisions would make your programs go *faster*, because it would utilize
>> the other pipelines that otherwise would not be used. The very opposite of Intel!
>
> In the last weeks I have noticed unusually many people posting
> over-long lines (i.e., longer than 80 chars, and ideally you should
> limit the lines to 70-72 chars to leave room for quoting). Is there
> some new attack on Usenet conventions coming out from Google?
>
> Anyway, my guess for the reason for slow precision-setting is that
> Intel and AMD microarchitects want the precision setting to be known
> to the decoder, so it can deliver the precision as part of the uop.
> This requires that when setting the precision, decoding of subsequent
> instructions starts from scratch. An alternative would be to deliver
> the precision as another input in the OoO engine, but that would
> require additional resources in the OoO engine, an apparently they
> thought that spending these resources elsewhere would buy more
> performance.
>
> Concerning your separate-pipelines idea, in that setup it's even more
> advantageous to know the precision early in instruction processing:
> You can have separate queues/ports for the different precisions and
> steer the instructions to these ports early, instead of having common
> FP queues, and steering the instructions to the right pipelines only
> when all the data (including precision) is in; ok, you could also have
> two stages of queues, but that introduces additional complication,
> area, and probably latency.
>
> Also, even if switching is fast, how frequent is code with mixed
> precision? E.g., in DGEMM you only use double precision operations,
> while in SGEMM you only use single-precision operations.
>
> Bottom line: There's a reason why Intel and AMD are designing their
> CPUs the way they are.

The x87 FP Control Word has flags to control
Infinity, Rounding, Precision, and Exception masking.

If you are changing some bits and not others then you have to
store the current value, mask in your changes, and load it FLDCW.
That store can either be synchronous FSTCW or asynchronous FNSTCW
(remember x87 is a separate co-processor with long running
transcendentals, and the FSTCW is actually an assembler macro
instruction which emits FWAIT, FNSTCW).

If any of the exception mask flags change then we may need to
synchronize with current and pending exceptions status.
So this ties changes in the control word into current
and future state in the status register.

Then there is the issue that the new CW is coming from the
data path so to merge the CW bits into the uOp bits in the decoder
implies some kind of front end delay between when the FLDCW decodes
and when the new value propagates back to decode.

An alternatively design merges the FPCW flags into the uOp in the
FPU itself, but then we have to deal with the FP instructions are
launching out-of-order and we have to make sure the right set
of flags goes to the right uOp.
For this I would have a small set physical FPCW registers,
4 should be sufficient, and a renamer for the one logical CW register.
This makes the CW bits a uOp data dependency like other FP operands
and it would require its own wake-up matrix and forwarding bus,
but much simpler than the normal operand support logic.
The then current (future) CW bits merge into the uOp when
it is launched for execution.

With this a write to the FPCW would only stall as long as it took
the new CW value to arrive at its CW physical register or appear
on its forwarding bus, so ideally allowing back-to-back execution.

Re: Approximate reciprocals

<Y5C4K.174018$ZmJ7.153689@fx06.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24665&group=comp.arch#24665

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!ecngs!feeder2.ecngs.de!178.20.174.213.MISMATCH!feeder1.feed.usenet.farm!feed.usenet.farm!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx06.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <memo.20220408160456.22520T@jgd.cix.co.uk> <tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com> <3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com> <5fd7f105-c8ce-48cf-8cfb-13e98a584649n@googlegroups.com> <2022Apr10.103214@mips.complang.tuwien.ac.at> <STA4K.349142$Gojc.88544@fx99.iad>
In-Reply-To: <STA4K.349142$Gojc.88544@fx99.iad>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 106
Message-ID: <Y5C4K.174018$ZmJ7.153689@fx06.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sun, 10 Apr 2022 14:47:20 UTC
Date: Sun, 10 Apr 2022 10:47:13 -0400
X-Received-Bytes: 6038

by: EricP - Sun, 10 Apr 2022 14:47 UTC

EricP wrote:
> Anton Ertl wrote:
>> Quadibloc <jsavard@ecn.ab.ca> writes:
>>> You have heard the crazy way *I* would have designed a processor.
>>> With separate
>>> pipelines for single precision and double precision and quad
>>> precision and one-and-a-half
>>> precision.
>>>
>>> But if one were to put a 60-bit floating-point number down the double
>>> precision pipeline,
>>> no, one would not have to drain it to change the mode. (The double
>>> precision pipeline
>>> would actually be designed for 72-bit floats, which would also use
>>> it. 80 bit temporary
>>> reals would go down the double precision pipeline, unless there was
>>> one for 96-bit
>>> floats.)
>>>
>>> So mixing precisions would make your programs go *faster*, because it
>>> would utilize
>>> the other pipelines that otherwise would not be used. The very
>>> opposite of Intel!
>>
>> In the last weeks I have noticed unusually many people posting
>> over-long lines (i.e., longer than 80 chars, and ideally you should
>> limit the lines to 70-72 chars to leave room for quoting). Is there
>> some new attack on Usenet conventions coming out from Google?
>>
>> Anyway, my guess for the reason for slow precision-setting is that
>> Intel and AMD microarchitects want the precision setting to be known
>> to the decoder, so it can deliver the precision as part of the uop.
>> This requires that when setting the precision, decoding of subsequent
>> instructions starts from scratch. An alternative would be to deliver
>> the precision as another input in the OoO engine, but that would
>> require additional resources in the OoO engine, an apparently they
>> thought that spending these resources elsewhere would buy more
>> performance.
>>
>> Concerning your separate-pipelines idea, in that setup it's even more
>> advantageous to know the precision early in instruction processing:
>> You can have separate queues/ports for the different precisions and
>> steer the instructions to these ports early, instead of having common
>> FP queues, and steering the instructions to the right pipelines only
>> when all the data (including precision) is in; ok, you could also have
>> two stages of queues, but that introduces additional complication,
>> area, and probably latency.
>>
>> Also, even if switching is fast, how frequent is code with mixed
>> precision? E.g., in DGEMM you only use double precision operations,
>> while in SGEMM you only use single-precision operations.
>>
>> Bottom line: There's a reason why Intel and AMD are designing their
>> CPUs the way they are.
>
> The x87 FP Control Word has flags to control
> Infinity, Rounding, Precision, and Exception masking.
>
> If you are changing some bits and not others then you have to
> store the current value, mask in your changes, and load it FLDCW.
> That store can either be synchronous FSTCW or asynchronous FNSTCW
> (remember x87 is a separate co-processor with long running
> transcendentals, and the FSTCW is actually an assembler macro
> instruction which emits FWAIT, FNSTCW).
>
> If any of the exception mask flags change then we may need to
> synchronize with current and pending exceptions status.
> So this ties changes in the control word into current
> and future state in the status register.
>
> Then there is the issue that the new CW is coming from the
> data path so to merge the CW bits into the uOp bits in the decoder
> implies some kind of front end delay between when the FLDCW decodes
> and when the new value propagates back to decode.
>
> An alternatively design merges the FPCW flags into the uOp in the
> FPU itself, but then we have to deal with the FP instructions are
> launching out-of-order and we have to make sure the right set
> of flags goes to the right uOp.
> For this I would have a small set physical FPCW registers,
> 4 should be sufficient, and a renamer for the one logical CW register.
> This makes the CW bits a uOp data dependency like other FP operands
> and it would require its own wake-up matrix and forwarding bus,
> but much simpler than the normal operand support logic.
> The then current (future) CW bits merge into the uOp when
> it is launched for execution.
>
> With this a write to the FPCW would only stall as long as it took
> the new CW value to arrive at its CW physical register or appear
> on its forwarding bus, so ideally allowing back-to-back execution.

This would also need something to handle the FP Status Word
which contains the sticky exception flags.
The implied OR of the uOp status with the current status
creates a serial dependency that we'd need to break up.

It is not the same as the integer status flags because those
flags overwrite the prior values so a rename can handle them.

Whereas each FP exception bit has to OR with its prior state
and then merge with that uOps exception mask bits to
decide whether to throw and exception.
So a simple renamer for the FPSW wouldn't suffice.

Subject	Author
Approximate reciprocals	Marcus
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	robf...@gmail.com
Re: Approximate reciprocals	Marcus
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	Marcus
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Quadibloc
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Marcus
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	BGB
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	Quadibloc
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	James Van Buskirk
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	Quadibloc
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Thomas Koenig
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	robf...@gmail.com
Useful floating point instructions (was: Approximate reciprocals)	Thomas Koenig
Re: Useful floating point instructions	Terje Mathisen
Re: Useful floating point instructions	Stephen Fuld
Re: Useful floating point instructions	MitchAlsup
Re: Useful floating point instructions	Stephen Fuld
Re: Useful floating point instructions	MitchAlsup
Re: Useful floating point instructions	Michael S
Re: Useful floating point instructions	Stephen Fuld
Re: Useful floating point instructions	Terje Mathisen
Re: Useful floating point instructions	Terje Mathisen
Re: Useful floating point instructions	Stefan Monnier
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	George Neuner
Re: Approximate reciprocals	Anton Ertl
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Anton Ertl
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	George Neuner
Re: Approximate reciprocals	Anton Ertl
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	John Dallman
Re: Approximate reciprocals	MitchAlsup
Re: Approximate reciprocals	George Neuner
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	EricP
Re: Approximate reciprocals	Anton Ertl
Re: Approximate reciprocals	Anton Ertl
Re: Approximate reciprocals	John Dallman
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Michael S
Re: Approximate reciprocals	Terje Mathisen
Re: Approximate reciprocals	Elijah Stone
Re: Approximate reciprocals	Marcus
Re: Approximate reciprocals	Marcus