Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

"Just the facts, Ma'am" -- Joe Friday


devel / comp.arch / Re: Approximate reciprocals

SubjectAuthor
* Approximate reciprocalsMarcus
+* Re: Approximate reciprocalsTerje Mathisen
|+- Re: Approximate reciprocalsrobf...@gmail.com
|+* Re: Approximate reciprocalsMarcus
||+- Re: Approximate reciprocalsMitchAlsup
||`* Re: Approximate reciprocalsTerje Mathisen
|| +- Re: Approximate reciprocalsMarcus
|| `- Re: Approximate reciprocalsMitchAlsup
|`* Re: Approximate reciprocalsQuadibloc
| `- Re: Approximate reciprocalsTerje Mathisen
+* Re: Approximate reciprocalsMitchAlsup
|+* Re: Approximate reciprocalsMarcus
||`* Re: Approximate reciprocalsMitchAlsup
|| `- Re: Approximate reciprocalsBGB
|`* Re: Approximate reciprocalsThomas Koenig
| `* Re: Approximate reciprocalsMitchAlsup
|  `* Re: Approximate reciprocalsThomas Koenig
|   +* Re: Approximate reciprocalsMichael S
|   |`* Re: Approximate reciprocalsThomas Koenig
|   | `* Re: Approximate reciprocalsMichael S
|   |  `* Re: Approximate reciprocalsThomas Koenig
|   |   `* Re: Approximate reciprocalsMichael S
|   |    `* Re: Approximate reciprocalsThomas Koenig
|   |     `* Re: Approximate reciprocalsMichael S
|   |      `* Re: Approximate reciprocalsMichael S
|   |       +* Re: Approximate reciprocalsTerje Mathisen
|   |       |+* Re: Approximate reciprocalsMitchAlsup
|   |       ||`* Re: Approximate reciprocalsTerje Mathisen
|   |       || `* Re: Approximate reciprocalsMitchAlsup
|   |       ||  +- Re: Approximate reciprocalsTerje Mathisen
|   |       ||  `- Re: Approximate reciprocalsQuadibloc
|   |       |`- Re: Approximate reciprocalsMichael S
|   |       `* Re: Approximate reciprocalsThomas Koenig
|   |        `* Re: Approximate reciprocalsMichael S
|   |         `* Re: Approximate reciprocalsThomas Koenig
|   |          `* Re: Approximate reciprocalsMichael S
|   |           `* Re: Approximate reciprocalsMichael S
|   |            +* Re: Approximate reciprocalsMitchAlsup
|   |            |`* Re: Approximate reciprocalsJames Van Buskirk
|   |            | `- Re: Approximate reciprocalsMitchAlsup
|   |            `* Re: Approximate reciprocalsThomas Koenig
|   |             `* Re: Approximate reciprocalsMichael S
|   |              +- Re: Approximate reciprocalsMichael S
|   |              +* Re: Approximate reciprocalsMitchAlsup
|   |              |`* Re: Approximate reciprocalsTerje Mathisen
|   |              | `* Re: Approximate reciprocalsMitchAlsup
|   |              |  +- Re: Approximate reciprocalsMichael S
|   |              |  `* Re: Approximate reciprocalsTerje Mathisen
|   |              |   `* Re: Approximate reciprocalsMitchAlsup
|   |              |    +- Re: Approximate reciprocalsMichael S
|   |              |    +- Re: Approximate reciprocalsMichael S
|   |              |    `- Re: Approximate reciprocalsTerje Mathisen
|   |              +* Re: Approximate reciprocalsMichael S
|   |              |`* Re: Approximate reciprocalsThomas Koenig
|   |              | +- Re: Approximate reciprocalsMichael S
|   |              | `* Re: Approximate reciprocalsTerje Mathisen
|   |              |  +- Re: Approximate reciprocalsQuadibloc
|   |              |  +* Re: Approximate reciprocalsThomas Koenig
|   |              |  |+- Re: Approximate reciprocalsMichael S
|   |              |  |+- Re: Approximate reciprocalsTerje Mathisen
|   |              |  |`* Re: Approximate reciprocalsMichael S
|   |              |  | `* Re: Approximate reciprocalsThomas Koenig
|   |              |  |  +- Re: Approximate reciprocalsMichael S
|   |              |  |  `* Re: Approximate reciprocalsMichael S
|   |              |  |   `* Re: Approximate reciprocalsThomas Koenig
|   |              |  |    `* Re: Approximate reciprocalsMichael S
|   |              |  |     `* Re: Approximate reciprocalsMichael S
|   |              |  |      `* Re: Approximate reciprocalsThomas Koenig
|   |              |  |       `* Re: Approximate reciprocalsMichael S
|   |              |  |        +* Re: Approximate reciprocalsrobf...@gmail.com
|   |              |  |        |`* Useful floating point instructions (was: Approximate reciprocals)Thomas Koenig
|   |              |  |        | `* Re: Useful floating point instructionsTerje Mathisen
|   |              |  |        |  `* Re: Useful floating point instructionsStephen Fuld
|   |              |  |        |   `* Re: Useful floating point instructionsMitchAlsup
|   |              |  |        |    `* Re: Useful floating point instructionsStephen Fuld
|   |              |  |        |     +- Re: Useful floating point instructionsMitchAlsup
|   |              |  |        |     +* Re: Useful floating point instructionsMichael S
|   |              |  |        |     |+- Re: Useful floating point instructionsStephen Fuld
|   |              |  |        |     |`- Re: Useful floating point instructionsTerje Mathisen
|   |              |  |        |     `* Re: Useful floating point instructionsTerje Mathisen
|   |              |  |        |      `- Re: Useful floating point instructionsStefan Monnier
|   |              |  |        +* Re: Approximate reciprocalsMichael S
|   |              |  |        |`* Re: Approximate reciprocalsGeorge Neuner
|   |              |  |        | +* Re: Approximate reciprocalsAnton Ertl
|   |              |  |        | |+* Re: Approximate reciprocalsMichael S
|   |              |  |        | ||`* Re: Approximate reciprocalsAnton Ertl
|   |              |  |        | || `- Re: Approximate reciprocalsMichael S
|   |              |  |        | |`* Re: Approximate reciprocalsGeorge Neuner
|   |              |  |        | | `* Re: Approximate reciprocalsAnton Ertl
|   |              |  |        | |  `* Re: Approximate reciprocalsMichael S
|   |              |  |        | |   `* Re: Approximate reciprocalsTerje Mathisen
|   |              |  |        | |    `* Re: Approximate reciprocalsMichael S
|   |              |  |        | |     `* Re: Approximate reciprocalsTerje Mathisen
|   |              |  |        | |      `- Re: Approximate reciprocalsMitchAlsup
|   |              |  |        | +- Re: Approximate reciprocalsMichael S
|   |              |  |        | `* Re: Approximate reciprocalsJohn Dallman
|   |              |  |        |  +- Re: Approximate reciprocalsMitchAlsup
|   |              |  |        |  `* Re: Approximate reciprocalsGeorge Neuner
|   |              |  |        |   +* Re: Approximate reciprocalsMichael S
|   |              |  |        |   |+* Re: Approximate reciprocalsEricP
|   |              |  |        |   ||`* Re: Approximate reciprocalsAnton Ertl
|   |              |  |        |   |`* Re: Approximate reciprocalsAnton Ertl
|   |              |  |        |   `* Re: Approximate reciprocalsJohn Dallman
|   |              |  |        +- Re: Approximate reciprocalsMichael S
|   |              |  |        `- Re: Approximate reciprocalsMichael S
|   |              |  `* Re: Approximate reciprocalsMichael S
|   |              `- Re: Approximate reciprocalsMichael S
|   `- Re: Approximate reciprocalsTerje Mathisen
+* Re: Approximate reciprocalsElijah Stone
+* Re: Approximate reciprocalsMarcus
`* Re: Approximate reciprocalsMarcus

Pages:12345678910111213
Re: Approximate reciprocals

<acc5d2ac-54b0-40f2-8eb8-f47877b68cb2n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24666&group=comp.arch#24666

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:54a:0:b0:69a:f10c:f533 with SMTP id 71-20020a37054a000000b0069af10cf533mr8532865qkf.525.1649602411326;
Sun, 10 Apr 2022 07:53:31 -0700 (PDT)
X-Received: by 2002:a05:6808:1a21:b0:2f9:c3b2:843b with SMTP id
bk33-20020a0568081a2100b002f9c3b2843bmr3316428oib.7.1649602411028; Sun, 10
Apr 2022 07:53:31 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!3.us.feeder.erje.net!feeder.erje.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 10 Apr 2022 07:53:30 -0700 (PDT)
In-Reply-To: <STA4K.349142$Gojc.88544@fx99.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=199.203.251.52; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 199.203.251.52
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <memo.20220408160456.22520T@jgd.cix.co.uk>
<tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com> <3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com>
<5fd7f105-c8ce-48cf-8cfb-13e98a584649n@googlegroups.com> <2022Apr10.103214@mips.complang.tuwien.ac.at>
<STA4K.349142$Gojc.88544@fx99.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <acc5d2ac-54b0-40f2-8eb8-f47877b68cb2n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: already5...@yahoo.com (Michael S)
Injection-Date: Sun, 10 Apr 2022 14:53:31 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 105
 by: Michael S - Sun, 10 Apr 2022 14:53 UTC

On Sunday, April 10, 2022 at 4:24:07 PM UTC+3, EricP wrote:
> Anton Ertl wrote:
> > Quadibloc <jsa...@ecn.ab.ca> writes:
> >> You have heard the crazy way *I* would have designed a processor. With separate
> >> pipelines for single precision and double precision and quad precision and one-and-a-half
> >> precision.
> >>
> >> But if one were to put a 60-bit floating-point number down the double precision pipeline,
> >> no, one would not have to drain it to change the mode. (The double precision pipeline
> >> would actually be designed for 72-bit floats, which would also use it. 80 bit temporary
> >> reals would go down the double precision pipeline, unless there was one for 96-bit
> >> floats.)
> >>
> >> So mixing precisions would make your programs go *faster*, because it would utilize
> >> the other pipelines that otherwise would not be used. The very opposite of Intel!
> >
> > In the last weeks I have noticed unusually many people posting
> > over-long lines (i.e., longer than 80 chars, and ideally you should
> > limit the lines to 70-72 chars to leave room for quoting). Is there
> > some new attack on Usenet conventions coming out from Google?
> >
> > Anyway, my guess for the reason for slow precision-setting is that
> > Intel and AMD microarchitects want the precision setting to be known
> > to the decoder, so it can deliver the precision as part of the uop.
> > This requires that when setting the precision, decoding of subsequent
> > instructions starts from scratch. An alternative would be to deliver
> > the precision as another input in the OoO engine, but that would
> > require additional resources in the OoO engine, an apparently they
> > thought that spending these resources elsewhere would buy more
> > performance.
> >
> > Concerning your separate-pipelines idea, in that setup it's even more
> > advantageous to know the precision early in instruction processing:
> > You can have separate queues/ports for the different precisions and
> > steer the instructions to these ports early, instead of having common
> > FP queues, and steering the instructions to the right pipelines only
> > when all the data (including precision) is in; ok, you could also have
> > two stages of queues, but that introduces additional complication,
> > area, and probably latency.
> >
> > Also, even if switching is fast, how frequent is code with mixed
> > precision? E.g., in DGEMM you only use double precision operations,
> > while in SGEMM you only use single-precision operations.
> >
> > Bottom line: There's a reason why Intel and AMD are designing their
> > CPUs the way they are.
> The x87 FP Control Word has flags to control
> Infinity, Rounding, Precision, and Exception masking.
>
> If you are changing some bits and not others then you have to
> store the current value, mask in your changes, and load it FLDCW.
> That store can either be synchronous FSTCW or asynchronous FNSTCW
> (remember x87 is a separate co-processor with long running
> transcendentals, and the FSTCW is actually an assembler macro
> instruction which emits FWAIT, FNSTCW).
>
> If any of the exception mask flags change then we may need to
> synchronize with current and pending exceptions status.
> So this ties changes in the control word into current
> and future state in the status register.
>
> Then there is the issue that the new CW is coming from the
> data path so to merge the CW bits into the uOp bits in the decoder
> implies some kind of front end delay between when the FLDCW decodes
> and when the new value propagates back to decode.
>
> An alternatively design merges the FPCW flags into the uOp in the
> FPU itself, but then we have to deal with the FP instructions are
> launching out-of-order and we have to make sure the right set
> of flags goes to the right uOp.
> For this I would have a small set physical FPCW registers,
> 4 should be sufficient, and a renamer for the one logical CW register.
> This makes the CW bits a uOp data dependency like other FP operands
> and it would require its own wake-up matrix and forwarding bus,
> but much simpler than the normal operand support logic.
> The then current (future) CW bits merge into the uOp when
> it is launched for execution.
>
> With this a write to the FPCW would only stall as long as it took
> the new CW value to arrive at its CW physical register or appear
> on its forwarding bus, so ideally allowing back-to-back execution.

According to my understanding, what you described as alternative design
is exactly what Intel did starting from Yonah and up to Nehalem (of course,
excluding Bonell).
In Sandy Bridge and later, it seems that limitation of 4 renaming registers gone.
Probably, by now x87 Control Word is stored in one of the big PRFs.

AMD Bulldozer and Zen are similar to Sandy Bridge derivatives.

AMD K7/K8/K10 does not rename Control Word, but somehow FLDCW is slow
only when the new value differs from the old one.

Pentium 4 went into opposite direction. x87 Control Word is not renamed, but
the value predicted to be the same as before a previous FLDCW . When prediction
fails then everything is flushed and CPU goes through very slow replay. Agner says
143 clocks, but this things are hard to measure. However in common scenario
[of late 90s and early 00s] software temporarily changes Control Word and
then restores the previous value, so this primitive prediction works well.

"Small" x86 cores (AMD Bobcat&Jaguar, Intel Silvermon and Goldmont) do not
rename x87 Control Word and don't try to be smart about it in any ways.

Re: Approximate reciprocals

<2022Apr10.173246@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24667&group=comp.arch#24667

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Sun, 10 Apr 2022 15:32:46 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 35
Message-ID: <2022Apr10.173246@mips.complang.tuwien.ac.at>
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <memo.20220408160456.22520T@jgd.cix.co.uk> <tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com> <3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com> <5fd7f105-c8ce-48cf-8cfb-13e98a584649n@googlegroups.com> <2022Apr10.103214@mips.complang.tuwien.ac.at> <STA4K.349142$Gojc.88544@fx99.iad> <Y5C4K.174018$ZmJ7.153689@fx06.iad>
Injection-Info: reader02.eternal-september.org; posting-host="475276b09171678866ab4d3ac4f25c13";
logging-data="12798"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/br6b5WecOUr711Bzs1XDq"
Cancel-Lock: sha1:uCMuKCZzRG2E9lKC06oqzLhz2s4=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Sun, 10 Apr 2022 15:32 UTC

EricP <ThatWouldBeTelling@thevillage.com> writes:
>The implied OR of the uOp status with the current status
>creates a serial dependency that we'd need to break up.

It's not really serial. OR is associative, so you can flatten the
evaluation sequence into an arbitrarily flat tree.

>It is not the same as the integer status flags because those
>flags overwrite the prior values so a rename can handle them.
>
>Whereas each FP exception bit has to OR with its prior state
>and then merge with that uOps exception mask bits to
>decide whether to throw and exception.
>So a simple renamer for the FPSW wouldn't suffice.

That's an interesting deviation from the usual OoO problems.

One thought is to maintain these bits in the in-order retirement unit.
If multiple FP instructions are retired in a cycle, it's relatively
easy to OR all their FP-exception bits and then or them with the bits
up to now. This is also where control-flow exceptions (traps) are
triggered; and if there are branch instructions that directly branch
on these bits, this is where the predictions are checked.

The only problem with this approach is an instruction that reads the
FPSW; such an instruction would have to wait until it is retired (and
therefore until all earlier instructions are retired) to produce its
result for the OoO engine. If the result is the only used for a
branch, that's not so bad, though: the branch prediction is also only
checked when the branch is retired.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Approximate reciprocals

<t0E4K.150546$8V_7.141296@fx04.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24668&group=comp.arch#24668

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx04.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <memo.20220408160456.22520T@jgd.cix.co.uk> <tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com> <3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com> <5fd7f105-c8ce-48cf-8cfb-13e98a584649n@googlegroups.com> <2022Apr10.103214@mips.complang.tuwien.ac.at> <STA4K.349142$Gojc.88544@fx99.iad> <Y5C4K.174018$ZmJ7.153689@fx06.iad> <2022Apr10.173246@mips.complang.tuwien.ac.at>
In-Reply-To: <2022Apr10.173246@mips.complang.tuwien.ac.at>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 69
Message-ID: <t0E4K.150546$8V_7.141296@fx04.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sun, 10 Apr 2022 16:58:01 UTC
Date: Sun, 10 Apr 2022 12:56:24 -0400
X-Received-Bytes: 4122
 by: EricP - Sun, 10 Apr 2022 16:56 UTC

Anton Ertl wrote:
> EricP <ThatWouldBeTelling@thevillage.com> writes:
>> The implied OR of the uOp status with the current status
>> creates a serial dependency that we'd need to break up.
>
> It's not really serial. OR is associative, so you can flatten the
> evaluation sequence into an arbitrarily flat tree.

The serialization comes in when FP instruction #1 has an
FP overflow but its uOp exception mask has overflow disabled,
and FP instruction #2 has no overflow but enables overflow exceptions.
So overflow from #1 is thrown at #2.

Ultimately there is no way to avoid this imposition of
program order on the update the the FP status word,
though there are oppotunities to optimize this.

>> It is not the same as the integer status flags because those
>> flags overwrite the prior values so a rename can handle them.
>>
>> Whereas each FP exception bit has to OR with its prior state
>> and then merge with that uOps exception mask bits to
>> decide whether to throw and exception.
>> So a simple renamer for the FPSW wouldn't suffice.
>
> That's an interesting deviation from the usual OoO problems.
>
> One thought is to maintain these bits in the in-order retirement unit.
> If multiple FP instructions are retired in a cycle, it's relatively
> easy to OR all their FP-exception bits and then or them with the bits
> up to now. This is also where control-flow exceptions (traps) are
> triggered; and if there are branch instructions that directly branch
> on these bits, this is where the predictions are checked.

The x87 FPSW also contains the 4 FP condition code bits
in addition to the pending exception status bits.
The condition bits are updated by overwrite however
and so amenable to renaming.

But ultimately they must merge with the overall FPSW in proper order.

>
> The only problem with this approach is an instruction that reads the
> FPSW; such an instruction would have to wait until it is retired (and
> therefore until all earlier instructions are retired) to produce its
> result for the OoO engine. If the result is the only used for a
> branch, that's not so bad, though: the branch prediction is also only
> checked when the branch is retired.
>
> - anton

Yes, doing this all at Retire is beating-it-with-a-blunt-object.
It works but its not very subtle as any FPSW register reads or writes
could have to drain the pipeline. Also delaying branch mispredict
detection until Retire wouldn't go over well.

It could have a separate circular buffer to which entries are added
in program order, that FP uOps output their status updates to,
and which drains at its own rate into a future copy of the
FPSW register might do the trick.

Kind of like a Load-Store Queue, this is an FP Status Queue.

This makes updates of the FPSW synchronous and in-order
but independent of the central instruction retire.
Branch mispredict rollback simply resets the queue head pointer
and essentially trims all mispredicted status updates.

Re: Approximate reciprocals

<2022Apr10.193504@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24669&group=comp.arch#24669

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Sun, 10 Apr 2022 17:35:04 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 64
Message-ID: <2022Apr10.193504@mips.complang.tuwien.ac.at>
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <memo.20220408160456.22520T@jgd.cix.co.uk> <tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com> <3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com> <5fd7f105-c8ce-48cf-8cfb-13e98a584649n@googlegroups.com> <2022Apr10.103214@mips.complang.tuwien.ac.at> <STA4K.349142$Gojc.88544@fx99.iad> <Y5C4K.174018$ZmJ7.153689@fx06.iad> <2022Apr10.173246@mips.complang.tuwien.ac.at> <t0E4K.150546$8V_7.141296@fx04.iad>
Injection-Info: reader02.eternal-september.org; posting-host="475276b09171678866ab4d3ac4f25c13";
logging-data="8198"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+E10AUYeRg4mZn3FxeQGJD"
Cancel-Lock: sha1:oW17XGBubg3NPi1dLkGooFPrEyQ=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Sun, 10 Apr 2022 17:35 UTC

EricP <ThatWouldBeTelling@thevillage.com> writes:
>Anton Ertl wrote:
>> EricP <ThatWouldBeTelling@thevillage.com> writes:
>>> The implied OR of the uOp status with the current status
>>> creates a serial dependency that we'd need to break up.
>>
>> It's not really serial. OR is associative, so you can flatten the
>> evaluation sequence into an arbitrarily flat tree.
>
>The serialization comes in when FP instruction #1 has an
>FP overflow but its uOp exception mask has overflow disabled,
>and FP instruction #2 has no overflow but enables overflow exceptions.
>So overflow from #1 is thrown at #2.

You need to change the control word to change the exception mask, and
that may be a slow operation anyway. But ok, let's assume it is not
if you don't change the rounding mode. AFAIK what the masking bits do
is that they are ANDed with the actual exception bits before they are
ORed into the status word. No throwing going on, and no trapping
(control-flow exception), either. The naming of FP exceptions is
unfortunate; calling them flags would be more in line with computer
architecture terminology.

>> One thought is to maintain these bits in the in-order retirement unit.
>> If multiple FP instructions are retired in a cycle, it's relatively
>> easy to OR all their FP-exception bits and then or them with the bits
>> up to now. This is also where control-flow exceptions (traps) are
>> triggered; and if there are branch instructions that directly branch
>> on these bits, this is where the predictions are checked.
>
>The x87 FPSW also contains the 4 FP condition code bits
>in addition to the pending exception status bits.
>The condition bits are updated by overwrite however
>and so amenable to renaming.

Yes, or they could just be handled like the others.

>Yes, doing this all at Retire is beating-it-with-a-blunt-object.
>It works but its not very subtle as any FPSW register reads or writes
>could have to drain the pipeline.

Writes would just flow into the retirement unit and do their work
there, no draining necessary.

Reading would result in the user of the value waiting for the read-sw
instruction to retire, which may appear like a significant latency of
the read-sw instruction, but not as bad as draining.

>Also delaying branch mispredict
>detection until Retire wouldn't go over well.

I don't think that's a big problem.

1) Few branches depend on the FP status.

2) Branch predictors work very well these days.

3) How many cycles early can you resolve the branch misprediction
anyway?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Approximate reciprocals

<e0808a3e-9c14-4f4d-84d2-4bcc5774d817n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24670&group=comp.arch#24670

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:9243:0:b0:69b:6009:856d with SMTP id u64-20020a379243000000b0069b6009856dmr8736227qkd.274.1649615014539;
Sun, 10 Apr 2022 11:23:34 -0700 (PDT)
X-Received: by 2002:a05:6808:2018:b0:2ec:c22b:15b8 with SMTP id
q24-20020a056808201800b002ecc22b15b8mr3682901oiw.136.1649615014337; Sun, 10
Apr 2022 11:23:34 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!2.eu.feeder.erje.net!feeder.erje.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 10 Apr 2022 11:23:34 -0700 (PDT)
In-Reply-To: <t0E4K.150546$8V_7.141296@fx04.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:6882:41be:f1d0:bc81;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:6882:41be:f1d0:bc81
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <memo.20220408160456.22520T@jgd.cix.co.uk>
<tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com> <3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com>
<5fd7f105-c8ce-48cf-8cfb-13e98a584649n@googlegroups.com> <2022Apr10.103214@mips.complang.tuwien.ac.at>
<STA4K.349142$Gojc.88544@fx99.iad> <Y5C4K.174018$ZmJ7.153689@fx06.iad>
<2022Apr10.173246@mips.complang.tuwien.ac.at> <t0E4K.150546$8V_7.141296@fx04.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e0808a3e-9c14-4f4d-84d2-4bcc5774d817n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 10 Apr 2022 18:23:34 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 90
 by: MitchAlsup - Sun, 10 Apr 2022 18:23 UTC

On Sunday, April 10, 2022 at 11:58:05 AM UTC-5, EricP wrote:
> Anton Ertl wrote:
> > EricP <ThatWould...@thevillage.com> writes:
> >> The implied OR of the uOp status with the current status
> >> creates a serial dependency that we'd need to break up.
> >
> > It's not really serial. OR is associative, so you can flatten the
> > evaluation sequence into an arbitrarily flat tree.
> The serialization comes in when FP instruction #1 has an
> FP overflow but its uOp exception mask has overflow disabled,
> and FP instruction #2 has no overflow but enables overflow exceptions.
> So overflow from #1 is thrown at #2.
<
Exception IP points at instruction that did not cause FP exception
but at the instruction that re-enabled the acceptance of the already
raised exception. And thus is imprecise.
>
> Ultimately there is no way to avoid this imposition of
> program order on the update the the FP status word,
> though there are oppotunities to optimize this.
<
Such as an exception re-order-buffer--or a handful of bits in the
regular re-order buffer.
<
> >> It is not the same as the integer status flags because those
> >> flags overwrite the prior values so a rename can handle them.
> >>
> >> Whereas each FP exception bit has to OR with its prior state
> >> and then merge with that uOps exception mask bits to
> >> decide whether to throw and exception.
> >> So a simple renamer for the FPSW wouldn't suffice.
> >
> > That's an interesting deviation from the usual OoO problems.
> >
> > One thought is to maintain these bits in the in-order retirement unit.
> > If multiple FP instructions are retired in a cycle, it's relatively
> > easy to OR all their FP-exception bits and then or them with the bits
> > up to now. This is also where control-flow exceptions (traps) are
> > triggered; and if there are branch instructions that directly branch
> > on these bits, this is where the predictions are checked.
> The x87 FPSW also contains the 4 FP condition code bits
> in addition to the pending exception status bits.
> The condition bits are updated by overwrite however
> and so amenable to renaming.
<
The enabled flags are also amenable to renaming
And so are the accumulated flags--as long as they are updated in
program order as instructions retire. This causes a visible latency
effect when one reads the bits dependent upon retirement.
>
> But ultimately they must merge with the overall FPSW in proper order.
<
If one HAS a FPSW.
> >
> > The only problem with this approach is an instruction that reads the
> > FPSW; such an instruction would have to wait until it is retired (and
> > therefore until all earlier instructions are retired) to produce its
> > result for the OoO engine. If the result is the only used for a
> > branch, that's not so bad, though: the branch prediction is also only
> > checked when the branch is retired.
> >
> > - anton
> Yes, doing this all at Retire is beating-it-with-a-blunt-object.
<
That is how IEEE defined it. You take the good with the bad.
<
> It works but its not very subtle as any FPSW register reads or writes
> could have to drain the pipeline. Also delaying branch mispredict
> detection until Retire wouldn't go over well.
<
Prediction of the flags is rather easy, and one only has to be come
dependent when an instruction reads or writes the flags explicitly.
The vast majority of instructions do not, and so most codes do not
suffer.
>
> It could have a separate circular buffer to which entries are added
> in program order, that FP uOps output their status updates to,
> and which drains at its own rate into a future copy of the
> FPSW register might do the trick.
>
> Kind of like a Load-Store Queue, this is an FP Status Queue.
<
I always just lobbed this into the re-order buffer (or similar.)
>
> This makes updates of the FPSW synchronous and in-order
> but independent of the central instruction retire.
> Branch mispredict rollback simply resets the queue head pointer
> and essentially trims all mispredicted status updates.
<
Once again, brain dead easy to predict. Prediction may be dependent
on the current bit pattern in FPSW (if present).

Re: Approximate reciprocals

<66535230-19e7-46e3-a6f8-8472ddd27ec2n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24671&group=comp.arch#24671

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:7f82:0:b0:2e1:caba:ad6e with SMTP id z2-20020ac87f82000000b002e1cabaad6emr23759038qtj.190.1649635784160;
Sun, 10 Apr 2022 17:09:44 -0700 (PDT)
X-Received: by 2002:a9d:4e99:0:b0:5b2:54f4:75e7 with SMTP id
v25-20020a9d4e99000000b005b254f475e7mr10364982otk.94.1649635783884; Sun, 10
Apr 2022 17:09:43 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 10 Apr 2022 17:09:43 -0700 (PDT)
In-Reply-To: <2022Apr10.103214@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:6947:3c86:73e1:a64e;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:6947:3c86:73e1:a64e
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <memo.20220408160456.22520T@jgd.cix.co.uk>
<tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com> <3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com>
<5fd7f105-c8ce-48cf-8cfb-13e98a584649n@googlegroups.com> <2022Apr10.103214@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <66535230-19e7-46e3-a6f8-8472ddd27ec2n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Mon, 11 Apr 2022 00:09:44 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 28
 by: Quadibloc - Mon, 11 Apr 2022 00:09 UTC

On Sunday, April 10, 2022 at 2:53:32 AM UTC-6, Anton Ertl wrote:

> Anyway, my guess for the reason for slow precision-setting is that
> Intel and AMD microarchitects want the precision setting to be known
> to the decoder, so it can deliver the precision as part of the uop.

Well, of course they would want _that_.

So, clearly, it's my fault for being insufficiently familiar with x86
machine language... I knew that the x87 worked with a stack,
and it wasn't until much later that more sensible non-vector float
instructions were added to the architecture as part of a set of
vector extensions... but I did not realize that the x86 architecture
worked with generic floating-point instructions, and changed their
meaning by setting some precision register.

I had assumed, instead, that like all sensible machines, the x86
would have had a Multiply Floating instruction and a separate
Multiply Double instruction. And Multiply Quad and Multiply
Temporary Real too, if needed. So if we include integer math, we get
the following set of instructions: Add Byte, Add Halfword, Add,
Add Long, Add Floating, Add Double, Add Temporary Real, Add Quad.

But then, x86 isn't even big-endian, so I guess one can't really
expect it to closely hew to the standards set by the One
Sensible Architecture as was used in Incredibly Big
Machines.

John Savard

Re: Approximate reciprocals

<a9d77bf9-e4c6-4afd-bbd2-b0f1d61daedan@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24672&group=comp.arch#24672

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:2946:b0:67b:3047:6d9d with SMTP id n6-20020a05620a294600b0067b30476d9dmr20016869qkp.691.1649636177504;
Sun, 10 Apr 2022 17:16:17 -0700 (PDT)
X-Received: by 2002:a9d:644b:0:b0:5cd:a627:c439 with SMTP id
m11-20020a9d644b000000b005cda627c439mr10257983otl.112.1649636177286; Sun, 10
Apr 2022 17:16:17 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 10 Apr 2022 17:16:17 -0700 (PDT)
In-Reply-To: <66535230-19e7-46e3-a6f8-8472ddd27ec2n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:6882:41be:f1d0:bc81;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:6882:41be:f1d0:bc81
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <memo.20220408160456.22520T@jgd.cix.co.uk>
<tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com> <3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com>
<5fd7f105-c8ce-48cf-8cfb-13e98a584649n@googlegroups.com> <2022Apr10.103214@mips.complang.tuwien.ac.at>
<66535230-19e7-46e3-a6f8-8472ddd27ec2n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <a9d77bf9-e4c6-4afd-bbd2-b0f1d61daedan@googlegroups.com>
Subject: Re: Approximate reciprocals
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 11 Apr 2022 00:16:17 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 31
 by: MitchAlsup - Mon, 11 Apr 2022 00:16 UTC

On Sunday, April 10, 2022 at 7:09:45 PM UTC-5, Quadibloc wrote:
> On Sunday, April 10, 2022 at 2:53:32 AM UTC-6, Anton Ertl wrote:
>
> > Anyway, my guess for the reason for slow precision-setting is that
> > Intel and AMD microarchitects want the precision setting to be known
> > to the decoder, so it can deliver the precision as part of the uop.
> Well, of course they would want _that_.
>
> So, clearly, it's my fault for being insufficiently familiar with x86
> machine language... I knew that the x87 worked with a stack,
> and it wasn't until much later that more sensible non-vector float
> instructions were added to the architecture as part of a set of
> vector extensions... but I did not realize that the x86 architecture
> worked with generic floating-point instructions, and changed their
> meaning by setting some precision register.
>
> I had assumed, instead, that like all sensible machines, the x86
> would have had a Multiply Floating instruction and a separate
> Multiply Double instruction. And Multiply Quad and Multiply
> Temporary Real too, if needed. So if we include integer math, we get
> the following set of instructions: Add Byte, Add Halfword, Add,
> Add Long, Add Floating, Add Double, Add Temporary Real, Add Quad.
>
> But then, x86 isn't even big-endian, so I guess one can't really
> expect it to closely hew to the standards set by the One
> Sensible Architecture as was used in Incredibly Big
> Machines.
<
I think this is the first time I have seen x86 and sensible in the same
sentence.
>
> John Savard

Re: Approximate reciprocals

<7d9bed9d-b010-4ac8-b4f1-54320379920bn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24673&group=comp.arch#24673

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:2954:b0:699:c4b2:48f7 with SMTP id n20-20020a05620a295400b00699c4b248f7mr20275562qkp.706.1649655475500;
Sun, 10 Apr 2022 22:37:55 -0700 (PDT)
X-Received: by 2002:a05:6870:a2d0:b0:d9:ae66:b8e2 with SMTP id
w16-20020a056870a2d000b000d9ae66b8e2mr13092959oak.7.1649655475223; Sun, 10
Apr 2022 22:37:55 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 10 Apr 2022 22:37:55 -0700 (PDT)
In-Reply-To: <a9d77bf9-e4c6-4afd-bbd2-b0f1d61daedan@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:6947:3c86:73e1:a64e;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:6947:3c86:73e1:a64e
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <memo.20220408160456.22520T@jgd.cix.co.uk>
<tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com> <3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com>
<5fd7f105-c8ce-48cf-8cfb-13e98a584649n@googlegroups.com> <2022Apr10.103214@mips.complang.tuwien.ac.at>
<66535230-19e7-46e3-a6f8-8472ddd27ec2n@googlegroups.com> <a9d77bf9-e4c6-4afd-bbd2-b0f1d61daedan@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <7d9bed9d-b010-4ac8-b4f1-54320379920bn@googlegroups.com>
Subject: Re: Approximate reciprocals
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Mon, 11 Apr 2022 05:37:55 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 19
 by: Quadibloc - Mon, 11 Apr 2022 05:37 UTC

On Sunday, April 10, 2022 at 6:16:18 PM UTC-6, MitchAlsup wrote:
> On Sunday, April 10, 2022 at 7:09:45 PM UTC-5, Quadibloc wrote:

> > But then, x86 isn't even big-endian, so I guess one can't really
> > expect it to closely hew to the standards set by the One
> > Sensible Architecture as was used in Incredibly Big
> > Machines.

> I think this is the first time I have seen x86 and sensible in the same
> sentence.

Good one! But that would be in the direction of _agreement_ with my
point of view, with which you are no doubt *not* totally in agreement.
My point of view expressed here being: the x86 is _not_ sensible, but
the IBM System/360 not only _is_ sensible, but is the very paradigm
of a sensible computer architecture.

Or should we say, or sing, that it is the very _model_ of a...

John Savard

Re: Approximate reciprocals

<2022Apr11.090246@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24674&group=comp.arch#24674

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Mon, 11 Apr 2022 07:02:46 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 57
Message-ID: <2022Apr11.090246@mips.complang.tuwien.ac.at>
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <memo.20220408160456.22520T@jgd.cix.co.uk> <tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com> <3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com> <5fd7f105-c8ce-48cf-8cfb-13e98a584649n@googlegroups.com> <2022Apr10.103214@mips.complang.tuwien.ac.at> <66535230-19e7-46e3-a6f8-8472ddd27ec2n@googlegroups.com>
Injection-Info: reader02.eternal-september.org; posting-host="ce035ca74fd4b146be216bf8e6b1a182";
logging-data="23946"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+WOGqIMkicdhE3OzxkktwY"
Cancel-Lock: sha1:pjkmZIBWeENfwtq6O84NfhNyvC8=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Mon, 11 Apr 2022 07:02 UTC

Quadibloc <jsavard@ecn.ab.ca> writes:
>On Sunday, April 10, 2022 at 2:53:32 AM UTC-6, Anton Ertl wrote:
>
>> Anyway, my guess for the reason for slow precision-setting is that
>> Intel and AMD microarchitects want the precision setting to be known
>> to the decoder, so it can deliver the precision as part of the uop.
>
>Well, of course they would want _that_.
>
>So, clearly, it's my fault for being insufficiently familiar with x86
>machine language... I knew that the x87 worked with a stack,
>and it wasn't until much later that more sensible non-vector float
>instructions were added to the architecture as part of a set of
>vector extensions... but I did not realize that the x86 architecture
>worked with generic floating-point instructions, and changed their
>meaning by setting some precision register.

Concerning rounding modes, this approach has been standardized by
IEEE. For precision, the 8087 ff. took the same approach. I think
the idea was that you would use almost always use the 64-bit mantissa,
and only use precision control in exceptional circumstances.

>So if we include integer math, we get
>the following set of instructions: Add Byte, Add Halfword, Add,
>Add Long,

"Halfword" and "long" are terms with architecture-specific meanings
(if any), but I understand that you want to have add instructions for
different lengths. IA-32 and AMD64 have these, but the 8-bit and
16-bit variants require partial register updates, which cause
performance problems and require extra area for mitigating them.
AMD64 defined the 32-bit variant as zero-extending instruction (i.e.,
with a full-width result), which eliminated the partial-register
update problem. Load/Store architectures only have full-width ALU
operations. There is no advantage to partial-width operations and
little to extending variants. One issue is adding a sign- or
zero-extended 32-bit value to a 64-bit address (thanks to the I32LP64
(and IL32LLP64) brain-damage), and Aarch64 has addressing modes for
that; if you have sign- and zero-extending 32-bit adds, you don't need
these, but no architecture has both.

>But then, x86 isn't even big-endian, so I guess one can't really
>expect it to closely hew to the standards set by the One
>Sensible Architecture as was used in Incredibly Big
>Machines.

Maybe you should read about where the term "big-endian" comes from to
understand how ridiculous this statement is.

And given that we started with FP operations, you are probably the
only one (including everyone from IBM) who thinks that the S/360 is a
standard to be closely followed.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Approximate reciprocals

<t30r8g$48q$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24675&group=comp.arch#24675

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Mon, 11 Apr 2022 02:09:35 -0700
Organization: A noiseless patient Spider
Lines: 50
Message-ID: <t30r8g$48q$1@dont-email.me>
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com>
<memo.20220408160456.22520T@jgd.cix.co.uk>
<tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com>
<3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com>
<5fd7f105-c8ce-48cf-8cfb-13e98a584649n@googlegroups.com>
<2022Apr10.103214@mips.complang.tuwien.ac.at>
<66535230-19e7-46e3-a6f8-8472ddd27ec2n@googlegroups.com>
<2022Apr11.090246@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 11 Apr 2022 09:09:36 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="37e0c037c1894dab03ed1101d33a8ccf";
logging-data="4378"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19dJPKiECSH8+//gs8nnNq2"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.8.0
Cancel-Lock: sha1:CwHD6954nzxisIU6Xv4qFUf/HPU=
In-Reply-To: <2022Apr11.090246@mips.complang.tuwien.ac.at>
Content-Language: en-US
 by: Ivan Godard - Mon, 11 Apr 2022 09:09 UTC

On 4/11/2022 12:02 AM, Anton Ertl wrote:
> Quadibloc <jsavard@ecn.ab.ca> writes:
>> On Sunday, April 10, 2022 at 2:53:32 AM UTC-6, Anton Ertl wrote:
>>
>>> Anyway, my guess for the reason for slow precision-setting is that
>>> Intel and AMD microarchitects want the precision setting to be known
>>> to the decoder, so it can deliver the precision as part of the uop.
>>
>> Well, of course they would want _that_.
>>
>> So, clearly, it's my fault for being insufficiently familiar with x86
>> machine language... I knew that the x87 worked with a stack,
>> and it wasn't until much later that more sensible non-vector float
>> instructions were added to the architecture as part of a set of
>> vector extensions... but I did not realize that the x86 architecture
>> worked with generic floating-point instructions, and changed their
>> meaning by setting some precision register.
>
> Concerning rounding modes, this approach has been standardized by
> IEEE. For precision, the 8087 ff. took the same approach. I think
> the idea was that you would use almost always use the 64-bit mantissa,
> and only use precision control in exceptional circumstances.
>
>> So if we include integer math, we get
>> the following set of instructions: Add Byte, Add Halfword, Add,
>> Add Long,
>
> "Halfword" and "long" are terms with architecture-specific meanings
> (if any), but I understand that you want to have add instructions for
> different lengths. IA-32 and AMD64 have these, but the 8-bit and
> 16-bit variants require partial register updates, which cause
> performance problems and require extra area for mitigating them.
> AMD64 defined the 32-bit variant as zero-extending instruction (i.e.,
> with a full-width result), which eliminated the partial-register
> update problem. Load/Store architectures only have full-width ALU
> operations. There is no advantage to partial-width operations and
> little to extending variants. One issue is adding a sign- or
> zero-extended 32-bit value to a 64-bit address (thanks to the I32LP64
> (and IL32LLP64) brain-damage), and Aarch64 has addressing modes for
> that; if you have sign- and zero-extending 32-bit adds, you don't need
> these, but no architecture has both.

While widening to a canonical integral width is widespread and is
blessed by C promotion rules, there are uses for exact-width integral
arithmetic: saturating arithmetic, excepting arithmetic that faults on
overflow, and wraparound semantics. Another issue is the disparity
between scalar (with widening semantics) and SIMD lane (with exact width
semantics) causing trouble in auto-vectorization.

These are admittedly specialized usages.

Re: Approximate reciprocals

<t30unk$oih$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24676&group=comp.arch#24676

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-3c49-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Mon, 11 Apr 2022 10:08:52 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <t30unk$oih$1@newsreader4.netcologne.de>
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com>
<memo.20220408160456.22520T@jgd.cix.co.uk>
<tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com>
<3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com>
<5fd7f105-c8ce-48cf-8cfb-13e98a584649n@googlegroups.com>
<2022Apr10.103214@mips.complang.tuwien.ac.at>
<66535230-19e7-46e3-a6f8-8472ddd27ec2n@googlegroups.com>
<a9d77bf9-e4c6-4afd-bbd2-b0f1d61daedan@googlegroups.com>
<7d9bed9d-b010-4ac8-b4f1-54320379920bn@googlegroups.com>
Injection-Date: Mon, 11 Apr 2022 10:08:52 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-3c49-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:3c49:0:7285:c2ff:fe6c:992d";
logging-data="25169"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Mon, 11 Apr 2022 10:08 UTC

Quadibloc <jsavard@ecn.ab.ca> schrieb:

> My point of view expressed here being: the x86 is _not_ sensible, but
> the IBM System/360 not only _is_ sensible, but is the very paradigm
> of a sensible computer architecture.

Which is silly, unless your definition of sensible differs from
everybody else's. Floating point was a huge regression vs. the 704,
but there were also other criticsims. John Levine posted a list
recently, which I would have to dig up (using a register instead of
pc-relative adressing and the restriction to 4095 bytes for offsets
were among its faults).

They did groundbreaking work with the /360 design, but people
certainly have learned since then (and also from the mistakes they
made at the time).

Re: Approximate reciprocals

<t312ht$eig$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24677&group=comp.arch#24677

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!rd9pRsUZyxkRLAEK7e/Uzw.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Mon, 11 Apr 2022 13:14:03 +0200
Organization: Aioe.org NNTP Server
Message-ID: <t312ht$eig$1@gioia.aioe.org>
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com>
<memo.20220408160456.22520T@jgd.cix.co.uk>
<tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com>
<3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com>
<5fd7f105-c8ce-48cf-8cfb-13e98a584649n@googlegroups.com>
<2022Apr10.103214@mips.complang.tuwien.ac.at>
<66535230-19e7-46e3-a6f8-8472ddd27ec2n@googlegroups.com>
<a9d77bf9-e4c6-4afd-bbd2-b0f1d61daedan@googlegroups.com>
<7d9bed9d-b010-4ac8-b4f1-54320379920bn@googlegroups.com>
<t30unk$oih$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="14928"; posting-host="rd9pRsUZyxkRLAEK7e/Uzw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.11.1
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Mon, 11 Apr 2022 11:14 UTC

Thomas Koenig wrote:
> Quadibloc <jsavard@ecn.ab.ca> schrieb:
>
>> My point of view expressed here being: the x86 is _not_ sensible, but
>> the IBM System/360 not only _is_ sensible, but is the very paradigm
>> of a sensible computer architecture.
>
> Which is silly, unless your definition of sensible differs from
> everybody else's. Floating point was a huge regression vs. the 704,
> but there were also other criticsims. John Levine posted a list
> recently, which I would have to dig up (using a register instead of
> pc-relative adressing and the restriction to 4095 bytes for offsets
> were among its faults).
>
> They did groundbreaking work with the /360 design, but people
> certainly have learned since then (and also from the mistakes they
> made at the time).

It is realøatively easy to see the most major 360 mistakes:

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Approximate reciprocals

<t312pr$j34$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24678&group=comp.arch#24678

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!rd9pRsUZyxkRLAEK7e/Uzw.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Mon, 11 Apr 2022 13:18:17 +0200
Organization: Aioe.org NNTP Server
Message-ID: <t312pr$j34$1@gioia.aioe.org>
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com>
<memo.20220408160456.22520T@jgd.cix.co.uk>
<tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com>
<3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com>
<5fd7f105-c8ce-48cf-8cfb-13e98a584649n@googlegroups.com>
<2022Apr10.103214@mips.complang.tuwien.ac.at>
<66535230-19e7-46e3-a6f8-8472ddd27ec2n@googlegroups.com>
<a9d77bf9-e4c6-4afd-bbd2-b0f1d61daedan@googlegroups.com>
<7d9bed9d-b010-4ac8-b4f1-54320379920bn@googlegroups.com>
<t30unk$oih$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="19556"; posting-host="rd9pRsUZyxkRLAEK7e/Uzw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.11.1
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Mon, 11 Apr 2022 11:18 UTC

Thomas Koenig wrote:
> Quadibloc <jsavard@ecn.ab.ca> schrieb:
>
>> My point of view expressed here being: the x86 is _not_ sensible, but
>> the IBM System/360 not only _is_ sensible, but is the very paradigm
>> of a sensible computer architecture.
>
> Which is silly, unless your definition of sensible differs from
> everybody else's. Floating point was a huge regression vs. the 704,
> but there were also other criticsims. John Levine posted a list
> recently, which I would have to dig up (using a register instead of
> pc-relative adressing and the restriction to 4095 bytes for offsets
> were among its faults).
>
> They did groundbreaking work with the /360 design, but people
> certainly have learned since then (and also from the mistakes they
> made at the time).
>
It is easy to see that they did make several mistakes, not just the
12-bit max address offsets and the hex FP, just look at all the stuff
they have modified/added alternative options on later versions of the
architecture.

24-bit addressing which ignored the upper byte is probably the most
obvious of those, it caused a major flag day when they went to 31-bit.

OTOH, they have been extremely good at providing user program and mostly
also OS backwards compatibility, but 86 has been at least as good here.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Approximate reciprocals

<4472a03e-dfce-46e0-97be-e272e93c900cn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24679&group=comp.arch#24679

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:1cc4:b0:435:b8a0:1fe9 with SMTP id g4-20020a0562141cc400b00435b8a01fe9mr26937433qvd.54.1649678869715;
Mon, 11 Apr 2022 05:07:49 -0700 (PDT)
X-Received: by 2002:a05:6870:c595:b0:da:4ea1:991f with SMTP id
ba21-20020a056870c59500b000da4ea1991fmr14449432oab.147.1649678869460; Mon, 11
Apr 2022 05:07:49 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 11 Apr 2022 05:07:49 -0700 (PDT)
In-Reply-To: <2022Apr11.090246@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:e5f3:8d5c:3c51:eff5;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:e5f3:8d5c:3c51:eff5
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <memo.20220408160456.22520T@jgd.cix.co.uk>
<tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com> <3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com>
<5fd7f105-c8ce-48cf-8cfb-13e98a584649n@googlegroups.com> <2022Apr10.103214@mips.complang.tuwien.ac.at>
<66535230-19e7-46e3-a6f8-8472ddd27ec2n@googlegroups.com> <2022Apr11.090246@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4472a03e-dfce-46e0-97be-e272e93c900cn@googlegroups.com>
Subject: Re: Approximate reciprocals
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Mon, 11 Apr 2022 12:07:49 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 11
 by: Quadibloc - Mon, 11 Apr 2022 12:07 UTC

On Monday, April 11, 2022 at 1:30:07 AM UTC-6, Anton Ertl wrote:

> And given that we started with FP operations, you are probably the
> only one (including everyone from IBM) who thinks that the S/360 is a
> standard to be closely followed.

Oh, it's true that not everything about it is perfect, and its floating-point
format is one of those things that is flawed. But having separate
instructions for each operand type, unless you're really tight for opcode
space, _is_ a very reasonable decision.

John Savard

Re: Approximate reciprocals

<45f4ff49-9fd0-4c76-8e56-c53b383ca143n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24680&group=comp.arch#24680

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:29cb:b0:699:fee3:265a with SMTP id s11-20020a05620a29cb00b00699fee3265amr13539408qkp.513.1649679019360;
Mon, 11 Apr 2022 05:10:19 -0700 (PDT)
X-Received: by 2002:a05:6808:3009:b0:2f9:6119:d676 with SMTP id
ay9-20020a056808300900b002f96119d676mr4526501oib.205.1649679018945; Mon, 11
Apr 2022 05:10:18 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 11 Apr 2022 05:10:18 -0700 (PDT)
In-Reply-To: <t312pr$j34$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:e5f3:8d5c:3c51:eff5;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:e5f3:8d5c:3c51:eff5
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <memo.20220408160456.22520T@jgd.cix.co.uk>
<tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com> <3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com>
<5fd7f105-c8ce-48cf-8cfb-13e98a584649n@googlegroups.com> <2022Apr10.103214@mips.complang.tuwien.ac.at>
<66535230-19e7-46e3-a6f8-8472ddd27ec2n@googlegroups.com> <a9d77bf9-e4c6-4afd-bbd2-b0f1d61daedan@googlegroups.com>
<7d9bed9d-b010-4ac8-b4f1-54320379920bn@googlegroups.com> <t30unk$oih$1@newsreader4.netcologne.de>
<t312pr$j34$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <45f4ff49-9fd0-4c76-8e56-c53b383ca143n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Mon, 11 Apr 2022 12:10:19 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 11
 by: Quadibloc - Mon, 11 Apr 2022 12:10 UTC

On Monday, April 11, 2022 at 5:18:22 AM UTC-6, Terje Mathisen wrote:

> It is easy to see that they did make several mistakes, not just the
> 12-bit max address offsets and the hex FP, just look at all the stuff
> they have modified/added alternative options on later versions of the
> architecture.

Oh, yes. But I look at the half of the glass that is full, and find *that* to
be so far ahead of x86 that there is no comparison. I did not mean to
imply there was anything of value in the half of the glass that is empty.

John Savard

Re: Approximate reciprocals

<SaX4K.508430$Rza5.413752@fx47.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24681&group=comp.arch#24681

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx47.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <memo.20220408160456.22520T@jgd.cix.co.uk> <tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com> <3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com> <5fd7f105-c8ce-48cf-8cfb-13e98a584649n@googlegroups.com> <2022Apr10.103214@mips.complang.tuwien.ac.at> <STA4K.349142$Gojc.88544@fx99.iad> <Y5C4K.174018$ZmJ7.153689@fx06.iad> <2022Apr10.173246@mips.complang.tuwien.ac.at> <t0E4K.150546$8V_7.141296@fx04.iad> <2022Apr10.193504@mips.complang.tuwien.ac.at>
In-Reply-To: <2022Apr10.193504@mips.complang.tuwien.ac.at>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 118
Message-ID: <SaX4K.508430$Rza5.413752@fx47.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Mon, 11 Apr 2022 14:46:10 UTC
Date: Mon, 11 Apr 2022 10:46:01 -0400
X-Received-Bytes: 6312
 by: EricP - Mon, 11 Apr 2022 14:46 UTC

Anton Ertl wrote:
> EricP <ThatWouldBeTelling@thevillage.com> writes:
>> Anton Ertl wrote:
>>> EricP <ThatWouldBeTelling@thevillage.com> writes:
>>>> The implied OR of the uOp status with the current status
>>>> creates a serial dependency that we'd need to break up.
>>> It's not really serial. OR is associative, so you can flatten the
>>> evaluation sequence into an arbitrarily flat tree.
>> The serialization comes in when FP instruction #1 has an
>> FP overflow but its uOp exception mask has overflow disabled,
>> and FP instruction #2 has no overflow but enables overflow exceptions.
>> So overflow from #1 is thrown at #2.
>
> You need to change the control word to change the exception mask, and
> that may be a slow operation anyway. But ok, let's assume it is not
> if you don't change the rounding mode.

The CW renaming allows all x87 control bits to change quickly.
Rounding bits would merge with the uOp on launch too.

SSE complicates its situation by putting the control bits and
status flags in the same MXCSR register rather than two.
The implication is that to change the rounding flags requires
reading MXCSR which requires syncing the the status flags.

> AFAIK what the masking bits do
> is that they are ANDed with the actual exception bits before they are
> ORed into the status word. No throwing going on, and no trapping
> (control-flow exception), either. The naming of FP exceptions is
> unfortunate; calling them flags would be more in line with computer
> architecture terminology.

The operation's status flags OR with the status register flags
and are written back to the status register.
Status register AND with the control word mask and if any
enabled bit is set then an exception generated at that point.

Note that unlike integer exceptions which do not modify the
destination register or integer status codes if they occur,
FP exceptions may or may not update the dest register with a
result or substitute a default alternative result,
and update the FP status word state before signalling
the exception at the point where it was detected,
depending on the condition detected.

>>> One thought is to maintain these bits in the in-order retirement unit.
>>> If multiple FP instructions are retired in a cycle, it's relatively
>>> easy to OR all their FP-exception bits and then or them with the bits
>>> up to now. This is also where control-flow exceptions (traps) are
>>> triggered; and if there are branch instructions that directly branch
>>> on these bits, this is where the predictions are checked.
>> The x87 FPSW also contains the 4 FP condition code bits
>> in addition to the pending exception status bits.
>> The condition bits are updated by overwrite however
>> and so amenable to renaming.
>
> Yes, or they could just be handled like the others.

If they are handled like others and the status word was updated
only at Retire it would cause FP conditional branch mispredict
detection to be delayed until Retire.

Which is why I'm singling then out as worthy of special handling.

>> Yes, doing this all at Retire is beating-it-with-a-blunt-object.
>> It works but its not very subtle as any FPSW register reads or writes
>> could have to drain the pipeline.
>
> Writes would just flow into the retirement unit and do their work
> there, no draining necessary.
>
> Reading would result in the user of the value waiting for the read-sw
> instruction to retire, which may appear like a significant latency of
> the read-sw instruction, but not as bad as draining.

Yes, and that's why I'm looking for ways to go faster than that.

Reading the status doesn't have to wait until retire if
it maintains a coherent "future" copy of the FPSW.
This future copy is coherent when all older FP instructions
have executed and applied their status to it.
Later Retire repeats the status updates to the Committed FPSW.

This allows the FP status word to be read or written ASAP
and not wait until older operations retire.

>> Also delaying branch mispredict
>> detection until Retire wouldn't go over well.
>
> I don't think that's a big problem.
>
> 1) Few branches depend on the FP status.
>
> 2) Branch predictors work very well these days.
>
> 3) How many cycles early can you resolve the branch misprediction
> anyway?

A large part of the branch mispredict handling, canceling of
younger in-flight instructions while keeping older ones,
and recovery of their resources.

This allows the wrong path to be trimmed and correct path
fetched, parsed and even executed long before the mispredicted
branch reaches Retire.

All that checkpoint and rollback logic is in place already for
integer branches so invoking it for FP is close to zero cost.

One question is whether it is worth doing similar optimization
for exceptions. I think that since exceptions are exceptional,
those can be left until Retire but with a prefetch of the handler
code into the Fetch buffer.
Note though that Mitch says the Motorola 88110 was doing this
optimization in 1992, 30 years ago.

Re: Approximate reciprocals

<RaX4K.508429$Rza5.439882@fx47.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24682&group=comp.arch#24682

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!feeder5.feed.usenet.farm!feeder1.feed.usenet.farm!feed.usenet.farm!peer02.ams4!peer.am4.highwinds-media.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx47.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <memo.20220408160456.22520T@jgd.cix.co.uk> <tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com> <3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com> <5fd7f105-c8ce-48cf-8cfb-13e98a584649n@googlegroups.com> <2022Apr10.103214@mips.complang.tuwien.ac.at> <STA4K.349142$Gojc.88544@fx99.iad> <Y5C4K.174018$ZmJ7.153689@fx06.iad> <2022Apr10.173246@mips.complang.tuwien.ac.at> <t0E4K.150546$8V_7.141296@fx04.iad> <e0808a3e-9c14-4f4d-84d2-4bcc5774d817n@googlegroups.com>
In-Reply-To: <e0808a3e-9c14-4f4d-84d2-4bcc5774d817n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 126
Message-ID: <RaX4K.508429$Rza5.439882@fx47.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Mon, 11 Apr 2022 14:46:09 UTC
Date: Mon, 11 Apr 2022 10:40:34 -0400
X-Received-Bytes: 6949
 by: EricP - Mon, 11 Apr 2022 14:40 UTC

MitchAlsup wrote:
> On Sunday, April 10, 2022 at 11:58:05 AM UTC-5, EricP wrote:
>> Anton Ertl wrote:
>>> EricP <ThatWould...@thevillage.com> writes:
>>>> The implied OR of the uOp status with the current status
>>>> creates a serial dependency that we'd need to break up.
>>> It's not really serial. OR is associative, so you can flatten the
>>> evaluation sequence into an arbitrarily flat tree.
>> The serialization comes in when FP instruction #1 has an
>> FP overflow but its uOp exception mask has overflow disabled,
>> and FP instruction #2 has no overflow but enables overflow exceptions.
>> So overflow from #1 is thrown at #2.
> <
> Exception IP points at instruction that did not cause FP exception
> but at the instruction that re-enabled the acceptance of the already
> raised exception. And thus is imprecise.

No, it is precise. The registers are updated for all instructions
before the exception point and none after. That is precise.

The difference for FP exceptions is that the definition of when they
are detected is different. Also, depending on the condition detected,
dest registers may or may not have been updated for the excepting operation.

Also x87 exceptions are signalled at the next WAIT/FWAIT.
The x87 "last instruction pointer" register is the faulting instruction.
The exception handler RIP is the FWAIT.

SSE exceptions are synchronous with the unmasked generating operation.

>> Ultimately there is no way to avoid this imposition of
>> program order on the update the the FP status word,
>> though there are oppotunities to optimize this.
> <
> Such as an exception re-order-buffer--or a handful of bits in the
> regular re-order buffer.
> <
>>>> It is not the same as the integer status flags because those
>>>> flags overwrite the prior values so a rename can handle them.
>>>>
>>>> Whereas each FP exception bit has to OR with its prior state
>>>> and then merge with that uOps exception mask bits to
>>>> decide whether to throw and exception.
>>>> So a simple renamer for the FPSW wouldn't suffice.
>>> That's an interesting deviation from the usual OoO problems.
>>>
>>> One thought is to maintain these bits in the in-order retirement unit.
>>> If multiple FP instructions are retired in a cycle, it's relatively
>>> easy to OR all their FP-exception bits and then or them with the bits
>>> up to now. This is also where control-flow exceptions (traps) are
>>> triggered; and if there are branch instructions that directly branch
>>> on these bits, this is where the predictions are checked.
>> The x87 FPSW also contains the 4 FP condition code bits
>> in addition to the pending exception status bits.
>> The condition bits are updated by overwrite however
>> and so amenable to renaming.
> <
> The enabled flags are also amenable to renaming
> And so are the accumulated flags--as long as they are updated in
> program order as instructions retire. This causes a visible latency
> effect when one reads the bits dependent upon retirement.

Yes, and since this is a "go as fast as possible" scenario
I am looking for ways to optimize that so the FPSW can be
calculated earlier than waiting until Retire.

>> But ultimately they must merge with the overall FPSW in proper order.
> <
> If one HAS a FPSW.

Doesn't IEEE-754 require the sticky exception flags?

>>> The only problem with this approach is an instruction that reads the
>>> FPSW; such an instruction would have to wait until it is retired (and
>>> therefore until all earlier instructions are retired) to produce its
>>> result for the OoO engine. If the result is the only used for a
>>> branch, that's not so bad, though: the branch prediction is also only
>>> checked when the branch is retired.
>>>
>>> - anton
>> Yes, doing this all at Retire is beating-it-with-a-blunt-object.
> <
> That is how IEEE defined it. You take the good with the bad.

Right, but the challenge here is to go as fast as possible.

Also looking at SSE they seem to make their situation worse by
having one combined control and status register instead of two.
That looks to my eye like a serious design error as it requires
a sync with the status flags in order to read the control bits,
causing a partial pipeline drain.

> <
>> It works but its not very subtle as any FPSW register reads or writes
>> could have to drain the pipeline. Also delaying branch mispredict
>> detection until Retire wouldn't go over well.
> <
> Prediction of the flags is rather easy, and one only has to be come
> dependent when an instruction reads or writes the flags explicitly.
> The vast majority of instructions do not, and so most codes do not
> suffer.
>> It could have a separate circular buffer to which entries are added
>> in program order, that FP uOps output their status updates to,
>> and which drains at its own rate into a future copy of the
>> FPSW register might do the trick.
>>
>> Kind of like a Load-Store Queue, this is an FP Status Queue.
> <
> I always just lobbed this into the re-order buffer (or similar.)

That too. This maintains a future FPSW for early reads and writes
of that register. A Committed FPSW is updated at Retire too.

>> This makes updates of the FPSW synchronous and in-order
>> but independent of the central instruction retire.
>> Branch mispredict rollback simply resets the queue head pointer
>> and essentially trims all mispredicted status updates.
> <
> Once again, brain dead easy to predict. Prediction may be dependent
> on the current bit pattern in FPSW (if present).

It doesn't need to predict the status flags
but it would be nice to read the FPSW ASAP.

Re: Approximate reciprocals

<1988adbe-4adc-4fe0-af2e-19bf45b43823n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24683&group=comp.arch#24683

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:6cc:b0:69b:dd1b:3235 with SMTP id 12-20020a05620a06cc00b0069bdd1b3235mr10557162qky.374.1649689206478;
Mon, 11 Apr 2022 08:00:06 -0700 (PDT)
X-Received: by 2002:a05:6830:22ea:b0:5b2:35c1:de3c with SMTP id
t10-20020a05683022ea00b005b235c1de3cmr11343292otc.282.1649689206018; Mon, 11
Apr 2022 08:00:06 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 11 Apr 2022 08:00:05 -0700 (PDT)
In-Reply-To: <t312pr$j34$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=199.203.251.52; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 199.203.251.52
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <memo.20220408160456.22520T@jgd.cix.co.uk>
<tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com> <3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com>
<5fd7f105-c8ce-48cf-8cfb-13e98a584649n@googlegroups.com> <2022Apr10.103214@mips.complang.tuwien.ac.at>
<66535230-19e7-46e3-a6f8-8472ddd27ec2n@googlegroups.com> <a9d77bf9-e4c6-4afd-bbd2-b0f1d61daedan@googlegroups.com>
<7d9bed9d-b010-4ac8-b4f1-54320379920bn@googlegroups.com> <t30unk$oih$1@newsreader4.netcologne.de>
<t312pr$j34$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1988adbe-4adc-4fe0-af2e-19bf45b43823n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: already5...@yahoo.com (Michael S)
Injection-Date: Mon, 11 Apr 2022 15:00:06 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 68
 by: Michael S - Mon, 11 Apr 2022 15:00 UTC

On Monday, April 11, 2022 at 2:18:22 PM UTC+3, Terje Mathisen wrote:
> Thomas Koenig wrote:
> > Quadibloc <jsa...@ecn.ab.ca> schrieb:
> >
> >> My point of view expressed here being: the x86 is _not_ sensible, but
> >> the IBM System/360 not only _is_ sensible, but is the very paradigm
> >> of a sensible computer architecture.
> >
> > Which is silly, unless your definition of sensible differs from
> > everybody else's. Floating point was a huge regression vs. the 704,
> > but there were also other criticsims. John Levine posted a list
> > recently, which I would have to dig up (using a register instead of
> > pc-relative adressing and the restriction to 4095 bytes for offsets
> > were among its faults).
> >
> > They did groundbreaking work with the /360 design, but people
> > certainly have learned since then (and also from the mistakes they
> > made at the time).
> >
> It is easy to see that they did make several mistakes, not just the
> 12-bit max address offsets and the hex FP, just look at all the stuff
> they have modified/added alternative options on later versions of the
> architecture.
>
> 24-bit addressing which ignored the upper byte is probably the most
> obvious of those, it caused a major flag day when they went to 31-bit.
>
> OTOH, they have been extremely good at providing user program and mostly
> also OS backwards compatibility, but 86 has been at least as good here.
>
> Terje
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

IMHO, S/360 major mistake has nothing to do with details of the ISA.

The major mistake was a managerial decision to never enter a mass market.
Probably, at any given point in time it looked perfectly reasonable, because
at any given point entering a mass market would mean cannibalizing nearly
all mid-range sails and adversely affecting a high end. And at short time
scale a profit, gained at mass market, wouldn't fully compensate for the
profit lost at higher end.
But without entering a mass market it inevitably ended up where it ended up,
i.e. architecture exists and implementations are even pretty good and even
quite competitive if we ignore a price tag. But it is completely irrelevant in
the Grand Scheme of Things. And, given a circumstances it's still relatively
good outcome. It could have been worse, much worse.

From technical point of view, the most suitable point for expansion of
S/360 into mass market was in the 1994. 9672 family that was introduced
in this year was the first S/360-compatible CPU suitable for cheap
manufacturing and for a deskside and may be even desktop format.
For the first few years is sucked performance-wise, but considering
potential iternal opposition to such step, it was not necessarily a bad property.

I find it ironical that it happened (that is, it happened to not happen) during the
early stage of the reign of Lou Gerstner, who is commonly considered as one
of the best CEOs in the history of IBM.

Re: Approximate reciprocals

<t31mgs$8s1$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24684&group=comp.arch#24684

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-3c49-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Mon, 11 Apr 2022 16:54:52 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <t31mgs$8s1$1@newsreader4.netcologne.de>
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com>
<memo.20220408160456.22520T@jgd.cix.co.uk>
<tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com>
<3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com>
<5fd7f105-c8ce-48cf-8cfb-13e98a584649n@googlegroups.com>
<2022Apr10.103214@mips.complang.tuwien.ac.at>
<66535230-19e7-46e3-a6f8-8472ddd27ec2n@googlegroups.com>
<a9d77bf9-e4c6-4afd-bbd2-b0f1d61daedan@googlegroups.com>
<7d9bed9d-b010-4ac8-b4f1-54320379920bn@googlegroups.com>
<t30unk$oih$1@newsreader4.netcologne.de> <t312pr$j34$1@gioia.aioe.org>
<1988adbe-4adc-4fe0-af2e-19bf45b43823n@googlegroups.com>
Injection-Date: Mon, 11 Apr 2022 16:54:52 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-3c49-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:3c49:0:7285:c2ff:fe6c:992d";
logging-data="9089"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Mon, 11 Apr 2022 16:54 UTC

Michael S <already5chosen@yahoo.com> schrieb:

> IMHO, S/360 major mistake has nothing to do with details of the ISA.
>
> The major mistake was a managerial decision to never enter a mass market.
> Probably, at any given point in time it looked perfectly reasonable, because
> at any given point entering a mass market would mean cannibalizing nearly
> all mid-range sails and adversely affecting a high end. And at short time
> scale a profit, gained at mass market, wouldn't fully compensate for the
> profit lost at higher end.

How could it have succeeded?

The mass market didn't need doing what the S/360 ff. architecture was
good at doing - doing established commercial applications for large
companies doing lots of I/O.

(German Wikipedia tells me of a 68000-360 which had S/360 microcode, I'm
not sure if I believe that).

> But without entering a mass market it inevitably ended up where it ended up,
> i.e. architecture exists and implementations are even pretty good and even
> quite competitive if we ignore a price tag. But it is completely irrelevant in
> the Grand Scheme of Things. And, given a circumstances it's still relatively
> good outcome. It could have been worse, much worse.
>
> From technical point of view, the most suitable point for expansion of
> S/360 into mass market was in the 1994. 9672 family that was introduced
> in this year was the first S/360-compatible CPU suitable for cheap
> manufacturing and for a deskside and may be even desktop format.
> For the first few years is sucked performance-wise, but considering
> potential iternal opposition to such step, it was not necessarily a bad property.

Look at what was already available in 1994. The minis were
almost dead, RISC ruled the workstation market, and the VAX 7000
vs Alpha comparison had shown that, given the technology of the
day, CISC was not the way to compete for raw computing power.
And Intel had already launched the Pentium. Plus, IBM was busy
selling high-performance RS/6000 workstations at the time.

All the people who were using PCs as a hobby, for gaming for for
business purposes had no IBM software to entice them to run
a /360 compatible computer.

Plus, the operating systems that IBM was offering were... arcane,
to say the least. Which private person or small company could
have administered an MVS system?

> I find it ironical that it happened (that is, it happened to not happen) during the
> early stage of the reign of Lou Gerstner, who is commonly considered as one
> of the best CEOs in the history of IBM.

If IBM would have wanted to push the /360, they should have done so
much easier - trying to go into the mini market. But I doubt they
could have compteded with the Data General Nova on price/performance
ratio with all that excess baggage they were carrying.

The 801, now...

Re: Approximate reciprocals

<memo.20220411180321.13360B@jgd.cix.co.uk>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24685&group=comp.arch#24685

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: jgd...@cix.co.uk (John Dallman)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Mon, 11 Apr 2022 18:03 +0100 (BST)
Organization: A noiseless patient Spider
Lines: 11
Message-ID: <memo.20220411180321.13360B@jgd.cix.co.uk>
References: <t31mgs$8s1$1@newsreader4.netcologne.de>
Reply-To: jgd@cix.co.uk
Injection-Info: reader02.eternal-september.org; posting-host="9618dbe60080bdc9b6ea7edd53a272a6";
logging-data="13508"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX195eBrhbPFpIxv5AFcAj2MVWB5Lnqby1Tg="
Cancel-Lock: sha1:fO4QR7uziWpwKzpsycPqQqMO2HQ=
 by: John Dallman - Mon, 11 Apr 2022 17:03 UTC

In article <t31mgs$8s1$1@newsreader4.netcologne.de>,
tkoenig@netcologne.de (Thomas Koenig) wrote:

> (German Wikipedia tells me of a 68000-360 which had S/360
> microcode, I'm not sure if I believe that).

It existed, but was not generally available. It was the core of
<https://en.wikipedia.org/wiki/PC-based_IBM_mainframe-compatible_systems#P
ersonal_Computer_XT/370>

John

Re: Approximate reciprocals

<t31onr$1fo4$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24686&group=comp.arch#24686

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!rd9pRsUZyxkRLAEK7e/Uzw.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Approximate reciprocals
Date: Mon, 11 Apr 2022 19:32:41 +0200
Organization: Aioe.org NNTP Server
Message-ID: <t31onr$1fo4$1@gioia.aioe.org>
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com>
<memo.20220408160456.22520T@jgd.cix.co.uk>
<tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com>
<3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com>
<5fd7f105-c8ce-48cf-8cfb-13e98a584649n@googlegroups.com>
<2022Apr10.103214@mips.complang.tuwien.ac.at>
<66535230-19e7-46e3-a6f8-8472ddd27ec2n@googlegroups.com>
<a9d77bf9-e4c6-4afd-bbd2-b0f1d61daedan@googlegroups.com>
<7d9bed9d-b010-4ac8-b4f1-54320379920bn@googlegroups.com>
<t30unk$oih$1@newsreader4.netcologne.de> <t312pr$j34$1@gioia.aioe.org>
<1988adbe-4adc-4fe0-af2e-19bf45b43823n@googlegroups.com>
<t31mgs$8s1$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="48900"; posting-host="rd9pRsUZyxkRLAEK7e/Uzw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.11.1
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Mon, 11 Apr 2022 17:32 UTC

Thomas Koenig wrote:
> Michael S <already5chosen@yahoo.com> schrieb:
>
>> IMHO, S/360 major mistake has nothing to do with details of the ISA.
>>
>> The major mistake was a managerial decision to never enter a mass market.
>> Probably, at any given point in time it looked perfectly reasonable, because
>> at any given point entering a mass market would mean cannibalizing nearly
>> all mid-range sails and adversely affecting a high end. And at short time
>> scale a profit, gained at mass market, wouldn't fully compensate for the
>> profit lost at higher end.
>
> How could it have succeeded?
>
> The mass market didn't need doing what the S/360 ff. architecture was
> good at doing - doing established commercial applications for large
> companies doing lots of I/O.
>
> (German Wikipedia tells me of a 68000-360 which had S/360 microcode, I'm
> not sure if I believe that).

Why not?

It did in fact exist, afaik it delivered pretty bad performance but did
work for sw development.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Approximate reciprocals

<4564bcbf-4bc7-4fa2-8da8-df1cd5242b48n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24687&group=comp.arch#24687

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:9e55:0:b0:69b:e707:e319 with SMTP id h82-20020a379e55000000b0069be707e319mr503109qke.561.1649701854268;
Mon, 11 Apr 2022 11:30:54 -0700 (PDT)
X-Received: by 2002:a05:6808:218a:b0:2f9:65d4:898a with SMTP id
be10-20020a056808218a00b002f965d4898amr215927oib.27.1649701854046; Mon, 11
Apr 2022 11:30:54 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!feeder.erje.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 11 Apr 2022 11:30:53 -0700 (PDT)
In-Reply-To: <2022Apr10.103214@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=199.203.251.52; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 199.203.251.52
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <memo.20220408160456.22520T@jgd.cix.co.uk>
<tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com> <3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com>
<5fd7f105-c8ce-48cf-8cfb-13e98a584649n@googlegroups.com> <2022Apr10.103214@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4564bcbf-4bc7-4fa2-8da8-df1cd5242b48n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: already5...@yahoo.com (Michael S)
Injection-Date: Mon, 11 Apr 2022 18:30:54 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 77
 by: Michael S - Mon, 11 Apr 2022 18:30 UTC

On Sunday, April 10, 2022 at 11:53:32 AM UTC+3, Anton Ertl wrote:
> Quadibloc <jsa...@ecn.ab.ca> writes:
> >You have heard the crazy way *I* would have designed a processor. With separate
> >pipelines for single precision and double precision and quad precision and one-and-a-half
> >precision.
> >
> >But if one were to put a 60-bit floating-point number down the double precision pipeline,
> >no, one would not have to drain it to change the mode. (The double precision pipeline
> >would actually be designed for 72-bit floats, which would also use it. 80 bit temporary
> >reals would go down the double precision pipeline, unless there was one for 96-bit
> >floats.)
> >
> >So mixing precisions would make your programs go *faster*, because it would utilize
> >the other pipelines that otherwise would not be used. The very opposite of Intel!
> In the last weeks I have noticed unusually many people posting
> over-long lines (i.e., longer than 80 chars, and ideally you should
> limit the lines to 70-72 chars to leave room for quoting). Is there
> some new attack on Usenet conventions coming out from Google?
>
> Anyway, my guess for the reason for slow precision-setting

I want to add just a little bit of facts to the nice theoretical discussion
about horrible slowness of changing of x87 precision.

uint64_t t0 = __rdtsc();
for (int i = 0; i < nIter; ++i) {
#ifdef LDCW_SQRTQ
unsigned short cw;
_FPU_GETCW(cw);
unsigned short new_cw = cw | _FPU_EXTENDED;
_FPU_SETCW(new_cw);
#endif
xbuf[i] = sqrtq(xbuf[i]);
#ifdef LDCW_SQRTQ
_FPU_SETCW(cw);
#endif
}
uint64_t t1 = __rdtsc();

When LDCW_SQRTQ is undefined the loop runs at 587 base clocks per iteration.
When LDCW_SQRTQ is undefined it runs at 595 base clocks per iteration.

So 8 clocks for save/modify/restore in tight loop.
Probably, 5-6 clocks in less tight loops.

Processor: Zen3.

> is that
> Intel and AMD microarchitects want the precision setting to be known
> to the decoder, so it can deliver the precision as part of the uop.
> This requires that when setting the precision, decoding of subsequent
> instructions starts from scratch. An alternative would be to deliver
> the precision as another input in the OoO engine, but that would
> require additional resources in the OoO engine, an apparently they
> thought that spending these resources elsewhere would buy more
> performance.
>
> Concerning your separate-pipelines idea, in that setup it's even more
> advantageous to know the precision early in instruction processing:
> You can have separate queues/ports for the different precisions and
> steer the instructions to these ports early, instead of having common
> FP queues, and steering the instructions to the right pipelines only
> when all the data (including precision) is in; ok, you could also have
> two stages of queues, but that introduces additional complication,
> area, and probably latency.
>
> Also, even if switching is fast, how frequent is code with mixed
> precision? E.g., in DGEMM you only use double precision operations,
> while in SGEMM you only use single-precision operations.
>
> Bottom line: There's a reason why Intel and AMD are designing their
> CPUs the way they are.
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: Approximate reciprocals

<2e2f3fe3-8492-4002-b953-c2c09a00a62en@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24688&group=comp.arch#24688

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:440d:0:b0:2ee:329e:1e86 with SMTP id j13-20020ac8440d000000b002ee329e1e86mr536762qtn.689.1649702188467;
Mon, 11 Apr 2022 11:36:28 -0700 (PDT)
X-Received: by 2002:a05:6808:2018:b0:2ec:c22b:15b8 with SMTP id
q24-20020a056808201800b002ecc22b15b8mr241605oiw.136.1649702188216; Mon, 11
Apr 2022 11:36:28 -0700 (PDT)
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 11 Apr 2022 11:36:28 -0700 (PDT)
In-Reply-To: <4564bcbf-4bc7-4fa2-8da8-df1cd5242b48n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=199.203.251.52; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 199.203.251.52
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <memo.20220408160456.22520T@jgd.cix.co.uk>
<tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com> <3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com>
<5fd7f105-c8ce-48cf-8cfb-13e98a584649n@googlegroups.com> <2022Apr10.103214@mips.complang.tuwien.ac.at>
<4564bcbf-4bc7-4fa2-8da8-df1cd5242b48n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2e2f3fe3-8492-4002-b953-c2c09a00a62en@googlegroups.com>
Subject: Re: Approximate reciprocals
From: already5...@yahoo.com (Michael S)
Injection-Date: Mon, 11 Apr 2022 18:36:28 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 79
 by: Michael S - Mon, 11 Apr 2022 18:36 UTC

On Monday, April 11, 2022 at 9:30:56 PM UTC+3, Michael S wrote:
> On Sunday, April 10, 2022 at 11:53:32 AM UTC+3, Anton Ertl wrote:
> > Quadibloc <jsa...@ecn.ab.ca> writes:
> > >You have heard the crazy way *I* would have designed a processor. With separate
> > >pipelines for single precision and double precision and quad precision and one-and-a-half
> > >precision.
> > >
> > >But if one were to put a 60-bit floating-point number down the double precision pipeline,
> > >no, one would not have to drain it to change the mode. (The double precision pipeline
> > >would actually be designed for 72-bit floats, which would also use it. 80 bit temporary
> > >reals would go down the double precision pipeline, unless there was one for 96-bit
> > >floats.)
> > >
> > >So mixing precisions would make your programs go *faster*, because it would utilize
> > >the other pipelines that otherwise would not be used. The very opposite of Intel!
> > In the last weeks I have noticed unusually many people posting
> > over-long lines (i.e., longer than 80 chars, and ideally you should
> > limit the lines to 70-72 chars to leave room for quoting). Is there
> > some new attack on Usenet conventions coming out from Google?
> >
> > Anyway, my guess for the reason for slow precision-setting
> I want to add just a little bit of facts to the nice theoretical discussion
> about horrible slowness of changing of x87 precision.
>
> uint64_t t0 = __rdtsc();
> for (int i = 0; i < nIter; ++i) {
> #ifdef LDCW_SQRTQ
> unsigned short cw;
> _FPU_GETCW(cw);
> unsigned short new_cw = cw | _FPU_EXTENDED;
> _FPU_SETCW(new_cw);
> #endif
> xbuf[i] = sqrtq(xbuf[i]);
> #ifdef LDCW_SQRTQ
> _FPU_SETCW(cw);
> #endif
> }
> uint64_t t1 = __rdtsc();
>
>
> When LDCW_SQRTQ is undefined the loop runs at 587 base clocks per iteration.
> When LDCW_SQRTQ is undefined it runs at 595 base clocks per iteration.

Please read:
When LDCW_SQRTQ is defined it runs at 595 base clocks per iteration.

>
> So 8 clocks for save/modify/restore in tight loop.
> Probably, 5-6 clocks in less tight loops.
>
> Processor: Zen3.
> > is that
> > Intel and AMD microarchitects want the precision setting to be known
> > to the decoder, so it can deliver the precision as part of the uop.
> > This requires that when setting the precision, decoding of subsequent
> > instructions starts from scratch. An alternative would be to deliver
> > the precision as another input in the OoO engine, but that would
> > require additional resources in the OoO engine, an apparently they
> > thought that spending these resources elsewhere would buy more
> > performance.
> >
> > Concerning your separate-pipelines idea, in that setup it's even more
> > advantageous to know the precision early in instruction processing:
> > You can have separate queues/ports for the different precisions and
> > steer the instructions to these ports early, instead of having common
> > FP queues, and steering the instructions to the right pipelines only
> > when all the data (including precision) is in; ok, you could also have
> > two stages of queues, but that introduces additional complication,
> > area, and probably latency.
> >
> > Also, even if switching is fast, how frequent is code with mixed
> > precision? E.g., in DGEMM you only use double precision operations,
> > while in SGEMM you only use single-precision operations.
> >
> > Bottom line: There's a reason why Intel and AMD are designing their
> > CPUs the way they are.
> > - anton
> > --
> > 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> > Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: Approximate reciprocals

<34ad8d82-bbea-4502-bab9-ceef4ef824c5n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24689&group=comp.arch#24689

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:5cc3:0:b0:441:1959:cb45 with SMTP id iu3-20020ad45cc3000000b004411959cb45mr28688595qvb.93.1649705209262;
Mon, 11 Apr 2022 12:26:49 -0700 (PDT)
X-Received: by 2002:a05:6870:1697:b0:e2:a341:a2e with SMTP id
j23-20020a056870169700b000e2a3410a2emr387655oae.69.1649705207550; Mon, 11 Apr
2022 12:26:47 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 11 Apr 2022 12:26:47 -0700 (PDT)
In-Reply-To: <7d9bed9d-b010-4ac8-b4f1-54320379920bn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:502:76e7:e8a9:7d2e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:502:76e7:e8a9:7d2e
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <memo.20220408160456.22520T@jgd.cix.co.uk>
<tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com> <3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com>
<5fd7f105-c8ce-48cf-8cfb-13e98a584649n@googlegroups.com> <2022Apr10.103214@mips.complang.tuwien.ac.at>
<66535230-19e7-46e3-a6f8-8472ddd27ec2n@googlegroups.com> <a9d77bf9-e4c6-4afd-bbd2-b0f1d61daedan@googlegroups.com>
<7d9bed9d-b010-4ac8-b4f1-54320379920bn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <34ad8d82-bbea-4502-bab9-ceef4ef824c5n@googlegroups.com>
Subject: Re: Approximate reciprocals
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 11 Apr 2022 19:26:49 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Mon, 11 Apr 2022 19:26 UTC

On Monday, April 11, 2022 at 12:37:57 AM UTC-5, Quadibloc wrote:
> On Sunday, April 10, 2022 at 6:16:18 PM UTC-6, MitchAlsup wrote:
> > On Sunday, April 10, 2022 at 7:09:45 PM UTC-5, Quadibloc wrote:
>
> > > But then, x86 isn't even big-endian, so I guess one can't really
> > > expect it to closely hew to the standards set by the One
> > > Sensible Architecture as was used in Incredibly Big
> > > Machines.
>
> > I think this is the first time I have seen x86 and sensible in the same
> > sentence.
> Good one! But that would be in the direction of _agreement_ with my
> point of view, with which you are no doubt *not* totally in agreement.
> My point of view expressed here being: the x86 is _not_ sensible, but
> the IBM System/360 not only _is_ sensible, but is the very paradigm
> of a sensible computer architecture.
<
I would say IBM 360 was a sensible architecture at the time of its
creation. I am not so sure what it has morphed into remains so.
>
> Or should we say, or sing, that it is the very _model_ of a...
>
> John Savard

Re: Approximate reciprocals

<b1f3a14a-2a6e-4b8b-89d0-b128d6e0420en@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24690&group=comp.arch#24690

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:18a4:b0:2e1:e7a5:98ba with SMTP id v36-20020a05622a18a400b002e1e7a598bamr749013qtc.424.1649705398491;
Mon, 11 Apr 2022 12:29:58 -0700 (PDT)
X-Received: by 2002:a05:6808:1451:b0:2ec:cfe4:21e with SMTP id
x17-20020a056808145100b002eccfe4021emr313894oiv.147.1649705398231; Mon, 11
Apr 2022 12:29:58 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 11 Apr 2022 12:29:58 -0700 (PDT)
In-Reply-To: <2022Apr11.090246@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:502:76e7:e8a9:7d2e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:502:76e7:e8a9:7d2e
References: <16su4hdjofh949len5eha1ncb73r4av8oe@4ax.com> <memo.20220408160456.22520T@jgd.cix.co.uk>
<tmg35h5jeb295i594psbeih9dlrjik3cvs@4ax.com> <3a582037-b580-45a1-9262-c4e0a4ced2ban@googlegroups.com>
<5fd7f105-c8ce-48cf-8cfb-13e98a584649n@googlegroups.com> <2022Apr10.103214@mips.complang.tuwien.ac.at>
<66535230-19e7-46e3-a6f8-8472ddd27ec2n@googlegroups.com> <2022Apr11.090246@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b1f3a14a-2a6e-4b8b-89d0-b128d6e0420en@googlegroups.com>
Subject: Re: Approximate reciprocals
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 11 Apr 2022 19:29:58 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Mon, 11 Apr 2022 19:29 UTC

On Monday, April 11, 2022 at 2:30:07 AM UTC-5, Anton Ertl wrote:
> Quadibloc <jsa...@ecn.ab.ca> writes:
> >On Sunday, April 10, 2022 at 2:53:32 AM UTC-6, Anton Ertl wrote:
> >
> >> Anyway, my guess for the reason for slow precision-setting is that
> >> Intel and AMD microarchitects want the precision setting to be known
> >> to the decoder, so it can deliver the precision as part of the uop.
> >
> >Well, of course they would want _that_.
> >
> >So, clearly, it's my fault for being insufficiently familiar with x86
> >machine language... I knew that the x87 worked with a stack,
> >and it wasn't until much later that more sensible non-vector float
> >instructions were added to the architecture as part of a set of
> >vector extensions... but I did not realize that the x86 architecture
> >worked with generic floating-point instructions, and changed their
> >meaning by setting some precision register.
> Concerning rounding modes, this approach has been standardized by
> IEEE. For precision, the 8087 ff. took the same approach. I think
> the idea was that you would use almost always use the 64-bit mantissa,
> and only use precision control in exceptional circumstances.
> >So if we include integer math, we get
> >the following set of instructions: Add Byte, Add Halfword, Add,
> >Add Long,
<
> "Halfword" and "long" are terms with architecture-specific meanings
> (if any), but I understand that you want to have add instructions for
> different lengths. IA-32 and AMD64 have these, but the 8-bit and
> 16-bit variants require partial register updates, which cause
> performance problems and require extra area for mitigating them.
> AMD64 defined the 32-bit variant as zero-extending instruction (i.e.,
> with a full-width result), which eliminated the partial-register
> update problem. Load/Store architectures only have full-width ALU
> operations. There is no advantage to partial-width operations and
<
Partial width calculations--there will always remain the need for partial
width memory references (which are operations)
<
> little to extending variants. One issue is adding a sign- or
> zero-extended 32-bit value to a 64-bit address (thanks to the I32LP64
> (and IL32LLP64) brain-damage), and Aarch64 has addressing modes for
> that; if you have sign- and zero-extending 32-bit adds, you don't need
> these, but no architecture has both.
> >But then, x86 isn't even big-endian, so I guess one can't really
> >expect it to closely hew to the standards set by the One
> >Sensible Architecture as was used in Incredibly Big
> >Machines.
> Maybe you should read about where the term "big-endian" comes from to
> understand how ridiculous this statement is.
>
> And given that we started with FP operations, you are probably the
> only one (including everyone from IBM) who thinks that the S/360 is a
> standard to be closely followed.
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Pages:12345678910111213
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor