Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Machines that have broken down will work perfectly when the repairman arrives.


devel / comp.arch / Re: Ill-advised use of CMOVE

SubjectAuthor
* Ill-advised use of CMOVEStefan Monnier
+* Re: Ill-advised use of CMOVEThomas Koenig
|`* Re: Ill-advised use of CMOVEStephen Fuld
| `* Re: Ill-advised use of CMOVEThomas Koenig
|  +* Re: Ill-advised use of CMOVEStephen Fuld
|  |`- Re: Ill-advised use of CMOVEaph
|  `* Re: Ill-advised use of CMOVEAnton Ertl
|   +* Re: Ill-advised use of CMOVEMichael S
|   |`* Re: Ill-advised use of CMOVEAnton Ertl
|   | `* Re: Ill-advised use of CMOVEMichael S
|   |  +* Re: Ill-advised use of CMOVEIvan Godard
|   |  |`* Re: Ill-advised use of CMOVEMichael S
|   |  | `* Re: Ill-advised use of CMOVEIvan Godard
|   |  |  `- Re: Ill-advised use of CMOVEMichael S
|   |  `* Re: Ill-advised use of CMOVEAnton Ertl
|   |   +- Re: Ill-advised use of CMOVEMichael S
|   |   `* branchless binary search (was: Ill-advised use of CMOVE)Anton Ertl
|   |    +* Re: branchless binary searchStefan Monnier
|   |    |`- Re: branchless binary searchAnton Ertl
|   |    +- Re: branchless binary searchTerje Mathisen
|   |    `* Re: branchless binary searchEricP
|   |     +* Re: branchless binary searchMichael S
|   |     |+* Re: branchless binary searchStephen Fuld
|   |     ||`* Re: branchless binary searchMichael S
|   |     || +- Re: branchless binary searchThomas Koenig
|   |     || `* Spectre fix (was: branchless binary search)Anton Ertl
|   |     ||  `- Re: Spectre fix (was: branchless binary search)Michael S
|   |     |+* Re: branchless binary searchStefan Monnier
|   |     ||`- Re: branchless binary searchMitchAlsup
|   |     |`* Re: branchless binary searchAndy Valencia
|   |     | `- Re: branchless binary searchTerje Mathisen
|   |     `* Re: branchless binary searchAnton Ertl
|   |      `* Re: branchless binary searchMitchAlsup
|   |       `* Spectre and resource comtention (was: branchless binary search)Anton Ertl
|   |        +* Re: Spectre and resource comtentionStefan Monnier
|   |        |+- Re: Spectre and resource comtentionMitchAlsup
|   |        |`- Re: Spectre and resource comtentionAnton Ertl
|   |        `* Re: Spectre and resource comtention (was: branchless binary search)MitchAlsup
|   |         `* Re: Spectre and resource comtention (was: branchless binary search)Anton Ertl
|   |          `- Re: Spectre and resource comtention (was: branchless binary search)MitchAlsup
|   +* Re: Ill-advised use of CMOVEStephen Fuld
|   |`* binary search vs. hash tables (was: Ill-advised use of CMOVE)Anton Ertl
|   | +* Re: binary search vs. hash tables (was: Ill-advised use of CMOVE)Stephen Fuld
|   | |`* Re: binary search vs. hash tables (was: Ill-advised use of CMOVE)John Levine
|   | | `* Re: binary search vs. hash tables (was: Ill-advised use of CMOVE)Michael S
|   | |  `* Re: binary search vs. hash tables (was: Ill-advised use of CMOVE)Michael S
|   | |   `* Re: binary search vs. hash tables (was: Ill-advised use of CMOVE)Michael S
|   | |    `* Re: binary search vs. hash tables (was: Ill-advised use of CMOVE)Michael S
|   | |     `* Re: binary search vs. hash tables (was: Ill-advised use of CMOVE)Michael S
|   | |      `* Re: binary search vs. hash tables (was: Ill-advised use of CMOVE)Michael S
|   | |       `* Re: binary search vs. hash tables (was: Ill-advised use of CMOVE)Michael S
|   | |        `* Re: binary search vs. hash tables (was: Ill-advised use of CMOVE)Brett
|   | |         `- Re: binary search vs. hash tables (was: Ill-advised use of CMOVE)Michael S
|   | `- Re: binary search vs. hash tables (was: Ill-advised use of CMOVE)John Levine
|   +* Re: Ill-advised use of CMOVEEricP
|   |`* Re: Ill-advised use of CMOVEBGB
|   | +* Re: Ill-advised use of CMOVEMitchAlsup
|   | |+* Re: Ill-advised use of CMOVEBGB
|   | ||`* Re: Ill-advised use of CMOVEMitchAlsup
|   | || +* Re: Ill-advised use of CMOVEBGB
|   | || |`- Re: Ill-advised use of CMOVEMitchAlsup
|   | || `- Re: Ill-advised use of CMOVEIvan Godard
|   | |`* Re: Ill-advised use of CMOVEIvan Godard
|   | | `- Re: Ill-advised use of CMOVEMitchAlsup
|   | `- Re: Ill-advised use of CMOVEIvan Godard
|   `- Re: Ill-advised use of CMOVEThomas Koenig
+* Re: Ill-advised use of CMOVETerje Mathisen
|+* Re: Ill-advised use of CMOVEMitchAlsup
||`* Re: Ill-advised use of CMOVETerje Mathisen
|| `* Re: Ill-advised use of CMOVEMitchAlsup
||  `* Re: Ill-advised use of CMOVEMarcus
||   `* Re: Ill-advised use of CMOVEMitchAlsup
||    `* Re: Ill-advised use of CMOVEBGB
||     `* Re: Ill-advised use of CMOVEMitchAlsup
||      `- Re: Ill-advised use of CMOVEBGB
|+- Re: Ill-advised use of CMOVEIvan Godard
|`* Re: Ill-advised use of CMOVEStephen Fuld
| +* Re: Ill-advised use of CMOVEMichael S
| |`* Re: Ill-advised use of CMOVEStephen Fuld
| | `- Re: Ill-advised use of CMOVEAnton Ertl
| +* Re: Ill-advised use of CMOVEStefan Monnier
| |`* Re: Ill-advised use of CMOVEStephen Fuld
| | `* Re: Ill-advised use of CMOVEAnton Ertl
| |  `* Re: Ill-advised use of CMOVEBGB
| |   `* Re: Ill-advised use of CMOVEMitchAlsup
| |    +- Re: Ill-advised use of CMOVEBGB
| |    `* Re: Ill-advised use of CMOVEThomas Koenig
| |     +- Re: Ill-advised use of CMOVEAnton Ertl
| |     `- Re: Ill-advised use of CMOVEMitchAlsup
| +* Re: Ill-advised use of CMOVEAnton Ertl
| |`* Re: Ill-advised use of CMOVEMitchAlsup
| | `* Re: Ill-advised use of CMOVEEricP
| |  +* Re: Ill-advised use of CMOVEMichael S
| |  |`* Re: Ill-advised use of CMOVEEricP
| |  | +- Re: Ill-advised use of CMOVEMitchAlsup
| |  | +* Re: Ill-advised use of CMOVETerje Mathisen
| |  | |`* Re: Ill-advised use of CMOVEEricP
| |  | | `* Re: Ill-advised use of CMOVEMitchAlsup
| |  | |  `- Re: Ill-advised use of CMOVEBGB
| |  | `- Re: Ill-advised use of CMOVEAnton Ertl
| |  `* Re: Ill-advised use of CMOVEAnton Ertl
| `* Re: Ill-advised use of CMOVETerje Mathisen
+- Re: Ill-advised use of CMOVEAnton Ertl
`* Re: Ill-advised use of CMOVEScott Michel

Pages:123456
Re: Ill-advised use of CMOVE

<e209a30a-6778-423a-a3b7-ee8200b36c77n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25201&group=comp.arch#25201

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:414d:b0:6a0:2035:f097 with SMTP id k13-20020a05620a414d00b006a02035f097mr19959436qko.458.1652290455804;
Wed, 11 May 2022 10:34:15 -0700 (PDT)
X-Received: by 2002:a05:6870:b68d:b0:de:9da7:9615 with SMTP id
cy13-20020a056870b68d00b000de9da79615mr3338552oab.117.1652290455403; Wed, 11
May 2022 10:34:15 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 11 May 2022 10:34:15 -0700 (PDT)
In-Reply-To: <hlPeK.4263$pqKf.2583@fx12.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=199.203.251.52; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 199.203.251.52
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org> <t58n6r$1riu$1@gioia.aioe.org>
<t5e1ek$s7c$1@dont-email.me> <2022May10.194427@mips.complang.tuwien.ac.at>
<0555194d-edc8-48fb-b304-7f78d62255d3n@googlegroups.com> <hlPeK.4263$pqKf.2583@fx12.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e209a30a-6778-423a-a3b7-ee8200b36c77n@googlegroups.com>
Subject: Re: Ill-advised use of CMOVE
From: already5...@yahoo.com (Michael S)
Injection-Date: Wed, 11 May 2022 17:34:15 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 5135
 by: Michael S - Wed, 11 May 2022 17:34 UTC

On Wednesday, May 11, 2022 at 5:01:53 PM UTC+3, EricP wrote:
> MitchAlsup wrote:
> > On Tuesday, May 10, 2022 at 1:09:01 PM UTC-5, Anton Ertl wrote:
> >> Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
> >>> On 5/8/2022 8:22 AM, Terje Mathisen wrote:
> >>>> It is extremely hard to find micro benchmarks where using CMOV to
> >>>> eliminate a branch is a win, somewhat better for larger/full programs
> >>>> but still very rare.
> >>> I am not questioning the truth of that, but I am trying to figure out
> >>> why.
> >> I would like to see some empirical support for that statement. It's
> >> pretty easy to design a micro benchmark where CMOV wins by a large
> >> margin. I guess what he meant is that in a typical microbenchmark
> >> aimed at some other charasteristic using CMOV instead of branching
> >> usually is a loss.
> >>> Since the CMOV is a pretty big win when the branch is
> >>> mispredicted, it must be a loss when the prediction would have been correct.
> >>>
> >>> Is this because
> >>>
> >>> The CMOV itself is too slow?
> >> In an OoO machine, this can mean several different things.
> >>
> >> * high resource usage. On the 21264, CMOV takes two
> >> microinstructions.
> For CMOV 1 or 2 extra uOps in a 100+ instruction queue is not a
> problem but for full predication this approach would not do
> and it requires smarter uOps.
> >>
> >> * long latency. The numbers I have seen are a latency of 1 or 2
> >> cycles.
> >>
> >> * The instruction has to wait on results that take their time to
> >> materialize.
> Note that Alpha CMOV is only reg<-reg
> x86/x64 allows reg<-reg or reg<-mem (conditional load)
> but does not support mem<-reg (conditional store).
> Neither supports reg<-imm (conditional load immediate).
>
> So Alpha must always do the expensive loads and stores and can skip
> the cheap reg<-reg move, which greatly limits its performance improvements.
> x86 can skip some memory loads which allows slightly more performance.
>

I don't think so.
If I am not mistaken, x86 architecture specifies that load should be always executed.
Or, may be, that's just an implementation thing, but since it's documented in the manuals
future architects will be always afraid to change it.

> But in all cases you must have prepared the data registers for the
> alternate data flow path in advance of the CMOV, which is usually the
> expensive part, and then skip the register move which is the cheap part.
> Which is why CMOV has limited benefits.
> > <
> > CMOV cannot begin executing until:
> > a) both operands are available
> > b) the condition is available
> In the Alpha 21264 case with its 2-uOp CMOV approach, each uOp can
> proceed independently when the condition and its data source are ready.
> Slightly better.
> > Instructions dependent on CMOV cannot begin execution until
> > c) CMOV delivers its result(s).
> > <
> > It is often this (c) that makes CMOV appear to be slow.
> That is what I read WRT some of the Itanium predication -
> that the control-dataflow can take longer than the data-dataflow.
>
> Full predication is orders of magnitude more complicated than CMOV
> but does contain more opportunities for HW run time optimizations.
>
> This led them to propose what they called "predicate slip" whereby
> the data side can proceed as soon as its operands are ready,
> and the predicate state is checked before retire.
>
> This gets messier if one desires to allow predicate slip but
> later cancel pending uOps when the predicate resolves to disabled
> so you don't perform work that you now know you are going to toss.
> This is more complicated than a branch mispredict because a branch
> mispredict flushes the instruction queue whereas this does not.

Re: Ill-advised use of CMOVE

<t5gt0t$g39$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25203&group=comp.arch#25203

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
Date: Wed, 11 May 2022 12:49:05 -0500
Organization: A noiseless patient Spider
Lines: 266
Message-ID: <t5gt0t$g39$1@dont-email.me>
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org>
<t58n6r$1riu$1@gioia.aioe.org> <t5e1ek$s7c$1@dont-email.me>
<jwv4k1xs3ai.fsf-monnier+comp.arch@gnu.org> <t5e9g7$c65$1@dont-email.me>
<2022May10.214311@mips.complang.tuwien.ac.at> <t5enf7$4mf$1@dont-email.me>
<13a8e93b-31a3-4cfc-aa2c-a5f373a2e4abn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 11 May 2022 17:50:21 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="7155b3a29c990c92a41c6ff303f1405a";
logging-data="16489"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18GK7mkmeJJ+Rcm7keS24qq"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.8.0
Cancel-Lock: sha1:KyNfbNmZaAORKJTvhsBdlNqgELk=
In-Reply-To: <13a8e93b-31a3-4cfc-aa2c-a5f373a2e4abn@googlegroups.com>
Content-Language: en-US
 by: BGB - Wed, 11 May 2022 17:49 UTC

On 5/10/2022 6:31 PM, MitchAlsup wrote:
> On Tuesday, May 10, 2022 at 5:03:23 PM UTC-5, BGB wrote:
>> On 5/10/2022 2:43 PM, Anton Ertl wrote:
>>> Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
>>>> So if you don't use hardware "condition prediction", and the compiler by
>>>> itself doesn't know how well a particular branch will be predicted, we
>>>> are left with the aforementioned programmer provided hints, or perhaps
>>>> some form of profile driven optimization.
>>>
>>> Normal profiles don't tell you how predictable a branch is. Of course
>>> always-taken is predictable, but 50% taken might be unpredictable, or
>>> perfectly predictable.
>>>
>> I can note here that my branch predictor ended up with states both for
>> predicting branches which are always the same, and for branches which
>> are nearly always the opposite.
>>
>> The "nearly always the same" case being the more common, more
>> traditional option. But, nearly always the opposite, seemed common
>> enough to be worthwhile.
>>
> As I have related in the past: The Mc 88120 had a branch predictor which
> is not based on taken/not-taken, but upon agree/disagree. This allows
> different branches which map to the same counters one taken one not-taken
> to use the same code in the prediction table.
> <
> Back when we were doing this, our branch predictor was about 90%-92%
> accurate, and this distinction was useful for getting rid of 1/3rd of the
> mispredicts. T/NT was 88%ish A/D was 92%ish.
> <
> The way one uses an A/D predictor is to have a code cache organized
> by branches, and code migrated into straightline decode strings. We
> used what we called a packet cache, but you can use a trace cache
> and there are way of organizing an instruction buffer to have this property.

Yeah, dunno there.

>>
>>
>> Though, this still leaves patterns which are (theoretically)
>> predictable, but would require more complex predictor logic, eg:
>> 110, 110, 110, ...
>> 1110, 1110, 1110, 1110, ...
> <
> This is known of as autocorrelation predictor.
> <
> But all sorts of patterns can "fit"::
> <
> 111011010 repeating.
> <
> I used a predictor such as this in the DRAM controller to prefetch data
> to the DRAM read buffer in K9 and this went into Opteron Rev G.

If one had enough context, lots of things could be possible.

The current scheme is sorta:
Take low bits of address, xor them with the recent global branch history;
Use this as an index into the table of 3-bit states.

The predictor in this case having 64 predictor weights.

Could potentially try to make this table bigger. It looks like from the
code I had previously tried using bigger tables, but then settled on a
64-entry table (probably because it fits in LUTRAM).

The other (commented out) logic looks like it was trying to map it to a
BlockRAM (but, this use-case doesn't really fit with BlockRAM).

I may not have realized it at the time, but:
32x3: Could use 1 LUTRAMs
64x3: Will need 2 LUTRAMs
128x3: Will need 3 LUTRAMs (probably also viable)

I had also experimented "finer grained" states (vs just weak/strong),
but IIRC, trying to add more fine transitions here actually worsened
branch-prediction accuracy.

So, the current pattern (weak/strong taken/not-taken and flip-flopping)
is what won out in my initial testing.

>>
>> So, we predict a period of, say, 1-3 bits, after which point the pattern
>> is expected to have one branch in the opposite direction (with longer
>> runs being expected to always branch in a single direction).
>>
>> This would likely require a fair bit more bits for the branch
>> predictors' state machine though, and it is already accurate enough that
>> this is maybe not worthwhile.
>>
>>
>> A 6-bit state could possibly pull off, say:
>> Runs of 0 (2 states, weak/strong);
>> Runs of 1 (2 states, weak/strong);
>> Runs of alternating 0/1 (~ 4 states);
>> Runs of 001 (~ 8 states);
>> Runs of 110 (~ 8 states);
>> Runs of 0001 (~ 16 states);
>> Runs of 1110 (~ 16 states);
>> Transition states (~ 8 could go here).
>>
>> The state would need to encode the position within the pattern, and the
>> relative confidence (weak/strong) that the pattern would continue as-is.
>> So, mispredict knocks strong to weak, and weak to a different pattern
>> (longer or shorter alternation, or to one of the other solid-run patterns).
>>
>> I can imagine the state graph for this, but not going to describe it
>> here as it would be annoyingly long.
>>> So: compilers are bad at knowing predictability (even with profile
>>> feedback). Programmers are bad, too, unless they use performance
>>> counter results to learn it. Looks like a good candidate for a
>>> hardware solution to me.
>>>
>> Yeah.
>>
>> It can be noted, this is one of those areas where I went with a hardware
>> solution...
>>
>> As for whether or not I would have been better off going with
>> superscalar than WEX, this is TBD.
>>
>> In theory, I could bolt superscalar onto my ISA as-is to potentially
>> grab up cases the compiler has missed (such as those which are
>> "theoretically safe" but don't match with the serial/parallel
>> equivalence rules, which would do things that aren't officially allowed
>> for the profile but the core in question is capable of doing, for code
>> not built with WEX enabled, or for code which uses 16-bit instruction
>> forms).
>>
>> I have yet to really look into whether this would scavenge enough
>> additional ILP to be worthwhile (would need to model this).
>>
>>
>> In any case, there would likely be a lot more cases which the compiler
>> would see, but the superscalar logic would miss, if I try to keep its
>> complexity modest (to keep it cheap).
> <
> It is all about the patterns needed versus the patterns Decode can recognize.

Basic pattern I started writing for the RISC-V decoding was:
Extract register fields, check that there are register conflicts;
Classify instruction as to whether it would be allowed as a prefix or
suffix.

In this case, the prefix/suffix determination was handled by looking up
a small set of bit-flags for each instruction, then feeding both sets of
bit-flags through a case lookup.

If one is valid as a prefix, the other valid as a suffix, and the
register fields don't clash, it can set a "virtual WEX bit" for the
prefix instruction.

The logic in this case being pretty "quick and dirty", only really
recognizing things like ALU instructions and similar. It also needs to
be cheap enough to operate during instruction fetch.

But, if this logic exists for RISC-V mode, it wouldn't ask that much
more to add similar logic for BJX2.

Though, the level of pattern recognition that can happen during fetch is
a lot more limited than what the compiler can do here.

For RISC-V mode, it makes a little more sense, since RISC-V lacks a WEX
bit or similar, so (absent a load-time WEXifier or similar), could could
only operate in serial.

I was previously also weighing against the load-time WEXifier in that it
would be incompatible with the use of the 'C' extension, but I still
have not really implemented RVC support yet, partly because it is a bit
of an ugly dog-chewed mess. Like, it is bad enough that, last time I
tried, I gave myself headache trying to imagine the logic for the decoder.

Its top-level list of encoding forms actually hides a detail in that
there enough internal special cases (bits where their places move around
from one instruction to another), that the actual number of unique
16-bit instruction forms is well into the double-digits.

It is almost like the designers saw Thumb and were then like "hold my
beer", and got even more bit-twiddly.

For BJX2, it is more unclear whether superscalar would gain enough to be
worthwhile, since the only cases it would likely find (in this simple
form) would be stuff that the compiler missed or otherwise did not to go
with.

Though, my compiler uses a more conservative interpretation of the
sequential/parallel equivalence than strictly required by the CPU.

Eg:
ADD R4, R5, R6 | ADD R5, R7, R4

Would execute correctly on the CPU core, but would not be bundled by the
compiler. The compiler will interpret it as-if parallel instructions
could execute in an unknown order, whereas the CPU will not produce any
results until after the bundle executes (so, the equivalence rule would
still hold under this behavior).

Would make more sense though for cases where the core is more advanced
than what is allowed in the baseline profile.


Click here to read the complete article
Re: Ill-advised use of CMOVE

<HDTeK.54$56e6.10@fx34.iad>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25204&group=comp.arch#25204

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx34.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org> <t58n6r$1riu$1@gioia.aioe.org> <t5e1ek$s7c$1@dont-email.me> <2022May10.194427@mips.complang.tuwien.ac.at> <0555194d-edc8-48fb-b304-7f78d62255d3n@googlegroups.com> <hlPeK.4263$pqKf.2583@fx12.iad> <e209a30a-6778-423a-a3b7-ee8200b36c77n@googlegroups.com>
In-Reply-To: <e209a30a-6778-423a-a3b7-ee8200b36c77n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 86
Message-ID: <HDTeK.54$56e6.10@fx34.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Wed, 11 May 2022 18:54:31 UTC
Date: Wed, 11 May 2022 14:53:15 -0400
X-Received-Bytes: 4332
 by: EricP - Wed, 11 May 2022 18:53 UTC

Michael S wrote:
> On Wednesday, May 11, 2022 at 5:01:53 PM UTC+3, EricP wrote:
>> MitchAlsup wrote:
>>> On Tuesday, May 10, 2022 at 1:09:01 PM UTC-5, Anton Ertl wrote:
>>>> Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
>>>>> On 5/8/2022 8:22 AM, Terje Mathisen wrote:
>>>>>> It is extremely hard to find micro benchmarks where using CMOV to
>>>>>> eliminate a branch is a win, somewhat better for larger/full programs
>>>>>> but still very rare.
>>>>> I am not questioning the truth of that, but I am trying to figure out
>>>>> why.
>>>> I would like to see some empirical support for that statement. It's
>>>> pretty easy to design a micro benchmark where CMOV wins by a large
>>>> margin. I guess what he meant is that in a typical microbenchmark
>>>> aimed at some other charasteristic using CMOV instead of branching
>>>> usually is a loss.
>>>>> Since the CMOV is a pretty big win when the branch is
>>>>> mispredicted, it must be a loss when the prediction would have been correct.
>>>>>
>>>>> Is this because
>>>>>
>>>>> The CMOV itself is too slow?
>>>> In an OoO machine, this can mean several different things.
>>>>
>>>> * high resource usage. On the 21264, CMOV takes two
>>>> microinstructions.
>> For CMOV 1 or 2 extra uOps in a 100+ instruction queue is not a
>> problem but for full predication this approach would not do
>> and it requires smarter uOps.
>>>> * long latency. The numbers I have seen are a latency of 1 or 2
>>>> cycles.
>>>>
>>>> * The instruction has to wait on results that take their time to
>>>> materialize.
>> Note that Alpha CMOV is only reg<-reg
>> x86/x64 allows reg<-reg or reg<-mem (conditional load)
>> but does not support mem<-reg (conditional store).
>> Neither supports reg<-imm (conditional load immediate).
>>
>> So Alpha must always do the expensive loads and stores and can skip
>> the cheap reg<-reg move, which greatly limits its performance improvements.
>> x86 can skip some memory loads which allows slightly more performance.
>>
>
> I don't think so.
> If I am not mistaken, x86 architecture specifies that load should be always executed.
> Or, may be, that's just an implementation thing, but since it's documented in the manuals
> future architects will be always afraid to change it.

Hmmm... looks like it could be either way.
Intel Vol-2 manual CMOV description says:

"If the condition is not satisfied, a move is not performed and execution
continues with the instruction following the CMOVcc instruction."

which seems an explicit statement of not accessing memory, but later the
pseudo-code description seems to imply the access is always performed:

Operation
temp ← SRC
IF condition TRUE
THEN
DEST ← temp;
FI;
ELSE
IF (OperandSize = 32 and IA-32e mode active)
THEN
DEST[63:32] ← 0;
FI;
FI;

and there is no other descriptions of CMOV's operation in other manuals.
It also shows that on x64 if operand size==32 it updates the dest
even if the condition is false, which is weird.

The AMD manual says "may report exceptions" which could also be either way:

"For the memory-based forms of CMOVcc, memory-related exceptions
may be reported even if the condition is false."

It looks on x64 one shouldn't implement
if (ptr) x = *ptr;

as a TEST, CMOV as it might still address fault.

Re: Ill-advised use of CMOVE

<811f1e24-3e72-4360-8f38-e0519585f968n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25206&group=comp.arch#25206

 copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:5a12:0:b0:456:3040:6b0 with SMTP id ei18-20020ad45a12000000b00456304006b0mr24125013qvb.68.1652299522688;
Wed, 11 May 2022 13:05:22 -0700 (PDT)
X-Received: by 2002:a05:6808:1453:b0:328:ad39:c88b with SMTP id
x19-20020a056808145300b00328ad39c88bmr2699810oiv.103.1652299522377; Wed, 11
May 2022 13:05:22 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 11 May 2022 13:05:22 -0700 (PDT)
In-Reply-To: <HDTeK.54$56e6.10@fx34.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:cd3e:bc77:963e:ac4c;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:cd3e:bc77:963e:ac4c
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org> <t58n6r$1riu$1@gioia.aioe.org>
<t5e1ek$s7c$1@dont-email.me> <2022May10.194427@mips.complang.tuwien.ac.at>
<0555194d-edc8-48fb-b304-7f78d62255d3n@googlegroups.com> <hlPeK.4263$pqKf.2583@fx12.iad>
<e209a30a-6778-423a-a3b7-ee8200b36c77n@googlegroups.com> <HDTeK.54$56e6.10@fx34.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <811f1e24-3e72-4360-8f38-e0519585f968n@googlegroups.com>
Subject: Re: Ill-advised use of CMOVE
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 11 May 2022 20:05:22 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 5505
 by: MitchAlsup - Wed, 11 May 2022 20:05 UTC

On Wednesday, May 11, 2022 at 1:54:34 PM UTC-5, EricP wrote:
> Michael S wrote:
> > On Wednesday, May 11, 2022 at 5:01:53 PM UTC+3, EricP wrote:
> >> MitchAlsup wrote:
> >>> On Tuesday, May 10, 2022 at 1:09:01 PM UTC-5, Anton Ertl wrote:
> >>>> Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
> >>>>> On 5/8/2022 8:22 AM, Terje Mathisen wrote:
> >>>>>> It is extremely hard to find micro benchmarks where using CMOV to
> >>>>>> eliminate a branch is a win, somewhat better for larger/full programs
> >>>>>> but still very rare.
> >>>>> I am not questioning the truth of that, but I am trying to figure out
> >>>>> why.
> >>>> I would like to see some empirical support for that statement. It's
> >>>> pretty easy to design a micro benchmark where CMOV wins by a large
> >>>> margin. I guess what he meant is that in a typical microbenchmark
> >>>> aimed at some other charasteristic using CMOV instead of branching
> >>>> usually is a loss.
> >>>>> Since the CMOV is a pretty big win when the branch is
> >>>>> mispredicted, it must be a loss when the prediction would have been correct.
> >>>>>
> >>>>> Is this because
> >>>>>
> >>>>> The CMOV itself is too slow?
> >>>> In an OoO machine, this can mean several different things.
> >>>>
> >>>> * high resource usage. On the 21264, CMOV takes two
> >>>> microinstructions.
> >> For CMOV 1 or 2 extra uOps in a 100+ instruction queue is not a
> >> problem but for full predication this approach would not do
> >> and it requires smarter uOps.
> >>>> * long latency. The numbers I have seen are a latency of 1 or 2
> >>>> cycles.
> >>>>
> >>>> * The instruction has to wait on results that take their time to
> >>>> materialize.
> >> Note that Alpha CMOV is only reg<-reg
> >> x86/x64 allows reg<-reg or reg<-mem (conditional load)
> >> but does not support mem<-reg (conditional store).
> >> Neither supports reg<-imm (conditional load immediate).
> >>
> >> So Alpha must always do the expensive loads and stores and can skip
> >> the cheap reg<-reg move, which greatly limits its performance improvements.
> >> x86 can skip some memory loads which allows slightly more performance.
> >>
> >
> > I don't think so.
> > If I am not mistaken, x86 architecture specifies that load should be always executed.
> > Or, may be, that's just an implementation thing, but since it's documented in the manuals
> > future architects will be always afraid to change it.
> Hmmm... looks like it could be either way.
> Intel Vol-2 manual CMOV description says:
>
> "If the condition is not satisfied, a move is not performed and execution
> continues with the instruction following the CMOVcc instruction."
>
> which seems an explicit statement of not accessing memory, but later the
> pseudo-code description seems to imply the access is always performed:
>
> Operation
> temp ← SRC
> IF condition TRUE
> THEN
> DEST ← temp;
> FI;
> ELSE
> IF (OperandSize = 32 and IA-32e mode active)
> THEN
> DEST[63:32] ← 0;
> FI;
> FI;
>
> and there is no other descriptions of CMOV's operation in other manuals.
> It also shows that on x64 if operand size==32 it updates the dest
> even if the condition is false, which is weird.
>
> The AMD manual says "may report exceptions" which could also be either way:
>
> "For the memory-based forms of CMOVcc, memory-related exceptions
> may be reported even if the condition is false."
>
> It looks on x64 one shouldn't implement
> if (ptr) x = *ptr;
<
Yes, but the very vast majority of pointer checks against zero are predicted
with great accuracy.
>
> as a TEST, CMOV as it might still address fault.
<
CMOV is for unpredictable branches not the predictable ones.

Re: Ill-advised use of CMOVE

<t5i6ij$1tam$1@gioia.aioe.org>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25214&group=comp.arch#25214

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!EhtdJS5E9ITDZpJm3Uerlg.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
Date: Thu, 12 May 2022 07:39:35 +0200
Organization: Aioe.org NNTP Server
Message-ID: <t5i6ij$1tam$1@gioia.aioe.org>
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org>
<t58n6r$1riu$1@gioia.aioe.org> <t5e1ek$s7c$1@dont-email.me>
<2022May10.194427@mips.complang.tuwien.ac.at>
<0555194d-edc8-48fb-b304-7f78d62255d3n@googlegroups.com>
<hlPeK.4263$pqKf.2583@fx12.iad>
<e209a30a-6778-423a-a3b7-ee8200b36c77n@googlegroups.com>
<HDTeK.54$56e6.10@fx34.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="62806"; posting-host="EhtdJS5E9ITDZpJm3Uerlg.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.12
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Thu, 12 May 2022 05:39 UTC

EricP wrote:
> Michael S wrote:
>> On Wednesday, May 11, 2022 at 5:01:53 PM UTC+3, EricP wrote:
>>> MitchAlsup wrote:
>>>> On Tuesday, May 10, 2022 at 1:09:01 PM UTC-5, Anton Ertl wrote:
>>>>> Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
>>>>>> On 5/8/2022 8:22 AM, Terje Mathisen wrote:
>>>>>>> It is extremely hard to find micro benchmarks where using CMOV to
>>>>>>> eliminate a branch is a win, somewhat better for larger/full
>>>>>>> programs but still very rare.
>>>>>> I am not questioning the truth of that, but I am trying to figure
>>>>>> out why.
>>>>> I would like to see some empirical support for that statement. It's
>>>>> pretty easy to design a micro benchmark where CMOV wins by a large
>>>>> margin. I guess what he meant is that in a typical microbenchmark
>>>>> aimed at some other charasteristic using CMOV instead of branching
>>>>> usually is a loss.
>>>>>> Since the CMOV is a pretty big win when the branch is
>>>>>> mispredicted, it must be a loss when the prediction would have
>>>>>> been correct.
>>>>>> Is this because
>>>>>> The CMOV itself is too slow?
>>>>> In an OoO machine, this can mean several different things.
>>>>> * high resource usage. On the 21264, CMOV takes two microinstructions.
>>> For CMOV 1 or 2 extra uOps in a 100+ instruction queue is not a
>>> problem but for full predication this approach would not do and it
>>> requires smarter uOps.
>>>>> * long latency. The numbers I have seen are a latency of 1 or 2
>>>>> cycles.
>>>>> * The instruction has to wait on results that take their time to
>>>>> materialize.
>>> Note that Alpha CMOV is only reg<-reg x86/x64 allows reg<-reg or
>>> reg<-mem (conditional load) but does not support mem<-reg
>>> (conditional store). Neither supports reg<-imm (conditional load
>>> immediate).
>>> So Alpha must always do the expensive loads and stores and can skip
>>> the cheap reg<-reg move, which greatly limits its performance
>>> improvements. x86 can skip some memory loads which allows slightly
>>> more performance.
>>
>> I don't think so.
>> If I am not mistaken, x86 architecture specifies that load should be
>> always executed.
>> Or, may be, that's just an implementation thing, but since it's
>> documented in the manuals
>> future architects will be always afraid to change it.
>
> Hmmm... looks like it could be either way.
> Intel Vol-2 manual CMOV description says:
>
> "If the condition is not satisfied, a move is not performed and execution
> continues with the instruction following the CMOVcc instruction."
>
> which seems an explicit statement of not accessing memory, but later the
> pseudo-code description seems to imply the access is always performed:
>
>   Operation
>   temp ← SRC
>   IF condition TRUE
>     THEN
>       DEST ← temp;
>     FI;
>   ELSE
>     IF (OperandSize = 32 and IA-32e mode active)
>       THEN
>         DEST[63:32] ← 0;
>     FI;
>   FI;
>
> and there is no other descriptions of CMOV's operation in other manuals.
> It also shows that on x64 if operand size==32 it updates the dest
> even if the condition is false, which is weird.
>
> The AMD manual says "may report exceptions" which could also be either way:
>
> "For the memory-based forms of CMOVcc, memory-related exceptions
> may be reported even if the condition is false."
>
> It looks on x64 one shouldn't implement
>   if (ptr) x = *ptr;
>
> as a TEST, CMOV as it might still address fault.

CMOV is a MOV, so it always accesses the source, but then it might
discard the result?

The key here is probably that all load-op instructions on x86 perform
the load part in an earlier phase than the operation itself, so it makes
perfect sense both that it always accesses both the source and
destination and always writes to the destination (at least in the
"target-is-buzy" sense).

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Ill-advised use of CMOVE

<t5i7pe$6qm$1@gioia.aioe.org>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25215&group=comp.arch#25215

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!EhtdJS5E9ITDZpJm3Uerlg.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
Date: Thu, 12 May 2022 08:00:18 +0200
Organization: Aioe.org NNTP Server
Message-ID: <t5i7pe$6qm$1@gioia.aioe.org>
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org>
<t58n6r$1riu$1@gioia.aioe.org> <t5e1ek$s7c$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="6998"; posting-host="EhtdJS5E9ITDZpJm3Uerlg.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.12
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Thu, 12 May 2022 06:00 UTC

Stephen Fuld wrote:
> On 5/8/2022 8:22 AM, Terje Mathisen wrote:
>> Stefan Monnier wrote:
>>> We recently bumped into a funny performance behavior in Emacs.
>>> Some code computing the length of a (possibly circular) list ended up
>>> macroexpanded to something like:
>>>
>>>      for (struct for_each_tail_internal li = { list, 2, 0, 2 };
>>>           CONSP (list);
>>>           (list = XCDR (list),
>>>            ((--li.q != 0
>>>              || 0 < --li.n
>>>              || (li.q = li.n = li.max <<= 1, li.n >>=
>>> USHRT_WIDTH,
>>>                  li.tortoise = (list), false))
>>>             && EQ (list, li.tortoise))
>>>            ? (void) (list = Qnil) : (void) 0))
>>>        len++;
>>>
>>> `EQ` is currently a slightly costly operation but in this specific case
>>> it can be replaced with a plain `==`.  When we tried that, the
>>> resulting
>>> loop ended up running almost twice *slower*.
>>>
>>> The problem turned out that with the simpler comparison, both GCC and
>>> LLVM decided it would be a good idea to use CMOVE for the
>>> `?:` operation, which just ends up making the data flow's critical path
>>> longer whereas a branch works much better here since it's trivial
>>> to predict.
>>
>> This has actually been the rule for almost all code since the pentiumPro!
>>
>> It is extremely hard to find micro benchmarks where using CMOV to
>> eliminate a branch is a win, somewhat better for larger/full programs
>> but still very rare.
>
> I am not questioning the truth of that, but I am trying to figure out
> why.  Since the CMOV is a pretty big win when the branch is
> mispredicted, it must be a loss when the prediction would have been
> correct.
>
> Is this because
>
> The CMOV itself is too slow?

Mainly this! 2 cycles of forced latency, on top of typically some extra
work to setup the two alternatives, was just too long.
>
> More code needs to be generated when using the CMOV so more instructions
> are executed?

Yeah, see above.
>
> The CMOV messes up speculative execution, so the instructions after it
> cannot be executed where they would be in the branch case?

I don't think this is the case, but mainly that branch predictors are
just too good:

cmp eax,LIMIT
jb ok
mov eax,LIMIT
ok:

The code above typically runs in zero cycles, i.e. the branch is
correctly predicted to be taken so we don't even have the latency for
EAX to become valid so that the CMP can be done.

When we miss it might take up to 20 cycles to fix it, but with a 99%
branch predictor (typical for such a range boundary saturation), that
corresponds to 0.2 cycles per iteration. Even at 95% correct it is still
on average just one cycle.

CMOV otoh would be:

mov ebx,LIMIT ; Can typically be hoisted above

cmp eax,ebx
cmovae eax,ebx

which has a fixed 3-cycle latency: One cycle to do the CMP followed by
two cycles for the CMOV. Even with a single-cycle CMOV it would still be
2 cycles.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Ill-advised use of CMOVE

<2022May12.091435@mips.complang.tuwien.ac.at>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25216&group=comp.arch#25216

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
Date: Thu, 12 May 2022 07:14:35 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 74
Message-ID: <2022May12.091435@mips.complang.tuwien.ac.at>
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org> <t58n6r$1riu$1@gioia.aioe.org> <t5e1ek$s7c$1@dont-email.me> <2022May10.194427@mips.complang.tuwien.ac.at> <0555194d-edc8-48fb-b304-7f78d62255d3n@googlegroups.com> <hlPeK.4263$pqKf.2583@fx12.iad> <e209a30a-6778-423a-a3b7-ee8200b36c77n@googlegroups.com> <HDTeK.54$56e6.10@fx34.iad>
Injection-Info: reader02.eternal-september.org; posting-host="c2550ebb267d4b8c45122550ee0bdd31";
logging-data="3062"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/ZRRJ+uBH7mN3PHcbJKvcx"
Cancel-Lock: sha1:nWb6UKhK6tGKUJ3OwDr4THofABg=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Thu, 12 May 2022 07:14 UTC

EricP <ThatWouldBeTelling@thevillage.com> writes:
>Michael S wrote:
>> If I am not mistaken, x86 architecture specifies that load should be always executed.
>> Or, may be, that's just an implementation thing, but since it's documented in the manuals
>> future architects will be always afraid to change it.
>
>Hmmm... looks like it could be either way.
>Intel Vol-2 manual CMOV description says:
>
>"If the condition is not satisfied, a move is not performed and execution
>continues with the instruction following the CMOVcc instruction."
>
>which seems an explicit statement of not accessing memory, but later the
>pseudo-code description seems to imply the access is always performed:
>
> Operation
> temp ← SRC
> IF condition TRUE
> THEN
> DEST ← temp;
> FI;
> ELSE
> IF (OperandSize = 32 and IA-32e mode active)
> THEN
> DEST[63:32] ← 0;
> FI;
> FI;
>
>and there is no other descriptions of CMOV's operation in other manuals.
>It also shows that on x64 if operand size==32 it updates the dest
>even if the condition is false, which is weird.
>
>The AMD manual says "may report exceptions" which could also be either way:
>
>"For the memory-based forms of CMOVcc, memory-related exceptions
>may be reported even if the condition is false."
>
>It looks on x64 one shouldn't implement
> if (ptr) x = *ptr;
>
>as a TEST, CMOV as it might still address fault.

I just tried it on a Skylake, and "cmovne (%rsi), %eax" with %rsi=0
segfaults whether ZF is set or not.

Suppressing the exception if the condition is false would have been
useful for the idiom above. AFAIK predicated instructions tend to
suppress the exceptions.

Concerning the implementation, it makes a lot of sense to perform the
load independent of the condition in an OoO engine: You want to insert
microinstructions into the OoO engine only based on the instruction
stream.

Even in an in-order implementation, you don't want to wait for the
condition to be ready (not necessarily at the start of the execution
stages) before starting execution.

Independent of the load issue, it would be ideal in an OoO engine if
CMOV would only wait for the condition and the selected input (rather
than both). How to achieve that? Maybe let it generate two
microinstructions, one for condition=true, and one for
condition=false; only one of them produces a result (this requires a
change in how the engine keeps track of readyness), but it is not
dependent on the other data input. In this way the CMOV would not
suffer from a dependeny on both data inputs. However, my guess is
that CMOV is not common enough to merit the effort and area necessary
for this solution (and/or the complementary solution of using a
predictability predictor).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Ill-advised use of CMOVE

<d09fK.2957$L_b6.1412@fx33.iad>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25217&group=comp.arch#25217

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx33.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org> <t58n6r$1riu$1@gioia.aioe.org> <t5e1ek$s7c$1@dont-email.me> <2022May10.194427@mips.complang.tuwien.ac.at> <0555194d-edc8-48fb-b304-7f78d62255d3n@googlegroups.com> <hlPeK.4263$pqKf.2583@fx12.iad> <e209a30a-6778-423a-a3b7-ee8200b36c77n@googlegroups.com> <HDTeK.54$56e6.10@fx34.iad> <t5i6ij$1tam$1@gioia.aioe.org>
In-Reply-To: <t5i6ij$1tam$1@gioia.aioe.org>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 97
Message-ID: <d09fK.2957$L_b6.1412@fx33.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Thu, 12 May 2022 14:41:13 UTC
Date: Thu, 12 May 2022 10:41:02 -0400
X-Received-Bytes: 4592
 by: EricP - Thu, 12 May 2022 14:41 UTC

Terje Mathisen wrote:
> EricP wrote:
>> Michael S wrote:
>>> On Wednesday, May 11, 2022 at 5:01:53 PM UTC+3, EricP wrote:
>>>> So Alpha must always do the expensive loads and stores and can skip
>>>> the cheap reg<-reg move, which greatly limits its performance
>>>> improvements. x86 can skip some memory loads which allows slightly
>>>> more performance.
>>>
>>> I don't think so.
>>> If I am not mistaken, x86 architecture specifies that load should be
>>> always executed.
>>> Or, may be, that's just an implementation thing, but since it's
>>> documented in the manuals
>>> future architects will be always afraid to change it.
>>
>> Hmmm... looks like it could be either way.
>> Intel Vol-2 manual CMOV description says:
>>
>> "If the condition is not satisfied, a move is not performed and execution
>> continues with the instruction following the CMOVcc instruction."
>>
>> which seems an explicit statement of not accessing memory, but later the
>> pseudo-code description seems to imply the access is always performed:
>>
>> Operation
>> temp ← SRC
>> IF condition TRUE
>> THEN
>> DEST ← temp;
>> FI;
>> ELSE
>> IF (OperandSize = 32 and IA-32e mode active)
>> THEN
>> DEST[63:32] ← 0;
>> FI;
>> FI;
>>
>> and there is no other descriptions of CMOV's operation in other manuals.
>> It also shows that on x64 if operand size==32 it updates the dest
>> even if the condition is false, which is weird.
>>
>> The AMD manual says "may report exceptions" which could also be either
>> way:
>>
>> "For the memory-based forms of CMOVcc, memory-related exceptions
>> may be reported even if the condition is false."
>>
>> It looks on x64 one shouldn't implement
>> if (ptr) x = *ptr;
>>
>> as a TEST, CMOV as it might still address fault.
>
> CMOV is a MOV, so it always accesses the source, but then it might
> discard the result?

It would appear so.

> The key here is probably that all load-op instructions on x86 perform
> the load part in an earlier phase than the operation itself, so it makes
> perfect sense both that it always accesses both the source and
> destination and always writes to the destination (at least in the
> "target-is-buzy" sense).
>
> Terje
>

I understand NOW that they look at it that way.
The way I saw it is that the whole point is to skip
the expensive memory access part if condition is false.

I have LDC & STC Load & Store Conditional instructions in my ISA
(nothing to do with atomic access) and they work by skipping the
memory access so I assumed that x86 CMOV would be designed the same way.

I can see why x86 did it that way - it's a lot easier for decode to
emit 2 uOps, one to always do a normal load to a temp register and
one using the existing CMOV uOp to conditionally move a register.

The alternative, which is what I chose, is to pass the condition
to the LSQ and have it decide for itself whether to perform the op.
If the condition is false then it does not translate the address,
which could trigger a table walk, or perform any memory accesses.
For LDC Load Conditional it also passes the original value of the dest
register to LSQ so LDC can copy it to new dest if the condition is false.

The Intel manual description doesn't even mention the
possibility of throwing exceptions when condition is false.
At least AMD mentions that it MAY happen.
That is not the only way to do this.
It also could have be designed to always perform the load but
block the memory exceptions if the condition is false.
But apparently not.

Re: Ill-advised use of CMOVE

<bB9fK.7576$pqKf.5267@fx12.iad>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25219&group=comp.arch#25219

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!news-out.netnews.com!news.alt.net!fdc2.netnews.com!feeder1.feed.usenet.farm!feed.usenet.farm!peer02.ams4!peer.am4.highwinds-media.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx12.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org> <t58n6r$1riu$1@gioia.aioe.org> <t5e1ek$s7c$1@dont-email.me> <t5i7pe$6qm$1@gioia.aioe.org>
In-Reply-To: <t5i7pe$6qm$1@gioia.aioe.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 90
Message-ID: <bB9fK.7576$pqKf.5267@fx12.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Thu, 12 May 2022 15:20:39 UTC
Date: Thu, 12 May 2022 11:19:56 -0400
X-Received-Bytes: 3582
 by: EricP - Thu, 12 May 2022 15:19 UTC

Terje Mathisen wrote:
> Stephen Fuld wrote:
>>
>> I am not questioning the truth of that, but I am trying to figure out
>> why. Since the CMOV is a pretty big win when the branch is
>> mispredicted, it must be a loss when the prediction would have been
>> correct.
>>
>> Is this because
>>
>> The CMOV itself is too slow?
>
> Mainly this! 2 cycles of forced latency, on top of typically some extra
> work to setup the two alternatives, was just too long.
>>
>> More code needs to be generated when using the CMOV so more
>> instructions are executed?
>
> Yeah, see above.
>>
>> The CMOV messes up speculative execution, so the instructions after it
>> cannot be executed where they would be in the branch case?
>
> I don't think this is the case, but mainly that branch predictors are
> just too good:
>
> cmp eax,LIMIT
> jb ok
> mov eax,LIMIT
> ok:
>
> The code above typically runs in zero cycles, i.e. the branch is
> correctly predicted to be taken so we don't even have the latency for
> EAX to become valid so that the CMP can be done.
>
> When we miss it might take up to 20 cycles to fix it, but with a 99%
> branch predictor (typical for such a range boundary saturation), that
> corresponds to 0.2 cycles per iteration. Even at 95% correct it is still
> on average just one cycle.
>
> CMOV otoh would be:
>
> mov ebx,LIMIT ; Can typically be hoisted above
>
> cmp eax,ebx
> cmovae eax,ebx
>
> which has a fixed 3-cycle latency: One cycle to do the CMP followed by
> two cycles for the CMOV. Even with a single-cycle CMOV it would still be
> 2 cycles.
>
> Terje
>

Right and for what I understood to be the take-away from
Alpha's CMOV and it looks like x86 too is that
a reg<-reg CMOV by itself is not sufficiently useful.

There are other problems, like

if (cond)
x = x + 1;

has conditional memory update. To do this with CMOV requires
- unconditional load [x]
- unconditional load immediate of 1 to reg
- unconditional zero reg
- cmov zero reg to 1 reg
- unconditional add reg-reg
- unconditional store to [x]

so no saving mostly because memory must be updated.

In my hypothetical ISA I had

MVC Move Conditional reg<-reg
LIC Load Immediate Conditional reg<-imm
LDC Load Conditional reg<-mem
STC Store Conditional mem<-reg

LDC and STC have nothing to do with atomic access.
LDC and STC do not fault or access memory if condition is false.

What I don't know is if those instructions are sufficient to be useful.
Or is this predication subset just more trouble than they are worth?

Does that means that full predication with all its bells and whistles,
and all its complexity, is an all-or-nothing feature set?

Re: Ill-advised use of CMOVE

<3e17aa42-6be4-41a4-9338-0b31e8b1b5e4n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25220&group=comp.arch#25220

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:a44:b0:6a0:aa6:801b with SMTP id j4-20020a05620a0a4400b006a00aa6801bmr435821qka.152.1652370633618;
Thu, 12 May 2022 08:50:33 -0700 (PDT)
X-Received: by 2002:a05:6870:969e:b0:ed:9e77:8eba with SMTP id
o30-20020a056870969e00b000ed9e778ebamr6149885oaq.269.1652370633123; Thu, 12
May 2022 08:50:33 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 12 May 2022 08:50:32 -0700 (PDT)
In-Reply-To: <d09fK.2957$L_b6.1412@fx33.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:8164:3f24:ecd9:a4a;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:8164:3f24:ecd9:a4a
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org> <t58n6r$1riu$1@gioia.aioe.org>
<t5e1ek$s7c$1@dont-email.me> <2022May10.194427@mips.complang.tuwien.ac.at>
<0555194d-edc8-48fb-b304-7f78d62255d3n@googlegroups.com> <hlPeK.4263$pqKf.2583@fx12.iad>
<e209a30a-6778-423a-a3b7-ee8200b36c77n@googlegroups.com> <HDTeK.54$56e6.10@fx34.iad>
<t5i6ij$1tam$1@gioia.aioe.org> <d09fK.2957$L_b6.1412@fx33.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3e17aa42-6be4-41a4-9338-0b31e8b1b5e4n@googlegroups.com>
Subject: Re: Ill-advised use of CMOVE
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 12 May 2022 15:50:33 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Thu, 12 May 2022 15:50 UTC

On Thursday, May 12, 2022 at 9:41:17 AM UTC-5, EricP wrote:
> Terje Mathisen wrote:
> > EricP wrote:
> >> Michael S wrote:
> >>> On Wednesday, May 11, 2022 at 5:01:53 PM UTC+3, EricP wrote:
> >>>> So Alpha must always do the expensive loads and stores and can skip
> >>>> the cheap reg<-reg move, which greatly limits its performance
> >>>> improvements. x86 can skip some memory loads which allows slightly
> >>>> more performance.
> >>>
> >>> I don't think so.
> >>> If I am not mistaken, x86 architecture specifies that load should be
> >>> always executed.
> >>> Or, may be, that's just an implementation thing, but since it's
> >>> documented in the manuals
> >>> future architects will be always afraid to change it.
> >>
> >> Hmmm... looks like it could be either way.
> >> Intel Vol-2 manual CMOV description says:
> >>
> >> "If the condition is not satisfied, a move is not performed and execution
> >> continues with the instruction following the CMOVcc instruction."
> >>
> >> which seems an explicit statement of not accessing memory, but later the
> >> pseudo-code description seems to imply the access is always performed:
> >>
> >> Operation
> >> temp ↠SRC
> >> IF condition TRUE
> >> THEN
> >> DEST ↠temp;
> >> FI;
> >> ELSE
> >> IF (OperandSize = 32 and IA-32e mode active)
> >> THEN
> >> DEST[63:32] ↠0;
> >> FI;
> >> FI;
> >>
> >> and there is no other descriptions of CMOV's operation in other manuals.
> >> It also shows that on x64 if operand size==32 it updates the dest
> >> even if the condition is false, which is weird.
> >>
> >> The AMD manual says "may report exceptions" which could also be either
> >> way:
> >>
> >> "For the memory-based forms of CMOVcc, memory-related exceptions
> >> may be reported even if the condition is false."
> >>
> >> It looks on x64 one shouldn't implement
> >> if (ptr) x = *ptr;
> >>
> >> as a TEST, CMOV as it might still address fault.
> >
> > CMOV is a MOV, so it always accesses the source, but then it might
> > discard the result?
> It would appear so.
> > The key here is probably that all load-op instructions on x86 perform
> > the load part in an earlier phase than the operation itself, so it makes
> > perfect sense both that it always accesses both the source and
> > destination and always writes to the destination (at least in the
> > "target-is-buzy" sense).
> >
> > Terje
> >
> I understand NOW that they look at it that way.
> The way I saw it is that the whole point is to skip
> the expensive memory access part if condition is false.
>
> I have LDC & STC Load & Store Conditional instructions in my ISA
> (nothing to do with atomic access) and they work by skipping the
> memory access so I assumed that x86 CMOV would be designed
> the same way.
<
With Intel being the parent company !!! That is a BIG assumption.
>
> I can see why x86 did it that way - it's a lot easier for decode to
> emit 2 uOps, one to always do a normal load to a temp register and
> one using the existing CMOV uOp to conditionally move a register.
>

>
> The alternative, which is what I chose, is to pass the condition
> to the LSQ and have it decide for itself whether to perform the op.
<
This is how predication is defined on My 66000--result cancellation.
<
> If the condition is false then it does not translate the address,
> which could trigger a table walk, or perform any memory accesses.
> For LDC Load Conditional it also passes the original value of the dest
> register to LSQ so LDC can copy it to new dest if the condition is false.
>
> The Intel manual description doesn't even mention the
> possibility of throwing exceptions when condition is false.
> At least AMD mentions that it MAY happen.
> That is not the only way to do this.
> It also could have be designed to always perform the load but
> block the memory exceptions if the condition is false.
> But apparently not.
<
You are expecting "sane" behavior from a company like Intel ?!?!?!

Re: Ill-advised use of CMOVE

<t5jsk8$nmv$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25223&group=comp.arch#25223

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
Date: Thu, 12 May 2022 16:00:43 -0500
Organization: A noiseless patient Spider
Lines: 131
Message-ID: <t5jsk8$nmv$1@dont-email.me>
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org>
<t58n6r$1riu$1@gioia.aioe.org> <t5e1ek$s7c$1@dont-email.me>
<2022May10.194427@mips.complang.tuwien.ac.at>
<0555194d-edc8-48fb-b304-7f78d62255d3n@googlegroups.com>
<hlPeK.4263$pqKf.2583@fx12.iad>
<e209a30a-6778-423a-a3b7-ee8200b36c77n@googlegroups.com>
<HDTeK.54$56e6.10@fx34.iad> <t5i6ij$1tam$1@gioia.aioe.org>
<d09fK.2957$L_b6.1412@fx33.iad>
<3e17aa42-6be4-41a4-9338-0b31e8b1b5e4n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 12 May 2022 21:02:00 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="dbc53fa5058c2db5d642f16029d50020";
logging-data="24287"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/ideyFu5aMCzPBCnyuDYUy"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.8.0
Cancel-Lock: sha1:H/vHj8NTLztTAfPPFbVTbpUOLD0=
In-Reply-To: <3e17aa42-6be4-41a4-9338-0b31e8b1b5e4n@googlegroups.com>
Content-Language: en-US
 by: BGB - Thu, 12 May 2022 21:00 UTC

On 5/12/2022 10:50 AM, MitchAlsup wrote:
> On Thursday, May 12, 2022 at 9:41:17 AM UTC-5, EricP wrote:
>> Terje Mathisen wrote:
>>> EricP wrote:
>>>> Michael S wrote:
>>>>> On Wednesday, May 11, 2022 at 5:01:53 PM UTC+3, EricP wrote:
>>>>>> So Alpha must always do the expensive loads and stores and can skip
>>>>>> the cheap reg<-reg move, which greatly limits its performance
>>>>>> improvements. x86 can skip some memory loads which allows slightly
>>>>>> more performance.
>>>>>
>>>>> I don't think so.
>>>>> If I am not mistaken, x86 architecture specifies that load should be
>>>>> always executed.
>>>>> Or, may be, that's just an implementation thing, but since it's
>>>>> documented in the manuals
>>>>> future architects will be always afraid to change it.
>>>>
>>>> Hmmm... looks like it could be either way.
>>>> Intel Vol-2 manual CMOV description says:
>>>>
>>>> "If the condition is not satisfied, a move is not performed and execution
>>>> continues with the instruction following the CMOVcc instruction."
>>>>
>>>> which seems an explicit statement of not accessing memory, but later the
>>>> pseudo-code description seems to imply the access is always performed:
>>>>
>>>> Operation
>>>> temp ↠SRC
>>>> IF condition TRUE
>>>> THEN
>>>> DEST ↠temp;
>>>> FI;
>>>> ELSE
>>>> IF (OperandSize = 32 and IA-32e mode active)
>>>> THEN
>>>> DEST[63:32] ↠0;
>>>> FI;
>>>> FI;
>>>>
>>>> and there is no other descriptions of CMOV's operation in other manuals.
>>>> It also shows that on x64 if operand size==32 it updates the dest
>>>> even if the condition is false, which is weird.
>>>>
>>>> The AMD manual says "may report exceptions" which could also be either
>>>> way:
>>>>
>>>> "For the memory-based forms of CMOVcc, memory-related exceptions
>>>> may be reported even if the condition is false."
>>>>
>>>> It looks on x64 one shouldn't implement
>>>> if (ptr) x = *ptr;
>>>>
>>>> as a TEST, CMOV as it might still address fault.
>>>
>>> CMOV is a MOV, so it always accesses the source, but then it might
>>> discard the result?
>> It would appear so.
>>> The key here is probably that all load-op instructions on x86 perform
>>> the load part in an earlier phase than the operation itself, so it makes
>>> perfect sense both that it always accesses both the source and
>>> destination and always writes to the destination (at least in the
>>> "target-is-buzy" sense).
>>>
>>> Terje
>>>
>> I understand NOW that they look at it that way.
>> The way I saw it is that the whole point is to skip
>> the expensive memory access part if condition is false.
>>
>> I have LDC & STC Load & Store Conditional instructions in my ISA
>> (nothing to do with atomic access) and they work by skipping the
>> memory access so I assumed that x86 CMOV would be designed
>> the same way.
> <
> With Intel being the parent company !!! That is a BIG assumption.
>>
>> I can see why x86 did it that way - it's a lot easier for decode to
>> emit 2 uOps, one to always do a normal load to a temp register and
>> one using the existing CMOV uOp to conditionally move a register.
>>
>
>>
>> The alternative, which is what I chose, is to pass the condition
>> to the LSQ and have it decide for itself whether to perform the op.
> <
> This is how predication is defined on My 66000--result cancellation.

In my case, once it reaches EX1, if the predicate is false, the
instruction's opcode turns into a NOP (except for branches, where it
turns into a "NoBranch" instruction, and the BRA/BRANB logic may then
initiate a branch based on whether or not it agreed with the
branch-predictor; BRANB initiating a branch to the following instruction
in the case where the branch-predictor predicted the branch to have been
taken).

I originally also had separate RTS/RTSU instructions, but have partially
re-merged them (RTS will behave like RTSU in the case of no LR updates
happening in the pipeline).

In the case of RTSU and predicted RTS, the branch predictor will predict
an unconditional branch to whatever address is currently held in LR.

A similar special case also applies to an unconditional branch to DHR
(R1), which may be used as a secondary link register (mostly as a result
of the prolog/epilog compression mechanism).

The branch-predictor will, however, need to refrain from predicting a
branch to LR or R1 if these registers encode an Inter-ISA jump or
similar (an actual branch being needed here so that the IF/ID stages
will decode instructions using the correct ISA).

> <
>> If the condition is false then it does not translate the address,
>> which could trigger a table walk, or perform any memory accesses.
>> For LDC Load Conditional it also passes the original value of the dest
>> register to LSQ so LDC can copy it to new dest if the condition is false.
>>
>> The Intel manual description doesn't even mention the
>> possibility of throwing exceptions when condition is false.
>> At least AMD mentions that it MAY happen.
>> That is not the only way to do this.
>> It also could have be designed to always perform the load but
>> block the memory exceptions if the condition is false.
>> But apparently not.
> <
> You are expecting "sane" behavior from a company like Intel ?!?!?!

I think a lot depends on implementation specifics here.

Re: Ill-advised use of CMOVE

<t5k8uf$5fl$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25224&group=comp.arch#25224

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
Date: Thu, 12 May 2022 17:32:14 -0700
Organization: A noiseless patient Spider
Lines: 105
Message-ID: <t5k8uf$5fl$1@dont-email.me>
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org>
<t58n6r$1riu$1@gioia.aioe.org> <t5e1ek$s7c$1@dont-email.me>
<t5i7pe$6qm$1@gioia.aioe.org> <bB9fK.7576$pqKf.5267@fx12.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 13 May 2022 00:32:15 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="3e7663ab724f815afc136b19bef029ca";
logging-data="5621"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/KlovOOAFMffpRwz1Kku9U"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.9.0
Cancel-Lock: sha1:baSYj2/nwXn5205sFMeqQnc9n3c=
In-Reply-To: <bB9fK.7576$pqKf.5267@fx12.iad>
Content-Language: en-US
 by: Ivan Godard - Fri, 13 May 2022 00:32 UTC

On 5/12/2022 8:19 AM, EricP wrote:
> Terje Mathisen wrote:
>> Stephen Fuld wrote:
>>>
>>> I am not questioning the truth of that, but I am trying to figure out
>>> why.  Since the CMOV is a pretty big win when the branch is
>>> mispredicted, it must be a loss when the prediction would have been
>>> correct.
>>>
>>> Is this because
>>>
>>> The CMOV itself is too slow?
>>
>> Mainly this! 2 cycles of forced latency, on top of typically some
>> extra work to setup the two alternatives, was just too long.
>>>
>>> More code needs to be generated when using the CMOV so more
>>> instructions are executed?
>>
>> Yeah, see above.
>>>
>>> The CMOV messes up speculative execution, so the instructions after
>>> it cannot be executed where they would be in the branch case?
>>
>> I don't think this is the case, but mainly that branch predictors are
>> just too good:
>>
>>   cmp eax,LIMIT
>>    jb ok
>>   mov eax,LIMIT
>> ok:
>>
>> The code above typically runs in zero cycles, i.e. the branch is
>> correctly predicted to be taken so we don't even have the latency for
>> EAX to become valid so that the CMP can be done.
>>
>> When we miss it might take up to 20 cycles to fix it, but with a 99%
>> branch predictor (typical for such a range boundary saturation), that
>> corresponds to 0.2 cycles per iteration. Even at 95% correct it is
>> still on average just one cycle.
>>
>> CMOV otoh would be:
>>
>>   mov ebx,LIMIT    ; Can typically be hoisted above
>>
>>   cmp eax,ebx
>>   cmovae eax,ebx
>>
>> which has a fixed 3-cycle latency: One cycle to do the CMP followed by
>> two cycles for the CMOV. Even with a single-cycle CMOV it would still
>> be 2 cycles.
>>
>> Terje
>>
>
> Right and for what I understood to be the take-away from
> Alpha's CMOV and it looks like x86 too is that
> a reg<-reg CMOV by itself is not sufficiently useful.
>
> There are other problems, like
>
>   if (cond)
>     x = x + 1;
>
> has conditional memory update. To do this with CMOV requires
> - unconditional load [x]
> - unconditional load immediate of 1 to reg
> - unconditional zero reg
> - cmov zero reg to 1 reg
> - unconditional add reg-reg
> - unconditional store to [x]
>
> so no saving mostly because memory must be updated.
>
> In my hypothetical ISA I had
>
>   MVC  Move Conditional reg<-reg
>   LIC  Load Immediate Conditional reg<-imm
>   LDC  Load Conditional reg<-mem
>   STC  Store Conditional mem<-reg
>
> LDC and STC have nothing to do with atomic access.
> LDC and STC do not fault or access memory if condition is false.
>
> What I don't know is if those instructions are sufficient to be useful.
> Or is this predication subset just more trouble than they are worth?

It is worth it IMO. Mill has predicated forms for both load and store.

> Does that means that full predication with all its bells and whistles,
> and all its complexity, is an all-or-nothing feature set?

It is if you want to do aggressive speculation. How you fit it into your
instruction set can vary a lot though: some (ARM) ISAs predicate
everything; Mill has individual predicated forms for instructions that
permanently modify machine state, but not for instructions (like ADD)
that don't, and lets each be predicated on a different condition; and
Mitch lets you predicate anything but only on the same condition
(although he has talked about supporting overlapping predicate splatters).

There does seem to be merit in supporting both predication and dataflow
join (CMOV or Mill pick). Predication is for exception and side effect
control in speculation, while dataflow join is to take advantage of
width in flattening dataflows.

Re: Ill-advised use of CMOVE

<8022ca05-f9c4-4146-897d-13a70ebd7326n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25227&group=comp.arch#25227

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:4154:b0:6a0:59e4:cc74 with SMTP id k20-20020a05620a415400b006a059e4cc74mr2229547qko.561.1652409291576;
Thu, 12 May 2022 19:34:51 -0700 (PDT)
X-Received: by 2002:a05:6870:b383:b0:e9:2fea:2148 with SMTP id
w3-20020a056870b38300b000e92fea2148mr6822798oap.103.1652409291374; Thu, 12
May 2022 19:34:51 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 12 May 2022 19:34:51 -0700 (PDT)
In-Reply-To: <t5k8uf$5fl$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:8164:3f24:ecd9:a4a;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:8164:3f24:ecd9:a4a
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org> <t58n6r$1riu$1@gioia.aioe.org>
<t5e1ek$s7c$1@dont-email.me> <t5i7pe$6qm$1@gioia.aioe.org>
<bB9fK.7576$pqKf.5267@fx12.iad> <t5k8uf$5fl$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <8022ca05-f9c4-4146-897d-13a70ebd7326n@googlegroups.com>
Subject: Re: Ill-advised use of CMOVE
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 13 May 2022 02:34:51 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 6667
 by: MitchAlsup - Fri, 13 May 2022 02:34 UTC

On Thursday, May 12, 2022 at 7:32:18 PM UTC-5, Ivan Godard wrote:
> On 5/12/2022 8:19 AM, EricP wrote:
> > Terje Mathisen wrote:
> >> Stephen Fuld wrote:
> >>>
> >>> I am not questioning the truth of that, but I am trying to figure out
> >>> why. Since the CMOV is a pretty big win when the branch is
> >>> mispredicted, it must be a loss when the prediction would have been
> >>> correct.
> >>>
> >>> Is this because
> >>>
> >>> The CMOV itself is too slow?
> >>
> >> Mainly this! 2 cycles of forced latency, on top of typically some
> >> extra work to setup the two alternatives, was just too long.
> >>>
> >>> More code needs to be generated when using the CMOV so more
> >>> instructions are executed?
> >>
> >> Yeah, see above.
> >>>
> >>> The CMOV messes up speculative execution, so the instructions after
> >>> it cannot be executed where they would be in the branch case?
> >>
> >> I don't think this is the case, but mainly that branch predictors are
> >> just too good:
> >>
> >> cmp eax,LIMIT
> >> jb ok
> >> mov eax,LIMIT
> >> ok:
> >>
> >> The code above typically runs in zero cycles, i.e. the branch is
> >> correctly predicted to be taken so we don't even have the latency for
> >> EAX to become valid so that the CMP can be done.
> >>
> >> When we miss it might take up to 20 cycles to fix it, but with a 99%
> >> branch predictor (typical for such a range boundary saturation), that
> >> corresponds to 0.2 cycles per iteration. Even at 95% correct it is
> >> still on average just one cycle.
> >>
> >> CMOV otoh would be:
> >>
> >> mov ebx,LIMIT ; Can typically be hoisted above
> >>
> >> cmp eax,ebx
> >> cmovae eax,ebx
> >>
> >> which has a fixed 3-cycle latency: One cycle to do the CMP followed by
> >> two cycles for the CMOV. Even with a single-cycle CMOV it would still
> >> be 2 cycles.
> >>
> >> Terje
> >>
> >
> > Right and for what I understood to be the take-away from
> > Alpha's CMOV and it looks like x86 too is that
> > a reg<-reg CMOV by itself is not sufficiently useful.
> >
> > There are other problems, like
> >
> > if (cond)
> > x = x + 1;
> >
> > has conditional memory update. To do this with CMOV requires
> > - unconditional load [x]
> > - unconditional load immediate of 1 to reg
> > - unconditional zero reg
> > - cmov zero reg to 1 reg
> > - unconditional add reg-reg
> > - unconditional store to [x]
> >
> > so no saving mostly because memory must be updated.
> >
> > In my hypothetical ISA I had
> >
> > MVC Move Conditional reg<-reg
> > LIC Load Immediate Conditional reg<-imm
> > LDC Load Conditional reg<-mem
> > STC Store Conditional mem<-reg
> >
> > LDC and STC have nothing to do with atomic access.
> > LDC and STC do not fault or access memory if condition is false.
> >
> > What I don't know is if those instructions are sufficient to be useful.
> > Or is this predication subset just more trouble than they are worth?
> It is worth it IMO. Mill has predicated forms for both load and store.
> > Does that means that full predication with all its bells and whistles,
> > and all its complexity, is an all-or-nothing feature set?
> It is if you want to do aggressive speculation. How you fit it into your
> instruction set can vary a lot though: some (ARM) ISAs predicate
> everything; Mill has individual predicated forms for instructions that
> permanently modify machine state, but not for instructions (like ADD)
> that don't, and lets each be predicated on a different condition; and
> Mitch lets you predicate anything but only on the same condition
> (although he has talked about supporting overlapping predicate splatters).
<
While a set (handful) of instructions can have a predicate cast over them,
One can have several of these sets with different predicates for each set
(some sets partially overlapping). && and || are easily accommodated.
<
There are a couple of things you cannot predicate::
a) branches--if a branch is taken, you have left the predicate shadow
{Branches, calls, returns, check for memory interference}
BUT technically a branch can exist under a predicate and if it should not be
executed, it cannot transfer control. In general, anything you can branch
over, you can case a predicate shadow--it is just that the shadow has limited
length.
<
There are a couple of things you cannot branch, too::
1) Vector loops must use predication--same rule: if a branch is taken you have
left the vectorized loop. Vectorization only handles the innermost loop.
<
There are a couple of places where an interrupt/exception causes strangeness:
*) an interrupt/exception in an ATOMIC event sends the event to the failure point
prior (and nullifies any pending participating store) to transfer of control. When
you get back, you need to take a fresh look at the concurrent data structure.
<
Luckily the My 66000 compiler does not produce those kinds of sequences.
>
> There does seem to be merit in supporting both predication and dataflow
> join (CMOV or Mill pick). Predication is for exception and side effect
> control in speculation, while dataflow join is to take advantage of
> width in flattening dataflows.
<
My 66000 CMOV is of the join type.

Re: Ill-advised use of CMOVE

<t5lbqs$1udl$1@gioia.aioe.org>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25234&group=comp.arch#25234

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!10O9MudpjwoXIahOJRbDvA.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
Date: Fri, 13 May 2022 12:27:42 +0200
Organization: Aioe.org NNTP Server
Message-ID: <t5lbqs$1udl$1@gioia.aioe.org>
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org>
<t58n6r$1riu$1@gioia.aioe.org> <t5e1ek$s7c$1@dont-email.me>
<t5i7pe$6qm$1@gioia.aioe.org> <bB9fK.7576$pqKf.5267@fx12.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="63925"; posting-host="10O9MudpjwoXIahOJRbDvA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.12
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Fri, 13 May 2022 10:27 UTC

EricP wrote:
> Terje Mathisen wrote:
>> Stephen Fuld wrote:
>>>
>>> I am not questioning the truth of that, but I am trying to figure out
>>> why.  Since the CMOV is a pretty big win when the branch is
>>> mispredicted, it must be a loss when the prediction would have been
>>> correct.
>>>
>>> Is this because
>>>
>>> The CMOV itself is too slow?
>>
>> Mainly this! 2 cycles of forced latency, on top of typically some
>> extra work to setup the two alternatives, was just too long.
>>>
>>> More code needs to be generated when using the CMOV so more
>>> instructions are executed?
>>
>> Yeah, see above.
>>>
>>> The CMOV messes up speculative execution, so the instructions after
>>> it cannot be executed where they would be in the branch case?
>>
>> I don't think this is the case, but mainly that branch predictors are
>> just too good:
>>
>>   cmp eax,LIMIT
>>    jb ok
>>   mov eax,LIMIT
>> ok:
>>
>> The code above typically runs in zero cycles, i.e. the branch is
>> correctly predicted to be taken so we don't even have the latency for
>> EAX to become valid so that the CMP can be done.
>>
>> When we miss it might take up to 20 cycles to fix it, but with a 99%
>> branch predictor (typical for such a range boundary saturation), that
>> corresponds to 0.2 cycles per iteration. Even at 95% correct it is
>> still on average just one cycle.
>>
>> CMOV otoh would be:
>>
>>   mov ebx,LIMIT    ; Can typically be hoisted above
>>
>>   cmp eax,ebx
>>   cmovae eax,ebx
>>
>> which has a fixed 3-cycle latency: One cycle to do the CMP followed by
>> two cycles for the CMOV. Even with a single-cycle CMOV it would still
>> be 2 cycles.
>>
>> Terje
>>
>
> Right and for what I understood to be the take-away from
> Alpha's CMOV and it looks like x86 too is that
> a reg<-reg CMOV by itself is not sufficiently useful.
>
> There are other problems, like
>
>   if (cond)
>     x = x + 1;
>
> has conditional memory update. To do this with CMOV requires
> - unconditional load [x]
> - unconditional load immediate of 1 to reg
> - unconditional zero reg
> - cmov zero reg to 1 reg
> - unconditional add reg-reg
> - unconditional store to [x]

When doing a predicated increment my preferred (asm) solution is to
place the condition in the Carry flag, then just do

ADC x,0

This is of course far faster than any CMOV version.

If the increment is more or less random then we used to do

sbb edx,edx ; Propagates Carry across the register
and edx,INCREMENT_VALUE
add x,edx

which is 3 cycles.

Using CMOV I would instead try to hoist as much as possible higher up:

;; x is in eax

lea ebx,[eax+INCREMENT_VALUE]

; Critical latency path
CMOVcond eax,ebx

so taking just the 2 CMOV cycles, or one in the case of a fast CMOV.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Ill-advised use of CMOVE

<t5m1kt$n7t$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25238&group=comp.arch#25238

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
Date: Fri, 13 May 2022 09:39:59 -0700
Organization: A noiseless patient Spider
Lines: 82
Message-ID: <t5m1kt$n7t$1@dont-email.me>
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org>
<t58n6r$1riu$1@gioia.aioe.org> <t5e1ek$s7c$1@dont-email.me>
<t5i7pe$6qm$1@gioia.aioe.org> <bB9fK.7576$pqKf.5267@fx12.iad>
<t5lbqs$1udl$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 13 May 2022 16:39:58 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="3e7663ab724f815afc136b19bef029ca";
logging-data="23805"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+AqGJ9DgUdVNPBtFGPv8S8"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.9.0
Cancel-Lock: sha1:jtJEU0TKVmwxoPsnfpkrFVaFCPg=
In-Reply-To: <t5lbqs$1udl$1@gioia.aioe.org>
Content-Language: en-US
 by: Ivan Godard - Fri, 13 May 2022 16:39 UTC

On 5/13/2022 3:27 AM, Terje Mathisen wrote:
> EricP wrote:
>> Terje Mathisen wrote:
>>> Stephen Fuld wrote:
>>>>
>>>> I am not questioning the truth of that, but I am trying to figure
>>>> out why.  Since the CMOV is a pretty big win when the branch is
>>>> mispredicted, it must be a loss when the prediction would have been
>>>> correct.
>>>>
>>>> Is this because
>>>>
>>>> The CMOV itself is too slow?
>>>
>>> Mainly this! 2 cycles of forced latency, on top of typically some
>>> extra work to setup the two alternatives, was just too long.
>>>>
>>>> More code needs to be generated when using the CMOV so more
>>>> instructions are executed?
>>>
>>> Yeah, see above.
>>>>
>>>> The CMOV messes up speculative execution, so the instructions after
>>>> it cannot be executed where they would be in the branch case?
>>>
>>> I don't think this is the case, but mainly that branch predictors are
>>> just too good:
>>>
>>>   cmp eax,LIMIT
>>>    jb ok
>>>   mov eax,LIMIT
>>> ok:
>>>
>>> The code above typically runs in zero cycles, i.e. the branch is
>>> correctly predicted to be taken so we don't even have the latency for
>>> EAX to become valid so that the CMP can be done.
>>>
>>> When we miss it might take up to 20 cycles to fix it, but with a 99%
>>> branch predictor (typical for such a range boundary saturation), that
>>> corresponds to 0.2 cycles per iteration. Even at 95% correct it is
>>> still on average just one cycle.
>>>
>>> CMOV otoh would be:
>>>
>>>   mov ebx,LIMIT    ; Can typically be hoisted above
>>>
>>>   cmp eax,ebx
>>>   cmovae eax,ebx
>>>
>>> which has a fixed 3-cycle latency: One cycle to do the CMP followed
>>> by two cycles for the CMOV. Even with a single-cycle CMOV it would
>>> still be 2 cycles.
>>>
>>> Terje
>>>
>>
>> Right and for what I understood to be the take-away from
>> Alpha's CMOV and it looks like x86 too is that
>> a reg<-reg CMOV by itself is not sufficiently useful.
>>
>> There are other problems, like
>>
>>    if (cond)
>>      x = x + 1;
>>
>> has conditional memory update. To do this with CMOV requires
>> - unconditional load [x]
>> - unconditional load immediate of 1 to reg
>> - unconditional zero reg
>> - cmov zero reg to 1 reg
>> - unconditional add reg-reg
>> - unconditional store to [x]
>
> When doing a predicated increment my preferred (asm) solution is to
> place the condition in the Carry flag, then just do
>
>  ADC x,0
>
> This is of course far faster than any CMOV version.

Terje, you have a dirty mind!

Re: Ill-advised use of CMOVE

<t5nojt$9q8$1@newsreader4.netcologne.de>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25259&group=comp.arch#25259

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-5c6-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
Date: Sat, 14 May 2022 08:18:05 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <t5nojt$9q8$1@newsreader4.netcologne.de>
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org>
<t58n6r$1riu$1@gioia.aioe.org> <t5e1ek$s7c$1@dont-email.me>
<jwv4k1xs3ai.fsf-monnier+comp.arch@gnu.org> <t5e9g7$c65$1@dont-email.me>
<2022May10.214311@mips.complang.tuwien.ac.at> <t5enf7$4mf$1@dont-email.me>
<13a8e93b-31a3-4cfc-aa2c-a5f373a2e4abn@googlegroups.com>
Injection-Date: Sat, 14 May 2022 08:18:05 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-5c6-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:5c6:0:7285:c2ff:fe6c:992d";
logging-data="10056"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Sat, 14 May 2022 08:18 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:

> As I have related in the past: The Mc 88120 had a branch predictor which
> is not based on taken/not-taken, but upon agree/disagree.

I have to confess I do not understand that.

What would agree / disagree with what?

Re: Ill-advised use of CMOVE

<2022May14.100411@mips.complang.tuwien.ac.at>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25260&group=comp.arch#25260

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
Date: Sat, 14 May 2022 08:04:11 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 60
Message-ID: <2022May14.100411@mips.complang.tuwien.ac.at>
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org> <t58n6r$1riu$1@gioia.aioe.org> <t5e1ek$s7c$1@dont-email.me> <2022May10.194427@mips.complang.tuwien.ac.at> <0555194d-edc8-48fb-b304-7f78d62255d3n@googlegroups.com> <hlPeK.4263$pqKf.2583@fx12.iad>
Injection-Info: reader02.eternal-september.org; posting-host="a02afb20dde026e3b7d278094936d6ec";
logging-data="1670"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+TH0TN/IyEjMSQzF5nQCw0"
Cancel-Lock: sha1:X4greLt7s90Hm6sDbNxMC+s5rC0=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Sat, 14 May 2022 08:04 UTC

EricP <ThatWouldBeTelling@thevillage.com> writes:
>MitchAlsup wrote:
>> On Tuesday, May 10, 2022 at 1:09:01 PM UTC-5, Anton Ertl wrote:
>>> In an OoO machine, this can mean several different things.
>>>
>>> * high resource usage. On the 21264, CMOV takes two
>>> microinstructions.
>
>For CMOV 1 or 2 extra uOps in a 100+ instruction queue is not a
>problem but for full predication this approach would not do
>and it requires smarter uOps.

Yes, I doubt that the extra uOps cause a slowdown in most cases and a
significant slowdown in the rest. And I think that the independence
of the other source is so important for performance that this is a
good approach for predication, too.

>In the Alpha 21264 case with its 2-uOp CMOV approach, each uOp can
>proceed independently when the condition and its data source are ready.
>Slightly better.

I think this can be a lot better, depending on the data dependencies.
In particular, if you have a loop recurrence involving a CMOV and one
input to the CMOV is loop-invariant (e.g., a constant), every time
this input is selected breaks the recurrence latency chain. And even
if both inputs are loop variants, if one has a long latency (e.g., it
involves a load) and the other a short latency (e.g., it's just the
result from the last iteration), this is a win.

For predication, if you have an if-then-else instruction, and a
then-instruction with the same destination register as an
else-instruction, you can pair these up as the two complementary
microinstructions producing the same target, without needing extra
microinstructions. You can have a smart decoder that finds this out
dynamically, or encode this pairing in the if-then-else instruction
(with an exception if the target registers then don't match).

>This led them to propose what they called "predicate slip" whereby
>the data side can proceed as soon as its operands are ready,
>and the predicate state is checked before retire.

AFAIK performing the operation, but conditionally suppressing the
result(s) is the classical in-order implementation technique.

For OoO you don't want to wait for retirement, because other
instructions in the OoO engine depend on the results.

I am a little bit confused by your use of the term "retirement", which
I associate with OoO CPUs.

>This gets messier if one desires to allow predicate slip but
>later cancel pending uOps when the predicate resolves to disabled
>so you don't perform work that you now know you are going to toss.

Sounds too messy for my taste.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Ill-advised use of CMOVE

<2022May14.110828@mips.complang.tuwien.ac.at>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25261&group=comp.arch#25261

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.swapon.de!news.mixmin.net!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
Date: Sat, 14 May 2022 09:08:28 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 17
Distribution: world
Message-ID: <2022May14.110828@mips.complang.tuwien.ac.at>
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org> <t58n6r$1riu$1@gioia.aioe.org> <t5e1ek$s7c$1@dont-email.me> <jwv4k1xs3ai.fsf-monnier+comp.arch@gnu.org> <t5e9g7$c65$1@dont-email.me> <2022May10.214311@mips.complang.tuwien.ac.at> <t5enf7$4mf$1@dont-email.me> <13a8e93b-31a3-4cfc-aa2c-a5f373a2e4abn@googlegroups.com> <t5nojt$9q8$1@newsreader4.netcologne.de>
Injection-Info: reader02.eternal-september.org; posting-host="a02afb20dde026e3b7d278094936d6ec";
logging-data="1670"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+xTYJtDmsLAESrGp+uy2X/"
Cancel-Lock: sha1:7U2X2R6xhQYP6Cu1uTRM/hOLM54=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Sat, 14 May 2022 09:08 UTC

Thomas Koenig <tkoenig@netcologne.de> writes:
>MitchAlsup <MitchAlsup@aol.com> schrieb:
>
>> As I have related in the past: The Mc 88120 had a branch predictor which
>> is not based on taken/not-taken, but upon agree/disagree.
>
>I have to confess I do not understand that.
>
>What would agree / disagree with what?

Probably with a different predictor. See
<https://ict.iitk.ac.in/wp-content/uploads/CS422-Computer-Architecture-agree_predictor.pdf>.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Ill-advised use of CMOVE

<d12f2fb1-4929-487b-8776-6b5ca5ca3945n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25267&group=comp.arch#25267

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:31a2:b0:6a0:1d82:8907 with SMTP id bi34-20020a05620a31a200b006a01d828907mr7379266qkb.408.1652548231340;
Sat, 14 May 2022 10:10:31 -0700 (PDT)
X-Received: by 2002:a05:6870:5818:b0:ee:e90:46cc with SMTP id
r24-20020a056870581800b000ee0e9046ccmr5559694oap.37.1652548231143; Sat, 14
May 2022 10:10:31 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 14 May 2022 10:10:30 -0700 (PDT)
In-Reply-To: <t5nojt$9q8$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3c12:2a2f:f7f9:90d8;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3c12:2a2f:f7f9:90d8
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org> <t58n6r$1riu$1@gioia.aioe.org>
<t5e1ek$s7c$1@dont-email.me> <jwv4k1xs3ai.fsf-monnier+comp.arch@gnu.org>
<t5e9g7$c65$1@dont-email.me> <2022May10.214311@mips.complang.tuwien.ac.at>
<t5enf7$4mf$1@dont-email.me> <13a8e93b-31a3-4cfc-aa2c-a5f373a2e4abn@googlegroups.com>
<t5nojt$9q8$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d12f2fb1-4929-487b-8776-6b5ca5ca3945n@googlegroups.com>
Subject: Re: Ill-advised use of CMOVE
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 14 May 2022 17:10:31 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2235
 by: MitchAlsup - Sat, 14 May 2022 17:10 UTC

On Saturday, May 14, 2022 at 3:18:08 AM UTC-5, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > As I have related in the past: The Mc 88120 had a branch predictor which
> > is not based on taken/not-taken, but upon agree/disagree.
> I have to confess I do not understand that.
>
> What would agree / disagree with what?
<
Instruction are installed into the cache of instructions in observed order.
When a branch is encountered (1 per cycle) if the predictor agrees, then
then instructions are played out in the same order as observed. If not,
then control is transferred to the alternate instruction fetch address (also
stored in the cache of instructions).
<
So we have an observed order and we have an alternate order from which
to make a choice.

Re: Ill-advised use of CMOVE

<49f15662-d5c6-432b-bca3-ec0b1cd820ddn@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25268&group=comp.arch#25268

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a0c:e906:0:b0:456:540b:4e87 with SMTP id a6-20020a0ce906000000b00456540b4e87mr8904080qvo.47.1652548701960;
Sat, 14 May 2022 10:18:21 -0700 (PDT)
X-Received: by 2002:a05:6808:577:b0:325:8089:eb8b with SMTP id
j23-20020a056808057700b003258089eb8bmr10428210oig.126.1652548701702; Sat, 14
May 2022 10:18:21 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!2.eu.feeder.erje.net!feeder.erje.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 14 May 2022 10:18:21 -0700 (PDT)
In-Reply-To: <2022May14.100411@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3c12:2a2f:f7f9:90d8;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3c12:2a2f:f7f9:90d8
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org> <t58n6r$1riu$1@gioia.aioe.org>
<t5e1ek$s7c$1@dont-email.me> <2022May10.194427@mips.complang.tuwien.ac.at>
<0555194d-edc8-48fb-b304-7f78d62255d3n@googlegroups.com> <hlPeK.4263$pqKf.2583@fx12.iad>
<2022May14.100411@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <49f15662-d5c6-432b-bca3-ec0b1cd820ddn@googlegroups.com>
Subject: Re: Ill-advised use of CMOVE
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 14 May 2022 17:18:21 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Sat, 14 May 2022 17:18 UTC

On Saturday, May 14, 2022 at 3:39:37 AM UTC-5, Anton Ertl wrote:
> EricP <ThatWould...@thevillage.com> writes:
> >MitchAlsup wrote:
> >> On Tuesday, May 10, 2022 at 1:09:01 PM UTC-5, Anton Ertl wrote:
> >>> In an OoO machine, this can mean several different things.
> >>>
> >>> * high resource usage. On the 21264, CMOV takes two
> >>> microinstructions.
> >
> >For CMOV 1 or 2 extra uOps in a 100+ instruction queue is not a
> >problem but for full predication this approach would not do
> >and it requires smarter uOps.
<
> Yes, I doubt that the extra uOps cause a slowdown in most cases and a
> significant slowdown in the rest. And I think that the independence
> of the other source is so important for performance that this is a
> good approach for predication, too.
<
> >In the Alpha 21264 case with its 2-uOp CMOV approach, each uOp can
> >proceed independently when the condition and its data source are ready.
> >Slightly better.
<
> I think this can be a lot better, depending on the data dependencies.
> In particular, if you have a loop recurrence involving a CMOV and one
> input to the CMOV is loop-invariant (e.g., a constant), every time
> this input is selected breaks the recurrence latency chain. And even
> if both inputs are loop variants, if one has a long latency (e.g., it
> involves a load) and the other a short latency (e.g., it's just the
> result from the last iteration), this is a win.
>
> For predication, if you have an if-then-else instruction, and a
> then-instruction with the same destination register as an
> else-instruction, you can pair these up as the two complementary
> microinstructions producing the same target, without needing extra
> microinstructions.
<
I give the same physical register name to instructions in the then-clause
and in the else clause. One will get delivered, the other one will get suppressed.
<
< You can have a smart decoder that finds this out
> dynamically, or encode this pairing in the if-then-else instruction
> (with an exception if the target registers then don't match).
<
> >This led them to propose what they called "predicate slip" whereby
> >the data side can proceed as soon as its operands are ready,
> >and the predicate state is checked before retire.
<
> AFAIK performing the operation, but conditionally suppressing the
> result(s) is the classical in-order implementation technique.
>
> For OoO you don't want to wait for retirement, because other
> instructions in the OoO engine depend on the results.
<
You wait for the instruction to be consistent (not in any shadow
of an instruction that can throw an exception)
>
> I am a little bit confused by your use of the term "retirement", which
> I associate with OoO CPUs.
<
Retirement is the point resources are recycled.
Consistent is the point where you know the instruction will retire.
<
> >This gets messier if one desires to allow predicate slip but
> >later cancel pending uOps when the predicate resolves to disabled
> >so you don't perform work that you now know you are going to toss.
> Sounds too messy for my taste.
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: Ill-advised use of CMOVE

<5d2201a8-26a0-4904-8916-3d8b4107f99cn@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25269&group=comp.arch#25269

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:31a2:b0:6a0:1d82:8907 with SMTP id bi34-20020a05620a31a200b006a01d828907mr7572819qkb.408.1652553459450;
Sat, 14 May 2022 11:37:39 -0700 (PDT)
X-Received: by 2002:a05:6870:e249:b0:f1:7fba:ce67 with SMTP id
d9-20020a056870e24900b000f17fbace67mr1662852oac.269.1652553459226; Sat, 14
May 2022 11:37:39 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 14 May 2022 11:37:38 -0700 (PDT)
In-Reply-To: <2022May14.100411@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3c12:2a2f:f7f9:90d8;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3c12:2a2f:f7f9:90d8
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org> <t58n6r$1riu$1@gioia.aioe.org>
<t5e1ek$s7c$1@dont-email.me> <2022May10.194427@mips.complang.tuwien.ac.at>
<0555194d-edc8-48fb-b304-7f78d62255d3n@googlegroups.com> <hlPeK.4263$pqKf.2583@fx12.iad>
<2022May14.100411@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5d2201a8-26a0-4904-8916-3d8b4107f99cn@googlegroups.com>
Subject: Re: Ill-advised use of CMOVE
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 14 May 2022 18:37:39 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Sat, 14 May 2022 18:37 UTC

On Saturday, May 14, 2022 at 3:39:37 AM UTC-5, Anton Ertl wrote:
> EricP <ThatWould...@thevillage.com> writes:
> >MitchAlsup wrote:
> >> On Tuesday, May 10, 2022 at 1:09:01 PM UTC-5, Anton Ertl wrote:
> >>> In an OoO machine, this can mean several different things.
> >>>
> >>> * high resource usage. On the 21264, CMOV takes two
> >>> microinstructions.
> >
> >For CMOV 1 or 2 extra uOps in a 100+ instruction queue is not a
> >problem but for full predication this approach would not do
> >and it requires smarter uOps.
> Yes, I doubt that the extra uOps cause a slowdown in most cases and a
> significant slowdown in the rest. And I think that the independence
> of the other source is so important for performance that this is a
> good approach for predication, too.
> >In the Alpha 21264 case with its 2-uOp CMOV approach, each uOp can
> >proceed independently when the condition and its data source are ready.
> >Slightly better.
> I think this can be a lot better, depending on the data dependencies.
> In particular, if you have a loop recurrence involving a CMOV and one
> input to the CMOV is loop-invariant (e.g., a constant), every time
> this input is selected breaks the recurrence latency chain. And even
> if both inputs are loop variants, if one has a long latency (e.g., it
> involves a load) and the other a short latency (e.g., it's just the
> result from the last iteration), this is a win.
<
You CoUlD make a CMOV predictor or you could make CMOV deliver
whichever operand arrives first, and then backup and rerun if the
prediction is wrong. With the proper infrastructure, this would not take
more cycles than simply waiting.
>
> For predication, if you have an if-then-else instruction, and a
> then-instruction with the same destination register as an
> else-instruction, you can pair these up as the two complementary
> microinstructions producing the same target, without needing extra
> microinstructions. You can have a smart decoder that finds this out
> dynamically, or encode this pairing in the if-then-else instruction
> (with an exception if the target registers then don't match).
> >This led them to propose what they called "predicate slip" whereby
> >the data side can proceed as soon as its operands are ready,
> >and the predicate state is checked before retire.
> AFAIK performing the operation, but conditionally suppressing the
> result(s) is the classical in-order implementation technique.
>
> For OoO you don't want to wait for retirement, because other
> instructions in the OoO engine depend on the results.
>
> I am a little bit confused by your use of the term "retirement", which
> I associate with OoO CPUs.
> >This gets messier if one desires to allow predicate slip but
> >later cancel pending uOps when the predicate resolves to disabled
> >so you don't perform work that you now know you are going to toss.
> Sounds too messy for my taste.
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: Ill-advised use of CMOVE

<2022May14.203453@mips.complang.tuwien.ac.at>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25271&group=comp.arch#25271

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
Date: Sat, 14 May 2022 18:34:53 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 42
Message-ID: <2022May14.203453@mips.complang.tuwien.ac.at>
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org> <t58n6r$1riu$1@gioia.aioe.org> <t5e1ek$s7c$1@dont-email.me> <2022May10.194427@mips.complang.tuwien.ac.at> <0555194d-edc8-48fb-b304-7f78d62255d3n@googlegroups.com> <hlPeK.4263$pqKf.2583@fx12.iad> <2022May14.100411@mips.complang.tuwien.ac.at> <49f15662-d5c6-432b-bca3-ec0b1cd820ddn@googlegroups.com>
Injection-Info: reader02.eternal-september.org; posting-host="a02afb20dde026e3b7d278094936d6ec";
logging-data="771"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19+rAPJytNS0BpBByAw3w3R"
Cancel-Lock: sha1:T5WXwV+FV8D8N6utA84Yc+0A0Cs=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Sat, 14 May 2022 18:34 UTC

MitchAlsup <MitchAlsup@aol.com> writes:
>On Saturday, May 14, 2022 at 3:39:37 AM UTC-5, Anton Ertl wrote:
>> EricP <ThatWould...@thevillage.com> writes:
>> >In the Alpha 21264 case with its 2-uOp CMOV approach, each uOp can
>> >proceed independently when the condition and its data source are ready.
>> >Slightly better.
><
>> I think this can be a lot better, depending on the data dependencies.
>> In particular, if you have a loop recurrence involving a CMOV and one
>> input to the CMOV is loop-invariant (e.g., a constant), every time
>> this input is selected breaks the recurrence latency chain. And even
>> if both inputs are loop variants, if one has a long latency (e.g., it
>> involves a load) and the other a short latency (e.g., it's just the
>> result from the last iteration), this is a win.
>>
>> For predication, if you have an if-then-else instruction, and a
>> then-instruction with the same destination register as an
>> else-instruction, you can pair these up as the two complementary
>> microinstructions producing the same target, without needing extra
>> microinstructions.
><
>I give the same physical register name to instructions in the then-clause
>and in the else clause. One will get delivered, the other one will get suppressed.

Yes, that's what I was thinking of (of course only if the
architectural register name is the same).

>> For OoO you don't want to wait for retirement, because other
>> instructions in the OoO engine depend on the results.
><
>You wait for the instruction to be consistent (not in any shadow
>of an instruction that can throw an exception)

No, I don't. I pass the result on as soon as it exists (and the
predicate is satisfied). If an earlier instruction traps, that will
cancel both the instruction at hand and all instructions that received
the result and ran with it.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Ill-advised use of CMOVE

<IZ8gK.6876$j0D5.5549@fx09.iad>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25293&group=comp.arch#25293

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!peer03.ams4!peer.am4.highwinds-media.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx09.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org> <t58n6r$1riu$1@gioia.aioe.org> <t5e1ek$s7c$1@dont-email.me> <2022May10.194427@mips.complang.tuwien.ac.at> <0555194d-edc8-48fb-b304-7f78d62255d3n@googlegroups.com> <hlPeK.4263$pqKf.2583@fx12.iad> <2022May14.100411@mips.complang.tuwien.ac.at>
In-Reply-To: <2022May14.100411@mips.complang.tuwien.ac.at>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 82
Message-ID: <IZ8gK.6876$j0D5.5549@fx09.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sun, 15 May 2022 15:27:36 UTC
Date: Sun, 15 May 2022 11:18:46 -0400
X-Received-Bytes: 4500
 by: EricP - Sun, 15 May 2022 15:18 UTC

Anton Ertl wrote:
> EricP <ThatWouldBeTelling@thevillage.com> writes:
>> MitchAlsup wrote:
>>> On Tuesday, May 10, 2022 at 1:09:01 PM UTC-5, Anton Ertl wrote:
>>>> In an OoO machine, this can mean several different things.
>>>>
>>>> * high resource usage. On the 21264, CMOV takes two
>>>> microinstructions.
>> For CMOV 1 or 2 extra uOps in a 100+ instruction queue is not a
>> problem but for full predication this approach would not do
>> and it requires smarter uOps.
>
> Yes, I doubt that the extra uOps cause a slowdown in most cases and a
> significant slowdown in the rest. And I think that the independence
> of the other source is so important for performance that this is a
> good approach for predication, too.
>
>> In the Alpha 21264 case with its 2-uOp CMOV approach, each uOp can
>> proceed independently when the condition and its data source are ready.
>> Slightly better.
>
> I think this can be a lot better, depending on the data dependencies.
> In particular, if you have a loop recurrence involving a CMOV and one
> input to the CMOV is loop-invariant (e.g., a constant), every time
> this input is selected breaks the recurrence latency chain. And even
> if both inputs are loop variants, if one has a long latency (e.g., it
> involves a load) and the other a short latency (e.g., it's just the
> result from the last iteration), this is a win.
>
> For predication, if you have an if-then-else instruction, and a
> then-instruction with the same destination register as an
> else-instruction, you can pair these up as the two complementary
> microinstructions producing the same target, without needing extra
> microinstructions. You can have a smart decoder that finds this out
> dynamically, or encode this pairing in the if-then-else instruction
> (with an exception if the target registers then don't match).

It looks like a uOp fusion. It might help in limited situations.

>> This led them to propose what they called "predicate slip" whereby
>> the data side can proceed as soon as its operands are ready,
>> and the predicate state is checked before retire.
>
> AFAIK performing the operation, but conditionally suppressing the
> result(s) is the classical in-order implementation technique.
>
> For OoO you don't want to wait for retirement, because other
> instructions in the OoO engine depend on the results.
>
> I am a little bit confused by your use of the term "retirement", which
> I associate with OoO CPUs.

This was proposed for OoO predication.
It frees up reservation stations ASAP by executing uOps when their
data operands are ready and not wait for the guarding predicate.
The results are held in the ROB and cannot be forwarded until
the predicate resolves.

The status of the predicate is later checked by retire
to ensure the uOp is all done.

For longer latency ops like DIV it makes sense.
Also for ops like MUL which are fast but the FU
is expensive so there is only one and they bottleneck.
And FP which take multiple cycles and bottlenecks.

But for the average 1-cycle integer ALU op where FU's are cheap
and you have multiple units, this is unnecessary.

>> This gets messier if one desires to allow predicate slip but
>> later cancel pending uOps when the predicate resolves to disabled
>> so you don't perform work that you now know you are going to toss.
>
> Sounds too messy for my taste.
>
> - anton

Just making OoO predication work at all looks pretty challenging to me.

Re: Ill-advised use of CMOVE

<JZ8gK.6877$j0D5.5934@fx09.iad>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25294&group=comp.arch#25294

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!tr1.eu1.usenetexpress.com!feeder.usenetexpress.com!tr3.iad1.usenetexpress.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx09.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org> <t58n6r$1riu$1@gioia.aioe.org> <t5e1ek$s7c$1@dont-email.me> <2022May10.194427@mips.complang.tuwien.ac.at> <0555194d-edc8-48fb-b304-7f78d62255d3n@googlegroups.com> <hlPeK.4263$pqKf.2583@fx12.iad> <2022May14.100411@mips.complang.tuwien.ac.at> <49f15662-d5c6-432b-bca3-ec0b1cd820ddn@googlegroups.com>
In-Reply-To: <49f15662-d5c6-432b-bca3-ec0b1cd820ddn@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 63
Message-ID: <JZ8gK.6877$j0D5.5934@fx09.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sun, 15 May 2022 15:27:37 UTC
Date: Sun, 15 May 2022 11:27:27 -0400
X-Received-Bytes: 3510
 by: EricP - Sun, 15 May 2022 15:27 UTC

MitchAlsup wrote:
> On Saturday, May 14, 2022 at 3:39:37 AM UTC-5, Anton Ertl wrote:
>> EricP <ThatWould...@thevillage.com> writes:
>>> MitchAlsup wrote:
>>>> On Tuesday, May 10, 2022 at 1:09:01 PM UTC-5, Anton Ertl wrote:
>>>>> In an OoO machine, this can mean several different things.
>>>>>
>>>>> * high resource usage. On the 21264, CMOV takes two
>>>>> microinstructions.
>>> For CMOV 1 or 2 extra uOps in a 100+ instruction queue is not a
>>> problem but for full predication this approach would not do
>>> and it requires smarter uOps.
> <
>> Yes, I doubt that the extra uOps cause a slowdown in most cases and a
>> significant slowdown in the rest. And I think that the independence
>> of the other source is so important for performance that this is a
>> good approach for predication, too.
> <
>>> In the Alpha 21264 case with its 2-uOp CMOV approach, each uOp can
>>> proceed independently when the condition and its data source are ready.
>>> Slightly better.
> <
>> I think this can be a lot better, depending on the data dependencies.
>> In particular, if you have a loop recurrence involving a CMOV and one
>> input to the CMOV is loop-invariant (e.g., a constant), every time
>> this input is selected breaks the recurrence latency chain. And even
>> if both inputs are loop variants, if one has a long latency (e.g., it
>> involves a load) and the other a short latency (e.g., it's just the
>> result from the last iteration), this is a win.
>>
>> For predication, if you have an if-then-else instruction, and a
>> then-instruction with the same destination register as an
>> else-instruction, you can pair these up as the two complementary
>> microinstructions producing the same target, without needing extra
>> microinstructions.
> <
> I give the same physical register name to instructions in the then-clause
> and in the else clause. One will get delivered, the other one will get suppressed.

So this handles things like:

x = cond? y + z : a * b;

This merges a PHI and two alternate uOps to eliminate the
temp register allocations and copies. But it requires all the
source operands y,z,a,b to be loaded so the savings are minimal.

I would want to skip loading either y,z or a,b as that is
where all the big saving are to be had.
That would be much harder for a decoder to detect and optimize
as the sequence is too long.

PRED { 111000 }
LD r1,[y]
LD r2,[z]
ADD r3 = r1 + r2
LD r1,[a]
LD r2,[b]
MUL r3 = r1 * r2
ST [x],r3

Re: Ill-advised use of CMOVE

<749dacbf-67e9-4121-82be-cb5fe54105cen@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25299&group=comp.arch#25299

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:4403:b0:6a0:5093:1742 with SMTP id v3-20020a05620a440300b006a050931742mr9829346qkp.691.1652633628288;
Sun, 15 May 2022 09:53:48 -0700 (PDT)
X-Received: by 2002:a05:6808:6d7:b0:325:67ff:a21b with SMTP id
m23-20020a05680806d700b0032567ffa21bmr6161560oih.105.1652633627705; Sun, 15
May 2022 09:53:47 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 15 May 2022 09:53:47 -0700 (PDT)
In-Reply-To: <JZ8gK.6877$j0D5.5934@fx09.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:4811:b402:98e:d212;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:4811:b402:98e:d212
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org> <t58n6r$1riu$1@gioia.aioe.org>
<t5e1ek$s7c$1@dont-email.me> <2022May10.194427@mips.complang.tuwien.ac.at>
<0555194d-edc8-48fb-b304-7f78d62255d3n@googlegroups.com> <hlPeK.4263$pqKf.2583@fx12.iad>
<2022May14.100411@mips.complang.tuwien.ac.at> <49f15662-d5c6-432b-bca3-ec0b1cd820ddn@googlegroups.com>
<JZ8gK.6877$j0D5.5934@fx09.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <749dacbf-67e9-4121-82be-cb5fe54105cen@googlegroups.com>
Subject: Re: Ill-advised use of CMOVE
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 15 May 2022 16:53:48 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Sun, 15 May 2022 16:53 UTC

On Sunday, May 15, 2022 at 10:27:42 AM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Saturday, May 14, 2022 at 3:39:37 AM UTC-5, Anton Ertl wrote:
> >> EricP <ThatWould...@thevillage.com> writes:
> >>> MitchAlsup wrote:
> >>>> On Tuesday, May 10, 2022 at 1:09:01 PM UTC-5, Anton Ertl wrote:
> >>>>> In an OoO machine, this can mean several different things.
> >>>>>
> >>>>> * high resource usage. On the 21264, CMOV takes two
> >>>>> microinstructions.
> >>> For CMOV 1 or 2 extra uOps in a 100+ instruction queue is not a
> >>> problem but for full predication this approach would not do
> >>> and it requires smarter uOps.
> > <
> >> Yes, I doubt that the extra uOps cause a slowdown in most cases and a
> >> significant slowdown in the rest. And I think that the independence
> >> of the other source is so important for performance that this is a
> >> good approach for predication, too.
> > <
> >>> In the Alpha 21264 case with its 2-uOp CMOV approach, each uOp can
> >>> proceed independently when the condition and its data source are ready.
> >>> Slightly better.
> > <
> >> I think this can be a lot better, depending on the data dependencies.
> >> In particular, if you have a loop recurrence involving a CMOV and one
> >> input to the CMOV is loop-invariant (e.g., a constant), every time
> >> this input is selected breaks the recurrence latency chain. And even
> >> if both inputs are loop variants, if one has a long latency (e.g., it
> >> involves a load) and the other a short latency (e.g., it's just the
> >> result from the last iteration), this is a win.
> >>
> >> For predication, if you have an if-then-else instruction, and a
> >> then-instruction with the same destination register as an
> >> else-instruction, you can pair these up as the two complementary
> >> microinstructions producing the same target, without needing extra
> >> microinstructions.
> > <
> > I give the same physical register name to instructions in the then-clause
> > and in the else clause. One will get delivered, the other one will get suppressed.
> So this handles things like:
>
> x = cond? y + z : a * b;
>
> This merges a PHI and two alternate uOps to eliminate the
> temp register allocations and copies. But it requires all the
> source operands y,z,a,b to be loaded so the savings are minimal.
>
> I would want to skip loading either y,z or a,b as that is
> where all the big saving are to be had.
> That would be much harder for a decoder to detect and optimize
> as the sequence is too long.
>
> PRED { 111000 }
> LD r1,[y]
> LD r2,[z]
> ADD r3 = r1 + r2
> LD r1,[a]
> LD r2,[b]
> MUL r3 = r1 * r2
> ST [x],r3
<
Yes, and especially when written this way::
<
PREDcnd {TE}
ADD R3,Ry,Rz
MUL R3,Ra,Rb

Pages:123456
server_pubkey.txt

rocksolid light 0.9.7
clearnet tor