Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

The best defense against logic is ignorance.


devel / comp.arch / Re: Ill-advised use of CMOVE

SubjectAuthor
* Ill-advised use of CMOVEStefan Monnier
+* Re: Ill-advised use of CMOVEThomas Koenig
|`* Re: Ill-advised use of CMOVEStephen Fuld
| `* Re: Ill-advised use of CMOVEThomas Koenig
|  +* Re: Ill-advised use of CMOVEStephen Fuld
|  |`- Re: Ill-advised use of CMOVEaph
|  `* Re: Ill-advised use of CMOVEAnton Ertl
|   +* Re: Ill-advised use of CMOVEMichael S
|   |`* Re: Ill-advised use of CMOVEAnton Ertl
|   | `* Re: Ill-advised use of CMOVEMichael S
|   |  +* Re: Ill-advised use of CMOVEIvan Godard
|   |  |`* Re: Ill-advised use of CMOVEMichael S
|   |  | `* Re: Ill-advised use of CMOVEIvan Godard
|   |  |  `- Re: Ill-advised use of CMOVEMichael S
|   |  `* Re: Ill-advised use of CMOVEAnton Ertl
|   |   +- Re: Ill-advised use of CMOVEMichael S
|   |   `* branchless binary search (was: Ill-advised use of CMOVE)Anton Ertl
|   |    +* Re: branchless binary searchStefan Monnier
|   |    |`- Re: branchless binary searchAnton Ertl
|   |    +- Re: branchless binary searchTerje Mathisen
|   |    `* Re: branchless binary searchEricP
|   |     +* Re: branchless binary searchMichael S
|   |     |+* Re: branchless binary searchStephen Fuld
|   |     ||`* Re: branchless binary searchMichael S
|   |     || +- Re: branchless binary searchThomas Koenig
|   |     || `* Spectre fix (was: branchless binary search)Anton Ertl
|   |     ||  `- Re: Spectre fix (was: branchless binary search)Michael S
|   |     |+* Re: branchless binary searchStefan Monnier
|   |     ||`- Re: branchless binary searchMitchAlsup
|   |     |`* Re: branchless binary searchAndy Valencia
|   |     | `- Re: branchless binary searchTerje Mathisen
|   |     `* Re: branchless binary searchAnton Ertl
|   |      `* Re: branchless binary searchMitchAlsup
|   |       `* Spectre and resource comtention (was: branchless binary search)Anton Ertl
|   |        +* Re: Spectre and resource comtentionStefan Monnier
|   |        |+- Re: Spectre and resource comtentionMitchAlsup
|   |        |`- Re: Spectre and resource comtentionAnton Ertl
|   |        `* Re: Spectre and resource comtention (was: branchless binary search)MitchAlsup
|   |         `* Re: Spectre and resource comtention (was: branchless binary search)Anton Ertl
|   |          `- Re: Spectre and resource comtention (was: branchless binary search)MitchAlsup
|   +* Re: Ill-advised use of CMOVEStephen Fuld
|   |`* binary search vs. hash tables (was: Ill-advised use of CMOVE)Anton Ertl
|   | +* Re: binary search vs. hash tables (was: Ill-advised use of CMOVE)Stephen Fuld
|   | |`* Re: binary search vs. hash tables (was: Ill-advised use of CMOVE)John Levine
|   | | `* Re: binary search vs. hash tables (was: Ill-advised use of CMOVE)Michael S
|   | |  `* Re: binary search vs. hash tables (was: Ill-advised use of CMOVE)Michael S
|   | |   `* Re: binary search vs. hash tables (was: Ill-advised use of CMOVE)Michael S
|   | |    `* Re: binary search vs. hash tables (was: Ill-advised use of CMOVE)Michael S
|   | |     `* Re: binary search vs. hash tables (was: Ill-advised use of CMOVE)Michael S
|   | |      `* Re: binary search vs. hash tables (was: Ill-advised use of CMOVE)Michael S
|   | |       `* Re: binary search vs. hash tables (was: Ill-advised use of CMOVE)Michael S
|   | |        `* Re: binary search vs. hash tables (was: Ill-advised use of CMOVE)Brett
|   | |         `- Re: binary search vs. hash tables (was: Ill-advised use of CMOVE)Michael S
|   | `- Re: binary search vs. hash tables (was: Ill-advised use of CMOVE)John Levine
|   +* Re: Ill-advised use of CMOVEEricP
|   |`* Re: Ill-advised use of CMOVEBGB
|   | +* Re: Ill-advised use of CMOVEMitchAlsup
|   | |+* Re: Ill-advised use of CMOVEBGB
|   | ||`* Re: Ill-advised use of CMOVEMitchAlsup
|   | || +* Re: Ill-advised use of CMOVEBGB
|   | || |`- Re: Ill-advised use of CMOVEMitchAlsup
|   | || `- Re: Ill-advised use of CMOVEIvan Godard
|   | |`* Re: Ill-advised use of CMOVEIvan Godard
|   | | `- Re: Ill-advised use of CMOVEMitchAlsup
|   | `- Re: Ill-advised use of CMOVEIvan Godard
|   `- Re: Ill-advised use of CMOVEThomas Koenig
+* Re: Ill-advised use of CMOVETerje Mathisen
|+* Re: Ill-advised use of CMOVEMitchAlsup
||`* Re: Ill-advised use of CMOVETerje Mathisen
|| `* Re: Ill-advised use of CMOVEMitchAlsup
||  `* Re: Ill-advised use of CMOVEMarcus
||   `* Re: Ill-advised use of CMOVEMitchAlsup
||    `* Re: Ill-advised use of CMOVEBGB
||     `* Re: Ill-advised use of CMOVEMitchAlsup
||      `- Re: Ill-advised use of CMOVEBGB
|+- Re: Ill-advised use of CMOVEIvan Godard
|`* Re: Ill-advised use of CMOVEStephen Fuld
| +* Re: Ill-advised use of CMOVEMichael S
| |`* Re: Ill-advised use of CMOVEStephen Fuld
| | `- Re: Ill-advised use of CMOVEAnton Ertl
| +* Re: Ill-advised use of CMOVEStefan Monnier
| |`* Re: Ill-advised use of CMOVEStephen Fuld
| | `* Re: Ill-advised use of CMOVEAnton Ertl
| |  `* Re: Ill-advised use of CMOVEBGB
| |   `* Re: Ill-advised use of CMOVEMitchAlsup
| |    +- Re: Ill-advised use of CMOVEBGB
| |    `* Re: Ill-advised use of CMOVEThomas Koenig
| |     +- Re: Ill-advised use of CMOVEAnton Ertl
| |     `- Re: Ill-advised use of CMOVEMitchAlsup
| +* Re: Ill-advised use of CMOVEAnton Ertl
| |`* Re: Ill-advised use of CMOVEMitchAlsup
| | `* Re: Ill-advised use of CMOVEEricP
| |  +* Re: Ill-advised use of CMOVEMichael S
| |  |`* Re: Ill-advised use of CMOVEEricP
| |  | +- Re: Ill-advised use of CMOVEMitchAlsup
| |  | +* Re: Ill-advised use of CMOVETerje Mathisen
| |  | |`* Re: Ill-advised use of CMOVEEricP
| |  | | `* Re: Ill-advised use of CMOVEMitchAlsup
| |  | |  `- Re: Ill-advised use of CMOVEBGB
| |  | `- Re: Ill-advised use of CMOVEAnton Ertl
| |  `* Re: Ill-advised use of CMOVEAnton Ertl
| `* Re: Ill-advised use of CMOVETerje Mathisen
+- Re: Ill-advised use of CMOVEAnton Ertl
`* Re: Ill-advised use of CMOVEScott Michel

Pages:123456
Ill-advised use of CMOVE

<jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25128&group=comp.arch#25128

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Ill-advised use of CMOVE
Date: Sun, 08 May 2022 11:05:47 -0400
Organization: A noiseless patient Spider
Lines: 27
Message-ID: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="181e1cbda87ea73d7fdeb928c0e858fa";
logging-data="6686"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19egltkXF2D/S90gsEQmv0/"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:/5CQB96zY1qADDfW+Poj8o0bJzM=
sha1:4sUJb1hVwxzVdhD7gzyL+yhe4iY=
 by: Stefan Monnier - Sun, 8 May 2022 15:05 UTC

We recently bumped into a funny performance behavior in Emacs.
Some code computing the length of a (possibly circular) list ended up
macroexpanded to something like:

for (struct for_each_tail_internal li = { list, 2, 0, 2 };
CONSP (list);
(list = XCDR (list),
((--li.q != 0
|| 0 < --li.n
|| (li.q = li.n = li.max <<= 1, li.n >>= USHRT_WIDTH,
li.tortoise = (list), false))
&& EQ (list, li.tortoise))
? (void) (list = Qnil) : (void) 0))
len++;

`EQ` is currently a slightly costly operation but in this specific case
it can be replaced with a plain `==`. When we tried that, the resulting
loop ended up running almost twice *slower*.

The problem turned out that with the simpler comparison, both GCC and
LLVM decided it would be a good idea to use CMOVE for the
`?:` operation, which just ends up making the data flow's critical path
longer whereas a branch works much better here since it's trivial
to predict.

Stefan

Re: Ill-advised use of CMOVE

<t58mug$ref$2@newsreader4.netcologne.de>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25129&group=comp.arch#25129

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!news.freedyn.de!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-ee5-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
Date: Sun, 8 May 2022 15:17:36 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <t58mug$ref$2@newsreader4.netcologne.de>
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org>
Injection-Date: Sun, 8 May 2022 15:17:36 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-ee5-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:ee5:0:7285:c2ff:fe6c:992d";
logging-data="28111"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Sun, 8 May 2022 15:17 UTC

Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
> We recently bumped into a funny performance behavior in Emacs.
> Some code computing the length of a (possibly circular) list ended up
> macroexpanded to something like:
>
> for (struct for_each_tail_internal li = { list, 2, 0, 2 };
> CONSP (list);
> (list = XCDR (list),
> ((--li.q != 0
> || 0 < --li.n
> || (li.q = li.n = li.max <<= 1, li.n >>= USHRT_WIDTH,
> li.tortoise = (list), false))
> && EQ (list, li.tortoise))
> ? (void) (list = Qnil) : (void) 0))
> len++;
>
> `EQ` is currently a slightly costly operation but in this specific case
> it can be replaced with a plain `==`. When we tried that, the resulting
> loop ended up running almost twice *slower*.
>
> The problem turned out that with the simpler comparison, both GCC and
> LLVM decided it would be a good idea to use CMOVE for the
> `?:` operation, which just ends up making the data flow's critical path
> longer whereas a branch works much better here since it's trivial
> to predict.

It would probably good to submit a PR for this. I'm not overly
optimisic about the chances of finding a good heuristic for when
code can be well predicted by a branch predictor, but it's worth
a shot.

Re: Ill-advised use of CMOVE

<t58n6r$1riu$1@gioia.aioe.org>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25130&group=comp.arch#25130

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!10O9MudpjwoXIahOJRbDvA.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
Date: Sun, 8 May 2022 17:22:11 +0200
Organization: Aioe.org NNTP Server
Message-ID: <t58n6r$1riu$1@gioia.aioe.org>
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="61022"; posting-host="10O9MudpjwoXIahOJRbDvA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.12
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Sun, 8 May 2022 15:22 UTC

Stefan Monnier wrote:
> We recently bumped into a funny performance behavior in Emacs.
> Some code computing the length of a (possibly circular) list ended up
> macroexpanded to something like:
>
> for (struct for_each_tail_internal li = { list, 2, 0, 2 };
> CONSP (list);
> (list = XCDR (list),
> ((--li.q != 0
> || 0 < --li.n
> || (li.q = li.n = li.max <<= 1, li.n >>= USHRT_WIDTH,
> li.tortoise = (list), false))
> && EQ (list, li.tortoise))
> ? (void) (list = Qnil) : (void) 0))
> len++;
>
> `EQ` is currently a slightly costly operation but in this specific case
> it can be replaced with a plain `==`. When we tried that, the resulting
> loop ended up running almost twice *slower*.
>
> The problem turned out that with the simpler comparison, both GCC and
> LLVM decided it would be a good idea to use CMOVE for the
> `?:` operation, which just ends up making the data flow's critical path
> longer whereas a branch works much better here since it's trivial
> to predict.

This has actually been the rule for almost all code since the pentiumPro!

It is extremely hard to find micro benchmarks where using CMOV to
eliminate a branch is a win, somewhat better for larger/full programs
but still very rare.

The one good counter-example is branching on encrypted/compressed data,
where the branch is pretty much fully random.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Ill-advised use of CMOVE

<2022May8.195312@mips.complang.tuwien.ac.at>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25132&group=comp.arch#25132

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
Date: Sun, 08 May 2022 17:53:12 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 32
Message-ID: <2022May8.195312@mips.complang.tuwien.ac.at>
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org>
Injection-Info: reader02.eternal-september.org; posting-host="6e4f0170b8a8cb5da03a1add0c771453";
logging-data="18553"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX196rNiuAKp3+8oYABwGit41"
Cancel-Lock: sha1:z4zPyStkmJZlkWouAlPNxNXlKHo=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Sun, 8 May 2022 17:53 UTC

Stefan Monnier <monnier@iro.umontreal.ca> writes:
>We recently bumped into a funny performance behavior in Emacs.
>Some code computing the length of a (possibly circular) list ended up
>macroexpanded to something like:
>
> for (struct for_each_tail_internal li = { list, 2, 0, 2 };
> CONSP (list);
> (list = XCDR (list),
> ((--li.q != 0
> || 0 < --li.n
> || (li.q = li.n = li.max <<= 1, li.n >>= USHRT_WIDTH,
> li.tortoise = (list), false))
> && EQ (list, li.tortoise))
> ? (void) (list = Qnil) : (void) 0))
> len++;
>
>`EQ` is currently a slightly costly operation but in this specific case
>it can be replaced with a plain `==`. When we tried that, the resulting
>loop ended up running almost twice *slower*.
>
>The problem turned out that with the simpler comparison, both GCC and
>LLVM decided it would be a good idea to use CMOVE for the
>`?:` operation, which just ends up making the data flow's critical path
>longer whereas a branch works much better here since it's trivial
>to predict.

Does it help to use __builtin_expect ?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Ill-advised use of CMOVE

<f1e1057a-966e-4838-885e-9f8a0bee12b8n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25133&group=comp.arch#25133

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a0c:f192:0:b0:45a:9a55:5df8 with SMTP id m18-20020a0cf192000000b0045a9a555df8mr10756135qvl.118.1652033040755;
Sun, 08 May 2022 11:04:00 -0700 (PDT)
X-Received: by 2002:a05:6830:1d92:b0:606:a1e:946a with SMTP id
y18-20020a0568301d9200b006060a1e946amr4763437oti.294.1652033040526; Sun, 08
May 2022 11:04:00 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 8 May 2022 11:04:00 -0700 (PDT)
In-Reply-To: <t58n6r$1riu$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:c058:497c:25c3:26c7;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:c058:497c:25c3:26c7
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org> <t58n6r$1riu$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f1e1057a-966e-4838-885e-9f8a0bee12b8n@googlegroups.com>
Subject: Re: Ill-advised use of CMOVE
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 08 May 2022 18:04:00 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Sun, 8 May 2022 18:04 UTC

On Sunday, May 8, 2022 at 10:22:10 AM UTC-5, Terje Mathisen wrote:
> Stefan Monnier wrote:
> > We recently bumped into a funny performance behavior in Emacs.
> > Some code computing the length of a (possibly circular) list ended up
> > macroexpanded to something like:
> >
> > for (struct for_each_tail_internal li = { list, 2, 0, 2 };
> > CONSP (list);
> > (list = XCDR (list),
> > ((--li.q != 0
> > || 0 < --li.n
> > || (li.q = li.n = li.max <<= 1, li.n >>= USHRT_WIDTH,
> > li.tortoise = (list), false))
> > && EQ (list, li.tortoise))
> > ? (void) (list = Qnil) : (void) 0))
> > len++;
> >
> > `EQ` is currently a slightly costly operation but in this specific case
> > it can be replaced with a plain `==`. When we tried that, the resulting
> > loop ended up running almost twice *slower*.
> >
> > The problem turned out that with the simpler comparison, both GCC and
> > LLVM decided it would be a good idea to use CMOVE for the
> > `?:` operation, which just ends up making the data flow's critical path
> > longer whereas a branch works much better here since it's trivial
> > to predict.
> This has actually been the rule for almost all code since the pentiumPro!
>
> It is extremely hard to find micro benchmarks where using CMOV to
> eliminate a branch is a win, somewhat better for larger/full programs
> but still very rare.
<
Conversely: it is not hard to find places where predication to conditionally
execute a "few" instructions (or not) is beneficial.
>
> The one good counter-example is branching on encrypted/compressed data,
> where the branch is pretty much fully random.
>
> Terje
>
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Re: Ill-advised use of CMOVE

<t5943b$ump$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25135&group=comp.arch#25135

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
Date: Sun, 8 May 2022 12:02:03 -0700
Organization: A noiseless patient Spider
Lines: 48
Message-ID: <t5943b$ump$1@dont-email.me>
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org>
<t58n6r$1riu$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 8 May 2022 19:02:04 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="ce911e635529211858546be849e14b2d";
logging-data="31449"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/FjJwvvb105CNnu6liKz+8"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.9.0
Cancel-Lock: sha1:NGm5iAOny/uO7Fl3bj2xTICo/zY=
In-Reply-To: <t58n6r$1riu$1@gioia.aioe.org>
Content-Language: en-US
 by: Ivan Godard - Sun, 8 May 2022 19:02 UTC

On 5/8/2022 8:22 AM, Terje Mathisen wrote:
> Stefan Monnier wrote:
>> We recently bumped into a funny performance behavior in Emacs.
>> Some code computing the length of a (possibly circular) list ended up
>> macroexpanded to something like:
>>
>>      for (struct for_each_tail_internal li = { list, 2, 0, 2 };
>>           CONSP (list);
>>           (list = XCDR (list),
>>            ((--li.q != 0
>>              || 0 < --li.n
>>              || (li.q = li.n = li.max <<= 1, li.n >>= USHRT_WIDTH,
>>                  li.tortoise = (list), false))
>>             && EQ (list, li.tortoise))
>>            ? (void) (list = Qnil) : (void) 0))
>>        len++;
>>
>> `EQ` is currently a slightly costly operation but in this specific case
>> it can be replaced with a plain `==`.  When we tried that, the resulting
>> loop ended up running almost twice *slower*.
>>
>> The problem turned out that with the simpler comparison, both GCC and
>> LLVM decided it would be a good idea to use CMOVE for the
>> `?:` operation, which just ends up making the data flow's critical path
>> longer whereas a branch works much better here since it's trivial
>> to predict.
>
> This has actually been the rule for almost all code since the pentiumPro!
>
> It is extremely hard to find micro benchmarks where using CMOV to
> eliminate a branch is a win, somewhat better for larger/full programs
> but still very rare.
>
> The one good counter-example is branching on encrypted/compressed data,
> where the branch is pretty much fully random.
>
> Terje
>
>

Too dependent on other things in the app, the architecture, and the
hardware to generalize. Pipe depth, issue, dispatch and LS queue sizes
impact miss cost. Loop-heavy vs. open code impacts miss frequency.
Micro- vs real- benchmarks impacts predictor saturation and miss
percentage. Esoteric details of cmove data retire impact dataflow impact.

Re: Ill-advised use of CMOVE

<t595aj$ase$1@gioia.aioe.org>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25136&group=comp.arch#25136

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!10O9MudpjwoXIahOJRbDvA.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
Date: Sun, 8 May 2022 21:22:59 +0200
Organization: Aioe.org NNTP Server
Message-ID: <t595aj$ase$1@gioia.aioe.org>
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org>
<t58n6r$1riu$1@gioia.aioe.org>
<f1e1057a-966e-4838-885e-9f8a0bee12b8n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="11150"; posting-host="10O9MudpjwoXIahOJRbDvA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.12
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Sun, 8 May 2022 19:22 UTC

MitchAlsup wrote:
> On Sunday, May 8, 2022 at 10:22:10 AM UTC-5, Terje Mathisen wrote:
>> Stefan Monnier wrote:
>>> We recently bumped into a funny performance behavior in Emacs.
>>> Some code computing the length of a (possibly circular) list ended up
>>> macroexpanded to something like:
>>>
>>> for (struct for_each_tail_internal li = { list, 2, 0, 2 };
>>> CONSP (list);
>>> (list = XCDR (list),
>>> ((--li.q != 0
>>> || 0 < --li.n
>>> || (li.q = li.n = li.max <<= 1, li.n >>= USHRT_WIDTH,
>>> li.tortoise = (list), false))
>>> && EQ (list, li.tortoise))
>>> ? (void) (list = Qnil) : (void) 0))
>>> len++;
>>>
>>> `EQ` is currently a slightly costly operation but in this specific case
>>> it can be replaced with a plain `==`. When we tried that, the resulting
>>> loop ended up running almost twice *slower*.
>>>
>>> The problem turned out that with the simpler comparison, both GCC and
>>> LLVM decided it would be a good idea to use CMOVE for the
>>> `?:` operation, which just ends up making the data flow's critical path
>>> longer whereas a branch works much better here since it's trivial
>>> to predict.
>> This has actually been the rule for almost all code since the pentiumPro!
>>
>> It is extremely hard to find micro benchmarks where using CMOV to
>> eliminate a branch is a win, somewhat better for larger/full programs
>> but still very rare.
> <
> Conversely: it is not hard to find places where predication to conditionally
> execute a "few" instructions (or not) is beneficial.

It particularly helps when your cpu is wide enough to execute both
branches at the same time, and they are of approximately equal length.

The final requirement is that the branch join operation needs to be
fast, and CMOV have been 2 cycles of latency for a long time. (Is it
fixed by now?)

Mill has the ultimate version, with phasing makes such the operation
effectively free.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Ill-advised use of CMOVE

<76560d31-593c-4afb-bce4-c96ae6953b26n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25139&group=comp.arch#25139

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:2886:b0:699:bab7:ae78 with SMTP id j6-20020a05620a288600b00699bab7ae78mr9710974qkp.618.1652045298102;
Sun, 08 May 2022 14:28:18 -0700 (PDT)
X-Received: by 2002:a54:4f83:0:b0:324:f58f:4b95 with SMTP id
g3-20020a544f83000000b00324f58f4b95mr9225982oiy.4.1652045297825; Sun, 08 May
2022 14:28:17 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!feeder.erje.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 8 May 2022 14:28:17 -0700 (PDT)
In-Reply-To: <t595aj$ase$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:c058:497c:25c3:26c7;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:c058:497c:25c3:26c7
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org> <t58n6r$1riu$1@gioia.aioe.org>
<f1e1057a-966e-4838-885e-9f8a0bee12b8n@googlegroups.com> <t595aj$ase$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <76560d31-593c-4afb-bce4-c96ae6953b26n@googlegroups.com>
Subject: Re: Ill-advised use of CMOVE
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 08 May 2022 21:28:18 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Sun, 8 May 2022 21:28 UTC

On Sunday, May 8, 2022 at 2:23:03 PM UTC-5, Terje Mathisen wrote:
> MitchAlsup wrote:
> > On Sunday, May 8, 2022 at 10:22:10 AM UTC-5, Terje Mathisen wrote:
> >> Stefan Monnier wrote:
> >>> We recently bumped into a funny performance behavior in Emacs.
> >>> Some code computing the length of a (possibly circular) list ended up
> >>> macroexpanded to something like:
> >>>
> >>> for (struct for_each_tail_internal li = { list, 2, 0, 2 };
> >>> CONSP (list);
> >>> (list = XCDR (list),
> >>> ((--li.q != 0
> >>> || 0 < --li.n
> >>> || (li.q = li.n = li.max <<= 1, li.n >>= USHRT_WIDTH,
> >>> li.tortoise = (list), false))
> >>> && EQ (list, li.tortoise))
> >>> ? (void) (list = Qnil) : (void) 0))
> >>> len++;
> >>>
> >>> `EQ` is currently a slightly costly operation but in this specific case
> >>> it can be replaced with a plain `==`. When we tried that, the resulting
> >>> loop ended up running almost twice *slower*.
> >>>
> >>> The problem turned out that with the simpler comparison, both GCC and
> >>> LLVM decided it would be a good idea to use CMOVE for the
> >>> `?:` operation, which just ends up making the data flow's critical path
> >>> longer whereas a branch works much better here since it's trivial
> >>> to predict.
> >> This has actually been the rule for almost all code since the pentiumPro!
> >>
> >> It is extremely hard to find micro benchmarks where using CMOV to
> >> eliminate a branch is a win, somewhat better for larger/full programs
> >> but still very rare.
> > <
> > Conversely: it is not hard to find places where predication to conditionally
> > execute a "few" instructions (or not) is beneficial.
<
> It particularly helps when your cpu is wide enough to execute both
> branches at the same time, and they are of approximately equal length.
<
Predication works by conservation of fetch bandwidth. One has a certain
amount of code that can be read each cycle, and a certain sized repository
of that code (instruction buffer). When code needs some flow control, but
the span of the flow is less than the fetch momentum, then it becomes
less <time> expensive to search the execution window and eliminate
instructions individually, rather than to redirect the fetch-point and suffer
the loss of fetched instructions in the repository.
<
Predication actually has NOTHING to do with the width of the fetch or
execute pipelines--although it tends to help wider pipelines more than
narrower ones. But, when instructions have more than one size,
predication also helps 1-wide in-order machines ! ! !
>
> The final requirement is that the branch join operation needs to be
> fast, and CMOV have been 2 cycles of latency for a long time. (Is it
> fixed by now?)
<
Note: CMOVEs can be effectively implemented with predication,
<
PNE {T}
MOV Rd,Rs1
<
but it is seldom that predication can be implemented with CMOVEs
<
PNE {T}
LDD Rd,[Rp+disp] // will pagefault when Rp ≡ 0
<
(especially when one of the potentially skipped over instructions
is "long running" such as DIV).
>
> Mill has the ultimate version, with phasing makes such the operation
> effectively free.
<
<
> Terje
>
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Re: Ill-advised use of CMOVE

<t5dtvr$oaf$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25168&group=comp.arch#25168

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
Date: Tue, 10 May 2022 16:48:27 +0200
Organization: A noiseless patient Spider
Lines: 93
Message-ID: <t5dtvr$oaf$1@dont-email.me>
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org>
<t58n6r$1riu$1@gioia.aioe.org>
<f1e1057a-966e-4838-885e-9f8a0bee12b8n@googlegroups.com>
<t595aj$ase$1@gioia.aioe.org>
<76560d31-593c-4afb-bce4-c96ae6953b26n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 10 May 2022 14:48:27 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="71e002fc747b07748e986b5f39cf85a2";
logging-data="24911"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19yI6DlX1YF/IhZ5zDu3dvI/PQB70+f6f8="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.8.1
Cancel-Lock: sha1:oEx2005izpa7XxR9o314u7vOihk=
In-Reply-To: <76560d31-593c-4afb-bce4-c96ae6953b26n@googlegroups.com>
Content-Language: en-US
 by: Marcus - Tue, 10 May 2022 14:48 UTC

On 2022-05-08, MitchAlsup wrote:
> On Sunday, May 8, 2022 at 2:23:03 PM UTC-5, Terje Mathisen wrote:
>> MitchAlsup wrote:
>>> On Sunday, May 8, 2022 at 10:22:10 AM UTC-5, Terje Mathisen wrote:
>>>> Stefan Monnier wrote:
>>>>> We recently bumped into a funny performance behavior in Emacs.
>>>>> Some code computing the length of a (possibly circular) list ended up
>>>>> macroexpanded to something like:
>>>>>
>>>>> for (struct for_each_tail_internal li = { list, 2, 0, 2 };
>>>>> CONSP (list);
>>>>> (list = XCDR (list),
>>>>> ((--li.q != 0
>>>>> || 0 < --li.n
>>>>> || (li.q = li.n = li.max <<= 1, li.n >>= USHRT_WIDTH,
>>>>> li.tortoise = (list), false))
>>>>> && EQ (list, li.tortoise))
>>>>> ? (void) (list = Qnil) : (void) 0))
>>>>> len++;
>>>>>
>>>>> `EQ` is currently a slightly costly operation but in this specific case
>>>>> it can be replaced with a plain `==`. When we tried that, the resulting
>>>>> loop ended up running almost twice *slower*.
>>>>>
>>>>> The problem turned out that with the simpler comparison, both GCC and
>>>>> LLVM decided it would be a good idea to use CMOVE for the
>>>>> `?:` operation, which just ends up making the data flow's critical path
>>>>> longer whereas a branch works much better here since it's trivial
>>>>> to predict.
>>>> This has actually been the rule for almost all code since the pentiumPro!
>>>>
>>>> It is extremely hard to find micro benchmarks where using CMOV to
>>>> eliminate a branch is a win, somewhat better for larger/full programs
>>>> but still very rare.
>>> <
>>> Conversely: it is not hard to find places where predication to conditionally
>>> execute a "few" instructions (or not) is beneficial.
> <
>> It particularly helps when your cpu is wide enough to execute both
>> branches at the same time, and they are of approximately equal length.
> <
> Predication works by conservation of fetch bandwidth. One has a certain
> amount of code that can be read each cycle, and a certain sized repository
> of that code (instruction buffer). When code needs some flow control, but
> the span of the flow is less than the fetch momentum, then it becomes
> less <time> expensive to search the execution window and eliminate
> instructions individually, rather than to redirect the fetch-point and suffer
> the loss of fetched instructions in the repository.
> <
> Predication actually has NOTHING to do with the width of the fetch or
> execute pipelines--although it tends to help wider pipelines more than
> narrower ones. But, when instructions have more than one size,
> predication also helps 1-wide in-order machines ! ! !
>>
>> The final requirement is that the branch join operation needs to be
>> fast, and CMOV have been 2 cycles of latency for a long time. (Is it
>> fixed by now?)
> <
> Note: CMOVEs can be effectively implemented with predication,
> <
> PNE {T}
> MOV Rd,Rs1
> <

Any particular reason why an explicit PNE (Predicate Not Equal, I
assume) instruction is better than repurposing a branch instruction and
interpreting it as a predication instruction in the front end? E.g:

BEQ 1f // Skip next instruction if EQual
MOV Rd,Rs1
1:

/Marcus

> but it is seldom that predication can be implemented with CMOVEs
> <
> PNE {T}
> LDD Rd,[Rp+disp] // will pagefault when Rp ≡ 0
> <
> (especially when one of the potentially skipped over instructions
> is "long running" such as DIV).
>>
>> Mill has the ultimate version, with phasing makes such the operation
>> effectively free.
> <
> <
>> Terje
>>
>>
>> --
>> - <Terje.Mathisen at tmsw.no>
>> "almost all programming can be viewed as an exercise in caching"

Re: Ill-advised use of CMOVE

<t5e1ek$s7c$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25169&group=comp.arch#25169

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
Date: Tue, 10 May 2022 08:47:30 -0700
Organization: A noiseless patient Spider
Lines: 55
Message-ID: <t5e1ek$s7c$1@dont-email.me>
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org>
<t58n6r$1riu$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 10 May 2022 15:47:32 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="b4d0dcd736aec6967bf7ae197a73af0f";
logging-data="28908"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+HJHZigR4B585ZTwYpOpDPKHEcn6b+bDk="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.9.0
Cancel-Lock: sha1:7RdE324PdMSOD8fjace0AE9gxi8=
In-Reply-To: <t58n6r$1riu$1@gioia.aioe.org>
Content-Language: en-US
 by: Stephen Fuld - Tue, 10 May 2022 15:47 UTC

On 5/8/2022 8:22 AM, Terje Mathisen wrote:
> Stefan Monnier wrote:
>> We recently bumped into a funny performance behavior in Emacs.
>> Some code computing the length of a (possibly circular) list ended up
>> macroexpanded to something like:
>>
>>      for (struct for_each_tail_internal li = { list, 2, 0, 2 };
>>           CONSP (list);
>>           (list = XCDR (list),
>>            ((--li.q != 0
>>              || 0 < --li.n
>>              || (li.q = li.n = li.max <<= 1, li.n >>= USHRT_WIDTH,
>>                  li.tortoise = (list), false))
>>             && EQ (list, li.tortoise))
>>            ? (void) (list = Qnil) : (void) 0))
>>        len++;
>>
>> `EQ` is currently a slightly costly operation but in this specific case
>> it can be replaced with a plain `==`.  When we tried that, the resulting
>> loop ended up running almost twice *slower*.
>>
>> The problem turned out that with the simpler comparison, both GCC and
>> LLVM decided it would be a good idea to use CMOVE for the
>> `?:` operation, which just ends up making the data flow's critical path
>> longer whereas a branch works much better here since it's trivial
>> to predict.
>
> This has actually been the rule for almost all code since the pentiumPro!
>
> It is extremely hard to find micro benchmarks where using CMOV to
> eliminate a branch is a win, somewhat better for larger/full programs
> but still very rare.

I am not questioning the truth of that, but I am trying to figure out
why. Since the CMOV is a pretty big win when the branch is
mispredicted, it must be a loss when the prediction would have been correct.

Is this because

The CMOV itself is too slow?

More code needs to be generated when using the CMOV so more instructions
are executed?

The CMOV messes up speculative execution, so the instructions after it
cannot be executed where they would be in the branch case?

Something else?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Ill-advised use of CMOVE

<29701bf7-7d80-493d-8822-4604e68f4e79n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25170&group=comp.arch#25170

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:4104:b0:42c:1db0:da28 with SMTP id kc4-20020a056214410400b0042c1db0da28mr18356451qvb.67.1652198719617;
Tue, 10 May 2022 09:05:19 -0700 (PDT)
X-Received: by 2002:a05:6808:c2:b0:325:eb87:c26f with SMTP id
t2-20020a05680800c200b00325eb87c26fmr391207oic.117.1652198719353; Tue, 10 May
2022 09:05:19 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!feeder.erje.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 10 May 2022 09:05:19 -0700 (PDT)
In-Reply-To: <t5e1ek$s7c$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=199.203.251.52; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 199.203.251.52
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org> <t58n6r$1riu$1@gioia.aioe.org>
<t5e1ek$s7c$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <29701bf7-7d80-493d-8822-4604e68f4e79n@googlegroups.com>
Subject: Re: Ill-advised use of CMOVE
From: already5...@yahoo.com (Michael S)
Injection-Date: Tue, 10 May 2022 16:05:19 +0000
Content-Type: text/plain; charset="UTF-8"
 by: Michael S - Tue, 10 May 2022 16:05 UTC

On Tuesday, May 10, 2022 at 6:47:35 PM UTC+3, Stephen Fuld wrote:
> On 5/8/2022 8:22 AM, Terje Mathisen wrote:
> > Stefan Monnier wrote:
> >> We recently bumped into a funny performance behavior in Emacs.
> >> Some code computing the length of a (possibly circular) list ended up
> >> macroexpanded to something like:
> >>
> >> for (struct for_each_tail_internal li = { list, 2, 0, 2 };
> >> CONSP (list);
> >> (list = XCDR (list),
> >> ((--li.q != 0
> >> || 0 < --li.n
> >> || (li.q = li.n = li.max <<= 1, li.n >>= USHRT_WIDTH,
> >> li.tortoise = (list), false))
> >> && EQ (list, li.tortoise))
> >> ? (void) (list = Qnil) : (void) 0))
> >> len++;
> >>
> >> `EQ` is currently a slightly costly operation but in this specific case
> >> it can be replaced with a plain `==`. When we tried that, the resulting
> >> loop ended up running almost twice *slower*.
> >>
> >> The problem turned out that with the simpler comparison, both GCC and
> >> LLVM decided it would be a good idea to use CMOVE for the
> >> `?:` operation, which just ends up making the data flow's critical path
> >> longer whereas a branch works much better here since it's trivial
> >> to predict.
> >
> > This has actually been the rule for almost all code since the pentiumPro!
> >
> > It is extremely hard to find micro benchmarks where using CMOV to
> > eliminate a branch is a win, somewhat better for larger/full programs
> > but still very rare.
> I am not questioning the truth of that, but I am trying to figure out
> why. Since the CMOV is a pretty big win when the branch is
> mispredicted, it must be a loss when the prediction would have been correct.
>
> Is this because
>
> The CMOV itself is too slow?

It used to be a significant reason on Pentium4. After that the answer is
pretty much always "No".

>
> More code needs to be generated when using the CMOV so more instructions
> are executed?

Yes, but that's something that would cause slowdown by 10-20% rather than 2x.
Unless compiler did something obviously stupid that also happens.
That was one of my own gcc bug reports. IIRC, the only of more than dozen
of my gcc bug reports that was fully fixed.

>
> The CMOV messes up speculative execution, so the instructions after it
> cannot be executed where they would be in the branch case?

Yes, that's the main reason of all 1.5x to 2x slowdowns.
Control dependencies can be speculated around, data dependencies are not,
at least at current state of tech.

>
> Something else?
>
>
>
>
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

Re: Ill-advised use of CMOVE

<c8ffd9e0-3913-48f1-b7d6-8678a8c2374fn@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25172&group=comp.arch#25172

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:4054:b0:6a0:6f88:9cd9 with SMTP id i20-20020a05620a405400b006a06f889cd9mr9920605qko.747.1652201976158;
Tue, 10 May 2022 09:59:36 -0700 (PDT)
X-Received: by 2002:a05:6870:d1cd:b0:e1:e7ee:faa0 with SMTP id
b13-20020a056870d1cd00b000e1e7eefaa0mr571875oac.5.1652201975932; Tue, 10 May
2022 09:59:35 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 10 May 2022 09:59:35 -0700 (PDT)
In-Reply-To: <t5dtvr$oaf$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:dc2e:8c1f:457a:8c1c;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:dc2e:8c1f:457a:8c1c
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org> <t58n6r$1riu$1@gioia.aioe.org>
<f1e1057a-966e-4838-885e-9f8a0bee12b8n@googlegroups.com> <t595aj$ase$1@gioia.aioe.org>
<76560d31-593c-4afb-bce4-c96ae6953b26n@googlegroups.com> <t5dtvr$oaf$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c8ffd9e0-3913-48f1-b7d6-8678a8c2374fn@googlegroups.com>
Subject: Re: Ill-advised use of CMOVE
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 10 May 2022 16:59:36 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2837
 by: MitchAlsup - Tue, 10 May 2022 16:59 UTC

On Tuesday, May 10, 2022 at 9:48:30 AM UTC-5, Marcus wrote:
> On 2022-05-08, MitchAlsup wrote:
> > On Sunday, May 8, 2022 at 2:23:03 PM UTC-5, Terje Mathisen wrote:
> >> MitchAlsup wrote:

> > Predication actually has NOTHING to do with the width of the fetch or
> > execute pipelines--although it tends to help wider pipelines more than
> > narrower ones. But, when instructions have more than one size,
> > predication also helps 1-wide in-order machines ! ! !
> >>
> >> The final requirement is that the branch join operation needs to be
> >> fast, and CMOV have been 2 cycles of latency for a long time. (Is it
> >> fixed by now?)
> > <
> > Note: CMOVEs can be effectively implemented with predication,
> > <
> > PNE {T}
> > MOV Rd,Rs1
> > <
> Any particular reason why an explicit PNE (Predicate Not Equal, I
> assume) instruction is better than repurposing a branch instruction and
> interpreting it as a predication instruction in the front end? E.g:
<
Yes::
>
One can issue through the predicated set of instructions, and sort out
which ones should be executed later.
<
Whereas:
<
The general philosophy of branches is to predict them and then attempt
to get to the target rapidly. Thus, here, one could not issue through the
predicated set of instructions.
>
> BEQ 1f // Skip next instruction if EQual
> MOV Rd,Rs1
> 1:
>
> /Marcus
<
Thus, by the time you get to decoding of the instructions at 1: at least
1 clock has transpired.

Re: Ill-advised use of CMOVE

<t5e6q4$nhr$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25173&group=comp.arch#25173

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
Date: Tue, 10 May 2022 10:18:58 -0700
Organization: A noiseless patient Spider
Lines: 69
Message-ID: <t5e6q4$nhr$1@dont-email.me>
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org>
<t58n6r$1riu$1@gioia.aioe.org> <t5e1ek$s7c$1@dont-email.me>
<29701bf7-7d80-493d-8822-4604e68f4e79n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 10 May 2022 17:19:00 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="b4d0dcd736aec6967bf7ae197a73af0f";
logging-data="24123"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+AjL6zipH+OFQjPqOOfOWWYp9vQQv55zo="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.9.0
Cancel-Lock: sha1:3iSZ7jAbNP4uByhU2KdACXr9+yg=
In-Reply-To: <29701bf7-7d80-493d-8822-4604e68f4e79n@googlegroups.com>
Content-Language: en-US
 by: Stephen Fuld - Tue, 10 May 2022 17:18 UTC

On 5/10/2022 9:05 AM, Michael S wrote:
> On Tuesday, May 10, 2022 at 6:47:35 PM UTC+3, Stephen Fuld wrote:
>> On 5/8/2022 8:22 AM, Terje Mathisen wrote:
>>> Stefan Monnier wrote:
>>>> We recently bumped into a funny performance behavior in Emacs.
>>>> Some code computing the length of a (possibly circular) list ended up
>>>> macroexpanded to something like:
>>>>
>>>> for (struct for_each_tail_internal li = { list, 2, 0, 2 };
>>>> CONSP (list);
>>>> (list = XCDR (list),
>>>> ((--li.q != 0
>>>> || 0 < --li.n
>>>> || (li.q = li.n = li.max <<= 1, li.n >>= USHRT_WIDTH,
>>>> li.tortoise = (list), false))
>>>> && EQ (list, li.tortoise))
>>>> ? (void) (list = Qnil) : (void) 0))
>>>> len++;
>>>>
>>>> `EQ` is currently a slightly costly operation but in this specific case
>>>> it can be replaced with a plain `==`. When we tried that, the resulting
>>>> loop ended up running almost twice *slower*.
>>>>
>>>> The problem turned out that with the simpler comparison, both GCC and
>>>> LLVM decided it would be a good idea to use CMOVE for the
>>>> `?:` operation, which just ends up making the data flow's critical path
>>>> longer whereas a branch works much better here since it's trivial
>>>> to predict.
>>>
>>> This has actually been the rule for almost all code since the pentiumPro!
>>>
>>> It is extremely hard to find micro benchmarks where using CMOV to
>>> eliminate a branch is a win, somewhat better for larger/full programs
>>> but still very rare.
>> I am not questioning the truth of that, but I am trying to figure out
>> why. Since the CMOV is a pretty big win when the branch is
>> mispredicted, it must be a loss when the prediction would have been correct.
>>
>> Is this because
>>
>> The CMOV itself is too slow?
>
> It used to be a significant reason on Pentium4. After that the answer is
> pretty much always "No".
>
>>
>> More code needs to be generated when using the CMOV so more instructions
>> are executed?
>
> Yes, but that's something that would cause slowdown by 10-20% rather than 2x.
> Unless compiler did something obviously stupid that also happens.
> That was one of my own gcc bug reports. IIRC, the only of more than dozen
> of my gcc bug reports that was fully fixed.
>
>>
>> The CMOV messes up speculative execution, so the instructions after it
>> cannot be executed where they would be in the branch case?
>
> Yes, that's the main reason of all 1.5x to 2x slowdowns.
> Control dependencies can be speculated around, data dependencies are not,
> at least at current state of tech.

Thank you. So, would it make sense to develop some kind of "CMOV
predictor", sort of like a branch predictor?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Ill-advised use of CMOVE

<jwv4k1xs3ai.fsf-monnier+comp.arch@gnu.org>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25174&group=comp.arch#25174

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
Date: Tue, 10 May 2022 13:36:15 -0400
Organization: A noiseless patient Spider
Lines: 17
Message-ID: <jwv4k1xs3ai.fsf-monnier+comp.arch@gnu.org>
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org>
<t58n6r$1riu$1@gioia.aioe.org> <t5e1ek$s7c$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="65fcb808b06b8df1ca0fac3b0af055c0";
logging-data="5320"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/jN9Zax57bAjqXHPPmvVTa"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:+ul/hL5q5c37nSX0qUq6JnEWEtU=
sha1:ZNUpfw3TibXnxoABGpGp9bD4ZU8=
 by: Stefan Monnier - Tue, 10 May 2022 17:36 UTC

> The CMOV messes up speculative execution, so the instructions after it
> cannot be executed where they would be in the branch case?

I the present case it seems that indeed the problem is that the condition
depends on an amount of code comparable (if not larger) than the rest of
the iteration. With branch prediction, the computation of those
conditions can be performed concurrently with the rest of the code,
whereas with CMOVE this is turned into a data dependency which
significantly lengthens the critical path.

Of course, CMOVE could also rely on "condition prediction" to break this
data dependency and recover the speed of the code using branch, but
CMOVEs tend to be optimized for the case where branches don't work well,
i.e. when prediction doesn't work well.

Stefan

Re: Ill-advised use of CMOVE

<t5e9g7$c65$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25175&group=comp.arch#25175

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
Date: Tue, 10 May 2022 11:04:55 -0700
Organization: A noiseless patient Spider
Lines: 26
Message-ID: <t5e9g7$c65$1@dont-email.me>
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org>
<t58n6r$1riu$1@gioia.aioe.org> <t5e1ek$s7c$1@dont-email.me>
<jwv4k1xs3ai.fsf-monnier+comp.arch@gnu.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 10 May 2022 18:04:56 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="b4d0dcd736aec6967bf7ae197a73af0f";
logging-data="12485"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/sYiEhusfBUVNyyUbUduJSUt0ONI4OtkU="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.9.0
Cancel-Lock: sha1:RA4zt6ggeQ/Af+OGYDLZAYd7X90=
In-Reply-To: <jwv4k1xs3ai.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US
 by: Stephen Fuld - Tue, 10 May 2022 18:04 UTC

On 5/10/2022 10:36 AM, Stefan Monnier wrote:
>> The CMOV messes up speculative execution, so the instructions after it
>> cannot be executed where they would be in the branch case?
>
> I the present case it seems that indeed the problem is that the condition
> depends on an amount of code comparable (if not larger) than the rest of
> the iteration. With branch prediction, the computation of those
> conditions can be performed concurrently with the rest of the code,
> whereas with CMOVE this is turned into a data dependency which
> significantly lengthens the critical path.
>
> Of course, CMOVE could also rely on "condition prediction" to break this
> data dependency and recover the speed of the code using branch, but
> CMOVEs tend to be optimized for the case where branches don't work well,
> i.e. when prediction doesn't work well.

So if you don't use hardware "condition prediction", and the compiler by
itself doesn't know how well a particular branch will be predicted, we
are left with the aforementioned programmer provided hints, or perhaps
some form of profile driven optimization.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Ill-advised use of CMOVE

<2022May10.194427@mips.complang.tuwien.ac.at>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25176&group=comp.arch#25176

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
Date: Tue, 10 May 2022 17:44:27 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 58
Message-ID: <2022May10.194427@mips.complang.tuwien.ac.at>
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org> <t58n6r$1riu$1@gioia.aioe.org> <t5e1ek$s7c$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="71587b239d0e242d49e2e31f45159f77";
logging-data="23382"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+7b3sbgYZABzQNWMMc1qWx"
Cancel-Lock: sha1:luthVl1TKMwDHwgd6Ehdg/r2rE4=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Tue, 10 May 2022 17:44 UTC

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>On 5/8/2022 8:22 AM, Terje Mathisen wrote:
>> It is extremely hard to find micro benchmarks where using CMOV to
>> eliminate a branch is a win, somewhat better for larger/full programs
>> but still very rare.
>
>I am not questioning the truth of that, but I am trying to figure out
>why.

I would like to see some empirical support for that statement. It's
pretty easy to design a micro benchmark where CMOV wins by a large
margin. I guess what he meant is that in a typical microbenchmark
aimed at some other charasteristic using CMOV instead of branching
usually is a loss.

>Since the CMOV is a pretty big win when the branch is
>mispredicted, it must be a loss when the prediction would have been correct.
>
>Is this because
>
>The CMOV itself is too slow?

In an OoO machine, this can mean several different things.

* high resource usage. On the 21264, CMOV takes two
microinstructions.

* long latency. The numbers I have seen are a latency of 1 or 2
cycles.

* The instruction has to wait on results that take their time to
materialize.

My guess is that most cases of CMOV slowness are due to the latter
effect. CMOV has a data dependence on both of its data inputs, and on
the control input. By contast, a correctly predicted non-taken branch
around a MOV incurs only the latency of one data input, and that's
unavoidable. And if the latency chain of that MOV is short (e.g., it
moves a constant), this cuts down the latency of all the instructions
depending on the target of the MOV.

>More code needs to be generated when using the CMOV so more instructions
>are executed?

I would hope that compilers would not go for CMOV in that case. I
certainly have not seen cases where compilers produced a lot of code
just to use CMOV.

>The CMOV messes up speculative execution, so the instructions after it
>cannot be executed where they would be in the branch case?

In a way, you could see the waiting effect in that category, but I am
not sure if you had that in mind.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Ill-advised use of CMOVE

<2022May10.201129@mips.complang.tuwien.ac.at>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25177&group=comp.arch#25177

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
Date: Tue, 10 May 2022 18:11:29 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 25
Message-ID: <2022May10.201129@mips.complang.tuwien.ac.at>
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org> <t58n6r$1riu$1@gioia.aioe.org> <t5e1ek$s7c$1@dont-email.me> <29701bf7-7d80-493d-8822-4604e68f4e79n@googlegroups.com> <t5e6q4$nhr$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="71587b239d0e242d49e2e31f45159f77";
logging-data="23382"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/MVC+VAMGhy7fzSfRWEyNU"
Cancel-Lock: sha1:J5UmPjuLY+FsE/fH3aysCHTR+K8=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Tue, 10 May 2022 18:11 UTC

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>So, would it make sense to develop some kind of "CMOV
>predictor", sort of like a branch predictor?

Probably not enough bad CMOVs around for that to make sense, although
maybe with such a feature that might change:

For every CMOV, keep track of how predictable it is (and maybe when
its three inputs become available relative to the input that makes it
to the target). If it's expected to be faster to predict the CMOV, do
so.

Programmers and compilers are pretty bad at predicting the
predictability of conditionals, so here hardware could benefit from
knowing what happens at run-time.

The converse could also be done: convert hard-to-predict branches and
their control-dependent instructions into data flow if that is
expected to be beneficial. I expect that this is harder to realize in
hardware, but the benefits on existing code may be bigger.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Ill-advised use of CMOVE

<2022May10.214311@mips.complang.tuwien.ac.at>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25179&group=comp.arch#25179

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
Date: Tue, 10 May 2022 19:43:11 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 19
Message-ID: <2022May10.214311@mips.complang.tuwien.ac.at>
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org> <t58n6r$1riu$1@gioia.aioe.org> <t5e1ek$s7c$1@dont-email.me> <jwv4k1xs3ai.fsf-monnier+comp.arch@gnu.org> <t5e9g7$c65$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="71587b239d0e242d49e2e31f45159f77";
logging-data="30589"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/O3QXUFKKzL0m7KfW1ZHJx"
Cancel-Lock: sha1:8zdwm16oxS4PANx/neAiFSIc5Cw=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Tue, 10 May 2022 19:43 UTC

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>So if you don't use hardware "condition prediction", and the compiler by
>itself doesn't know how well a particular branch will be predicted, we
>are left with the aforementioned programmer provided hints, or perhaps
>some form of profile driven optimization.

Normal profiles don't tell you how predictable a branch is. Of course
always-taken is predictable, but 50% taken might be unpredictable, or
perfectly predictable.

So: compilers are bad at knowing predictability (even with profile
feedback). Programmers are bad, too, unless they use performance
counter results to learn it. Looks like a good candidate for a
hardware solution to me.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Ill-advised use of CMOVE

<0555194d-edc8-48fb-b304-7f78d62255d3n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25180&group=comp.arch#25180

 copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:7e94:0:b0:2f3:ce2b:c320 with SMTP id w20-20020ac87e94000000b002f3ce2bc320mr16775967qtj.670.1652212681441;
Tue, 10 May 2022 12:58:01 -0700 (PDT)
X-Received: by 2002:a05:6808:140e:b0:326:6a85:44f5 with SMTP id
w14-20020a056808140e00b003266a8544f5mr887006oiv.109.1652212681217; Tue, 10
May 2022 12:58:01 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 10 May 2022 12:58:01 -0700 (PDT)
In-Reply-To: <2022May10.194427@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:a0da:736:be6c:6d2c;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:a0da:736:be6c:6d2c
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org> <t58n6r$1riu$1@gioia.aioe.org>
<t5e1ek$s7c$1@dont-email.me> <2022May10.194427@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <0555194d-edc8-48fb-b304-7f78d62255d3n@googlegroups.com>
Subject: Re: Ill-advised use of CMOVE
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 10 May 2022 19:58:01 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 4133
 by: MitchAlsup - Tue, 10 May 2022 19:58 UTC

On Tuesday, May 10, 2022 at 1:09:01 PM UTC-5, Anton Ertl wrote:
> Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
> >On 5/8/2022 8:22 AM, Terje Mathisen wrote:
> >> It is extremely hard to find micro benchmarks where using CMOV to
> >> eliminate a branch is a win, somewhat better for larger/full programs
> >> but still very rare.
> >
> >I am not questioning the truth of that, but I am trying to figure out
> >why.
> I would like to see some empirical support for that statement. It's
> pretty easy to design a micro benchmark where CMOV wins by a large
> margin. I guess what he meant is that in a typical microbenchmark
> aimed at some other charasteristic using CMOV instead of branching
> usually is a loss.
> >Since the CMOV is a pretty big win when the branch is
> >mispredicted, it must be a loss when the prediction would have been correct.
> >
> >Is this because
> >
> >The CMOV itself is too slow?
> In an OoO machine, this can mean several different things.
>
> * high resource usage. On the 21264, CMOV takes two
> microinstructions.
>
> * long latency. The numbers I have seen are a latency of 1 or 2
> cycles.
>
> * The instruction has to wait on results that take their time to
> materialize.
<
CMOV cannot begin executing until:
a) both operands are available
b) the condition is available
>
Instructions dependent on CMOV cannot begin execution until
c) CMOV delivers its result(s).
<
It is often this (c) that makes CMOV appear to be slow.
>
> My guess is that most cases of CMOV slowness are due to the latter
> effect. CMOV has a data dependence on both of its data inputs, and on
> the control input. By contast, a correctly predicted non-taken branch
> around a MOV incurs only the latency of one data input, and that's
> unavoidable. And if the latency chain of that MOV is short (e.g., it
> moves a constant), this cuts down the latency of all the instructions
> depending on the target of the MOV.
<
> >More code needs to be generated when using the CMOV so more instructions
> >are executed?
> I would hope that compilers would not go for CMOV in that case. I
> certainly have not seen cases where compilers produced a lot of code
> just to use CMOV.
<
In My 66000 ISA using CMOV takes no more instructions (and often fewer)
than either predication or branching.
<
> >The CMOV messes up speculative execution, so the instructions after it
> >cannot be executed where they would be in the branch case?
<
No, you just create data dependencies and suffer the result latency of the
CMOV.
<
> In a way, you could see the waiting effect in that category, but I am
> not sure if you had that in mind.
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: Ill-advised use of CMOVE

<t5ei5p$mjo$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25183&group=comp.arch#25183

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
Date: Tue, 10 May 2022 15:31:42 -0500
Organization: A noiseless patient Spider
Lines: 94
Message-ID: <t5ei5p$mjo$1@dont-email.me>
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org>
<t58n6r$1riu$1@gioia.aioe.org>
<f1e1057a-966e-4838-885e-9f8a0bee12b8n@googlegroups.com>
<t595aj$ase$1@gioia.aioe.org>
<76560d31-593c-4afb-bce4-c96ae6953b26n@googlegroups.com>
<t5dtvr$oaf$1@dont-email.me>
<c8ffd9e0-3913-48f1-b7d6-8678a8c2374fn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 10 May 2022 20:32:57 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="6fad1b8eaaa2f0d0449235de8e2e6b99";
logging-data="23160"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19GwgLU2qXxd+9uyojeJpcP"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.8.0
Cancel-Lock: sha1:6RC8gLHBqkxzqQlLGx/+vIe60NQ=
In-Reply-To: <c8ffd9e0-3913-48f1-b7d6-8678a8c2374fn@googlegroups.com>
Content-Language: en-US
 by: BGB - Tue, 10 May 2022 20:31 UTC

On 5/10/2022 11:59 AM, MitchAlsup wrote:
> On Tuesday, May 10, 2022 at 9:48:30 AM UTC-5, Marcus wrote:
>> On 2022-05-08, MitchAlsup wrote:
>>> On Sunday, May 8, 2022 at 2:23:03 PM UTC-5, Terje Mathisen wrote:
>>>> MitchAlsup wrote:
>
>>> Predication actually has NOTHING to do with the width of the fetch or
>>> execute pipelines--although it tends to help wider pipelines more than
>>> narrower ones. But, when instructions have more than one size,
>>> predication also helps 1-wide in-order machines ! ! !
>>>>
>>>> The final requirement is that the branch join operation needs to be
>>>> fast, and CMOV have been 2 cycles of latency for a long time. (Is it
>>>> fixed by now?)
>>> <
>>> Note: CMOVEs can be effectively implemented with predication,
>>> <
>>> PNE {T}
>>> MOV Rd,Rs1
>>> <
>> Any particular reason why an explicit PNE (Predicate Not Equal, I
>> assume) instruction is better than repurposing a branch instruction and
>> interpreting it as a predication instruction in the front end? E.g:
> <
> Yes::
>>
> One can issue through the predicated set of instructions, and sort out
> which ones should be executed later.
> <
> Whereas:
> <
> The general philosophy of branches is to predict them and then attempt
> to get to the target rapidly. Thus, here, one could not issue through the
> predicated set of instructions.
>>
>> BEQ 1f // Skip next instruction if EQual
>> MOV Rd,Rs1
>> 1:
>>
>> /Marcus
> <
> Thus, by the time you get to decoding of the instructions at 1: at least
> 1 clock has transpired.

In my case, something like:
CMPEQ R4, R5
ADD?F R4, 1, R4
Will take 2 cycles.

And:
CMPEQ R4, R5
BT .L0
ADD R4, 1, R4
.L0:

Will take either 3 or 4 cycles (if predicted correctly), or 8 cycles
(mispredict).

In a lot of cases, this difference can be fairly noticeable (and is part
of why predication ended up being promoted to a core ISA feature).

I am mostly ignoring OoO and speculative execution, as these fall
outside the scope of what I am targeting.

It can also be used for "Range Coders that aren't dead slow", but range
coding is still pretty slow.

At present my core has enough L1 cache to deal with 12-bit-limited
Huffman with 1 or 2 tables within a "reasonable" performance window.

For longer symbol lengths, some options are:
Direct 15 or 16 bit lookup;
Short 8 or 10 bit lookup, with a branch for dealing with longer symbols.

On both my CPU core, and on x86, the latter option tends to be faster
IME (though, still not as fast as limiting the symbol to 12 or 13 bits).

But, why not a limit of 10 or 11 bits?... Because at this point, it has
a more significant adverse effect on Huffman's ability to "actually do
anything" (one needs some spill-over space for longer symbols to give
shorter symbols a space to exist).

I have also noted that on both my ISA, and on x86, implementing a PNG
style Paeth filter or similar is faster with CMOV (x86) or predicated
instructions (my ISA), than it is with branches.

....

Re: Ill-advised use of CMOVE

<5a0d9ded-735a-4c4c-9953-4f462860f5a8n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25185&group=comp.arch#25185

 copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:5b8c:0:b0:45a:9340:ef92 with SMTP id 12-20020ad45b8c000000b0045a9340ef92mr19946261qvp.85.1652217716105;
Tue, 10 May 2022 14:21:56 -0700 (PDT)
X-Received: by 2002:a05:6808:577:b0:325:8089:eb8b with SMTP id
j23-20020a056808057700b003258089eb8bmr1048506oig.126.1652217715872; Tue, 10
May 2022 14:21:55 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 10 May 2022 14:21:55 -0700 (PDT)
In-Reply-To: <t5ei5p$mjo$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:a0da:736:be6c:6d2c;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:a0da:736:be6c:6d2c
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org> <t58n6r$1riu$1@gioia.aioe.org>
<f1e1057a-966e-4838-885e-9f8a0bee12b8n@googlegroups.com> <t595aj$ase$1@gioia.aioe.org>
<76560d31-593c-4afb-bce4-c96ae6953b26n@googlegroups.com> <t5dtvr$oaf$1@dont-email.me>
<c8ffd9e0-3913-48f1-b7d6-8678a8c2374fn@googlegroups.com> <t5ei5p$mjo$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5a0d9ded-735a-4c4c-9953-4f462860f5a8n@googlegroups.com>
Subject: Re: Ill-advised use of CMOVE
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 10 May 2022 21:21:56 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 4960
 by: MitchAlsup - Tue, 10 May 2022 21:21 UTC

On Tuesday, May 10, 2022 at 3:33:00 PM UTC-5, BGB wrote:
> On 5/10/2022 11:59 AM, MitchAlsup wrote:
> > On Tuesday, May 10, 2022 at 9:48:30 AM UTC-5, Marcus wrote:
> >> On 2022-05-08, MitchAlsup wrote:
> >>> On Sunday, May 8, 2022 at 2:23:03 PM UTC-5, Terje Mathisen wrote:
> >>>> MitchAlsup wrote:
> >
> >>> Predication actually has NOTHING to do with the width of the fetch or
> >>> execute pipelines--although it tends to help wider pipelines more than
> >>> narrower ones. But, when instructions have more than one size,
> >>> predication also helps 1-wide in-order machines ! ! !
> >>>>
> >>>> The final requirement is that the branch join operation needs to be
> >>>> fast, and CMOV have been 2 cycles of latency for a long time. (Is it
> >>>> fixed by now?)
> >>> <
> >>> Note: CMOVEs can be effectively implemented with predication,
> >>> <
> >>> PNE {T}
> >>> MOV Rd,Rs1
> >>> <
> >> Any particular reason why an explicit PNE (Predicate Not Equal, I
> >> assume) instruction is better than repurposing a branch instruction and
> >> interpreting it as a predication instruction in the front end? E.g:
> > <
> > Yes::
> >>
> > One can issue through the predicated set of instructions, and sort out
> > which ones should be executed later.
> > <
> > Whereas:
> > <
> > The general philosophy of branches is to predict them and then attempt
> > to get to the target rapidly. Thus, here, one could not issue through the
> > predicated set of instructions.
> >>
> >> BEQ 1f // Skip next instruction if EQual
> >> MOV Rd,Rs1
> >> 1:
> >>
> >> /Marcus
> > <
> > Thus, by the time you get to decoding of the instructions at 1: at least
> > 1 clock has transpired.
> In my case, something like:
> CMPEQ R4, R5
> ADD?F R4, 1, R4
> Will take 2 cycles.
>
> And:
> CMPEQ R4, R5
> BT .L0
> ADD R4, 1, R4
> .L0:
>
> Will take either 3 or 4 cycles (if predicted correctly), or 8 cycles
> (mispredict).
>
>
> In a lot of cases, this difference can be fairly noticeable (and is part
> of why predication ended up being promoted to a core ISA feature).
>
>
> I am mostly ignoring OoO and speculative execution, as these fall
> outside the scope of what I am targeting.
>
>
>
> It can also be used for "Range Coders that aren't dead slow", but range
> coding is still pretty slow.
<
if( 0 <= x && x <= MAX ) {then-clause}
<
CMP Rt,Rx,Rmax
BRIN Rt,end-if // RIN is Really In
// then-clause
end-if:
>
> At present my core has enough L1 cache to deal with 12-bit-limited
> Huffman with 1 or 2 tables within a "reasonable" performance window.
>
> For longer symbol lengths, some options are:
> Direct 15 or 16 bit lookup;
> Short 8 or 10 bit lookup, with a branch for dealing with longer symbols.
>
> On both my CPU core, and on x86, the latter option tends to be faster
> IME (though, still not as fast as limiting the symbol to 12 or 13 bits).
>
>
> But, why not a limit of 10 or 11 bits?... Because at this point, it has
> a more significant adverse effect on Huffman's ability to "actually do
> anything" (one needs some spill-over space for longer symbols to give
> shorter symbols a space to exist).
>
>
>
> I have also noted that on both my ISA, and on x86, implementing a PNG
> style Paeth filter or similar is faster with CMOV (x86) or predicated
> instructions (my ISA), than it is with branches.
>
> ...

Re: Ill-advised use of CMOVE

<t5enf7$4mf$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25186&group=comp.arch#25186

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
Date: Tue, 10 May 2022 17:02:04 -0500
Organization: A noiseless patient Spider
Lines: 95
Message-ID: <t5enf7$4mf$1@dont-email.me>
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org>
<t58n6r$1riu$1@gioia.aioe.org> <t5e1ek$s7c$1@dont-email.me>
<jwv4k1xs3ai.fsf-monnier+comp.arch@gnu.org> <t5e9g7$c65$1@dont-email.me>
<2022May10.214311@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 10 May 2022 22:03:20 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="7155b3a29c990c92a41c6ff303f1405a";
logging-data="4815"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/adxWdguXEM2iuwE06ib9d"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.8.0
Cancel-Lock: sha1:pG3TMa/e1+NGKDR0e1C7OMGUAv0=
In-Reply-To: <2022May10.214311@mips.complang.tuwien.ac.at>
Content-Language: en-US
 by: BGB - Tue, 10 May 2022 22:02 UTC

On 5/10/2022 2:43 PM, Anton Ertl wrote:
> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>> So if you don't use hardware "condition prediction", and the compiler by
>> itself doesn't know how well a particular branch will be predicted, we
>> are left with the aforementioned programmer provided hints, or perhaps
>> some form of profile driven optimization.
>
> Normal profiles don't tell you how predictable a branch is. Of course
> always-taken is predictable, but 50% taken might be unpredictable, or
> perfectly predictable.
>

I can note here that my branch predictor ended up with states both for
predicting branches which are always the same, and for branches which
are nearly always the opposite.

The "nearly always the same" case being the more common, more
traditional option. But, nearly always the opposite, seemed common
enough to be worthwhile.

Though, this still leaves patterns which are (theoretically)
predictable, but would require more complex predictor logic, eg:
110, 110, 110, ...
1110, 1110, 1110, 1110, ...

So, we predict a period of, say, 1-3 bits, after which point the pattern
is expected to have one branch in the opposite direction (with longer
runs being expected to always branch in a single direction).

This would likely require a fair bit more bits for the branch
predictors' state machine though, and it is already accurate enough that
this is maybe not worthwhile.

A 6-bit state could possibly pull off, say:
Runs of 0 (2 states, weak/strong);
Runs of 1 (2 states, weak/strong);
Runs of alternating 0/1 (~ 4 states);
Runs of 001 (~ 8 states);
Runs of 110 (~ 8 states);
Runs of 0001 (~ 16 states);
Runs of 1110 (~ 16 states);
Transition states (~ 8 could go here).

The state would need to encode the position within the pattern, and the
relative confidence (weak/strong) that the pattern would continue as-is.
So, mispredict knocks strong to weak, and weak to a different pattern
(longer or shorter alternation, or to one of the other solid-run patterns).

I can imagine the state graph for this, but not going to describe it
here as it would be annoyingly long.

> So: compilers are bad at knowing predictability (even with profile
> feedback). Programmers are bad, too, unless they use performance
> counter results to learn it. Looks like a good candidate for a
> hardware solution to me.
>

Yeah.

It can be noted, this is one of those areas where I went with a hardware
solution...

As for whether or not I would have been better off going with
superscalar than WEX, this is TBD.

In theory, I could bolt superscalar onto my ISA as-is to potentially
grab up cases the compiler has missed (such as those which are
"theoretically safe" but don't match with the serial/parallel
equivalence rules, which would do things that aren't officially allowed
for the profile but the core in question is capable of doing, for code
not built with WEX enabled, or for code which uses 16-bit instruction
forms).

I have yet to really look into whether this would scavenge enough
additional ILP to be worthwhile (would need to model this).

In any case, there would likely be a lot more cases which the compiler
would see, but the superscalar logic would miss, if I try to keep its
complexity modest (to keep it cheap).

It is most likely that a hardware version would be fairly limited and
conservative (probably only looking for 2-wide cases in non-bundled
instruction sequences; which it would handle by behaving "as-if" the
instruction had its WEX bit set).

This is partly because I had already been looking at superscalar as a
possibility for RISC-V mode (and if I added it for RISC-V, almost may as
well support it for BJX2 as well).

....

Re: Ill-advised use of CMOVE

<t5eps6$o8$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25188&group=comp.arch#25188

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
Date: Tue, 10 May 2022 17:43:07 -0500
Organization: A noiseless patient Spider
Lines: 123
Message-ID: <t5eps6$o8$1@dont-email.me>
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org>
<t58n6r$1riu$1@gioia.aioe.org>
<f1e1057a-966e-4838-885e-9f8a0bee12b8n@googlegroups.com>
<t595aj$ase$1@gioia.aioe.org>
<76560d31-593c-4afb-bce4-c96ae6953b26n@googlegroups.com>
<t5dtvr$oaf$1@dont-email.me>
<c8ffd9e0-3913-48f1-b7d6-8678a8c2374fn@googlegroups.com>
<t5ei5p$mjo$1@dont-email.me>
<5a0d9ded-735a-4c4c-9953-4f462860f5a8n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 10 May 2022 22:44:22 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="7155b3a29c990c92a41c6ff303f1405a";
logging-data="776"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18Qe2V82LTvdb0FfTOKauhI"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.8.0
Cancel-Lock: sha1:/3CsG6lv/KuL5A6IMNjk4hE5TQw=
In-Reply-To: <5a0d9ded-735a-4c4c-9953-4f462860f5a8n@googlegroups.com>
Content-Language: en-US
 by: BGB - Tue, 10 May 2022 22:43 UTC

On 5/10/2022 4:21 PM, MitchAlsup wrote:
> On Tuesday, May 10, 2022 at 3:33:00 PM UTC-5, BGB wrote:
>> On 5/10/2022 11:59 AM, MitchAlsup wrote:
>>> On Tuesday, May 10, 2022 at 9:48:30 AM UTC-5, Marcus wrote:
>>>> On 2022-05-08, MitchAlsup wrote:
>>>>> On Sunday, May 8, 2022 at 2:23:03 PM UTC-5, Terje Mathisen wrote:
>>>>>> MitchAlsup wrote:
>>>
>>>>> Predication actually has NOTHING to do with the width of the fetch or
>>>>> execute pipelines--although it tends to help wider pipelines more than
>>>>> narrower ones. But, when instructions have more than one size,
>>>>> predication also helps 1-wide in-order machines ! ! !
>>>>>>
>>>>>> The final requirement is that the branch join operation needs to be
>>>>>> fast, and CMOV have been 2 cycles of latency for a long time. (Is it
>>>>>> fixed by now?)
>>>>> <
>>>>> Note: CMOVEs can be effectively implemented with predication,
>>>>> <
>>>>> PNE {T}
>>>>> MOV Rd,Rs1
>>>>> <
>>>> Any particular reason why an explicit PNE (Predicate Not Equal, I
>>>> assume) instruction is better than repurposing a branch instruction and
>>>> interpreting it as a predication instruction in the front end? E.g:
>>> <
>>> Yes::
>>>>
>>> One can issue through the predicated set of instructions, and sort out
>>> which ones should be executed later.
>>> <
>>> Whereas:
>>> <
>>> The general philosophy of branches is to predict them and then attempt
>>> to get to the target rapidly. Thus, here, one could not issue through the
>>> predicated set of instructions.
>>>>
>>>> BEQ 1f // Skip next instruction if EQual
>>>> MOV Rd,Rs1
>>>> 1:
>>>>
>>>> /Marcus
>>> <
>>> Thus, by the time you get to decoding of the instructions at 1: at least
>>> 1 clock has transpired.
>> In my case, something like:
>> CMPEQ R4, R5
>> ADD?F R4, 1, R4
>> Will take 2 cycles.
>>
>> And:
>> CMPEQ R4, R5
>> BT .L0
>> ADD R4, 1, R4
>> .L0:
>>
>> Will take either 3 or 4 cycles (if predicted correctly), or 8 cycles
>> (mispredict).
>>
>>
>> In a lot of cases, this difference can be fairly noticeable (and is part
>> of why predication ended up being promoted to a core ISA feature).
>>
>>
>> I am mostly ignoring OoO and speculative execution, as these fall
>> outside the scope of what I am targeting.
>>
>>
>>
>> It can also be used for "Range Coders that aren't dead slow", but range
>> coding is still pretty slow.
> <
> if( 0 <= x && x <= MAX ) {then-clause}
> <
> CMP Rt,Rx,Rmax
> BRIN Rt,end-if // RIN is Really In
> // then-clause
> end-if:

It is mostly that range-coders need to use pipelined multiplies and
conditional re-normalization (say, when the high order bits of the high
and low range become equal).

It isn't as bad with predication as it is with branches, but it still
doesn't exactly manage to match the speed of a Huffman or Rice coder.

This is assuming a "bit at a time" Range Coder (like in LZMA or
similar), the ones which implement "symbol at a time" via integer
division, are basically "no chance"...

Though, with the general metric of "viability" being the ability to (on
a 50MHz core) get much over 1 million symbols per second or so...

A Range Coder is still hard-pressed to be able to cross this mark.

>>
>> At present my core has enough L1 cache to deal with 12-bit-limited
>> Huffman with 1 or 2 tables within a "reasonable" performance window.
>>
>> For longer symbol lengths, some options are:
>> Direct 15 or 16 bit lookup;
>> Short 8 or 10 bit lookup, with a branch for dealing with longer symbols.
>>
>> On both my CPU core, and on x86, the latter option tends to be faster
>> IME (though, still not as fast as limiting the symbol to 12 or 13 bits).
>>
>>
>> But, why not a limit of 10 or 11 bits?... Because at this point, it has
>> a more significant adverse effect on Huffman's ability to "actually do
>> anything" (one needs some spill-over space for longer symbols to give
>> shorter symbols a space to exist).
>>
>>
>>
>> I have also noted that on both my ISA, and on x86, implementing a PNG
>> style Paeth filter or similar is faster with CMOV (x86) or predicated
>> instructions (my ISA), than it is with branches.
>>
>> ...

Re: Ill-advised use of CMOVE

<13a8e93b-31a3-4cfc-aa2c-a5f373a2e4abn@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25189&group=comp.arch#25189

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:588:b0:2f3:bca9:ea34 with SMTP id c8-20020a05622a058800b002f3bca9ea34mr22061891qtb.601.1652225483770;
Tue, 10 May 2022 16:31:23 -0700 (PDT)
X-Received: by 2002:a05:6870:478f:b0:e9:8c5c:3c37 with SMTP id
c15-20020a056870478f00b000e98c5c3c37mr1312115oaq.217.1652225483513; Tue, 10
May 2022 16:31:23 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 10 May 2022 16:31:23 -0700 (PDT)
In-Reply-To: <t5enf7$4mf$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:a0da:736:be6c:6d2c;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:a0da:736:be6c:6d2c
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org> <t58n6r$1riu$1@gioia.aioe.org>
<t5e1ek$s7c$1@dont-email.me> <jwv4k1xs3ai.fsf-monnier+comp.arch@gnu.org>
<t5e9g7$c65$1@dont-email.me> <2022May10.214311@mips.complang.tuwien.ac.at> <t5enf7$4mf$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <13a8e93b-31a3-4cfc-aa2c-a5f373a2e4abn@googlegroups.com>
Subject: Re: Ill-advised use of CMOVE
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 10 May 2022 23:31:23 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Tue, 10 May 2022 23:31 UTC

On Tuesday, May 10, 2022 at 5:03:23 PM UTC-5, BGB wrote:
> On 5/10/2022 2:43 PM, Anton Ertl wrote:
> > Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
> >> So if you don't use hardware "condition prediction", and the compiler by
> >> itself doesn't know how well a particular branch will be predicted, we
> >> are left with the aforementioned programmer provided hints, or perhaps
> >> some form of profile driven optimization.
> >
> > Normal profiles don't tell you how predictable a branch is. Of course
> > always-taken is predictable, but 50% taken might be unpredictable, or
> > perfectly predictable.
> >
> I can note here that my branch predictor ended up with states both for
> predicting branches which are always the same, and for branches which
> are nearly always the opposite.
>
> The "nearly always the same" case being the more common, more
> traditional option. But, nearly always the opposite, seemed common
> enough to be worthwhile.
>
As I have related in the past: The Mc 88120 had a branch predictor which
is not based on taken/not-taken, but upon agree/disagree. This allows
different branches which map to the same counters one taken one not-taken
to use the same code in the prediction table.
<
Back when we were doing this, our branch predictor was about 90%-92%
accurate, and this distinction was useful for getting rid of 1/3rd of the
mispredicts. T/NT was 88%ish A/D was 92%ish.
<
The way one uses an A/D predictor is to have a code cache organized
by branches, and code migrated into straightline decode strings. We
used what we called a packet cache, but you can use a trace cache
and there are way of organizing an instruction buffer to have this property.
>
>
> Though, this still leaves patterns which are (theoretically)
> predictable, but would require more complex predictor logic, eg:
> 110, 110, 110, ...
> 1110, 1110, 1110, 1110, ...
<
This is known of as autocorrelation predictor.
<
But all sorts of patterns can "fit"::
<
111011010 repeating.
<
I used a predictor such as this in the DRAM controller to prefetch data
to the DRAM read buffer in K9 and this went into Opteron Rev G.
>
> So, we predict a period of, say, 1-3 bits, after which point the pattern
> is expected to have one branch in the opposite direction (with longer
> runs being expected to always branch in a single direction).
>
> This would likely require a fair bit more bits for the branch
> predictors' state machine though, and it is already accurate enough that
> this is maybe not worthwhile.
>
>
> A 6-bit state could possibly pull off, say:
> Runs of 0 (2 states, weak/strong);
> Runs of 1 (2 states, weak/strong);
> Runs of alternating 0/1 (~ 4 states);
> Runs of 001 (~ 8 states);
> Runs of 110 (~ 8 states);
> Runs of 0001 (~ 16 states);
> Runs of 1110 (~ 16 states);
> Transition states (~ 8 could go here).
>
> The state would need to encode the position within the pattern, and the
> relative confidence (weak/strong) that the pattern would continue as-is.
> So, mispredict knocks strong to weak, and weak to a different pattern
> (longer or shorter alternation, or to one of the other solid-run patterns).
>
> I can imagine the state graph for this, but not going to describe it
> here as it would be annoyingly long.
> > So: compilers are bad at knowing predictability (even with profile
> > feedback). Programmers are bad, too, unless they use performance
> > counter results to learn it. Looks like a good candidate for a
> > hardware solution to me.
> >
> Yeah.
>
> It can be noted, this is one of those areas where I went with a hardware
> solution...
>
> As for whether or not I would have been better off going with
> superscalar than WEX, this is TBD.
>
> In theory, I could bolt superscalar onto my ISA as-is to potentially
> grab up cases the compiler has missed (such as those which are
> "theoretically safe" but don't match with the serial/parallel
> equivalence rules, which would do things that aren't officially allowed
> for the profile but the core in question is capable of doing, for code
> not built with WEX enabled, or for code which uses 16-bit instruction
> forms).
>
> I have yet to really look into whether this would scavenge enough
> additional ILP to be worthwhile (would need to model this).
>
>
> In any case, there would likely be a lot more cases which the compiler
> would see, but the superscalar logic would miss, if I try to keep its
> complexity modest (to keep it cheap).
<
It is all about the patterns needed versus the patterns Decode can recognize.
>
> It is most likely that a hardware version would be fairly limited and
> conservative (probably only looking for 2-wide cases in non-bundled
> instruction sequences; which it would handle by behaving "as-if" the
> instruction had its WEX bit set).
>
> This is partly because I had already been looking at superscalar as a
> possibility for RISC-V mode (and if I added it for RISC-V, almost may as
> well support it for BJX2 as well).
>
> ...

Re: Ill-advised use of CMOVE

<hlPeK.4263$pqKf.2583@fx12.iad>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=25200&group=comp.arch#25200

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx12.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Ill-advised use of CMOVE
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org> <t58n6r$1riu$1@gioia.aioe.org> <t5e1ek$s7c$1@dont-email.me> <2022May10.194427@mips.complang.tuwien.ac.at> <0555194d-edc8-48fb-b304-7f78d62255d3n@googlegroups.com>
In-Reply-To: <0555194d-edc8-48fb-b304-7f78d62255d3n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 80
Message-ID: <hlPeK.4263$pqKf.2583@fx12.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Wed, 11 May 2022 14:01:49 UTC
Date: Wed, 11 May 2022 10:00:52 -0400
X-Received-Bytes: 4169
 by: EricP - Wed, 11 May 2022 14:00 UTC

MitchAlsup wrote:
> On Tuesday, May 10, 2022 at 1:09:01 PM UTC-5, Anton Ertl wrote:
>> Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
>>> On 5/8/2022 8:22 AM, Terje Mathisen wrote:
>>>> It is extremely hard to find micro benchmarks where using CMOV to
>>>> eliminate a branch is a win, somewhat better for larger/full programs
>>>> but still very rare.
>>> I am not questioning the truth of that, but I am trying to figure out
>>> why.
>> I would like to see some empirical support for that statement. It's
>> pretty easy to design a micro benchmark where CMOV wins by a large
>> margin. I guess what he meant is that in a typical microbenchmark
>> aimed at some other charasteristic using CMOV instead of branching
>> usually is a loss.
>>> Since the CMOV is a pretty big win when the branch is
>>> mispredicted, it must be a loss when the prediction would have been correct.
>>>
>>> Is this because
>>>
>>> The CMOV itself is too slow?
>> In an OoO machine, this can mean several different things.
>>
>> * high resource usage. On the 21264, CMOV takes two
>> microinstructions.

For CMOV 1 or 2 extra uOps in a 100+ instruction queue is not a
problem but for full predication this approach would not do
and it requires smarter uOps.

>>
>> * long latency. The numbers I have seen are a latency of 1 or 2
>> cycles.
>>
>> * The instruction has to wait on results that take their time to
>> materialize.

Note that Alpha CMOV is only reg<-reg
x86/x64 allows reg<-reg or reg<-mem (conditional load)
but does not support mem<-reg (conditional store).
Neither supports reg<-imm (conditional load immediate).

So Alpha must always do the expensive loads and stores and can skip
the cheap reg<-reg move, which greatly limits its performance improvements.
x86 can skip some memory loads which allows slightly more performance.

But in all cases you must have prepared the data registers for the
alternate data flow path in advance of the CMOV, which is usually the
expensive part, and then skip the register move which is the cheap part.
Which is why CMOV has limited benefits.

> <
> CMOV cannot begin executing until:
> a) both operands are available
> b) the condition is available

In the Alpha 21264 case with its 2-uOp CMOV approach, each uOp can
proceed independently when the condition and its data source are ready.
Slightly better.

> Instructions dependent on CMOV cannot begin execution until
> c) CMOV delivers its result(s).
> <
> It is often this (c) that makes CMOV appear to be slow.

That is what I read WRT some of the Itanium predication -
that the control-dataflow can take longer than the data-dataflow.

Full predication is orders of magnitude more complicated than CMOV
but does contain more opportunities for HW run time optimizations.

This led them to propose what they called "predicate slip" whereby
the data side can proceed as soon as its operands are ready,
and the predicate state is checked before retire.

This gets messier if one desires to allow predicate slip but
later cancel pending uOps when the predicate resolves to disabled
so you don't perform work that you now know you are going to toss.
This is more complicated than a branch mispredict because a branch
mispredict flushes the instruction queue whereas this does not.

Pages:123456
server_pubkey.txt

rocksolid light 0.9.7
clearnet tor