Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Those who can, do; those who can't, write. Those who can't write work for the Bell Labs Record.


devel / comp.arch / Re: Compiling predicated insts to dataflow

SubjectAuthor
* Compiling predicated insts to dataflowStefan Monnier
+* Re: Compiling predicated insts to dataflowMitchAlsup
|+* Re: Compiling predicated insts to dataflowStefan Monnier
||`- Re: Compiling predicated insts to dataflowIvan Godard
|+- Re: Compiling predicated insts to dataflowIvan Godard
|`* Re: Compiling predicated insts to dataflowEricP
| +* Re: Compiling predicated insts to dataflowMitchAlsup
| |`* Re: Compiling predicated insts to dataflowEricP
| | `* Re: Compiling predicated insts to dataflowMitchAlsup
| |  +* Re: Compiling predicated insts to dataflowEricP
| |  |`- Re: Compiling predicated insts to dataflowMitchAlsup
| |  `* Re: Compiling predicated insts to dataflowStephen Fuld
| |   `* Re: Compiling predicated insts to dataflowMitchAlsup
| |    `* Re: Compiling predicated insts to dataflowStephen Fuld
| |     `- Re: Compiling predicated insts to dataflowMitchAlsup
| `- Re: Compiling predicated insts to dataflowEricP
`- Re: Compiling predicated insts to dataflowIvan Godard

1
Compiling predicated insts to dataflow

<jwv1r0263mr.fsf-monnier+comp.arch@gnu.org>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=23586&group=comp.arch#23586

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Compiling predicated insts to dataflow
Date: Wed, 16 Feb 2022 14:07:22 -0500
Organization: A noiseless patient Spider
Lines: 30
Message-ID: <jwv1r0263mr.fsf-monnier+comp.arch@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="78f761c51fcb7be8c116fec35f477bf1";
logging-data="29386"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19wavNzNxzz81AnQt/jqH3P"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:1WcQlmKTYQRCQzKocJMflIFIQoc=
sha1:N5q84cjCmNrTsRAZImn/MgjVs48=
 by: Stefan Monnier - Wed, 16 Feb 2022 19:07 UTC

Regarding predication, I was wondering how it's handled in an OoO CPU.
E.g.

if (foo)
x = a + 1;
else
x = b + 2;
y = x[3];

say we compile this to something like:

PRED foo {True, False}
ADD x <- a, 1
ADD x <- b, 2
LD y <- x, 3

What will this turn into in the dataflow.
Will it be treated as:

x <- foo ? a + 1: x
x <- foo ? x : b + 2
y <- x[3]

If so, that implies that the two ADDs can't be executed concurrently.

But if we don't, then what to put as "input node" for the `x` passed to
LD since we will only know which node to use after foo is resolved?

Stefan

Re: Compiling predicated insts to dataflow

<7e844fbc-7998-4c87-a17c-3f4b16c9629dn@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=23589&group=comp.arch#23589

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a1c:c907:0:b0:37b:f983:5d4e with SMTP id f7-20020a1cc907000000b0037bf9835d4emr3175172wmb.174.1645045301078;
Wed, 16 Feb 2022 13:01:41 -0800 (PST)
X-Received: by 2002:a05:6870:a447:b0:d2:ca49:2a73 with SMTP id
n7-20020a056870a44700b000d2ca492a73mr1284947oal.21.1645045300511; Wed, 16 Feb
2022 13:01:40 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 16 Feb 2022 13:01:40 -0800 (PST)
In-Reply-To: <jwv1r0263mr.fsf-monnier+comp.arch@gnu.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3db1:25d2:322a:440e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3db1:25d2:322a:440e
References: <jwv1r0263mr.fsf-monnier+comp.arch@gnu.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <7e844fbc-7998-4c87-a17c-3f4b16c9629dn@googlegroups.com>
Subject: Re: Compiling predicated insts to dataflow
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 16 Feb 2022 21:01:41 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Wed, 16 Feb 2022 21:01 UTC

On Wednesday, February 16, 2022 at 1:07:28 PM UTC-6, Stefan Monnier wrote:
> Regarding predication, I was wondering how it's handled in an OoO CPU.
> E.g.
>
> if (foo)
> x = a + 1;
> else
> x = b + 2;
> y = x[3];
<
x was scalar above and now vector ?!?
>
> say we compile this to something like:
>
> PRED foo {True, False}
> ADD x <- a, 1
> ADD x <- b, 2
> LD y <- x, 3
>
> What will this turn into in the dataflow.
> Will it be treated as:
>
> x <- foo ? a + 1: x
> x <- foo ? x : b + 2
> y <- x[3]
>
> If so, that implies that the two ADDs can't be executed concurrently.
>
> But if we don't, then what to put as "input node" for the `x` passed to
> LD since we will only know which node to use after foo is resolved?
>
>
> Stefan
<
You need to make a distinction between getting a calculation started
(dependent on operands) and getting a calculation finished (dependent
on WAR or WAW, and whether this instruction was supposed to execute.)
<
Sure you could delay the start until the predicate arrives, but you have
other options:
--------------------
a) Scoreboard style
have a <ahem> belt at the calculation unit where results reside until
.....its known that they should execute
So in your above example:
result [0] x = a+1
result [1] x = b+2
When predicate resolves you can deliver the now unique result to Rx.
Rx = PRED ? result[0] : result[1];
here only 1 result is delivered
In DECODE, you see that Rx is a destination of 2 instructions and you
give it the same name.
--------------------
b) Reservation station style:
you give Rx a different name for a+1 and for b+2
here, DECODE insert PHI operations to choose which to route forward.
everyone dependent on Rx is waiting for PHI to deliver. PHI is dependent
on PRED, and all/both produced results.
Here both results are delivered and an extra op executes.
--------------------
c) Mill style:
I let Ivan do this one.

--------------------
It just depends on how the other parts of the execution window are
already working.

Re: Compiling predicated insts to dataflow

<jwvmtiq3469.fsf-monnier+comp.arch@gnu.org>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=23596&group=comp.arch#23596

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Compiling predicated insts to dataflow
Date: Wed, 16 Feb 2022 16:25:56 -0500
Organization: A noiseless patient Spider
Lines: 55
Message-ID: <jwvmtiq3469.fsf-monnier+comp.arch@gnu.org>
References: <jwv1r0263mr.fsf-monnier+comp.arch@gnu.org>
<7e844fbc-7998-4c87-a17c-3f4b16c9629dn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="78f761c51fcb7be8c116fec35f477bf1";
logging-data="17352"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19pL/1O0FmWyADk8THwQ06p"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:+lNZ33TqNVu6opiVlwxGvFhoGUY=
sha1:8XLIdFh7AZMUE5p8RlHY8psrBpk=
 by: Stefan Monnier - Wed, 16 Feb 2022 21:25 UTC

Thanks Mitch.

MitchAlsup [2022-02-16 13:01:40] wrote:
> Sure you could delay the start until the predicate arrives, but you have
> other options:
> --------------------
> a) Scoreboard style
> have a <ahem> belt at the calculation unit where results reside until
> ....its known that they should execute
> So in your above example:
> result [0] x = a+1
> result [1] x = b+2
> When predicate resolves you can deliver the now unique result to Rx.
> Rx = PRED ? result[0] : result[1];
> here only 1 result is delivered
> In DECODE, you see that Rx is a destination of 2 instructions and you
> give it the same name.

[ For the sake of simplicity, I'll call this "PRED ? result[0] : result[1];"
a "phi" node as well in the text below, tho it is admittedly something else. ]

> --------------------
> b) Reservation station style:
> you give Rx a different name for a+1 and for b+2
> here, DECODE insert PHI operations to choose which to route forward.
> everyone dependent on Rx is waiting for PHI to deliver. PHI is dependent
> on PRED, and all/both produced results.
> Here both results are delivered and an extra op executes.
> --------------------

Hmm... so in both cases you need to insert some kind of "phi" node.
If you can have several different predicates apply to different
instructions modifying the same register, it seems this could become
fairly complex (i.e. a "phi" node with 3 args or more, or a cascade of
several 2-arg "phi" nodes).

Does that mean that you have to impose a limit on the number of PREDs
that can apply to a given instruction (there's is a natural a limit that
comes from the max size of the shadow of a PRED, but I'd assume this
natural limit could result in too much complexity, handing phi nodes
that depend on too many predicates)?

> c) Mill style:
> I let Ivan do this one.

AFAIK Mill only has

x <- (foo ? a : b)

rather than predicated instructions, so the "phi" nodes are explicitly
present in the machine code so they don't need to infer/insert them at
run-time.

Stefan

Re: Compiling predicated insts to dataflow

<sulguo$kpa$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=23616&group=comp.arch#23616

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Compiling predicated insts to dataflow
Date: Thu, 17 Feb 2022 05:01:10 -0800
Organization: A noiseless patient Spider
Lines: 53
Message-ID: <sulguo$kpa$1@dont-email.me>
References: <jwv1r0263mr.fsf-monnier+comp.arch@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 17 Feb 2022 13:01:12 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4cdbdab641c35097c9e27d444e25d5dd";
logging-data="21290"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+xCRLrSQZUwVMcbl8NX6qM"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:JiMdBVLiDcM+4r7ktvRVLTcNKFQ=
In-Reply-To: <jwv1r0263mr.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US
 by: Ivan Godard - Thu, 17 Feb 2022 13:01 UTC

On 2/16/2022 11:07 AM, Stefan Monnier wrote:
> Regarding predication, I was wondering how it's handled in an OoO CPU.
> E.g.
>
> if (foo)
> x = a + 1;
> else
> x = b + 2;
> y = x[3];
>
> say we compile this to something like:
>
> PRED foo {True, False}
> ADD x <- a, 1
> ADD x <- b, 2
> LD y <- x, 3
>
> What will this turn into in the dataflow.
> Will it be treated as:
>
> x <- foo ? a + 1: x
> x <- foo ? x : b + 2
> y <- x[3]
>
> If so, that implies that the two ADDs can't be executed concurrently.
>
> But if we don't, then what to put as "input node" for the `x` passed to
> LD since we will only know which node to use after foo is resolved?
>
>
> Stefan

Integer add is idempotent on most ISAs, so there's no real reason to use
predication for this on anything but an extremely narrow (e.g. 1-wide)
micro-architecture. You use whatever instructions the ISA has for
if-conversion - SELECT, CMOVE, PICK, ... The same is true for
static-scheduled machines.

You'd only use predication if the expressions were not idempotent, such
as if they contained stores:
> if (foo)
> x = a + 1;
> else
> x = z = b + 2;
> y = x[3];
or instructions that can fault:
float x, a, b;
> if (foo)
> x = a + 1;
> else
> x = b + 2;
> y = x[3];
although that problem can be avoided by moving fault state into metadata.

Re: Compiling predicated insts to dataflow

<sulhle$2t1$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=23617&group=comp.arch#23617

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Compiling predicated insts to dataflow
Date: Thu, 17 Feb 2022 05:13:18 -0800
Organization: A noiseless patient Spider
Lines: 56
Message-ID: <sulhle$2t1$1@dont-email.me>
References: <jwv1r0263mr.fsf-monnier+comp.arch@gnu.org>
<7e844fbc-7998-4c87-a17c-3f4b16c9629dn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 17 Feb 2022 13:13:18 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4cdbdab641c35097c9e27d444e25d5dd";
logging-data="2977"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19Vymz7Omq/nH9AMGqSul36"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:JmlIjzomuYbZRsNv4Jc9zjtUAd0=
In-Reply-To: <7e844fbc-7998-4c87-a17c-3f4b16c9629dn@googlegroups.com>
Content-Language: en-US
 by: Ivan Godard - Thu, 17 Feb 2022 13:13 UTC

On 2/16/2022 1:01 PM, MitchAlsup wrote:
> On Wednesday, February 16, 2022 at 1:07:28 PM UTC-6, Stefan Monnier wrote:
>> Regarding predication, I was wondering how it's handled in an OoO CPU.
>> E.g.
>>
>> if (foo)
>> x = a + 1;
>> else
>> x = b + 2;
>> y = x[3];
> <
> x was scalar above and now vector ?!?
>>
>> say we compile this to something like:
>>
>> PRED foo {True, False}
>> ADD x <- a, 1
>> ADD x <- b, 2
>> LD y <- x, 3
>>
>> What will this turn into in the dataflow.
>> Will it be treated as:
>>
>> x <- foo ? a + 1: x
>> x <- foo ? x : b + 2
>> y <- x[3]
>>
>> If so, that implies that the two ADDs can't be executed concurrently.
>>
>> But if we don't, then what to put as "input node" for the `x` passed to
>> LD since we will only know which node to use after foo is resolved?
>>
>>
>> Stefan
> <

<snip>

> --------------------
> c) Mill style:
> I let Ivan do this one.

// commas separate instructions in a bundle; semicolon separates
bundles
%0<-add(%a, 1), %1<-add(%b,2), %2<-pick(%foo, %0, %1);
#0:load(%2, 3, <width>);
.... // some time later, the load hoisted as early as possible
%y<-loadRetire(#0)

The if-conversion is done by the specializer

>
> --------------------
> It just depends on how the other parts of the execution window are
> already working.

Re: Compiling predicated insts to dataflow

<sulj3i$1kh$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=23618&group=comp.arch#23618

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Compiling predicated insts to dataflow
Date: Thu, 17 Feb 2022 05:37:52 -0800
Organization: A noiseless patient Spider
Lines: 84
Message-ID: <sulj3i$1kh$1@dont-email.me>
References: <jwv1r0263mr.fsf-monnier+comp.arch@gnu.org>
<7e844fbc-7998-4c87-a17c-3f4b16c9629dn@googlegroups.com>
<jwvmtiq3469.fsf-monnier+comp.arch@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 17 Feb 2022 13:37:54 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4cdbdab641c35097c9e27d444e25d5dd";
logging-data="1681"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18fc1W736sZ+wn/ITnpAFYy"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:Gofi4J6WxWJ94d0ALWfuwUvYi00=
In-Reply-To: <jwvmtiq3469.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US
 by: Ivan Godard - Thu, 17 Feb 2022 13:37 UTC

On 2/16/2022 1:25 PM, Stefan Monnier wrote:
> Thanks Mitch.
>
> MitchAlsup [2022-02-16 13:01:40] wrote:
>> Sure you could delay the start until the predicate arrives, but you have
>> other options:
>> --------------------
>> a) Scoreboard style
>> have a <ahem> belt at the calculation unit where results reside until
>> ....its known that they should execute
>> So in your above example:
>> result [0] x = a+1
>> result [1] x = b+2
>> When predicate resolves you can deliver the now unique result to Rx.
>> Rx = PRED ? result[0] : result[1];
>> here only 1 result is delivered
>> In DECODE, you see that Rx is a destination of 2 instructions and you
>> give it the same name.
>
> [ For the sake of simplicity, I'll call this "PRED ? result[0] : result[1];"
> a "phi" node as well in the text below, tho it is admittedly something else. ]
>
>> --------------------
>> b) Reservation station style:
>> you give Rx a different name for a+1 and for b+2
>> here, DECODE insert PHI operations to choose which to route forward.
>> everyone dependent on Rx is waiting for PHI to deliver. PHI is dependent
>> on PRED, and all/both produced results.
>> Here both results are delivered and an extra op executes.
>> --------------------
>
> Hmm... so in both cases you need to insert some kind of "phi" node.
> If you can have several different predicates apply to different
> instructions modifying the same register, it seems this could become
> fairly complex (i.e. a "phi" node with 3 args or more, or a cascade of
> several 2-arg "phi" nodes).
>
> Does that mean that you have to impose a limit on the number of PREDs
> that can apply to a given instruction (there's is a natural a limit that
> comes from the max size of the shadow of a PRED, but I'd assume this
> natural limit could result in too much complexity, handing phi nodes
> that depend on too many predicates)?
>
>> c) Mill style:
>> I let Ivan do this one.
>
> AFAIK Mill only has
>
> x <- (foo ? a : b)
>
> rather than predicated instructions, so the "phi" nodes are explicitly
> present in the machine code so they don't need to infer/insert them at
> run-time.

Mill has both ?: (called "pick()") and also predicated forms of all
non-idempotent instructions. Pick ?: will give better code than
predication when it can be used, because a belt doesn't let two
instruction results drop to the same belt position the way Mitch can use
the same result register.

However, Mill can if-convert even blocks that are not idempotent, such
as if they contain control flow or store. For example:
if (foo)
x = z[i] = a+1;
else
bar(y, x = b+2);
the specializer would if-convert the "x=..." part using pick, but the
store "z[i]=" and the call "bar(...)" would be predicated. The conAsm
code is:
%0<-add(%a, 1),
%1<-add(%b, 2),
store(%foo, &z, %i, %0),
%x<-pick(%foo, %0, %1),
call(%foo, &bar, %y, %1);

Yes, that's one bundle and one cycle, not counting the bar() body if the
call is taken.

If-conversion is a big win over branching if the predicate (foo) is even
mildly unpredictable. It is a win over predication if the HW has enough
width to execute the merged blocks without running out of FUs, but the
instructions have to be idempotent, which rules out FP on most ISAs, and
should rule out most integer instead of abusing the UB definitions.

Re: Compiling predicated insts to dataflow

<NVuPJ.17969$V7da.14032@fx13.iad>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=23638&group=comp.arch#23638

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!feeder1.feed.usenet.farm!feed.usenet.farm!peer02.ams4!peer.am4.highwinds-media.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx13.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Compiling predicated insts to dataflow
References: <jwv1r0263mr.fsf-monnier+comp.arch@gnu.org> <7e844fbc-7998-4c87-a17c-3f4b16c9629dn@googlegroups.com>
In-Reply-To: <7e844fbc-7998-4c87-a17c-3f4b16c9629dn@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 133
Message-ID: <NVuPJ.17969$V7da.14032@fx13.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Thu, 17 Feb 2022 16:42:21 UTC
Date: Thu, 17 Feb 2022 11:42:06 -0500
X-Received-Bytes: 5490
 by: EricP - Thu, 17 Feb 2022 16:42 UTC

MitchAlsup wrote:
> On Wednesday, February 16, 2022 at 1:07:28 PM UTC-6, Stefan Monnier wrote:
>> Regarding predication, I was wondering how it's handled in an OoO CPU.
>> E.g.
>>
>> if (foo)
>> x = a + 1;
>> else
>> x = b + 2;
>> y = x[3];
> <
> x was scalar above and now vector ?!?
>> say we compile this to something like:
>>
>> PRED foo {True, False}
>> ADD x <- a, 1
>> ADD x <- b, 2
>> LD y <- x, 3
>>
>> What will this turn into in the dataflow.
>> Will it be treated as:
>>
>> x <- foo ? a + 1: x
>> x <- foo ? x : b + 2
>> y <- x[3]
>>
>> If so, that implies that the two ADDs can't be executed concurrently.
>>
>> But if we don't, then what to put as "input node" for the `x` passed to
>> LD since we will only know which node to use after foo is resolved?
>>
>>
>> Stefan
> <
> You need to make a distinction between getting a calculation started
> (dependent on operands) and getting a calculation finished (dependent
> on WAR or WAW, and whether this instruction was supposed to execute.)
> <
> Sure you could delay the start until the predicate arrives, but you have
> other options:
> --------------------
> a) Scoreboard style
> have a <ahem> belt at the calculation unit where results reside until
> .....its known that they should execute
> So in your above example:
> result [0] x = a+1
> result [1] x = b+2
> When predicate resolves you can deliver the now unique result to Rx.
> Rx = PRED ? result[0] : result[1];
> here only 1 result is delivered
> In DECODE, you see that Rx is a destination of 2 instructions and you
> give it the same name.

This optimization would be tricky as Rename must look at the
whole shadow, up to 8 instructions in My66k case, to detect
the dynamic-MUX (or whatever we call this) as the PRED mask could be
{ T, T, T, T, T, T, F, F } or be { T, T, T, F, T, T, T, F }
or any variation thereof, as all produce the same result.

To look at the whole window of 8 instructions and edit them
all at once, it would have to stall the front end pipeline,
just in case this edit was possible.

Of course that's not practical so we limit the edit to masks of length 2,
provided it has two uOps side by side in the front end buffers.

> --------------------
> b) Reservation station style:
> you give Rx a different name for a+1 and for b+2
> here, DECODE insert PHI operations to choose which to route forward.
> everyone dependent on Rx is waiting for PHI to deliver. PHI is dependent
> on PRED, and all/both produced results.
> Here both results are delivered and an extra op executes.
> --------------------
> c) Mill style:
> I let Ivan do this one.
>
> --------------------
> It just depends on how the other parts of the execution window are
> already working.

D) When PRED executes and predicate value resolves,
dynamically elide dependency chains for disabled parts of uOps
so at least the uOps don't wait for prior results they no longer need.

Below I add the version number on X_n that Rename would create,
assuming the above optimization was NOT performed:

ld foo
ld a
ld b
x_2 <- foo ? a + 1: x_1 ADD#1
x_3 <- foo ? x_2 : b + 2 ADD#2
y <- x_3[3]

The renamer makes ADD#2 serially dependent on ADD#1.

As soon as foo is known and PRED executes, it forwards the predicate
value to all dependent instructions, many of which are hopefully
already waiting in reservation Stations (they could be anywhere
in the front end, including not yet decoded).

Each RS uOp can prune the input dependency:
- if foo == 0 then
- ADD#1 prunes the dependency on 'a' and keeps 'x_1'
- ADD#2 prunes the dependency on x_2 and keeps 'b'
- if foo == 1 then
- ADD#1 prunes the dependency on 'x_1' and keeps 'a'
- ADD#2 prunes the dependency on 'b' and keeps 'x_2'

Net result is that if foo == 0 then ADD#2 can launch as soon as 'b' is
valid without waiting for x_1, or 'a' or x_2. However if foo == 1 then
ADD#2 is still serially dependent on ADD#1 to copy the x_2 value into x_3.

With the rename optimization x_3 is eliminated

ld foo
ld a
ld b
x_2 <- foo ? a + 1: x_1 ADD#1
x_2 <- foo ? x_1 : b + 2 ADD#2
y <- x_2[3]

both ADD#1 and ADD#2 output to x_2 and ADD2 is no longer
serially dependent on ADD#1 if foo == 0.

(In my simulated uArch, canceling the RS uOp dependency requires
matching a forwarding tag id assigned to each predicate value,
and flipping a RS field from Valid or MatchTag to Ignore.
The predicate value can have its own 1-bit forwarding bus,
and would have its own wake-up matrix to alert waiting RS uOps.)

Re: Compiling predicated insts to dataflow

<48ef89d3-1397-4145-ae8f-dfd165c67f06n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=23641&group=comp.arch#23641

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:600c:4f08:b0:37b:e830:d231 with SMTP id l8-20020a05600c4f0800b0037be830d231mr4221195wmq.144.1645129137822;
Thu, 17 Feb 2022 12:18:57 -0800 (PST)
X-Received: by 2002:a9d:708e:0:b0:5ac:fa7e:be84 with SMTP id
l14-20020a9d708e000000b005acfa7ebe84mr1472377otj.129.1645129134767; Thu, 17
Feb 2022 12:18:54 -0800 (PST)
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 17 Feb 2022 12:18:54 -0800 (PST)
In-Reply-To: <NVuPJ.17969$V7da.14032@fx13.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:2029:4bfa:a509:1325;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:2029:4bfa:a509:1325
References: <jwv1r0263mr.fsf-monnier+comp.arch@gnu.org> <7e844fbc-7998-4c87-a17c-3f4b16c9629dn@googlegroups.com>
<NVuPJ.17969$V7da.14032@fx13.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <48ef89d3-1397-4145-ae8f-dfd165c67f06n@googlegroups.com>
Subject: Re: Compiling predicated insts to dataflow
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 17 Feb 2022 20:18:57 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Thu, 17 Feb 2022 20:18 UTC

On Thursday, February 17, 2022 at 10:42:26 AM UTC-6, EricP wrote:
> MitchAlsup wrote:
> > On Wednesday, February 16, 2022 at 1:07:28 PM UTC-6, Stefan Monnier wrote:
> >> Regarding predication, I was wondering how it's handled in an OoO CPU.
> >> E.g.
> >>
> >> if (foo)
> >> x = a + 1;
> >> else
> >> x = b + 2;
> >> y = x[3];
> > <
> > x was scalar above and now vector ?!?
> >> say we compile this to something like:
> >>
> >> PRED foo {True, False}
> >> ADD x <- a, 1
> >> ADD x <- b, 2
> >> LD y <- x, 3
> >>
> >> What will this turn into in the dataflow.
> >> Will it be treated as:
> >>
> >> x <- foo ? a + 1: x
> >> x <- foo ? x : b + 2
> >> y <- x[3]
> >>
> >> If so, that implies that the two ADDs can't be executed concurrently.
> >>
> >> But if we don't, then what to put as "input node" for the `x` passed to
> >> LD since we will only know which node to use after foo is resolved?
> >>
> >>
> >> Stefan
> > <
> > You need to make a distinction between getting a calculation started
> > (dependent on operands) and getting a calculation finished (dependent
> > on WAR or WAW, and whether this instruction was supposed to execute.)
> > <
> > Sure you could delay the start until the predicate arrives, but you have
> > other options:
> > --------------------
> > a) Scoreboard style
> > have a <ahem> belt at the calculation unit where results reside until
> > .....its known that they should execute
> > So in your above example:
> > result [0] x = a+1
> > result [1] x = b+2
> > When predicate resolves you can deliver the now unique result to Rx.
> > Rx = PRED ? result[0] : result[1];
> > here only 1 result is delivered
> > In DECODE, you see that Rx is a destination of 2 instructions and you
> > give it the same name.
> This optimization would be tricky as Rename must look at the
> whole shadow, up to 8 instructions in My66k case, to detect
> the dynamic-MUX (or whatever we call this) as the PRED mask could be
> { T, T, T, T, T, T, F, F } or be { T, T, T, F, T, T, T, F }
> or any variation thereof, as all produce the same result.
>
> To look at the whole window of 8 instructions and edit them
> all at once, it would have to stall the front end pipeline,
> just in case this edit was possible.
>
> Of course that's not practical so we limit the edit to masks of length 2,
> provided it has two uOps side by side in the front end buffers.
> > --------------------
> > b) Reservation station style:
> > you give Rx a different name for a+1 and for b+2
> > here, DECODE insert PHI operations to choose which to route forward.
> > everyone dependent on Rx is waiting for PHI to deliver. PHI is dependent
> > on PRED, and all/both produced results.
> > Here both results are delivered and an extra op executes.
> > --------------------
> > c) Mill style:
> > I let Ivan do this one.
> >
> > --------------------
> > It just depends on how the other parts of the execution window are
> > already working.
> D) When PRED executes and predicate value resolves,
> dynamically elide dependency chains for disabled parts of uOps
> so at least the uOps don't wait for prior results they no longer need.
>
> Below I add the version number on X_n that Rename would create,
> assuming the above optimization was NOT performed:
>
> ld foo
> ld a
> ld b
> x_2 <- foo ? a + 1: x_1 ADD#1
> x_3 <- foo ? x_2 : b + 2 ADD#2
> y <- x_3[3]
>
> The renamer makes ADD#2 serially dependent on ADD#1.
<
OK, first you are considering a reservation station like machine;
ScoreBoards don't (typically) use renamers.
Renamer needs to take the PRED mask into consideration for renaming
<
result [0] x = a+1 then-clause
result [1] x = b+2 else-clause
here, the renamer sees that it can assign physical register to x just once.
When I looked at it it looked quadratically hard--that is something you can
do in HW only with limited horizon.
>
> As soon as foo is known and PRED executes, it forwards the predicate
> value to all dependent instructions, many of which are hopefully
> already waiting in reservation Stations (they could be anywhere
> in the front end, including not yet decoded).
<
Ideally, they would be waiting at the end of the calculation unit to deliver
their results (i.e., already out of station) PREDs resolve to complete
instructions in their shadow, not resolve to initiate execution.
>
> Each RS uOp can prune the input dependency:
> - if foo == 0 then
> - ADD#1 prunes the dependency on 'a' and keeps 'x_1'
> - ADD#2 prunes the dependency on x_2 and keeps 'b'
> - if foo == 1 then
> - ADD#1 prunes the dependency on 'x_1' and keeps 'a'
> - ADD#2 prunes the dependency on 'b' and keeps 'x_2'
>
> Net result is that if foo == 0 then ADD#2 can launch as soon as 'b' is
> valid without waiting for x_1, or 'a' or x_2. However if foo == 1 then
> ADD#2 is still serially dependent on ADD#1 to copy the x_2 value into x_3.
>
> With the rename optimization x_3 is eliminated
>
> ld foo
> ld a
> ld b
> x_2 <- foo ? a + 1: x_1 ADD#1
> x_2 <- foo ? x_1 : b + 2 ADD#2
> y <- x_2[3]
>
> both ADD#1 and ADD#2 output to x_2 and ADD2 is no longer
> serially dependent on ADD#1 if foo == 0.
>
> (In my simulated uArch, canceling the RS uOp dependency requires
> matching a forwarding tag id assigned to each predicate value,
> and flipping a RS field from Valid or MatchTag to Ignore.
> The predicate value can have its own 1-bit forwarding bus,
> and would have its own wake-up matrix to alert waiting RS uOps.)

Re: Compiling predicated insts to dataflow

<9byPJ.6566$uW1.501@fx27.iad>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=23643&group=comp.arch#23643

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!npeer.as286.net!npeer-ng0.as286.net!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx27.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Compiling predicated insts to dataflow
References: <jwv1r0263mr.fsf-monnier+comp.arch@gnu.org> <7e844fbc-7998-4c87-a17c-3f4b16c9629dn@googlegroups.com> <NVuPJ.17969$V7da.14032@fx13.iad>
In-Reply-To: <NVuPJ.17969$V7da.14032@fx13.iad>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 46
Message-ID: <9byPJ.6566$uW1.501@fx27.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Thu, 17 Feb 2022 20:25:41 UTC
Date: Thu, 17 Feb 2022 15:25:16 -0500
X-Received-Bytes: 2637
 by: EricP - Thu, 17 Feb 2022 20:25 UTC

EricP wrote:
> MitchAlsup wrote:
>> In DECODE, you see that Rx is a destination of 2 instructions and you
>> give it the same name.
>
> This optimization would be tricky as Rename must look at the
> whole shadow, up to 8 instructions in My66k case, to detect
> the dynamic-MUX (or whatever we call this) as the PRED mask could be
> { T, T, T, T, T, T, F, F } or be { T, T, T, F, T, T, T, F }
> or any variation thereof, as all produce the same result.
>
> To look at the whole window of 8 instructions and edit them
> all at once, it would have to stall the front end pipeline,
> just in case this edit was possible.
>
> Of course that's not practical so we limit the edit to masks of length 2,
> provided it has two uOps side by side in the front end buffers.
>
>
> With the rename optimization x_3 is eliminated
>
> ld foo
> ld a
> ld b
> x_2 <- foo ? a + 1: x_1 ADD#1
> x_2 <- foo ? x_1 : b + 2 ADD#2
> y <- x_2[3]
>
> both ADD#1 and ADD#2 output to x_2 and ADD2 is no longer
> serially dependent on ADD#1 if foo == 0.

And with the rename optimization it can't interrupt/exception between
ADD#1 and ADD#2 as that would leave x_2 with an non-existent value.
(This only happens with this Rename optimization, and not for sequences
of PRED shadow instructions which remain serially dependent in order to
propagate unchanged values and therefore leave the physical register
with a valid value if the chain is interrupted.)

Not sure how to handle that cheaply...
maybe mark ADD#1 uOp as "not a retire point"
and ADD#2 as "is a retire point" so Retire steps over
ADD#1 marked Done and sees if ADD#2 is marked Done and retires both
together by adding both instruction lengths to committed RIP,
and incrementing the Instruction Queue tail pointer by 2.

Re: Compiling predicated insts to dataflow

<MGyPJ.38899$yi_7.34012@fx39.iad>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=23645&group=comp.arch#23645

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!news.freedyn.de!newsreader4.netcologne.de!news.netcologne.de!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx39.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Compiling predicated insts to dataflow
References: <jwv1r0263mr.fsf-monnier+comp.arch@gnu.org> <7e844fbc-7998-4c87-a17c-3f4b16c9629dn@googlegroups.com> <NVuPJ.17969$V7da.14032@fx13.iad> <48ef89d3-1397-4145-ae8f-dfd165c67f06n@googlegroups.com>
In-Reply-To: <48ef89d3-1397-4145-ae8f-dfd165c67f06n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 60
Message-ID: <MGyPJ.38899$yi_7.34012@fx39.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Thu, 17 Feb 2022 20:59:24 UTC
Date: Thu, 17 Feb 2022 15:59:14 -0500
X-Received-Bytes: 3216
 by: EricP - Thu, 17 Feb 2022 20:59 UTC

MitchAlsup wrote:
> On Thursday, February 17, 2022 at 10:42:26 AM UTC-6, EricP wrote:
>> D) When PRED executes and predicate value resolves,
>> dynamically elide dependency chains for disabled parts of uOps
>> so at least the uOps don't wait for prior results they no longer need.
>>
>> Below I add the version number on X_n that Rename would create,
>> assuming the above optimization was NOT performed:
>>
>> ld foo
>> ld a
>> ld b
>> x_2 <- foo ? a + 1: x_1 ADD#1
>> x_3 <- foo ? x_2 : b + 2 ADD#2
>> y <- x_3[3]
>>
>> The renamer makes ADD#2 serially dependent on ADD#1.
> <
> OK, first you are considering a reservation station like machine;
> ScoreBoards don't (typically) use renamers.

Right, I'm looking at this as OoO because the original question
was if the two ADDs can/can't be executed concurrently.

A Scoreboard would also have RAW/WAR/WAW dependencies that can be pruned.

> Renamer needs to take the PRED mask into consideration for renaming
> <
> result [0] x = a+1 then-clause
> result [1] x = b+2 else-clause
> here, the renamer sees that it can assign physical register to x just once.

Yes.

> When I looked at it it looked quadratically hard--that is something you can
> do in HW only with limited horizon.

Ok.

>> As soon as foo is known and PRED executes, it forwards the predicate
>> value to all dependent instructions, many of which are hopefully
>> already waiting in reservation Stations (they could be anywhere
>> in the front end, including not yet decoded).
> <
> Ideally, they would be waiting at the end of the calculation unit to deliver
> their results (i.e., already out of station) PREDs resolve to complete
> instructions in their shadow, not resolve to initiate execution.

For cheapo ALU ops yes, that seems reasonable. But for MUL, DIV
and floats one might want to wait rather than commit the FU.
For LD I would not want to launch until the predicate resolves.
And ST can't launch until retire which means predicate is resolved.

Both LD and ST might prefetch while their predicate is unresolved.
If LD or ST speculates past an unresolved predicate that creates
Spectre possibilities. One could potentially have option flags on
the PRED instruction to indicate whether such speculation was allowed.

Re: Compiling predicated insts to dataflow

<472f80e0-2d73-41b7-ae44-a13407b9160an@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=23656&group=comp.arch#23656

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a5d:6d8d:0:b0:1e3:3de4:e0e6 with SMTP id l13-20020a5d6d8d000000b001e33de4e0e6mr4268217wrs.159.1645149193260;
Thu, 17 Feb 2022 17:53:13 -0800 (PST)
X-Received: by 2002:a05:6830:448e:b0:5a4:c845:9869 with SMTP id
r14-20020a056830448e00b005a4c8459869mr1862104otv.112.1645149192430; Thu, 17
Feb 2022 17:53:12 -0800 (PST)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!pasdenom.info!news.ortolo.eu!fdn.fr!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 17 Feb 2022 17:53:12 -0800 (PST)
In-Reply-To: <MGyPJ.38899$yi_7.34012@fx39.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:451:3ac0:1f65:c745;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:451:3ac0:1f65:c745
References: <jwv1r0263mr.fsf-monnier+comp.arch@gnu.org> <7e844fbc-7998-4c87-a17c-3f4b16c9629dn@googlegroups.com>
<NVuPJ.17969$V7da.14032@fx13.iad> <48ef89d3-1397-4145-ae8f-dfd165c67f06n@googlegroups.com>
<MGyPJ.38899$yi_7.34012@fx39.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <472f80e0-2d73-41b7-ae44-a13407b9160an@googlegroups.com>
Subject: Re: Compiling predicated insts to dataflow
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 18 Feb 2022 01:53:13 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Fri, 18 Feb 2022 01:53 UTC

On Thursday, February 17, 2022 at 2:59:28 PM UTC-6, EricP wrote:
> MitchAlsup wrote:
> > On Thursday, February 17, 2022 at 10:42:26 AM UTC-6, EricP wrote:
> >> D) When PRED executes and predicate value resolves,
> >> dynamically elide dependency chains for disabled parts of uOps
> >> so at least the uOps don't wait for prior results they no longer need.
> >>
> >> Below I add the version number on X_n that Rename would create,
> >> assuming the above optimization was NOT performed:
> >>
> >> ld foo
> >> ld a
> >> ld b
> >> x_2 <- foo ? a + 1: x_1 ADD#1
> >> x_3 <- foo ? x_2 : b + 2 ADD#2
> >> y <- x_3[3]
> >>
> >> The renamer makes ADD#2 serially dependent on ADD#1.
> > <
> > OK, first you are considering a reservation station like machine;
> > ScoreBoards don't (typically) use renamers.
> Right, I'm looking at this as OoO because the original question
> was if the two ADDs can/can't be executed concurrently.
>
> A Scoreboard would also have RAW/WAR/WAW dependencies that can be pruned.
> > Renamer needs to take the PRED mask into consideration for renaming
> > <
> > result [0] x = a+1 then-clause
> > result [1] x = b+2 else-clause
> > here, the renamer sees that it can assign physical register to x just once.
> Yes.
> > When I looked at it it looked quadratically hard--that is something you can
> > do in HW only with limited horizon.
> Ok.
> >> As soon as foo is known and PRED executes, it forwards the predicate
> >> value to all dependent instructions, many of which are hopefully
> >> already waiting in reservation Stations (they could be anywhere
> >> in the front end, including not yet decoded).
> > <
> > Ideally, they would be waiting at the end of the calculation unit to deliver
> > their results (i.e., already out of station) PREDs resolve to complete
> > instructions in their shadow, not resolve to initiate execution.
> For cheapo ALU ops yes, that seems reasonable. But for MUL, DIV
> and floats one might want to wait rather than commit the FU.
<
MUL and DIV are the least expensive units to add extra pending result
containers. The default position is 1 container which blocks starting
new calculations. But when you have <say> 4 containers, you can hold
several waiting on PRED.
<
> For LD I would not want to launch until the predicate resolves.
<
You can go for L1, but you cannot let a L1 miss propagate.
<
> And ST can't launch until retire which means predicate is resolved.
<
I don't even read the value to be stored until it is "known to retire"
I call this point "complete".
<
>
> Both LD and ST might prefetch while their predicate is unresolved.
<
Not past L1 or you open up Spectré attacks.
<
> If LD or ST speculates past an unresolved predicate that creates
> Spectre possibilities. One could potentially have option flags on
> the PRED instruction to indicate whether such speculation was allowed.

Re: Compiling predicated insts to dataflow

<3RQPJ.42313$yi_7.26320@fx39.iad>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=23675&group=comp.arch#23675

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!feeder1.feed.usenet.farm!feed.usenet.farm!peer01.ams4!peer.am4.highwinds-media.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx39.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Compiling predicated insts to dataflow
References: <jwv1r0263mr.fsf-monnier+comp.arch@gnu.org> <7e844fbc-7998-4c87-a17c-3f4b16c9629dn@googlegroups.com> <NVuPJ.17969$V7da.14032@fx13.iad> <48ef89d3-1397-4145-ae8f-dfd165c67f06n@googlegroups.com> <MGyPJ.38899$yi_7.34012@fx39.iad> <472f80e0-2d73-41b7-ae44-a13407b9160an@googlegroups.com>
In-Reply-To: <472f80e0-2d73-41b7-ae44-a13407b9160an@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 95
Message-ID: <3RQPJ.42313$yi_7.26320@fx39.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Fri, 18 Feb 2022 17:39:11 UTC
Date: Fri, 18 Feb 2022 12:38:26 -0500
X-Received-Bytes: 5255
 by: EricP - Fri, 18 Feb 2022 17:38 UTC

MitchAlsup wrote:
> On Thursday, February 17, 2022 at 2:59:28 PM UTC-6, EricP wrote:
>> MitchAlsup wrote:
>>> On Thursday, February 17, 2022 at 10:42:26 AM UTC-6, EricP wrote:
>>>> As soon as foo is known and PRED executes, it forwards the predicate
>>>> value to all dependent instructions, many of which are hopefully
>>>> already waiting in reservation Stations (they could be anywhere
>>>> in the front end, including not yet decoded).
>>> <
>>> Ideally, they would be waiting at the end of the calculation unit to deliver
>>> their results (i.e., already out of station) PREDs resolve to complete
>>> instructions in their shadow, not resolve to initiate execution.
>> For cheapo ALU ops yes, that seems reasonable. But for MUL, DIV
>> and floats one might want to wait rather than commit the FU.
> <
> MUL and DIV are the least expensive units to add extra pending result
> containers. The default position is 1 container which blocks starting
> new calculations. But when you have <say> 4 containers, you can hold
> several waiting on PRED.

I saw that eager launch was possible but it wasn't clear that it
would be a benefit worth the extra complexity.

Eager launch is valid when all the operands to the OP function are present,
but not the predicate value, and ignoring the state of the propagate value.
Its not clear how often this occurs.

The temp eager result could be stored back in the RS entry,
but that adds complication to the RS as it likely needs extra ports
to route signals to the result bus as well as the calculation units.
It also needs to maintain a watch for forwarding of the predicate value
and alternate propagate value while the calculation is running,
and as the RS is already doing this, it should just keep doing it.

For ALU it is easy to do but it didn't look like it saved anything
as it probably costs a clock to recognize the predicate ready wake-up
and that is the same latency cost as doing the ALU op.

For MUL with, say, latency of 4 and throughput of 1 (L:4,T:1) then
we get a head start on the latency, and the throughput of 1 means
MUL doesn't get plugged up doing potentially unnecessary work.

For a radix-4 DIV with, say, (L:17,T:17) then it does get plugged up
if you launch a speculative DIV and a useful one shows up.
If the DIV uses the MUL for a Newton-Raphson (or whatever)
then MUL is plugged up too for the duration.
So the FU scheduler gets slightly more complicated to choose
entries with resolved predicates first, then unresolved ones.
But once a DIV is started, it is committed.

This only benefits when the RS OP operands are all valid,
and the predicate is not, and there is not also work to do that
is queued for the same FU and not speculative and is ready.
It seems like an optimization best left to version 2.

> <
>> For LD I would not want to launch until the predicate resolves.
> <
> You can go for L1, but you cannot let a L1 miss propagate.

Ok, I'm being too conservative.
This allows launching a speculative LD that hits
TLB and either load-store forwarding or L1.

That would most likely be a big win.

> <
>> And ST can't launch until retire which means predicate is resolved.
> <
> I don't even read the value to be stored until it is "known to retire"
> I call this point "complete".
> <
>> Both LD and ST might prefetch while their predicate is unresolved.
> <
> Not past L1 or you open up Spectré attacks.

ST could eager translate VA->PA as long as it hits TLB.
Not sure how much that saves later though
and whether it is worth the extra logic to support it.

> <
>> If LD or ST speculates past an unresolved predicate that creates
>> Spectre possibilities. One could potentially have option flags on
>> the PRED instruction to indicate whether such speculation was allowed.

What do you think of the idea of having a bit on a PRED instruction
to enable/disable speculation? Its similar to my idea of having
branch hints that block conditional branch speculation.

It seems overkill to block all speculation if only
a small subset is susceptible to use in a Spectre attack.
Which begs the question: is only a small subset of
conditional branches susceptible to Spectre use?

Re: Compiling predicated insts to dataflow

<suonb2$9qk$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=23676&group=comp.arch#23676

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Compiling predicated insts to dataflow
Date: Fri, 18 Feb 2022 10:08:32 -0800
Organization: A noiseless patient Spider
Lines: 63
Message-ID: <suonb2$9qk$1@dont-email.me>
References: <jwv1r0263mr.fsf-monnier+comp.arch@gnu.org>
<7e844fbc-7998-4c87-a17c-3f4b16c9629dn@googlegroups.com>
<NVuPJ.17969$V7da.14032@fx13.iad>
<48ef89d3-1397-4145-ae8f-dfd165c67f06n@googlegroups.com>
<MGyPJ.38899$yi_7.34012@fx39.iad>
<472f80e0-2d73-41b7-ae44-a13407b9160an@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 18 Feb 2022 18:08:35 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="e37a38c8e9b3a9de719ecd143107053e";
logging-data="10068"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18++z6Cr1mzG2B8r8TzQ6QXURpunIFWf5U="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.0
Cancel-Lock: sha1:dOEUvTB+0VggVaXEk2mMfF9Ymi4=
In-Reply-To: <472f80e0-2d73-41b7-ae44-a13407b9160an@googlegroups.com>
Content-Language: en-US
 by: Stephen Fuld - Fri, 18 Feb 2022 18:08 UTC

On 2/17/2022 5:53 PM, MitchAlsup wrote:
> On Thursday, February 17, 2022 at 2:59:28 PM UTC-6, EricP wrote:
>> MitchAlsup wrote:
>>> On Thursday, February 17, 2022 at 10:42:26 AM UTC-6, EricP wrote:
>>>> D) When PRED executes and predicate value resolves,
>>>> dynamically elide dependency chains for disabled parts of uOps
>>>> so at least the uOps don't wait for prior results they no longer need.
>>>>
>>>> Below I add the version number on X_n that Rename would create,
>>>> assuming the above optimization was NOT performed:
>>>>
>>>> ld foo
>>>> ld a
>>>> ld b
>>>> x_2 <- foo ? a + 1: x_1 ADD#1
>>>> x_3 <- foo ? x_2 : b + 2 ADD#2
>>>> y <- x_3[3]
>>>>
>>>> The renamer makes ADD#2 serially dependent on ADD#1.
>>> <
>>> OK, first you are considering a reservation station like machine;
>>> ScoreBoards don't (typically) use renamers.
>> Right, I'm looking at this as OoO because the original question
>> was if the two ADDs can/can't be executed concurrently.
>>
>> A Scoreboard would also have RAW/WAR/WAW dependencies that can be pruned.
>>> Renamer needs to take the PRED mask into consideration for renaming
>>> <
>>> result [0] x = a+1 then-clause
>>> result [1] x = b+2 else-clause
>>> here, the renamer sees that it can assign physical register to x just once.
>> Yes.
>>> When I looked at it it looked quadratically hard--that is something you can
>>> do in HW only with limited horizon.
>> Ok.
>>>> As soon as foo is known and PRED executes, it forwards the predicate
>>>> value to all dependent instructions, many of which are hopefully
>>>> already waiting in reservation Stations (they could be anywhere
>>>> in the front end, including not yet decoded).
>>> <
>>> Ideally, they would be waiting at the end of the calculation unit to deliver
>>> their results (i.e., already out of station) PREDs resolve to complete
>>> instructions in their shadow, not resolve to initiate execution.
>> For cheapo ALU ops yes, that seems reasonable. But for MUL, DIV
>> and floats one might want to wait rather than commit the FU.
> <
> MUL and DIV are the least expensive units to add extra pending result
> containers. The default position is 1 container which blocks starting
> new calculations. But when you have <say> 4 containers, you can hold
> several waiting on PRED.
> <
>> For LD I would not want to launch until the predicate resolves.
> <
> You can go for L1, but you cannot let a L1 miss propagate.

If you have a multiway L1 cache, and you do some sort of LRU on the
ways, does that potentially expose information through a side channel?
Or do you delay the LRU update until the predicate is resolved?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Compiling predicated insts to dataflow

<96f9c5ed-fb24-40e4-902c-b43be255a00fn@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=23691&group=comp.arch#23691

 copy link   Newsgroups: comp.arch
X-Received: by 2002:adf:fe0d:0:b0:1e3:3f5e:7469 with SMTP id n13-20020adffe0d000000b001e33f5e7469mr7681908wrr.61.1645231675589;
Fri, 18 Feb 2022 16:47:55 -0800 (PST)
X-Received: by 2002:a05:6870:1c8:b0:d3:6d9a:8fd8 with SMTP id
n8-20020a05687001c800b000d36d9a8fd8mr4192058oad.333.1645231675044; Fri, 18
Feb 2022 16:47:55 -0800 (PST)
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!feeder1.cambriumusenet.nl!feed.tweak.nl!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 18 Feb 2022 16:47:54 -0800 (PST)
In-Reply-To: <3RQPJ.42313$yi_7.26320@fx39.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:807a:9c3e:6af0:c510;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:807a:9c3e:6af0:c510
References: <jwv1r0263mr.fsf-monnier+comp.arch@gnu.org> <7e844fbc-7998-4c87-a17c-3f4b16c9629dn@googlegroups.com>
<NVuPJ.17969$V7da.14032@fx13.iad> <48ef89d3-1397-4145-ae8f-dfd165c67f06n@googlegroups.com>
<MGyPJ.38899$yi_7.34012@fx39.iad> <472f80e0-2d73-41b7-ae44-a13407b9160an@googlegroups.com>
<3RQPJ.42313$yi_7.26320@fx39.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <96f9c5ed-fb24-40e4-902c-b43be255a00fn@googlegroups.com>
Subject: Re: Compiling predicated insts to dataflow
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 19 Feb 2022 00:47:55 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Sat, 19 Feb 2022 00:47 UTC

On Friday, February 18, 2022 at 11:39:16 AM UTC-6, EricP wrote:
> MitchAlsup wrote:
> > On Thursday, February 17, 2022 at 2:59:28 PM UTC-6, EricP wrote:
> >> MitchAlsup wrote:
> >>> On Thursday, February 17, 2022 at 10:42:26 AM UTC-6, EricP wrote:
> >>>> As soon as foo is known and PRED executes, it forwards the predicate
> >>>> value to all dependent instructions, many of which are hopefully
> >>>> already waiting in reservation Stations (they could be anywhere
> >>>> in the front end, including not yet decoded).
> >>> <
> >>> Ideally, they would be waiting at the end of the calculation unit to deliver
> >>> their results (i.e., already out of station) PREDs resolve to complete
> >>> instructions in their shadow, not resolve to initiate execution.
> >> For cheapo ALU ops yes, that seems reasonable. But for MUL, DIV
> >> and floats one might want to wait rather than commit the FU.
> > <
> > MUL and DIV are the least expensive units to add extra pending result
> > containers. The default position is 1 container which blocks starting
> > new calculations. But when you have <say> 4 containers, you can hold
> > several waiting on PRED.
> I saw that eager launch was possible but it wasn't clear that it
> would be a benefit worth the extra complexity.
>
> Eager launch is valid when all the operands to the OP function are present,
> but not the predicate value, and ignoring the state of the propagate value.
> Its not clear how often this occurs.
>
> The temp eager result could be stored back in the RS entry,
<
unlikely to be the proper place to store it.
<
> but that adds complication to the RS as it likely needs extra ports
> to route signals to the result bus as well as the calculation units.
> It also needs to maintain a watch for forwarding of the predicate value
> and alternate propagate value while the calculation is running,
> and as the RS is already doing this, it should just keep doing it.
<
Luke store eager results in FUs, but forwarded the result (without
writing register file) writing the RF in order of PRED resolution.
I remember him saying that 90%+ of results would get to the
operands they needed to get to with 1 early forward only, followed
by 1 result+write later on.
>
> For ALU it is easy to do but it didn't look like it saved anything
> as it probably costs a clock to recognize the predicate ready wake-up
> and that is the same latency cost as doing the ALU op.
>
> For MUL with, say, latency of 4 and throughput of 1 (L:4,T:1) then
> we get a head start on the latency, and the throughput of 1 means
> MUL doesn't get plugged up doing potentially unnecessary work.
>
> For a radix-4 DIV with, say, (L:17,T:17) then it does get plugged up
> if you launch a speculative DIV and a useful one shows up.
<
The complexities of pipeline design.....
<
> If the DIV uses the MUL for a Newton-Raphson (or whatever)
> then MUL is plugged up too for the duration.
<
There is a free multiplier path every 3 cycles while the GoldSchmidt
iteration is calculated Denominator, Numerator, free, Denominator,...
So you could get 1/3 FMUL throughput while performing FDIVs,
1/4 when performing SQRTs.
<
Newton-Raphson, you get 1/2 doing FDIV and 2/3rds doing SQRT.
<
> So the FU scheduler gets slightly more complicated to choose
> entries with resolved predicates first, then unresolved ones.
> But once a DIV is started, it is committed.
<
FU sends out a signal called "busy" scheduler does not perform
a pick for the FU's next cycle when signal is asserted.
It is really easy. FU's busy signal is easy logic too.
>
> This only benefits when the RS OP operands are all valid,
> and the predicate is not, and there is not also work to do that
> is queued for the same FU and not speculative and is ready.
> It seems like an optimization best left to version 2.
> > <
> >> For LD I would not want to launch until the predicate resolves.
> > <
> > You can go for L1, but you cannot let a L1 miss propagate.
<
> Ok, I'm being too conservative.
> This allows launching a speculative LD that hits
> TLB and either load-store forwarding or L1.
>
> That would most likely be a big win.
<
Which is why you don't want it to propagate further--that
opens us attack vectors for Spectré
> > <
> >> And ST can't launch until retire which means predicate is resolved.
> > <
> > I don't even read the value to be stored until it is "known to retire"
> > I call this point "complete".
> > <
> >> Both LD and ST might prefetch while their predicate is unresolved.
> > <
> > Not past L1 or you open up Spectré attacks.
<
> ST could eager translate VA->PA as long as it hits TLB.
> Not sure how much that saves later though
> and whether it is worth the extra logic to support it.
> > <
> >> If LD or ST speculates past an unresolved predicate that creates
> >> Spectre possibilities. One could potentially have option flags on
> >> the PRED instruction to indicate whether such speculation was allowed.
<
> What do you think of the idea of having a bit on a PRED instruction
> to enable/disable speculation? Its similar to my idea of having
> branch hints that block conditional branch speculation.
<
I don't see how compiler could use it. I can't see compiler being the
only one using it, and it opens the door to Spectré for little gain.
>
> It seems overkill to block all speculation if only
> a small subset is susceptible to use in a Spectre attack.
> Which begs the question: is only a small subset of
> conditional branches susceptible to Spectre use?
<
Spend the logic making the caches perform better.

Re: Compiling predicated insts to dataflow

<51a99412-017a-45c0-947a-936bb3a8ac47n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=23692&group=comp.arch#23692

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a5d:6884:0:b0:1e4:ed7b:fd71 with SMTP id h4-20020a5d6884000000b001e4ed7bfd71mr7893601wru.550.1645232077613;
Fri, 18 Feb 2022 16:54:37 -0800 (PST)
X-Received: by 2002:a05:6870:884:b0:d3:120d:fb4a with SMTP id
fx4-20020a056870088400b000d3120dfb4amr5245198oab.327.1645232077135; Fri, 18
Feb 2022 16:54:37 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 18 Feb 2022 16:54:36 -0800 (PST)
In-Reply-To: <suonb2$9qk$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:807a:9c3e:6af0:c510;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:807a:9c3e:6af0:c510
References: <jwv1r0263mr.fsf-monnier+comp.arch@gnu.org> <7e844fbc-7998-4c87-a17c-3f4b16c9629dn@googlegroups.com>
<NVuPJ.17969$V7da.14032@fx13.iad> <48ef89d3-1397-4145-ae8f-dfd165c67f06n@googlegroups.com>
<MGyPJ.38899$yi_7.34012@fx39.iad> <472f80e0-2d73-41b7-ae44-a13407b9160an@googlegroups.com>
<suonb2$9qk$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <51a99412-017a-45c0-947a-936bb3a8ac47n@googlegroups.com>
Subject: Re: Compiling predicated insts to dataflow
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 19 Feb 2022 00:54:37 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Sat, 19 Feb 2022 00:54 UTC

On Friday, February 18, 2022 at 12:08:38 PM UTC-6, Stephen Fuld wrote:
> On 2/17/2022 5:53 PM, MitchAlsup wrote:
> >> For LD I would not want to launch until the predicate resolves.
> > <
> > You can go for L1, but you cannot let a L1 miss propagate.
> If you have a multiway L1 cache, and you do some sort of LRU on the
> ways, does that potentially expose information through a side channel?
> Or do you delay the LRU update until the predicate is resolved?
<
We don't use LRU, we use Not-Recently-Used. Each time a cache way is
touched its bit gets set. When all cache way bits are set, clear them all.
Have a find first scan the bits. When cache takes a miss, find first tells
you which way gets assigned.
<
For 4-way set caches this is 4-SR flip flops, 6-gates of find first, 1 4-in NAND
gate, and 3-4 control gates; per set.
<
A LRU for 4 way would have 6 SR flip-flops, and 20-odd logic gates.
<
>
>
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

Re: Compiling predicated insts to dataflow

<supqmm$j5u$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=23694&group=comp.arch#23694

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Compiling predicated insts to dataflow
Date: Fri, 18 Feb 2022 20:12:05 -0800
Organization: A noiseless patient Spider
Lines: 29
Message-ID: <supqmm$j5u$1@dont-email.me>
References: <jwv1r0263mr.fsf-monnier+comp.arch@gnu.org>
<7e844fbc-7998-4c87-a17c-3f4b16c9629dn@googlegroups.com>
<NVuPJ.17969$V7da.14032@fx13.iad>
<48ef89d3-1397-4145-ae8f-dfd165c67f06n@googlegroups.com>
<MGyPJ.38899$yi_7.34012@fx39.iad>
<472f80e0-2d73-41b7-ae44-a13407b9160an@googlegroups.com>
<suonb2$9qk$1@dont-email.me>
<51a99412-017a-45c0-947a-936bb3a8ac47n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 19 Feb 2022 04:12:07 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="1866c87decec77ac97b72b35e5a58cae";
logging-data="19646"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1921zAPe2p4mbsi4mGsa6tOZm2ISW8rHVc="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.0
Cancel-Lock: sha1:79fa+rkdoCL8aFXEnJZ1PklADdo=
In-Reply-To: <51a99412-017a-45c0-947a-936bb3a8ac47n@googlegroups.com>
Content-Language: en-US
 by: Stephen Fuld - Sat, 19 Feb 2022 04:12 UTC

On 2/18/2022 4:54 PM, MitchAlsup wrote:
> On Friday, February 18, 2022 at 12:08:38 PM UTC-6, Stephen Fuld wrote:
>> On 2/17/2022 5:53 PM, MitchAlsup wrote:
>>>> For LD I would not want to launch until the predicate resolves.
>>> <
>>> You can go for L1, but you cannot let a L1 miss propagate.
>> If you have a multiway L1 cache, and you do some sort of LRU on the
>> ways, does that potentially expose information through a side channel?
>> Or do you delay the LRU update until the predicate is resolved?
> <
> We don't use LRU, we use Not-Recently-Used. Each time a cache way is
> touched its bit gets set. When all cache way bits are set, clear them all.
> Have a find first scan the bits. When cache takes a miss, find first tells
> you which way gets assigned.
> <
> For 4-way set caches this is 4-SR flip flops, 6-gates of find first, 1 4-in NAND
> gate, and 3-4 control gates; per set.
> <
> A LRU for 4 way would have 6 SR flip-flops, and 20-odd logic gates.

OK, but you didn't answer my question. If you use not recently used,
and you set the bit before the predicate is resolved, does that open the
way for a side channel attack?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Compiling predicated insts to dataflow

<94a7f3a9-8452-4c94-8517-e7c400d47b5bn@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=23698&group=comp.arch#23698

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a5d:5983:0:b0:1e5:7dd6:710 with SMTP id n3-20020a5d5983000000b001e57dd60710mr10050287wri.392.1645288487581;
Sat, 19 Feb 2022 08:34:47 -0800 (PST)
X-Received: by 2002:a05:6870:9688:b0:d2:9fd8:ca84 with SMTP id
o8-20020a056870968800b000d29fd8ca84mr4771976oaq.337.1645288487025; Sat, 19
Feb 2022 08:34:47 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 19 Feb 2022 08:34:46 -0800 (PST)
In-Reply-To: <supqmm$j5u$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:b1ac:886e:a458:f180;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:b1ac:886e:a458:f180
References: <jwv1r0263mr.fsf-monnier+comp.arch@gnu.org> <7e844fbc-7998-4c87-a17c-3f4b16c9629dn@googlegroups.com>
<NVuPJ.17969$V7da.14032@fx13.iad> <48ef89d3-1397-4145-ae8f-dfd165c67f06n@googlegroups.com>
<MGyPJ.38899$yi_7.34012@fx39.iad> <472f80e0-2d73-41b7-ae44-a13407b9160an@googlegroups.com>
<suonb2$9qk$1@dont-email.me> <51a99412-017a-45c0-947a-936bb3a8ac47n@googlegroups.com>
<supqmm$j5u$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <94a7f3a9-8452-4c94-8517-e7c400d47b5bn@googlegroups.com>
Subject: Re: Compiling predicated insts to dataflow
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 19 Feb 2022 16:34:47 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Sat, 19 Feb 2022 16:34 UTC

On Friday, February 18, 2022 at 10:12:11 PM UTC-6, Stephen Fuld wrote:
> On 2/18/2022 4:54 PM, MitchAlsup wrote:
> > On Friday, February 18, 2022 at 12:08:38 PM UTC-6, Stephen Fuld wrote:
> >> On 2/17/2022 5:53 PM, MitchAlsup wrote:
> >>>> For LD I would not want to launch until the predicate resolves.
> >>> <
> >>> You can go for L1, but you cannot let a L1 miss propagate.
> >> If you have a multiway L1 cache, and you do some sort of LRU on the
> >> ways, does that potentially expose information through a side channel?
> >> Or do you delay the LRU update until the predicate is resolved?
> > <
> > We don't use LRU, we use Not-Recently-Used. Each time a cache way is
> > touched its bit gets set. When all cache way bits are set, clear them all.
> > Have a find first scan the bits. When cache takes a miss, find first tells
> > you which way gets assigned.
> > <
> > For 4-way set caches this is 4-SR flip flops, 6-gates of find first, 1 4-in NAND
> > gate, and 3-4 control gates; per set.
> > <
> > A LRU for 4 way would have 6 SR flip-flops, and 20-odd logic gates.
<
> OK, but you didn't answer my question. If you use not recently used,
> and you set the bit before the predicate is resolved, does that open the
> way for a side channel attack?
<
Maybe, maybe not.
<
You have not changed which cache lines are present and in what states.
But you did possibly change the order of replacement.
<
Spectré tolerant design will have the new replacement lines in the miss
buffer and will not have modified the cache. Cache is not updates until
instruction which caused the update retires.
<
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

1
server_pubkey.txt

rocksolid light 0.9.7
clearnet tor