Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Remember, UNIX spelled backwards is XINU. -- Mt.


devel / comp.arch / Re: Branch prediction hints

SubjectAuthor
* Branch prediction hintsThomas Koenig
+- Re: Branch prediction hintsStefan Monnier
+- Re: Branch prediction hintsMitchAlsup
+* Re: Branch prediction hintsIvan Godard
|`* Re: Branch prediction hintsMitchAlsup
| `- Re: Branch prediction hintsMitchAlsup
+* Re: Branch prediction hintsBGB
|`* Re: Branch prediction hintsIvan Godard
| `* Re: Branch prediction hintsBGB
|  `* Re: Branch prediction hintsIvan Godard
|   +* Re: Branch prediction hintsMitchAlsup
|   |`* Re: Branch prediction hintsIvan Godard
|   | +* Re: Branch prediction hintsBGB
|   | |`- Re: Branch prediction hintsMitchAlsup
|   | `- Re: Branch prediction hintsMitchAlsup
|   `* Re: Branch prediction hintsBGB
|    `* Re: Branch prediction hintsIvan Godard
|     +* Re: Branch prediction hintsMitchAlsup
|     |`* Re: Branch prediction hintsBGB
|     | `* Re: Branch prediction hintsMitchAlsup
|     |  `* Re: Branch prediction hintsBGB
|     |   `* Re: Branch prediction hintsMitchAlsup
|     |    `* Re: Branch prediction hintsBGB
|     |     `* Re: Branch prediction hintsMitchAlsup
|     |      `* Re: Branch prediction hintsBGB
|     |       +* Re: Branch prediction hintsMitchAlsup
|     |       |`* Re: Branch prediction hintsBGB
|     |       | `* Re: Branch prediction hintsMitchAlsup
|     |       |  `* Re: Branch prediction hintsBGB
|     |       |   +- Re: Branch prediction hintsMitchAlsup
|     |       |   `* Re: Branch prediction hintsStefan Monnier
|     |       |    `* Re: Branch prediction hintsBGB
|     |       |     `- Re: Branch prediction hintsrobf...@gmail.com
|     |       `* Re: Branch prediction hintsMitchAlsup
|     |        `- Re: Branch prediction hintsBGB
|     `* Re: Branch prediction hintsBGB
|      `- Re: Branch prediction hintsMitchAlsup
+* Re: Branch prediction hintsTerje Mathisen
|`* Re: Branch prediction hintsMitchAlsup
| +* Re: Branch prediction hintsStefan Monnier
| |`* Re: Branch prediction hintsMitchAlsup
| | `* Re: Branch prediction hintsMarcus
| |  +* Re: Branch prediction hintsThomas Koenig
| |  |+- Re: Branch prediction hintsMarcus
| |  |`* Re: Branch prediction hintsAnton Ertl
| |  | `- Re: Branch prediction hintsMitchAlsup
| |  `* Re: Branch prediction hintsStephen Fuld
| |   `* Re: Branch prediction hintsTim Rentsch
| |    +- Re: Branch prediction hintsStephen Fuld
| |    `* Re: Branch prediction hintsMitchAlsup
| |     `- Re: Branch prediction hintsQuadibloc
| `- Re: Branch prediction hintsTerje Mathisen
`* Re: Branch prediction hintsEricP
 +* Re: Branch prediction hintsThomas Koenig
 |`- Re: Branch prediction hintsEricP
 +* Re: Branch prediction hintsMitchAlsup
 |+* Re: Branch prediction hintsIvan Godard
 ||`* Re: Branch prediction hintsMitchAlsup
 || `* Re: Branch prediction hintsEricP
 ||  `* Re: Branch prediction hintsMitchAlsup
 ||   `* Re: Branch prediction hintsEricP
 ||    `* Re: Branch prediction hintsMitchAlsup
 ||     `* Re: HW Transactions [was Branch prediction hints]EricP
 ||      `* Re: HW Transactions [was Branch prediction hints]MitchAlsup
 ||       `* Re: HW Transactions [was Branch prediction hints]EricP
 ||        `- Re: HW Transactions [was Branch prediction hints]MitchAlsup
 |`- Re: Branch prediction hintsEricP
 `* Re: Branch prediction hintsIvan Godard
  +* Re: Branch prediction hintsMitchAlsup
  |`* Re: Branch prediction hintsIvan Godard
  | `- Re: Branch prediction hintsMitchAlsup
  `* Re: Branch prediction hintsEricP
   `- Re: Branch prediction hintsThomas Koenig

Pages:123
Re: Branch prediction hints

<s8ed3m$m3c$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17092&group=comp.arch#17092

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Branch prediction hints
Date: Sun, 23 May 2021 15:14:42 -0500
Organization: A noiseless patient Spider
Lines: 139
Message-ID: <s8ed3m$m3c$1@dont-email.me>
References: <s8c0j2$q5d$1@newsreader4.netcologne.de>
<s8cmv1$1e7$1@dont-email.me> <s8csfm$172$1@dont-email.me>
<s8d40h$73f$1@dont-email.me> <s8dakc$27r$1@dont-email.me>
<f7ed098c-cf36-404b-b61b-f14732b978c9n@googlegroups.com>
<s8e9pg$lnc$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 23 May 2021 20:14:46 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="6c9cc4202b21f9c8a5dacf516cd4c7cd";
logging-data="22636"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19e1s3bRof4M0hcS1N2eWF8"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.1
Cancel-Lock: sha1:rj3pyor/b6Tucav3lTnVWkvLXGg=
In-Reply-To: <s8e9pg$lnc$1@dont-email.me>
Content-Language: en-US
 by: BGB - Sun, 23 May 2021 20:14 UTC

On 5/23/2021 2:18 PM, Ivan Godard wrote:
> On 5/23/2021 8:56 AM, MitchAlsup wrote:
>> On Sunday, May 23, 2021 at 5:26:22 AM UTC-5, Ivan Godard wrote:
>>> On 5/23/2021 1:33 AM, BGB wrote:
>>>> On 5/23/2021 1:24 AM, Ivan Godard wrote:
>>
>>>> Bundle creation in my case is explicit and handled by the compiler (or
>>>> ASM programmer).
>>>>
>>>>
>>>> But, I am only dealing with predication for simple branches, eg:
>>>> if(x>0)
>>>> x--;
>>>> Or:
>>>> if(x>0)
>>>> z=x-5;
>>>> else
>>>> z=x+13;
>>>> ...
>>            PLT0     Rx,{1,1}
>>            ADD      Rz,Rx,#-1
>> ...
>>            PLT0     Rx,{2,10}
>>            ADD      Rz,Rx,#-5
>>            ADD      Rz,Rx,#13
>>>>
>>> consider:
>>> if(x>0)
>>> z=x-5;
>>> else
>>> z=foo(x);
>> <
>>            PLT0     Rx,{4,1000}
>>            ADD      Rz,Rx,#-1
>>            MOV     R1,Rx
>>            CALL    foo
>>            MOV     Rz,R1
>> <
>>> if you have exceptions under control, this can become:
>>> t1=x-5;t2=(x>0)?nil:foo(x);
>>> z=x>0?t1:t2;
>>> the architectural challenge is how to implement "t2=(x>0)?nil:foo(x);",
>>> i.e. predicated calls.
>> <
>> You also have to predicate the argument setup and the result delivery.
>
> No, unless the argument setup is potentially excepting; another example
> that predication must keep exceptions under control, or you get a
> predication explosion and wind up with an ARM. Once you have Mill's NaR
> bits or equivalent the only ops that get predicated are control flow
> (including call and return) and store.
>
> As for call result: whether a not-taken predicate clears or leaves alone
> the result reg is a matter for  architectural design; either can work.
>

One may need to predicate any "phi" style operations at the end
(assuming they are not using conditional select), ...

Though, it is simplest to "just predicate everything", if the compiler
were clever, it could potentially also split things like
register-allocation along true/false/always paths.

>>>> But, not for anything much more than a few instructions, or anything
>>>> involving a function call, ..., since presumably in this case a branch
>>>> is cheaper (and the core isn't really wide nor has enough registers to
>>>> really justify any IA-64 style modular scheduling trickery, ...).
>>> That's a problem with any if-conversions or other speculative
>>> scheduling: you have to have enough FUs to get useful parallelism.
>>> There's a sweet spot in speculative width: too narrow and you lose the
>>> benefit; too wide and the power and area cost more than the overhead
>>> of OOO
>> <
>> The majority of the benefit has already accrued when you can predicate
>> as far as you FETCH width. SO if you FETCH 4-wide, you get the majority
>> of the benefit by predication at least 4 instructions. More than 2× this
>> distance and you should be using branches to avoid tracking instructions
>> that don't execute.
>
> That assumes that consecutive instructuins all predicate the same way,
> which works when you have hardware dynamic OOO, and doesn't when you
> have static scheduling with interleaving from multiple paths.
>

Agreed.

To be effective, it doesn't make sense to predicate the true branch
followed by the false branch, but rather when possible to try to run
both branches in parallel.

This turns out to be the major features of PrWEX, as it provides a way
to express these sorts of "running true and false branch in parallel" cases.

If the decoder were clever enough, one could potentially allow extra
bundle width in these cases. However, my core is not smart enough for this.

> There really doesn't seem to be any middle ground: either you work out
> all the implications of static scheduling and wind up with a Mill, or
> you work out all those of OOO and wind up with a MY66.
>

As I see it, I don't know where the paths lead.

>> <
>>>> Also, the state of SR.T would not be preserved across a function call,
>>>> so any logic following the function's return could not be predicated.
>>> This can be architected around; ours is not the only possible way to
>>> do it.
>> <
>> If SR.T is in a preserved register, you just use PRED again after return.
>>
>

This is not currently true of BJX2, where 'SR' is not preserved on
function calls. However, if some of the execution state were copied into
LR (which is then effectively widened to 64 bits), this could be another
option for this.

BSR:
Copies PC(47:0) -> LR(47: 0)
Copies SR(15:0) -> LR(63:48)

RTS:
Copies LR(47:0) -> PC
Copies LR(63:48) -> SR

This would then save predicate state, pred-stack flags, and SIMD
predicate state, across function calls (as opposed to the current
approach, where they are undefined).

Similarly, would also make sense to put the WXE in there (in place of
the IRQ level bits), since WEX profile is function-scoped rather than
global.

....

Re: Branch prediction hints

<007244df-66ca-4422-ab33-fc045c307002n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17093&group=comp.arch#17093

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:6699:: with SMTP id d25mr10935376qtp.326.1621801569716;
Sun, 23 May 2021 13:26:09 -0700 (PDT)
X-Received: by 2002:a9d:3623:: with SMTP id w32mr15593145otb.16.1621801569484;
Sun, 23 May 2021 13:26:09 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.snarked.org!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 23 May 2021 13:26:09 -0700 (PDT)
In-Reply-To: <s8e9pg$lnc$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:bc69:35bc:a8f4:11b6;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:bc69:35bc:a8f4:11b6
References: <s8c0j2$q5d$1@newsreader4.netcologne.de> <s8cmv1$1e7$1@dont-email.me>
<s8csfm$172$1@dont-email.me> <s8d40h$73f$1@dont-email.me> <s8dakc$27r$1@dont-email.me>
<f7ed098c-cf36-404b-b61b-f14732b978c9n@googlegroups.com> <s8e9pg$lnc$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <007244df-66ca-4422-ab33-fc045c307002n@googlegroups.com>
Subject: Re: Branch prediction hints
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 23 May 2021 20:26:09 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 112
 by: MitchAlsup - Sun, 23 May 2021 20:26 UTC

On Sunday, May 23, 2021 at 2:18:10 PM UTC-5, Ivan Godard wrote:
> On 5/23/2021 8:56 AM, MitchAlsup wrote:
> > On Sunday, May 23, 2021 at 5:26:22 AM UTC-5, Ivan Godard wrote:
> >> On 5/23/2021 1:33 AM, BGB wrote:
> >>> On 5/23/2021 1:24 AM, Ivan Godard wrote:
> >
> >>> Bundle creation in my case is explicit and handled by the compiler (or
> >>> ASM programmer).
> >>>
> >>>
> >>> But, I am only dealing with predication for simple branches, eg:
> >>> if(x>0)
> >>> x--;
> >>> Or:
> >>> if(x>0)
> >>> z=x-5;
> >>> else
> >>> z=x+13;
> >>> ...
> > PLT0 Rx,{1,1}
> > ADD Rz,Rx,#-1
> > ...
> > PLT0 Rx,{2,10}
> > ADD Rz,Rx,#-5
> > ADD Rz,Rx,#13
> >>>
> >> consider:
> >> if(x>0)
> >> z=x-5;
> >> else
> >> z=foo(x);
> > <
> > PLT0 Rx,{4,1000}
> > ADD Rz,Rx,#-1
> > MOV R1,Rx
> > CALL foo
> > MOV Rz,R1
> > <
> >> if you have exceptions under control, this can become:
> >> t1=x-5;t2=(x>0)?nil:foo(x);
> >> z=x>0?t1:t2;
> >> the architectural challenge is how to implement "t2=(x>0)?nil:foo(x);",
> >> i.e. predicated calls.
> > <
> > You also have to predicate the argument setup and the result delivery.
> No, unless the argument setup is potentially excepting; another example
> that predication must keep exceptions under control, or you get a
> predication explosion and wind up with an ARM. Once you have Mill's NaR
> bits or equivalent the only ops that get predicated are control flow
> (including call and return) and store.
>
> As for call result: whether a not-taken predicate clears or leaves alone
> the result reg is a matter for architectural design; either can work.
> >>> But, not for anything much more than a few instructions, or anything
> >>> involving a function call, ..., since presumably in this case a branch
> >>> is cheaper (and the core isn't really wide nor has enough registers to
> >>> really justify any IA-64 style modular scheduling trickery, ...).
> >> That's a problem with any if-conversions or other speculative
> >> scheduling: you have to have enough FUs to get useful parallelism.
> >> There's a sweet spot in speculative width: too narrow and you lose the
> >> benefit; too wide and the power and area cost more than the overhead of OOO
> > <
> > The majority of the benefit has already accrued when you can predicate
> > as far as you FETCH width. SO if you FETCH 4-wide, you get the majority
> > of the benefit by predication at least 4 instructions. More than 2× this
> > distance and you should be using branches to avoid tracking instructions
> > that don't execute.
<
> That assumes that consecutive instructuins all predicate the same way,
<
If you mean all instructions under the shadow of the predicate are either
all in the then-clause or all in the else-clause, I have to disagree because
My 66000 predication can be nested to a minor extent (seldom done in
practice). But:: "if then (if then else) else (if then else)" works just fine in My
66000.
<
> which works when you have hardware dynamic OOO, and doesn't when you
> have static scheduling with interleaving from multiple paths.
<
Yes, I remain in the GBOoO class.
>
> There really doesn't seem to be any middle ground: either you work out
> all the implications of static scheduling and wind up with a Mill, or
> you work out all those of OOO and wind up with a MY66.
<
Agreed.
> > <
> >>> Also, the state of SR.T would not be preserved across a function call,
> >>> so any logic following the function's return could not be predicated.
> >> This can be architected around; ours is not the only possible way to do it.
> > <
> > If SR.T is in a preserved register, you just use PRED again after return.
> >

Re: Branch prediction hints

<13fa2553-eaf2-43df-a87a-3559a45d88a0n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17094&group=comp.arch#17094

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:a11:: with SMTP id dw17mr26801464qvb.8.1621801779815;
Sun, 23 May 2021 13:29:39 -0700 (PDT)
X-Received: by 2002:a05:6808:1496:: with SMTP id e22mr9128960oiw.78.1621801779718;
Sun, 23 May 2021 13:29:39 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 23 May 2021 13:29:39 -0700 (PDT)
In-Reply-To: <s8eabb$ig$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:bc69:35bc:a8f4:11b6;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:bc69:35bc:a8f4:11b6
References: <s8c0j2$q5d$1@newsreader4.netcologne.de> <s8cmv1$1e7$1@dont-email.me>
<s8csfm$172$1@dont-email.me> <s8d40h$73f$1@dont-email.me> <s8dakc$27r$1@dont-email.me>
<s8e60c$ca6$1@dont-email.me> <s8eabb$ig$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <13fa2553-eaf2-43df-a87a-3559a45d88a0n@googlegroups.com>
Subject: Re: Branch prediction hints
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 23 May 2021 20:29:39 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Sun, 23 May 2021 20:29 UTC

On Sunday, May 23, 2021 at 2:27:41 PM UTC-5, Ivan Godard wrote:
> On 5/23/2021 11:13 AM, BGB wrote:
> > On 5/23/2021 5:26 AM, Ivan Godard wrote:
> >> On 5/23/2021 1:33 AM, BGB wrote:
> >
> > Other ops would have 3 registers (18 bits), or 2 registers (12 bits).
> > Compare ops could have a 2-bit predicate-destination field.
> >
> > It is possible that 01:00 (Never Execute) could be used to encode a
> > Jumbo Prefix or similar (or, maybe a few unconditional large-immed
> > instructions or similar).
<
> Having predication for most ops is just entropy clutter and a waste of
> power: it costs more to *not* do an ADD than to do it, so always do it,
> and junk the predicates. You need predicates for ops that might do a
> hard throw, or that change persistent state like store and control flow;
> nowhere else.
<
If the standard word size was 36-bits I would disagree, here, but since it is
32-bits, I have to agree.

Re: Branch prediction hints

<40d0a738-6741-4f96-b608-7b507fe12b90n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17095&group=comp.arch#17095

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:570a:: with SMTP id 10mr22718949qtw.360.1621801873247;
Sun, 23 May 2021 13:31:13 -0700 (PDT)
X-Received: by 2002:aca:c64a:: with SMTP id w71mr9215732oif.44.1621801873065;
Sun, 23 May 2021 13:31:13 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!usenet.pasdenom.info!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 23 May 2021 13:31:12 -0700 (PDT)
In-Reply-To: <s8eajo$ig$2@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:bc69:35bc:a8f4:11b6;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:bc69:35bc:a8f4:11b6
References: <s8c0j2$q5d$1@newsreader4.netcologne.de> <Z7tqI.61928$N%1.35599@fx28.iad>
<s8eajo$ig$2@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <40d0a738-6741-4f96-b608-7b507fe12b90n@googlegroups.com>
Subject: Re: Branch prediction hints
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 23 May 2021 20:31:13 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Sun, 23 May 2021 20:31 UTC

On Sunday, May 23, 2021 at 2:32:10 PM UTC-5, Ivan Godard wrote:
> On 5/23/2021 6:53 AM, EricP wrote:
> > Thomas Koenig wrote:
> >> To quote the POWER9 User Manual:

> > Without an explicit "predict never" hint, in the case of HTM
> > this looked to me like that speculation might have to shut off
> > while a transaction was in progress because there is no way
> > to deduce which loads are guarded by a particular condition.
> > At a minimum, in an HTM it looked like no loads performed while
> > any prior branch was unresolved, not even prefetched into cache
> > (or maybe that's a good thing, I don't know).
> >
> > Predict-never can also be used for rarely executed error handling code.
> >
> > Predict-always is the complementary case for branching around
> > rarely executed error handling code that one wants inline,
> > and it doesn't matter what it did the last million times it executed.
> >
> >
> Your HTM breaks on a load? Why? We break only on a colliding store.
<
I allow, under a critical circumstance, to NAK the offending load, allowing
the one in the critical section to make forward progress and prevent the
one not yet there from interfering.
>
> Perhaps your HTM's intra-transaction state is visible from outside?

Re: Branch prediction hints

<s8ee40$ab4$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17096&group=comp.arch#17096

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Branch prediction hints
Date: Sun, 23 May 2021 15:31:56 -0500
Organization: A noiseless patient Spider
Lines: 196
Message-ID: <s8ee40$ab4$1@dont-email.me>
References: <s8c0j2$q5d$1@newsreader4.netcologne.de>
<s8cmv1$1e7$1@dont-email.me> <s8csfm$172$1@dont-email.me>
<s8d40h$73f$1@dont-email.me> <s8dakc$27r$1@dont-email.me>
<s8e60c$ca6$1@dont-email.me> <s8eabb$ig$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 23 May 2021 20:32:01 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="6c9cc4202b21f9c8a5dacf516cd4c7cd";
logging-data="10596"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1848F149zxxoDwV24oUo5tu"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.1
Cancel-Lock: sha1:rdaDL0iMnAIu71xzTrfy8pRrTRk=
In-Reply-To: <s8eabb$ig$1@dont-email.me>
Content-Language: en-US
 by: BGB - Sun, 23 May 2021 20:31 UTC

On 5/23/2021 2:27 PM, Ivan Godard wrote:
> On 5/23/2021 11:13 AM, BGB wrote:
>> On 5/23/2021 5:26 AM, Ivan Godard wrote:
>>> On 5/23/2021 1:33 AM, BGB wrote:
>>>> On 5/23/2021 1:24 AM, Ivan Godard wrote:
>>>>> On 5/22/2021 9:50 PM, BGB wrote:
>>>>>> On 5/22/2021 5:28 PM, Thomas Koenig wrote:
>>>>>>> To quote the POWER9 User Manual:
>>>>>>>
>>>>>>> # The POWER9 core normally ignores any software that attempts to
>>>>>>> # override the dynamic branch prediction by setting the “a” bit
>>>>>>> # in the BO field. This is done because historically programmers
>>>>>>> # and compilers have made poor choices for setting the “a” bit,
>>>>>>> # which limited the performance of codes where the hardware can
>>>>>>> # do a superior job of predicting the branches.
>>>>>>>
>>>>>>> Having read this: Are branching hints actually useful today?
>>>>>>>
>>>>>>> I could see some use in a "almost never used" hint for branches
>>>>>>> for fatal error messages, maybe.
>>>>>>>
>>>>>>
>>>>>> Scenario 1:
>>>>>>    Core is too cheap to do branch prediction:
>>>>>>      Branch hints are useless.
>>>>>>    Core only does a fixed prediction with no context:
>>>>>>      Maybe relevant.
>>>>>>    Core does branch prediction, has context:
>>>>>>      This is useless.
>>>>>>
>>>>>> Predictable branches:
>>>>>> A hardware branch predictor can predict them fairly easily, this
>>>>>> is useless.
>>>>>>
>>>>>> Unpredictable branches:
>>>>>> Can't be predicted either way, this is useless.
>>>>>>
>>>>>> So, general leaning:
>>>>>> Branch direction hints are "kinda useless"...
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Nevermind if my ISA has a few encodings which could potentially be
>>>>>> interpreted this way, in my defense, these encodings arrived as a
>>>>>> historical accident (predicated ops were added on after branches
>>>>>> already existed, so some redundant encodings appeared, ...).
>>>>>>
>>>>>> It is likely I might reclaim some of this space eventually and use
>>>>>> it for something else (maybe more space for PrWEX ops?...).
>>>>>>
>>>>>> As noted, some encodings technically exist (for Disp20 branches),
>>>>>> but I don't really consider them "valid":
>>>>>>    BSR?T / BSR?F (1);
>>>>>>    BT?T / BT?F / BF?T / BF?F;
>>>>>>    WEX encoded branch ops (2).
>>>>>>
>>>>>> *1: These operations "actually work", but predicated subroutine
>>>>>> calls aren't really an operation which "makes sense". So, fall
>>>>>> into a sort of "invalid de-facto because the operation itself is
>>>>>> kinda absurd" category.
>>>>>
>>>>> Predicated calls are common in if-converted code. Of course, if you
>>>>> are doing hardware bundle creation as in Mitch's then you don't
>>>>> need static predication of any form.
>>>>>
>>>>
>>>> Bundle creation in my case is explicit and handled by the compiler
>>>> (or ASM programmer).
>>>>
>>>>
>>>> But, I am only dealing with predication for simple branches, eg:
>>>>    if(x>0)
>>>>      x--;
>>>> Or:
>>>>    if(x>0)
>>>>      z=x-5;
>>>>    else
>>>>      z=x+13;
>>>> ...
>>>>
>>>
>>> consider:
>>>      if(x>0)
>>>        z=x-5;
>>>      else
>>>        z=foo(x);
>>> if you have exceptions under control, this can become:
>>>      t1=x-5;t2=(x>0)?nil:foo(x);
>>>      z=x>0?t1:t2;
>>> the architectural challenge is how to implement
>>> "t2=(x>0)?nil:foo(x);", i.e. predicated calls.
>>>
>>
>> This case could be done as-is with some register-use trickery, though
>> unclear how useful it would be in general.
>>
>>
>>>> But, not for anything much more than a few instructions, or anything
>>>> involving a function call, ..., since presumably in this case a
>>>> branch is cheaper (and the core isn't really wide nor has enough
>>>> registers to really justify any IA-64 style modular scheduling
>>>> trickery, ...).
>>>
>>> That's a problem with any if-conversions or other speculative
>>> scheduling: you have to have enough FUs to get useful parallelism.
>>> There's a sweet spot in speculative width: too narrow and you lose
>>> the benefit; too wide and the power and area cost more than the
>>> overhead of OOO
>>>
>>
>> Yeah. In my own uses, what I was able to leverage in hand-written ASM
>> seems to imply an optimal width of ~ 2 or 3. Any wider, and I run out
>> of stuff that could be run in parallel, or run out of registers to put
>> stuff in. Some modular-loop scheduling was done manually in a few
>> cases, but is a rarity, and only sometimes pays off.
>>
>>
>> My C compiler still falls well short of this though...
>>
>>
>> It looks to me like making use of a 4 or 5 wide core would effectively
>> require a rather different approach:
>>    multiple predication registers
>>     with ops being able to select a src/dst predicate
>>    bigger register file
>>    ...
>>
>> At this point, it would start to look more like an Itanium.
>>
>> Though, something kinda like Itanium, but with say 64 GPRs and 4 or 8
>> predicate registers, and variable-length bundles, could make some
>> sense (goal being to still use 32-bit instruction words and still have
>> a "plausible" code density).
>>
>> One possibility for predication is that ops are predicated by default,
>> just one of the predicate flags is hard-wired (to allow for "always
>> execute" ops), then ops fall into a mode:
>>    00 Scalar/End-Of-Bundle, Execute True
>>    01 Scalar/End-Of-Bundle, Execute False
>>    10 Wide, Execute True
>>    11 Wide, Execute False
>> With a predicate register (Source):
>>    00: Hard wired as True
>>    01: Predicate 1
>>    10: Predicate 2
>>    11: Predicate 3
>>
>> Other ops would have 3 registers (18 bits), or 2 registers (12 bits).
>> Compare ops could have a 2-bit predicate-destination field.
>>
>> It is possible that 01:00 (Never Execute) could be used to encode a
>> Jumbo Prefix or similar (or, maybe a few unconditional large-immed
>> instructions or similar).
>
> Having predication for most ops is just entropy clutter and a waste of
> power: it costs more to *not* do an ADD than to do it, so always do it,
> and junk the predicates. You need predicates for ops that might do a
> hard throw, or that change persistent state like store and control flow;
> nowhere else.
>

The predicates mostly just serve to replace an the contents of an opcode
field with a NOP or similar... (Well, except branches, which might need
both BRA and BRA_NB cases to deal with the branch predictor, where
BRA_NB is an unconditional branch to the following instruction for cases
where the branch predictor had predicted that the branch would be taken,
but it was not).

The logic to deal with predicates also overlaps with that of pipeline
flushes (during a branch), which are essentially unavoidable. Not doing
this would lead to more garbage in the register file, higher register
pressure, ...

And, by the time one has paid what it costs to have a pipeline flush,
they have already paid most of the cost of the predication.

The only "real" cost here IMO is that it eats slightly more encoding space.


Click here to read the complete article
Re: Branch prediction hints

<e977b069-6ce0-46bf-ac2a-f7fb85ef9f6cn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17097&group=comp.arch#17097

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:e113:: with SMTP id c19mr27822415qkm.329.1621802021506; Sun, 23 May 2021 13:33:41 -0700 (PDT)
X-Received: by 2002:aca:cf09:: with SMTP id f9mr8600411oig.37.1621802021270; Sun, 23 May 2021 13:33:41 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!tr2.eu1.usenetexpress.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 23 May 2021 13:33:41 -0700 (PDT)
In-Reply-To: <s8earo$a9k$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:bc69:35bc:a8f4:11b6; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:bc69:35bc:a8f4:11b6
References: <s8c0j2$q5d$1@newsreader4.netcologne.de> <Z7tqI.61928$N%1.35599@fx28.iad> <c42f2bb0-6920-44d2-8877-cff238443ca7n@googlegroups.com> <s8earo$a9k$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e977b069-6ce0-46bf-ac2a-f7fb85ef9f6cn@googlegroups.com>
Subject: Re: Branch prediction hints
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 23 May 2021 20:33:41 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 66
 by: MitchAlsup - Sun, 23 May 2021 20:33 UTC

On Sunday, May 23, 2021 at 2:36:26 PM UTC-5, Ivan Godard wrote:
> On 5/23/2021 9:07 AM, MitchAlsup wrote:
> > On Sunday, May 23, 2021 at 8:54:03 AM UTC-5, EricP wrote:
> >
> >> There is a case for a "predict never" hint where it doesn't matter
> >> what this conditional branch did the last million times it executed,
> >> always predict not-taken.
> >>
> >> In a spinlock which normally test the lock condition with
> >> a load before attempting the atomic sequence,
> >> you never want to speculatively execute into the atomic sequence.
> >> At a minimum it could cause ping-pong'ing the cache lines.
> > <
> > Yes, the classical test-and-test-and-set. But this only decreases
> > bus traffic from BigO(n^3) to BigO(N^2). There are ways to
> > decrease bus traffic to BigO( N+3 )
> >>
> >> With Hardware Transactional Memory HTM reading a memory location
> >> even speculatively might abort another's processors active transaction,
> >> you don't want to even touch data memory without explicit permission,
> >> not even prefetching any load or store addresses.
> > <
> > This is one of the things WRONG about HTM.
> > <
> > in My 66000 there is a Exotic Synchronization Method (ESM) which is not
> > an HTM but can be used to create HTMs. In ESM, if the ATOMIC event has
> > reached a critical juncture (i.e., can complete) the CPUs reaching those
> > points gain the ability to NAK interference, allowing these CPUs to complete
> > the ATOMIC event, and making the interferers run slower!
> >>
> >> If you don't have a hint to explicitly block speculation at the
> >> branch then the design would have to use more complicated and
> >> probably error prone dynamic logic to "deduce" what to do.
> > <
> > You do not want Naked memory refs to be used to setup or complete
> > ATOMIC events. You need to "mark" their participation in the event
> > so the machine knows that such an event is going on from the out-
> > set.
<
> This also permits you to have intra-transaction stores that are not part
> of the transaction, say for logging and debugging, where you don't lock
> the log memory.
<
Yes, exactly--and that is where the word "participating" came into use
when doing the ASF as AMD. Since you CANNOT single step through
and ATOMIC event, you need some way of figuring out what is going
on, lobbing registers into a memory buffer for later printing is the
prescribed means.
<
> >> Without an explicit "predict never" hint, in the case of HTM
> >> this looked to me like that speculation might have to shut off
> >> while a transaction was in progress because there is no way
> >> to deduce which loads are guarded by a particular condition.
> >> At a minimum, in an HTM it looked like no loads performed while
> >> any prior branch was unresolved, not even prefetched into cache
> >> (or maybe that's a good thing, I don't know).
> > <
> > Should an ATOMIC event fail, the compiler needs to know that all
> > of the participating memory references are not viable containers
> > of data! And not ever use those units of stale data. The only use
> > that should be allowed is to print the values that failed.
> >>
> >> Predict-never can also be used for rarely executed error handling code.
> >>
> >> Predict-always is the complementary case for branching around
> >> rarely executed error handling code that one wants inline,
> >> and it doesn't matter what it did the last million times it executed.

Re: Branch prediction hints

<c36e1f5b-cb14-42cf-9ef7-2d6e39657712n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17099&group=comp.arch#17099

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:18d:: with SMTP id q13mr26193998qvr.60.1621802391518;
Sun, 23 May 2021 13:39:51 -0700 (PDT)
X-Received: by 2002:a9d:7612:: with SMTP id k18mr15888750otl.178.1621802391326;
Sun, 23 May 2021 13:39:51 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!usenet.pasdenom.info!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 23 May 2021 13:39:51 -0700 (PDT)
In-Reply-To: <s8ed3m$m3c$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:bc69:35bc:a8f4:11b6;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:bc69:35bc:a8f4:11b6
References: <s8c0j2$q5d$1@newsreader4.netcologne.de> <s8cmv1$1e7$1@dont-email.me>
<s8csfm$172$1@dont-email.me> <s8d40h$73f$1@dont-email.me> <s8dakc$27r$1@dont-email.me>
<f7ed098c-cf36-404b-b61b-f14732b978c9n@googlegroups.com> <s8e9pg$lnc$1@dont-email.me>
<s8ed3m$m3c$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c36e1f5b-cb14-42cf-9ef7-2d6e39657712n@googlegroups.com>
Subject: Re: Branch prediction hints
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 23 May 2021 20:39:51 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Sun, 23 May 2021 20:39 UTC

On Sunday, May 23, 2021 at 3:14:49 PM UTC-5, BGB wrote:
> On 5/23/2021 2:18 PM, Ivan Godard wrote:
> > On 5/23/2021 8:56 AM, MitchAlsup wrote:
> >> On Sunday, May 23, 2021 at 5:26:22 AM UTC-5, Ivan Godard wrote:

> >> <
> >> You also have to predicate the argument setup and the result delivery.
> >
> > No, unless the argument setup is potentially excepting; another example
> > that predication must keep exceptions under control, or you get a
> > predication explosion and wind up with an ARM. Once you have Mill's NaR
> > bits or equivalent the only ops that get predicated are control flow
> > (including call and return) and store.
<
You not being Mill, but other more-or-less std RISC architectures.
> >
> > As for call result: whether a not-taken predicate clears or leaves alone
> > the result reg is a matter for architectural design; either can work.
> >
> One may need to predicate any "phi" style operations at the end
> (assuming they are not using conditional select), ...
>
> Though, it is simplest to "just predicate everything", if the compiler
> were clever, it could potentially also split things like
> register-allocation along true/false/always paths.
> >>>> But, not for anything much more than a few instructions, or anything
> >>>> involving a function call, ..., since presumably in this case a branch
> >>>> is cheaper (and the core isn't really wide nor has enough registers to
> >>>> really justify any IA-64 style modular scheduling trickery, ...).
> >>> That's a problem with any if-conversions or other speculative
> >>> scheduling: you have to have enough FUs to get useful parallelism.
> >>> There's a sweet spot in speculative width: too narrow and you lose the
> >>> benefit; too wide and the power and area cost more than the overhead
> >>> of OOO
> >> <
> >> The majority of the benefit has already accrued when you can predicate
> >> as far as you FETCH width. SO if you FETCH 4-wide, you get the majority
> >> of the benefit by predication at least 4 instructions. More than 2× this
> >> distance and you should be using branches to avoid tracking instructions
> >> that don't execute.
> >
> > That assumes that consecutive instructuins all predicate the same way,
> > which works when you have hardware dynamic OOO, and doesn't when you
> > have static scheduling with interleaving from multiple paths.
> >
> Agreed.
>
> To be effective, it doesn't make sense to predicate the true branch
> followed by the false branch, but rather when possible to try to run
> both branches in parallel.
<
That is effectively what My 66000 predicate instructions do::
In the FETCH and DECODE stages, they plow through the instruction
stream as if there were no predication going on; but more importantly
as if there was no flow control encountered, either.
<
Flexibility in executing/flushing these instructions is available across a wide
range of microarchitectures.
<
Instructions that were not supposed to execute are prevented from damaging
architectural or microarchitectural state.
>
> This turns out to be the major features of PrWEX, as it provides a way
> to express these sorts of "running true and false branch in parallel" cases.
<
OK
>
> If the decoder were clever enough, one could potentially allow extra
> bundle width in these cases. However, my core is not smart enough for this.
<
> > There really doesn't seem to be any middle ground: either you work out
> > all the implications of static scheduling and wind up with a Mill, or
> > you work out all those of OOO and wind up with a MY66.
> >
> As I see it, I don't know where the paths lead.
<
You are still young, Grasshopper........
> >> <
> >>>> Also, the state of SR.T would not be preserved across a function call,
> >>>> so any logic following the function's return could not be predicated..
> >>> This can be architected around; ours is not the only possible way to
> >>> do it.
> >> <
> >> If SR.T is in a preserved register, you just use PRED again after return.
> >>
> >
> This is not currently true of BJX2, where 'SR' is not preserved on
> function calls. However, if some of the execution state were copied into
> LR (which is then effectively widened to 64 bits), this could be another
> option for this.
>
> BSR:
> Copies PC(47:0) -> LR(47: 0)
> Copies SR(15:0) -> LR(63:48)
>
> RTS:
> Copies LR(47:0) -> PC
> Copies LR(63:48) -> SR
>
> This would then save predicate state, pred-stack flags, and SIMD
> predicate state, across function calls (as opposed to the current
> approach, where they are undefined).
>
> Similarly, would also make sense to put the WXE in there (in place of
> the IRQ level bits), since WEX profile is function-scoped rather than
> global.
>
>
> ...

Re: Branch prediction hints

<690f26f4-7c01-4ad4-a0e0-de52fba6c732n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17100&group=comp.arch#17100

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:58cc:: with SMTP id u12mr23722555qta.302.1621802629535;
Sun, 23 May 2021 13:43:49 -0700 (PDT)
X-Received: by 2002:aca:3644:: with SMTP id d65mr5111899oia.122.1621802629357;
Sun, 23 May 2021 13:43:49 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 23 May 2021 13:43:49 -0700 (PDT)
In-Reply-To: <s8ee40$ab4$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:bc69:35bc:a8f4:11b6;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:bc69:35bc:a8f4:11b6
References: <s8c0j2$q5d$1@newsreader4.netcologne.de> <s8cmv1$1e7$1@dont-email.me>
<s8csfm$172$1@dont-email.me> <s8d40h$73f$1@dont-email.me> <s8dakc$27r$1@dont-email.me>
<s8e60c$ca6$1@dont-email.me> <s8eabb$ig$1@dont-email.me> <s8ee40$ab4$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <690f26f4-7c01-4ad4-a0e0-de52fba6c732n@googlegroups.com>
Subject: Re: Branch prediction hints
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 23 May 2021 20:43:49 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Sun, 23 May 2021 20:43 UTC

On Sunday, May 23, 2021 at 3:32:03 PM UTC-5, BGB wrote:
> On 5/23/2021 2:27 PM, Ivan Godard wrote:
> > On 5/23/2021 11:13 AM, BGB wrote:
> >> On 5/23/2021 5:26 AM, Ivan Godard wrote:
> >
> > Having predication for most ops is just entropy clutter and a waste of
> > power: it costs more to *not* do an ADD than to do it, so always do it,
> > and junk the predicates. You need predicates for ops that might do a
> > hard throw, or that change persistent state like store and control flow;
> > nowhere else.
> >
> The predicates mostly just serve to replace an the contents of an opcode
> field with a NOP or similar... (Well, except branches, which might need
> both BRA and BRA_NB cases to deal with the branch predictor, where
> BRA_NB is an unconditional branch to the following instruction for cases
> where the branch predictor had predicted that the branch would be taken,
> but it was not).
<
In My 66000, and taken branch under the shadow of predication causes
predication control bits in the PS OctoWord to get cleared. So you can
nest predication (a tiny bit) but if you effect real control flow, the shadow
is erased.
>
> The logic to deal with predicates also overlaps with that of pipeline
> flushes (during a branch), which are essentially unavoidable. Not doing
> this would lead to more garbage in the register file, higher register
> pressure, ...
>
>
> And, by the time one has paid what it costs to have a pipeline flush,
> they have already paid most of the cost of the predication.
<
And this is why is term predication as "valuable" when it is between
of the size of the FETCH access width and 2× this width.
>
> The only "real" cost here IMO is that it eats slightly more encoding space.
> >>>> Also, the state of SR.T would not be preserved across a function
> >>>> call, so any logic following the function's return could not be
> >>>> predicated.
> >>>
> >>> This can be architected around; ours is not the only possible way to
> >>> do it.
> >>
> >>
> >> Yeah. Most likely option would be a callee-save register containing
> >> predicates or similar. As opposed to a single predicate flag which is
> >> treated as a scratch value (and only ISRs need to bother with
> >> preserving it).
> >>
> >

Re: Branch prediction hints

<s8eqko$aqc$2@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17105&group=comp.arch#17105

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Branch prediction hints
Date: Sun, 23 May 2021 17:05:44 -0700
Organization: A noiseless patient Spider
Lines: 33
Message-ID: <s8eqko$aqc$2@dont-email.me>
References: <s8c0j2$q5d$1@newsreader4.netcologne.de>
<Z7tqI.61928$N%1.35599@fx28.iad> <s8eajo$ig$2@dont-email.me>
<40d0a738-6741-4f96-b608-7b507fe12b90n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 24 May 2021 00:05:44 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="762e1ac8637bbc520c2290c7053301cd";
logging-data="11084"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX195S6FS/F25bpWkhqqijeBg"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.1
Cancel-Lock: sha1:sZca4C6meHkzivLRrSn8sw41fpQ=
In-Reply-To: <40d0a738-6741-4f96-b608-7b507fe12b90n@googlegroups.com>
Content-Language: en-US
 by: Ivan Godard - Mon, 24 May 2021 00:05 UTC

On 5/23/2021 1:31 PM, MitchAlsup wrote:
> On Sunday, May 23, 2021 at 2:32:10 PM UTC-5, Ivan Godard wrote:
>> On 5/23/2021 6:53 AM, EricP wrote:
>>> Thomas Koenig wrote:
>>>> To quote the POWER9 User Manual:
>
>>> Without an explicit "predict never" hint, in the case of HTM
>>> this looked to me like that speculation might have to shut off
>>> while a transaction was in progress because there is no way
>>> to deduce which loads are guarded by a particular condition.
>>> At a minimum, in an HTM it looked like no loads performed while
>>> any prior branch was unresolved, not even prefetched into cache
>>> (or maybe that's a good thing, I don't know).
>>>
>>> Predict-never can also be used for rarely executed error handling code.
>>>
>>> Predict-always is the complementary case for branching around
>>> rarely executed error handling code that one wants inline,
>>> and it doesn't matter what it did the last million times it executed.
>>>
>>>
>> Your HTM breaks on a load? Why? We break only on a colliding store.
> <
> I allow, under a critical circumstance, to NAK the offending load, allowing
> the one in the critical section to make forward progress and prevent the
> one not yet there from interfering.

But that breaks the load; it doesn't break the transaction. Guess we're
in violent agreement.

>>
>> Perhaps your HTM's intra-transaction state is visible from outside?

Re: Branch prediction hints

<490869ad-4aca-4be5-8c7e-e882069683f9n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17107&group=comp.arch#17107

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a0c:99db:: with SMTP id y27mr27748528qve.19.1621818324353; Sun, 23 May 2021 18:05:24 -0700 (PDT)
X-Received: by 2002:a4a:d41a:: with SMTP id n26mr16507348oos.66.1621818324165; Sun, 23 May 2021 18:05:24 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!tr1.eu1.usenetexpress.com!feeder.usenetexpress.com!tr1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 23 May 2021 18:05:23 -0700 (PDT)
In-Reply-To: <s8eqko$aqc$2@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:bc69:35bc:a8f4:11b6; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:bc69:35bc:a8f4:11b6
References: <s8c0j2$q5d$1@newsreader4.netcologne.de> <Z7tqI.61928$N%1.35599@fx28.iad> <s8eajo$ig$2@dont-email.me> <40d0a738-6741-4f96-b608-7b507fe12b90n@googlegroups.com> <s8eqko$aqc$2@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <490869ad-4aca-4be5-8c7e-e882069683f9n@googlegroups.com>
Subject: Re: Branch prediction hints
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 24 May 2021 01:05:24 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 32
 by: MitchAlsup - Mon, 24 May 2021 01:05 UTC

On Sunday, May 23, 2021 at 7:05:46 PM UTC-5, Ivan Godard wrote:
> On 5/23/2021 1:31 PM, MitchAlsup wrote:
> > On Sunday, May 23, 2021 at 2:32:10 PM UTC-5, Ivan Godard wrote:
> >> On 5/23/2021 6:53 AM, EricP wrote:
> >>> Thomas Koenig wrote:
> >>>> To quote the POWER9 User Manual:
> >
> >>> Without an explicit "predict never" hint, in the case of HTM
> >>> this looked to me like that speculation might have to shut off
> >>> while a transaction was in progress because there is no way
> >>> to deduce which loads are guarded by a particular condition.
> >>> At a minimum, in an HTM it looked like no loads performed while
> >>> any prior branch was unresolved, not even prefetched into cache
> >>> (or maybe that's a good thing, I don't know).
> >>>
> >>> Predict-never can also be used for rarely executed error handling code.
> >>>
> >>> Predict-always is the complementary case for branching around
> >>> rarely executed error handling code that one wants inline,
> >>> and it doesn't matter what it did the last million times it executed.
> >>>
> >>>
> >> Your HTM breaks on a load? Why? We break only on a colliding store.
> > <
> > I allow, under a critical circumstance, to NAK the offending load, allowing
> > the one in the critical section to make forward progress and prevent the
> > one not yet there from interfering.
> But that breaks the load; it doesn't break the transaction. Guess we're
> in violent agreement.
<
Punish the interference and allow the innocent to make forward progress.
> >>
> >> Perhaps your HTM's intra-transaction state is visible from outside?

Re: Branch prediction hints

<s8ev0d$t2t$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17108&group=comp.arch#17108

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Branch prediction hints
Date: Sun, 23 May 2021 20:20:09 -0500
Organization: A noiseless patient Spider
Lines: 62
Message-ID: <s8ev0d$t2t$1@dont-email.me>
References: <s8c0j2$q5d$1@newsreader4.netcologne.de>
<s8cmv1$1e7$1@dont-email.me> <s8csfm$172$1@dont-email.me>
<s8d40h$73f$1@dont-email.me> <s8dakc$27r$1@dont-email.me>
<s8e60c$ca6$1@dont-email.me> <s8eabb$ig$1@dont-email.me>
<13fa2553-eaf2-43df-a87a-3559a45d88a0n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 24 May 2021 01:20:13 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="8d56d2960a7d0447c0565a4e0ad1e757";
logging-data="29789"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Xi7wYJ6i02ZDTS2UwKUEK"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.1
Cancel-Lock: sha1:kCXVKHYb9YL85cgrnyE3l7bIDAI=
In-Reply-To: <13fa2553-eaf2-43df-a87a-3559a45d88a0n@googlegroups.com>
Content-Language: en-US
 by: BGB - Mon, 24 May 2021 01:20 UTC

On 5/23/2021 3:29 PM, MitchAlsup wrote:
> On Sunday, May 23, 2021 at 2:27:41 PM UTC-5, Ivan Godard wrote:
>> On 5/23/2021 11:13 AM, BGB wrote:
>>> On 5/23/2021 5:26 AM, Ivan Godard wrote:
>>>> On 5/23/2021 1:33 AM, BGB wrote:
>>>
>>> Other ops would have 3 registers (18 bits), or 2 registers (12 bits).
>>> Compare ops could have a 2-bit predicate-destination field.
>>>
>>> It is possible that 01:00 (Never Execute) could be used to encode a
>>> Jumbo Prefix or similar (or, maybe a few unconditional large-immed
>>> instructions or similar).
> <
>> Having predication for most ops is just entropy clutter and a waste of
>> power: it costs more to *not* do an ADD than to do it, so always do it,
>> and junk the predicates. You need predicates for ops that might do a
>> hard throw, or that change persistent state like store and control flow;
>> nowhere else.
> <
> If the standard word size was 36-bits I would disagree, here, but since it is
> 32-bits, I have to agree.
>

I did start writing up some ideas for an ISA spec (as an idea, I called
is BSR4W-A).

Relative to BJX2, it would gains 3 encoding bits due to not having any
16-bit ops, but then lose 5 encoding bits (due to 6-bit register IDs and
a 2-bit predicate-register field), meaning a net loss of 2 bits.

Compared to BJX2, this would mean somewhat less usable encoding space
for opcodes, meaning it is likely either:
I have fewer, or smaller, ops with immediate and displacement fields;
Parts of the core ISA would need to be encoded using jumbo-encodings.

Ideally I would like to keep the Disp9+Jumbo24 -> Disp33s pattern for
Loads/Stores, so it is likely would need to "squeeze things" somewhere
else to make room.

My current estimate is that if I populated the encoding space, it would
start out basically "already full".

Still not fully settled on instruction layouts yet, and don't feel
particularly inclined at the moment to pursue this, since the main way
to "actually take advantage of it" would like require use of modulo loop
scheduling or clever function inlining or similar (or, basically, one of
the same issues which Itanium had to deal with).

Some possible debate is whether code would benefit from a move from 32
to 64 GPRs. Short of some tasks which come up in an OpenGL rasterizer
(namely parallel edge walking over a bunch of parameters or similar), I
have doubts.

It is more likely to pay off for a wider core, but this would assume
having a compiler which is effective enough to use the additional width
(whereas, as-is, my compiler can't even really manage 3-wide effectively).

....

Re: Branch prediction hints

<23b797e3-2809-4cd4-a5b4-2085a35f98cen@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17109&group=comp.arch#17109

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:5fc:: with SMTP id z28mr24985030qkg.378.1621819974756;
Sun, 23 May 2021 18:32:54 -0700 (PDT)
X-Received: by 2002:a05:6830:4d0:: with SMTP id s16mr17410414otd.5.1621819974546;
Sun, 23 May 2021 18:32:54 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!usenet.pasdenom.info!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 23 May 2021 18:32:54 -0700 (PDT)
In-Reply-To: <s8ev0d$t2t$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:bc69:35bc:a8f4:11b6;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:bc69:35bc:a8f4:11b6
References: <s8c0j2$q5d$1@newsreader4.netcologne.de> <s8cmv1$1e7$1@dont-email.me>
<s8csfm$172$1@dont-email.me> <s8d40h$73f$1@dont-email.me> <s8dakc$27r$1@dont-email.me>
<s8e60c$ca6$1@dont-email.me> <s8eabb$ig$1@dont-email.me> <13fa2553-eaf2-43df-a87a-3559a45d88a0n@googlegroups.com>
<s8ev0d$t2t$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <23b797e3-2809-4cd4-a5b4-2085a35f98cen@googlegroups.com>
Subject: Re: Branch prediction hints
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 24 May 2021 01:32:54 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Mon, 24 May 2021 01:32 UTC

On Sunday, May 23, 2021 at 8:20:16 PM UTC-5, BGB wrote:
> On 5/23/2021 3:29 PM, MitchAlsup wrote:
> > On Sunday, May 23, 2021 at 2:27:41 PM UTC-5, Ivan Godard wrote:
> >> On 5/23/2021 11:13 AM, BGB wrote:
> >>> On 5/23/2021 5:26 AM, Ivan Godard wrote:
> >>>> On 5/23/2021 1:33 AM, BGB wrote:
> >>>
> >>> Other ops would have 3 registers (18 bits), or 2 registers (12 bits).
> >>> Compare ops could have a 2-bit predicate-destination field.
> >>>
> >>> It is possible that 01:00 (Never Execute) could be used to encode a
> >>> Jumbo Prefix or similar (or, maybe a few unconditional large-immed
> >>> instructions or similar).
> > <
> >> Having predication for most ops is just entropy clutter and a waste of
> >> power: it costs more to *not* do an ADD than to do it, so always do it,
> >> and junk the predicates. You need predicates for ops that might do a
> >> hard throw, or that change persistent state like store and control flow;
> >> nowhere else.
> > <
> > If the standard word size was 36-bits I would disagree, here, but since it is
> > 32-bits, I have to agree.
> >
> I did start writing up some ideas for an ISA spec (as an idea, I called
> is BSR4W-A).
>
> Relative to BJX2, it would gains 3 encoding bits due to not having any
> 16-bit ops, but then lose 5 encoding bits (due to 6-bit register IDs and
> a 2-bit predicate-register field), meaning a net loss of 2 bits.
>
>
> Compared to BJX2, this would mean somewhat less usable encoding space
> for opcodes, meaning it is likely either:
> I have fewer, or smaller, ops with immediate and displacement fields;
> Parts of the core ISA would need to be encoded using jumbo-encodings.
>
>
> Ideally I would like to keep the Disp9+Jumbo24 -> Disp33s pattern for
> Loads/Stores, so it is likely would need to "squeeze things" somewhere
> else to make room.
>
> My current estimate is that if I populated the encoding space, it would
> start out basically "already full".
<
Would I be out of line to state that this sounds like a poor starting point?
<
My 66000 has 1/3rd of its Major OpCode space unallocated,
a bit less than 1/2 of its memory reference OpCode Space allocated,
a bit less than 1/2 of its 2-operand OpCode Space allocated,
a bit less than 1/128 of its 1-operand Op[Code Apace allocated,
and 1/4 of its 3-operand OpCode Space unallocated.
>
>
> Still not fully settled on instruction layouts yet, and don't feel
> particularly inclined at the moment to pursue this, since the main way
> to "actually take advantage of it" would like require use of modulo loop
> scheduling or clever function inlining or similar (or, basically, one of
> the same issues which Itanium had to deal with).
>
> Some possible debate is whether code would benefit from a move from 32
> to 64 GPRs. Short of some tasks which come up in an OpenGL rasterizer
> (namely parallel edge walking over a bunch of parameters or similar), I
> have doubts.
>
>
> It is more likely to pay off for a wider core, but this would assume
> having a compiler which is effective enough to use the additional width
> (whereas, as-is, my compiler can't even really manage 3-wide effectively).
>
I have lived under the assumption that the wider cores have the HW resources
to do many of these things for themselves, so that code written, compiled, and
scheduled for the 1-wide cores run within spitting distance of the best compiled
code one could target at the GBOoO core. I developed this assumption from the
Mc 88120 effort where we even achieved 2.0 IPC running SPEC 89 XLISP ! and
5.99 IPC running MATRIX300.
> ...

Re: Branch prediction hints

<s8f9bo$dup$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17111&group=comp.arch#17111

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Branch prediction hints
Date: Sun, 23 May 2021 23:16:51 -0500
Organization: A noiseless patient Spider
Lines: 163
Message-ID: <s8f9bo$dup$1@dont-email.me>
References: <s8c0j2$q5d$1@newsreader4.netcologne.de>
<s8cmv1$1e7$1@dont-email.me> <s8csfm$172$1@dont-email.me>
<s8d40h$73f$1@dont-email.me> <s8dakc$27r$1@dont-email.me>
<s8e60c$ca6$1@dont-email.me> <s8eabb$ig$1@dont-email.me>
<13fa2553-eaf2-43df-a87a-3559a45d88a0n@googlegroups.com>
<s8ev0d$t2t$1@dont-email.me>
<23b797e3-2809-4cd4-a5b4-2085a35f98cen@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 24 May 2021 04:16:56 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="8d56d2960a7d0447c0565a4e0ad1e757";
logging-data="14297"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/9nN19jkvYfEde8T4MR5yA"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.1
Cancel-Lock: sha1:ZMZG9znBDgH6dzkYvdtmPTrx/Qc=
In-Reply-To: <23b797e3-2809-4cd4-a5b4-2085a35f98cen@googlegroups.com>
Content-Language: en-US
 by: BGB - Mon, 24 May 2021 04:16 UTC

On 5/23/2021 8:32 PM, MitchAlsup wrote:
> On Sunday, May 23, 2021 at 8:20:16 PM UTC-5, BGB wrote:
>> On 5/23/2021 3:29 PM, MitchAlsup wrote:
>>> On Sunday, May 23, 2021 at 2:27:41 PM UTC-5, Ivan Godard wrote:
>>>> On 5/23/2021 11:13 AM, BGB wrote:
>>>>> On 5/23/2021 5:26 AM, Ivan Godard wrote:
>>>>>> On 5/23/2021 1:33 AM, BGB wrote:
>>>>>
>>>>> Other ops would have 3 registers (18 bits), or 2 registers (12 bits).
>>>>> Compare ops could have a 2-bit predicate-destination field.
>>>>>
>>>>> It is possible that 01:00 (Never Execute) could be used to encode a
>>>>> Jumbo Prefix or similar (or, maybe a few unconditional large-immed
>>>>> instructions or similar).
>>> <
>>>> Having predication for most ops is just entropy clutter and a waste of
>>>> power: it costs more to *not* do an ADD than to do it, so always do it,
>>>> and junk the predicates. You need predicates for ops that might do a
>>>> hard throw, or that change persistent state like store and control flow;
>>>> nowhere else.
>>> <
>>> If the standard word size was 36-bits I would disagree, here, but since it is
>>> 32-bits, I have to agree.
>>>
>> I did start writing up some ideas for an ISA spec (as an idea, I called
>> is BSR4W-A).
>>
>> Relative to BJX2, it would gains 3 encoding bits due to not having any
>> 16-bit ops, but then lose 5 encoding bits (due to 6-bit register IDs and
>> a 2-bit predicate-register field), meaning a net loss of 2 bits.
>>
>>
>> Compared to BJX2, this would mean somewhat less usable encoding space
>> for opcodes, meaning it is likely either:
>> I have fewer, or smaller, ops with immediate and displacement fields;
>> Parts of the core ISA would need to be encoded using jumbo-encodings.
>>
>>
>> Ideally I would like to keep the Disp9+Jumbo24 -> Disp33s pattern for
>> Loads/Stores, so it is likely would need to "squeeze things" somewhere
>> else to make room.
>>
>> My current estimate is that if I populated the encoding space, it would
>> start out basically "already full".
> <
> Would I be out of line to state that this sounds like a poor starting point?

Probably.

It is more akin to designing for a 16-bit ISA, where it doesn't take
much to eat through pretty much all of it.

> <
> My 66000 has 1/3rd of its Major OpCode space unallocated,
> a bit less than 1/2 of its memory reference OpCode Space allocated,
> a bit less than 1/2 of its 2-operand OpCode Space allocated,
> a bit less than 1/128 of its 1-operand Op[Code Apace allocated,
> and 1/4 of its 3-operand OpCode Space unallocated.

Starts looking at it a little more, and realizing encoding space may be
a more serious problem than I realized initially...

I can't really map BJX2 to this new space, it just doesn't fit...

Then again, maybe it might win more points with the "RISC means small
ISA listing" crowd... Because one runs out of encoding bits before they
can fit all that much into it...

"Well, Imma define some Disp9 Load/Store Ops...",
"Oh-Noes, that was 1/4 of the encoding space!",
"How about some 3R Load/Store ops and 3R ALU ops and 2R space",
"Now it at 1/2 of the opcode space!"

Then one has to struggle to fit some useful 3RI ALU ops, 2RI ops, and
Branch ops, before realizing they are already basically out of encoding
space...

Yeah, a shortfall of several bits seems to make a pretty big difference...

It goes a little further if one does Load/Store and 3RI ops using
Disp6/Imm6 instead of Disp9/Imm9.

Not enough bits to encode an Imm33/Disp33 in a 64-bit pair, and not
enough bits to encode Imm64 in 96-bits, ...

Yeah, "poor starting point" is starting to seem fairly evident...

>>
>>
>> Still not fully settled on instruction layouts yet, and don't feel
>> particularly inclined at the moment to pursue this, since the main way
>> to "actually take advantage of it" would like require use of modulo loop
>> scheduling or clever function inlining or similar (or, basically, one of
>> the same issues which Itanium had to deal with).
>>
>> Some possible debate is whether code would benefit from a move from 32
>> to 64 GPRs. Short of some tasks which come up in an OpenGL rasterizer
>> (namely parallel edge walking over a bunch of parameters or similar), I
>> have doubts.
>>
>>
>> It is more likely to pay off for a wider core, but this would assume
>> having a compiler which is effective enough to use the additional width
>> (whereas, as-is, my compiler can't even really manage 3-wide effectively).
>>
> I have lived under the assumption that the wider cores have the HW resources
> to do many of these things for themselves, so that code written, compiled, and
> scheduled for the 1-wide cores run within spitting distance of the best compiled
> code one could target at the GBOoO core. I developed this assumption from the
> Mc 88120 effort where we even achieved 2.0 IPC running SPEC 89 XLISP ! and
> 5.99 IPC running MATRIX300.

I am assuming a lack of any OoO or GBOoO capabilities, and instead a
strictly in-order bundle-at-a-time core more like the existing BJX2
pipeline, just possibly widened from 3 to 5 or similar.

But, the amount of heavy lifting the compiler would need to do to make
this worthwhile is a problem.

Some stuff I have read does seem to imply that GCC may have some of
needed optimizations to be able to make this workable though.

Likewise, going the other direction, a 1-wide RISC-like core (vs 3-wide):
I can clock it at 75 or 100MHz;
I still need to reduce the L1 cache sizes at these speeds.

With smaller L1's, I can have a 100MHz core that runs slower than a
3-wide core running at 50MHz...
Which is kinda how I ended up in the current boat to begin with.

And, a 3-wide core at 75MHz is faster than a 1-wide core at 100MHz.

And the 50MHz core rolls in with its ability to have massively larger L1
caches and similar, and "owns it".

Decided to leave out going off onto a tangent about my ongoing battles
with DRAM bandwidth... (ATM, it appears to be mostly a "death by 1000
paper cuts" situation, though now mostly confined to the L2 cache and L2
Cache <-> DDR Controller interface and similar).

If there were some good way to predict cache misses before they
happened, this could be useful...

May also consider a "sweep L2 and evict old dirty cache lines"
mechanism, ...

Re: Branch prediction hints

<XPQqI.613760$nn2.276769@fx48.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17115&group=comp.arch#17115

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!4.us.feeder.erje.net!2.eu.feeder.erje.net!feeder.erje.net!border1.nntp.ams1.giganews.com!nntp.giganews.com!npeer.as286.net!npeer-ng0.as286.net!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx48.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Branch prediction hints
References: <s8c0j2$q5d$1@newsreader4.netcologne.de> <Z7tqI.61928$N%1.35599@fx28.iad> <s8eajo$ig$2@dont-email.me>
In-Reply-To: <s8eajo$ig$2@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 44
Message-ID: <XPQqI.613760$nn2.276769@fx48.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Mon, 24 May 2021 16:51:03 UTC
Date: Mon, 24 May 2021 12:49:36 -0400
X-Received-Bytes: 2779
X-Original-Bytes: 2607
 by: EricP - Mon, 24 May 2021 16:49 UTC

Ivan Godard wrote:
> On 5/23/2021 6:53 AM, EricP wrote:
>>
>> With Hardware Transactional Memory HTM reading a memory location
>> even speculatively might abort another's processors active transaction,
>> you don't want to even touch data memory without explicit permission,
>> not even prefetching any load or store addresses.
>
> Your HTM breaks on a load? Why? We break only on a colliding store.
>
> Perhaps your HTM's intra-transaction state is visible from outside?

Current implementations by Intel and IBM (and maybe Sparc too)
use the coherence protocol and exclusive cache line ownership
to detect transaction contention.
A load triggers a cache line read-share, which causes the current
owner to loose exclusive state and aborts the current line owner's
transaction (contender-wins, no forward progress guarantee).

They work that way because it was easier to incorporate some form
of HTM into their existing cache hierarchy. However the results
might be problematic, fragile, or have performance issues.

My original HTM design ideas do not work the same way,
and get fairness and a forward progress guarantee.
That original design would abort a transaction if an
exclusive cache line was taken away from its HTM owner.

That would not occur between participants in an HTM transaction.
However it could occur if there is false sharing of a cache line
with transaction non-participants, as might occur in, for example,
a non cache line aware memory heap.

If such false sharing did occur, it broke my HTM forward progress
guarantee as the false sharer could repeatedly abort a transaction,
possibly without knowing they were doing so.
So in the case of false sharing, my design degraded to equivalent
to the current implementations.

I have some ideas on how to fix the false sharing problem
- negotiate at the line level and if there is a line collision then
renegotiate at the byte level - but I have not followed up on them.

Re: Branch prediction hints

<ejRqI.173866$lyv9.146449@fx35.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17116&group=comp.arch#17116

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!fdcspool6.netnews.com!news-out.netnews.com!news.alt.net!fdc3.netnews.com!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx35.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Branch prediction hints
References: <s8c0j2$q5d$1@newsreader4.netcologne.de> <Z7tqI.61928$N%1.35599@fx28.iad> <c42f2bb0-6920-44d2-8877-cff238443ca7n@googlegroups.com>
In-Reply-To: <c42f2bb0-6920-44d2-8877-cff238443ca7n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 73
Message-ID: <ejRqI.173866$lyv9.146449@fx35.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Mon, 24 May 2021 17:24:26 UTC
Date: Mon, 24 May 2021 13:24:06 -0400
X-Received-Bytes: 4404
 by: EricP - Mon, 24 May 2021 17:24 UTC

MitchAlsup wrote:
> On Sunday, May 23, 2021 at 8:54:03 AM UTC-5, EricP wrote:
>
>> There is a case for a "predict never" hint where it doesn't matter
>> what this conditional branch did the last million times it executed,
>> always predict not-taken.
>>
>> In a spinlock which normally test the lock condition with
>> a load before attempting the atomic sequence,
>> you never want to speculatively execute into the atomic sequence.
>> At a minimum it could cause ping-pong'ing the cache lines.
> <
> Yes, the classical test-and-test-and-set. But this only decreases
> bus traffic from BigO(n^3) to BigO(N^2). There are ways to
> decrease bus traffic to BigO( N+3 )

Right, like the CLH spinlock queues discussed here many times.

However if it speculates past a spinlock, it can start touching the
data objects guarded by those spinlocks and ping-pong those lines too.
And if you have multiple waiting contenders doing this,
it could flood the coherence bus with useless traffic.

To prevent this it must never predict that the lock grant is true and
speculatively execute load instructions after the conditional branch
(and some in-order designs can do this too).
A "predict-never" hint seems a simple solution to this problem.

>> With Hardware Transactional Memory HTM reading a memory location
>> even speculatively might abort another's processors active transaction,
>> you don't want to even touch data memory without explicit permission,
>> not even prefetching any load or store addresses.
> <
> This is one of the things WRONG about HTM.

Yes, but it is the real, current state of affairs.

> <
> in My 66000 there is a Exotic Synchronization Method (ESM) which is not
> an HTM but can be used to create HTMs. In ESM, if the ATOMIC event has
> reached a critical juncture (i.e., can complete) the CPUs reaching those
> points gain the ability to NAK interference, allowing these CPUs to complete
> the ATOMIC event, and making the interferers run slower!
>> If you don't have a hint to explicitly block speculation at the
>> branch then the design would have to use more complicated and
>> probably error prone dynamic logic to "deduce" what to do.
> <
> You do not want Naked memory refs to be used to setup or complete
> ATOMIC events. You need to "mark" their participation in the event
> so the machine knows that such an event is going on from the out-
> set.
>> Without an explicit "predict never" hint, in the case of HTM
>> this looked to me like that speculation might have to shut off
>> while a transaction was in progress because there is no way
>> to deduce which loads are guarded by a particular condition.
>> At a minimum, in an HTM it looked like no loads performed while
>> any prior branch was unresolved, not even prefetched into cache
>> (or maybe that's a good thing, I don't know).
> <
> Should an ATOMIC event fail, the compiler needs to know that all
> of the participating memory references are not viable containers
> of data! And not ever use those units of stale data. The only use
> that should be allowed is to print the values that failed.

Above again I was thinking about about line ownership interference.
For example something like an HTM on a AVL binary tree
where the branch decisions are inside the transaction.
It has left and right child pointers, but it should not touch
_either_ child object until it decides which path to follow.

A "predict-never" hint is simple and could solve this problem too.

Re: Branch prediction hints

<sBRqI.613761$nn2.535143@fx48.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17118&group=comp.arch#17118

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!news.swapon.de!2.eu.feeder.erje.net!feeder.erje.net!npeer.as286.net!npeer-ng0.as286.net!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx48.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Branch prediction hints
References: <s8c0j2$q5d$1@newsreader4.netcologne.de> <Z7tqI.61928$N%1.35599@fx28.iad> <c42f2bb0-6920-44d2-8877-cff238443ca7n@googlegroups.com> <s8earo$a9k$1@dont-email.me> <e977b069-6ce0-46bf-ac2a-f7fb85ef9f6cn@googlegroups.com>
In-Reply-To: <e977b069-6ce0-46bf-ac2a-f7fb85ef9f6cn@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 39
Message-ID: <sBRqI.613761$nn2.535143@fx48.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Mon, 24 May 2021 17:43:52 UTC
Date: Mon, 24 May 2021 13:43:41 -0400
X-Received-Bytes: 2725
 by: EricP - Mon, 24 May 2021 17:43 UTC

MitchAlsup wrote:
> On Sunday, May 23, 2021 at 2:36:26 PM UTC-5, Ivan Godard wrote:
>> On 5/23/2021 9:07 AM, MitchAlsup wrote:
>>> On Sunday, May 23, 2021 at 8:54:03 AM UTC-5, EricP wrote:
>>>> If you don't have a hint to explicitly block speculation at the
>>>> branch then the design would have to use more complicated and
>>>> probably error prone dynamic logic to "deduce" what to do.
>>> <
>>> You do not want Naked memory refs to be used to setup or complete
>>> ATOMIC events. You need to "mark" their participation in the event
>>> so the machine knows that such an event is going on from the out-
>>> set.
> <
>> This also permits you to have intra-transaction stores that are not part
>> of the transaction, say for logging and debugging, where you don't lock
>> the log memory.
> <
> Yes, exactly--and that is where the word "participating" came into use
> when doing the ASF as AMD. Since you CANNOT single step through
> and ATOMIC event, you need some way of figuring out what is going
> on, lobbing registers into a memory buffer for later printing is the
> prescribed means.
> <

I also did not want to checkpoint the register set at the start
of a transaction and roll them all back on abort.
I want memory protected but not registers.

When I played about (on paper) with the ASF design I found
it quite inconvenient that there was no way to communicate
anything from inside a transaction to outside.

In my design when a transaction starts, it remembers the start PC.
If an abort occurs, it tosses the protected memory changes,
and jumps back to that PC and sets a status into a register,
but any other registers already retired have their values retained.
The abort handler would then reload the registers it wants.

Re: Branch prediction hints

<1430abb7-231c-4f12-985a-44b623c2fcafn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17119&group=comp.arch#17119

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:b84:: with SMTP id fe4mr31964065qvb.42.1621878742074;
Mon, 24 May 2021 10:52:22 -0700 (PDT)
X-Received: by 2002:a9d:3bcb:: with SMTP id k69mr20454539otc.206.1621878741814;
Mon, 24 May 2021 10:52:21 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 24 May 2021 10:52:21 -0700 (PDT)
In-Reply-To: <s8f9bo$dup$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:ed6f:f412:a8c7:989c;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:ed6f:f412:a8c7:989c
References: <s8c0j2$q5d$1@newsreader4.netcologne.de> <s8cmv1$1e7$1@dont-email.me>
<s8csfm$172$1@dont-email.me> <s8d40h$73f$1@dont-email.me> <s8dakc$27r$1@dont-email.me>
<s8e60c$ca6$1@dont-email.me> <s8eabb$ig$1@dont-email.me> <13fa2553-eaf2-43df-a87a-3559a45d88a0n@googlegroups.com>
<s8ev0d$t2t$1@dont-email.me> <23b797e3-2809-4cd4-a5b4-2085a35f98cen@googlegroups.com>
<s8f9bo$dup$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1430abb7-231c-4f12-985a-44b623c2fcafn@googlegroups.com>
Subject: Re: Branch prediction hints
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 24 May 2021 17:52:22 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Mon, 24 May 2021 17:52 UTC

On Sunday, May 23, 2021 at 11:16:58 PM UTC-5, BGB wrote:
> On 5/23/2021 8:32 PM, MitchAlsup wrote:
> > On Sunday, May 23, 2021 at 8:20:16 PM UTC-5, BGB wrote:
> >> On 5/23/2021 3:29 PM, MitchAlsup wrote:
> >>> On Sunday, May 23, 2021 at 2:27:41 PM UTC-5, Ivan Godard wrote:
> >>>> On 5/23/2021 11:13 AM, BGB wrote:
> >>>>> On 5/23/2021 5:26 AM, Ivan Godard wrote:
> >>>>>> On 5/23/2021 1:33 AM, BGB wrote:
> >>>>>
> >>>>> Other ops would have 3 registers (18 bits), or 2 registers (12 bits).
> >>>>> Compare ops could have a 2-bit predicate-destination field.
> >>>>>
> >>>>> It is possible that 01:00 (Never Execute) could be used to encode a
> >>>>> Jumbo Prefix or similar (or, maybe a few unconditional large-immed
> >>>>> instructions or similar).
> >>> <
> >>>> Having predication for most ops is just entropy clutter and a waste of
> >>>> power: it costs more to *not* do an ADD than to do it, so always do it,
> >>>> and junk the predicates. You need predicates for ops that might do a
> >>>> hard throw, or that change persistent state like store and control flow;
> >>>> nowhere else.
> >>> <
> >>> If the standard word size was 36-bits I would disagree, here, but since it is
> >>> 32-bits, I have to agree.
> >>>
> >> I did start writing up some ideas for an ISA spec (as an idea, I called
> >> is BSR4W-A).
> >>
> >> Relative to BJX2, it would gains 3 encoding bits due to not having any
> >> 16-bit ops, but then lose 5 encoding bits (due to 6-bit register IDs and
> >> a 2-bit predicate-register field), meaning a net loss of 2 bits.
> >>
> >>
> >> Compared to BJX2, this would mean somewhat less usable encoding space
> >> for opcodes, meaning it is likely either:
> >> I have fewer, or smaller, ops with immediate and displacement fields;
> >> Parts of the core ISA would need to be encoded using jumbo-encodings.
> >>
> >>
> >> Ideally I would like to keep the Disp9+Jumbo24 -> Disp33s pattern for
> >> Loads/Stores, so it is likely would need to "squeeze things" somewhere
> >> else to make room.
> >>
> >> My current estimate is that if I populated the encoding space, it would
> >> start out basically "already full".
> > <
> > Would I be out of line to state that this sounds like a poor starting point?
> Probably.
>
> It is more akin to designing for a 16-bit ISA, where it doesn't take
> much to eat through pretty much all of it.
> > <
> > My 66000 has 1/3rd of its Major OpCode space unallocated,
> > a bit less than 1/2 of its memory reference OpCode Space allocated,
> > a bit less than 1/2 of its 2-operand OpCode Space allocated,
> > a bit less than 1/128 of its 1-operand Op[Code Apace allocated,
> > and 1/4 of its 3-operand OpCode Space unallocated.
> Starts looking at it a little more, and realizing encoding space may be
> a more serious problem than I realized initially...
>
>
> I can't really map BJX2 to this new space, it just doesn't fit...
>
>
> Then again, maybe it might win more points with the "RISC means small
> ISA listing" crowd... Because one runs out of encoding bits before they
> can fit all that much into it...
>
>
> "Well, Imma define some Disp9 Load/Store Ops...",
> "Oh-Noes, that was 1/4 of the encoding space!",
> "How about some 3R Load/Store ops and 3R ALU ops and 2R space",
> "Now it at 1/2 of the opcode space!"
<
To be fair, I made a loot of these mistakes in Mc 88K, and corrected the
vast majority of them in My 66000.
>
> Then one has to struggle to fit some useful 3RI ALU ops, 2RI ops, and
> Branch ops, before realizing they are already basically out of encoding
> space...
<
The important thing to remember is that the most precious resource is
the Major OpCode space--and the reason is that this gives you access
to the other spaces.
<
In My 66000, the Major OpCode space consists of all 16-bit immediates
The branches with IP relative offsets, and the extension OpCodes, of
which there are 6 {Predication, Shifts, 2R+Disp memory refs, 2-Operand,
3-Operand, and 1-Operand.}
<
For all of the extended instructions, My 66000 has 3-bits to control the
signs of the operands and access to long immediates, and access to
5-bit immediates in Src1. This supports things like 1<<k in a single instruction.
<
The second most important resource is the 3-operand space because
there are only 8 available entries and we need FMAC (single and double),
CMOV, and INSert.
<
The other spaces are so partially populated that one has a pretty free
reign.
>
>
> Yeah, a shortfall of several bits seems to make a pretty big difference...
>
>
> It goes a little further if one does Load/Store and 3RI ops using
> Disp6/Imm6 instead of Disp9/Imm9.
>
> Not enough bits to encode an Imm33/Disp33 in a 64-bit pair, and not
> enough bits to encode Imm64 in 96-bits, ...
>
>
> Yeah, "poor starting point" is starting to seem fairly evident...
> >>
> >>
> >> Still not fully settled on instruction layouts yet, and don't feel
> >> particularly inclined at the moment to pursue this, since the main way
> >> to "actually take advantage of it" would like require use of modulo loop
> >> scheduling or clever function inlining or similar (or, basically, one of
> >> the same issues which Itanium had to deal with).
> >>
> >> Some possible debate is whether code would benefit from a move from 32
> >> to 64 GPRs. Short of some tasks which come up in an OpenGL rasterizer
> >> (namely parallel edge walking over a bunch of parameters or similar), I
> >> have doubts.
> >>
> >>
> >> It is more likely to pay off for a wider core, but this would assume
> >> having a compiler which is effective enough to use the additional width
> >> (whereas, as-is, my compiler can't even really manage 3-wide effectively).
> >>
> > I have lived under the assumption that the wider cores have the HW resources
> > to do many of these things for themselves, so that code written, compiled, and
> > scheduled for the 1-wide cores run within spitting distance of the best compiled
> > code one could target at the GBOoO core. I developed this assumption from the
> > Mc 88120 effort where we even achieved 2.0 IPC running SPEC 89 XLISP ! and
> > 5.99 IPC running MATRIX300.
<
> I am assuming a lack of any OoO or GBOoO capabilities, and instead a
> strictly in-order bundle-at-a-time core more like the existing BJX2
> pipeline, just possibly widened from 3 to 5 or similar.
<
Yes, you are targeting a particular chip to hold your design, while I am
designing from the very small (1-wide In Order) to the moderately large
(8-wide Out of Order)
>
> But, the amount of heavy lifting the compiler would need to do to make
> this worthwhile is a problem.
<
Having designed a 6-wide GBOoO and understanding you medium of
expression, I can understand why you are not.
>
> Some stuff I have read does seem to imply that GCC may have some of
> needed optimizations to be able to make this workable though.
>
>
>
> Likewise, going the other direction, a 1-wide RISC-like core (vs 3-wide):
> I can clock it at 75 or 100MHz;
> I still need to reduce the L1 cache sizes at these speeds.
<
What would the trade off be if you added a pipe stage to LD so you
could run at the higher frequency and have the larger cache ?
>
>
> With smaller L1's, I can have a 100MHz core that runs slower than a
> 3-wide core running at 50MHz...
> Which is kinda how I ended up in the current boat to begin with.
<
What about adding a pipe stage to LD and making the cache 75%
of a cycle longer ?
<
I don't remember the sizes of your L1s, but this sounds like the classical
Cache pipe-stage delima. 8K 2-cycle versus 64K 3-cycle. for high miss
rate applications the larger slower cache is better especially if the core
frequency improves.
>
> And, a 3-wide core at 75MHz is faster than a 1-wide core at 100MHz.
<
Back when we had the 1-wides up and running and were designing the
2-wides we simulated a bunch of the design space and found:: in general:
1-wide could get 0.7 IPC, 2-wide 0.95 IPC, 3-wide 1.1 IPC, 4-wide 1.2 IPC.
{Note these were NOT the OoO machines.}
<
So, based on those numbers your 3-wide should be 20% faster.
<
How fast is it relative to the 1-wide.
>
> And the 50MHz core rolls in with its ability to have massively larger L1
> caches and similar, and "owns it".
>
>
>
>
> Decided to leave out going off onto a tangent about my ongoing battles
> with DRAM bandwidth... (ATM, it appears to be mostly a "death by 1000
> paper cuts" situation, though now mostly confined to the L2 cache and L2
> Cache <-> DDR Controller interface and similar).
>
>
> If there were some good way to predict cache misses before they
> happened, this could be useful...
>
> May also consider a "sweep L2 and evict old dirty cache lines"
> mechanism, ...


Click here to read the complete article
Re: Branch prediction hints

<7fb7aaf5-f3d2-42c0-b258-f670603330f5n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17120&group=comp.arch#17120

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:ef08:: with SMTP id j8mr28552878qkk.24.1621879030542;
Mon, 24 May 2021 10:57:10 -0700 (PDT)
X-Received: by 2002:a05:6830:40a4:: with SMTP id x36mr19402921ott.342.1621879030323;
Mon, 24 May 2021 10:57:10 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 24 May 2021 10:57:10 -0700 (PDT)
In-Reply-To: <sBRqI.613761$nn2.535143@fx48.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:ed6f:f412:a8c7:989c;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:ed6f:f412:a8c7:989c
References: <s8c0j2$q5d$1@newsreader4.netcologne.de> <Z7tqI.61928$N%1.35599@fx28.iad>
<c42f2bb0-6920-44d2-8877-cff238443ca7n@googlegroups.com> <s8earo$a9k$1@dont-email.me>
<e977b069-6ce0-46bf-ac2a-f7fb85ef9f6cn@googlegroups.com> <sBRqI.613761$nn2.535143@fx48.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <7fb7aaf5-f3d2-42c0-b258-f670603330f5n@googlegroups.com>
Subject: Re: Branch prediction hints
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 24 May 2021 17:57:10 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Mon, 24 May 2021 17:57 UTC

On Monday, May 24, 2021 at 12:43:55 PM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Sunday, May 23, 2021 at 2:36:26 PM UTC-5, Ivan Godard wrote:
> >> On 5/23/2021 9:07 AM, MitchAlsup wrote:
> >>> On Sunday, May 23, 2021 at 8:54:03 AM UTC-5, EricP wrote:
> >>>> If you don't have a hint to explicitly block speculation at the
> >>>> branch then the design would have to use more complicated and
> >>>> probably error prone dynamic logic to "deduce" what to do.
> >>> <
> >>> You do not want Naked memory refs to be used to setup or complete
> >>> ATOMIC events. You need to "mark" their participation in the event
> >>> so the machine knows that such an event is going on from the out-
> >>> set.
> > <
> >> This also permits you to have intra-transaction stores that are not part
> >> of the transaction, say for logging and debugging, where you don't lock
> >> the log memory.
> > <
> > Yes, exactly--and that is where the word "participating" came into use
> > when doing the ASF as AMD. Since you CANNOT single step through
> > and ATOMIC event, you need some way of figuring out what is going
> > on, lobbing registers into a memory buffer for later printing is the
> > prescribed means.
> > <
> I also did not want to checkpoint the register set at the start
> of a transaction and roll them all back on abort.
> I want memory protected but not registers.
>
> When I played about (on paper) with the ASF design I found
> it quite inconvenient that there was no way to communicate
> anything from inside a transaction to outside.
>
> In my design when a transaction starts, it remembers the start PC.
<
ESM does this too, records the starting IP and if interference happens
the ATOMIC event restarts there. ESM also uses the Branch on memory
interference instruction to reset this point as as escape point should the
event fail later.
<
I admit that that was a problem in ASF.
<
> If an abort occurs, it tosses the protected memory changes,
> and jumps back to that PC and sets a status into a register,
> but any other registers already retired have their values retained.
<
ESM does nothing to the registers upon fail either, but it specifically
states that the compiler is not allowed to use the values in the
"participating" registers. The non-participating registers may not
have been updated in a von Neumann order, either.
<
> The abort handler would then reload the registers it wants.
<
Yes, this is the "compiler cannot use" part.

Re: Branch prediction hints

<s8gu6d$udd$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17124&group=comp.arch#17124

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-3221-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Branch prediction hints
Date: Mon, 24 May 2021 19:18:37 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <s8gu6d$udd$1@newsreader4.netcologne.de>
References: <s8c0j2$q5d$1@newsreader4.netcologne.de>
<Z7tqI.61928$N%1.35599@fx28.iad> <s8eajo$ig$2@dont-email.me>
<XPQqI.613760$nn2.276769@fx48.iad>
Injection-Date: Mon, 24 May 2021 19:18:37 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-3221-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:3221:0:7285:c2ff:fe6c:992d";
logging-data="31149"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Mon, 24 May 2021 19:18 UTC

EricP <ThatWouldBeTelling@thevillage.com> schrieb:
> Ivan Godard wrote:
>> On 5/23/2021 6:53 AM, EricP wrote:
>>>
>>> With Hardware Transactional Memory HTM reading a memory location
>>> even speculatively might abort another's processors active transaction,
>>> you don't want to even touch data memory without explicit permission,
>>> not even prefetching any load or store addresses.
>>
>> Your HTM breaks on a load? Why? We break only on a colliding store.
>>
>> Perhaps your HTM's intra-transaction state is visible from outside?
>
> Current implementations by Intel and IBM (and maybe Sparc too)
> use the coherence protocol and exclusive cache line ownership
> to detect transaction contention.
> A load triggers a cache line read-share, which causes the current
> owner to loose exclusive state and aborts the current line owner's
> transaction (contender-wins, no forward progress guarantee).
>
> They work that way because it was easier to incorporate some form
> of HTM into their existing cache hierarchy. However the results
> might be problematic, fragile, or have performance issues.

If I remember correctly, hardware transactional memory
worked on POWER8, but due to a design fault IBM did not get
it to work correctly on POWER9. There are some hints in
https://www.kernel.org/doc/html/latest/powerpc/transactional_memory.html#power9
and "POWER9 Processor DD 2.1 Use Restrictions" flatly
states "Limitation Not Supported".

Re: Branch prediction hints

<s8h1je$dhj$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17127&group=comp.arch#17127

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: Branch prediction hints
Date: Mon, 24 May 2021 22:16:46 +0200
Organization: A noiseless patient Spider
Lines: 35
Message-ID: <s8h1je$dhj$1@dont-email.me>
References: <s8c0j2$q5d$1@newsreader4.netcologne.de>
<s8dcbt$7f4$1@gioia.aioe.org>
<36626ffe-f5d8-4a62-af27-310684375561n@googlegroups.com>
<jwvfsyd1jio.fsf-monnier+comp.arch@gnu.org>
<ad6c1950-c7df-4ec0-b3ab-20550baccb67n@googlegroups.com>
<12fa6b22-9cf8-4dd0-813d-1b8b21058c50n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 24 May 2021 20:16:46 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="2b6dda3c0f5ccea7267016108717f11a";
logging-data="13875"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18T0GoHnQ9TQIYhVQ6v0j8ZInYJi4Lbtjc="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.8.1
Cancel-Lock: sha1:4zEYeEwg0DxleX27p3j/SYRLgoo=
In-Reply-To: <12fa6b22-9cf8-4dd0-813d-1b8b21058c50n@googlegroups.com>
Content-Language: en-US
 by: Marcus - Mon, 24 May 2021 20:16 UTC

On 2021-05-23, MitchAlsup wrote:
> On Sunday, May 23, 2021 at 12:15:04 PM UTC-5, robf...@gmail.com wrote:
>> Speaking of the usefulness of branch hints for prediction I have to agree
>> that they are not that useful. As a gag though I added the ability to supply
>> branch predictor hints in ‘if’ statements that also allowed the branch
>> predictor to be selected. How useful is it to be able to select the branch
>> predictor to use (assuming multiple predictors are present)?
>> The only case I can think of is maybe power savings.
> <
> I might note that Virtual Vector Method loops do not use the branch predictor
> but are executed in advance of the loop iteration to effectively perform as if
> the branch took zero cycles when the loop terminates (and zero cycles when
> the loop continues.)
> <
> This improves the prediction accuracy of the "rest of the branches".
> <
> PREDication also does not use the branch predictor getting the HW setup
> to execute either then-clause or else-clause. This also improves the prediction
> accuracy of the "rest of the branches".
>

Those are nice properties, and some of it reminds me of DSP style
"hardware assisted loops" (e.g. SPLOOP in TI 320C66x).

MRISC32 style vector loops still have regular loop branch instructions,
and although they execute less frequently (the core of the vector loop
is "hardware assisted"), they still occupy slots in the branch
predictor.

BTW, apart from VVM, are there any good examples of ISA:s with loop
instructions that are easy to predict ahead of time (thus effectively
unrolling loops, eliminating compares/branches, and reducing branch
predictor load)?

/Marcus

Re: Branch prediction hints

<s8h31l$vv$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17128&group=comp.arch#17128

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Branch prediction hints
Date: Mon, 24 May 2021 15:41:21 -0500
Organization: A noiseless patient Spider
Lines: 365
Message-ID: <s8h31l$vv$1@dont-email.me>
References: <s8c0j2$q5d$1@newsreader4.netcologne.de>
<s8cmv1$1e7$1@dont-email.me> <s8csfm$172$1@dont-email.me>
<s8d40h$73f$1@dont-email.me> <s8dakc$27r$1@dont-email.me>
<s8e60c$ca6$1@dont-email.me> <s8eabb$ig$1@dont-email.me>
<13fa2553-eaf2-43df-a87a-3559a45d88a0n@googlegroups.com>
<s8ev0d$t2t$1@dont-email.me>
<23b797e3-2809-4cd4-a5b4-2085a35f98cen@googlegroups.com>
<s8f9bo$dup$1@dont-email.me>
<1430abb7-231c-4f12-985a-44b623c2fcafn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 24 May 2021 20:41:26 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="8d56d2960a7d0447c0565a4e0ad1e757";
logging-data="1023"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19VEoXCl22KQvcjuyQGRV6Q"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.2
Cancel-Lock: sha1:FmFQmdXac38hZ+xXBVI6hnv38Lc=
In-Reply-To: <1430abb7-231c-4f12-985a-44b623c2fcafn@googlegroups.com>
Content-Language: en-US
 by: BGB - Mon, 24 May 2021 20:41 UTC

On 5/24/2021 12:52 PM, MitchAlsup wrote:
> On Sunday, May 23, 2021 at 11:16:58 PM UTC-5, BGB wrote:
>> On 5/23/2021 8:32 PM, MitchAlsup wrote:
>>> On Sunday, May 23, 2021 at 8:20:16 PM UTC-5, BGB wrote:
>>>> On 5/23/2021 3:29 PM, MitchAlsup wrote:
>>>>> On Sunday, May 23, 2021 at 2:27:41 PM UTC-5, Ivan Godard wrote:
>>>>>> On 5/23/2021 11:13 AM, BGB wrote:
>>>>>>> On 5/23/2021 5:26 AM, Ivan Godard wrote:
>>>>>>>> On 5/23/2021 1:33 AM, BGB wrote:
>>>>>>>
>>>>>>> Other ops would have 3 registers (18 bits), or 2 registers (12 bits).
>>>>>>> Compare ops could have a 2-bit predicate-destination field.
>>>>>>>
>>>>>>> It is possible that 01:00 (Never Execute) could be used to encode a
>>>>>>> Jumbo Prefix or similar (or, maybe a few unconditional large-immed
>>>>>>> instructions or similar).
>>>>> <
>>>>>> Having predication for most ops is just entropy clutter and a waste of
>>>>>> power: it costs more to *not* do an ADD than to do it, so always do it,
>>>>>> and junk the predicates. You need predicates for ops that might do a
>>>>>> hard throw, or that change persistent state like store and control flow;
>>>>>> nowhere else.
>>>>> <
>>>>> If the standard word size was 36-bits I would disagree, here, but since it is
>>>>> 32-bits, I have to agree.
>>>>>
>>>> I did start writing up some ideas for an ISA spec (as an idea, I called
>>>> is BSR4W-A).
>>>>
>>>> Relative to BJX2, it would gains 3 encoding bits due to not having any
>>>> 16-bit ops, but then lose 5 encoding bits (due to 6-bit register IDs and
>>>> a 2-bit predicate-register field), meaning a net loss of 2 bits.
>>>>
>>>>
>>>> Compared to BJX2, this would mean somewhat less usable encoding space
>>>> for opcodes, meaning it is likely either:
>>>> I have fewer, or smaller, ops with immediate and displacement fields;
>>>> Parts of the core ISA would need to be encoded using jumbo-encodings.
>>>>
>>>>
>>>> Ideally I would like to keep the Disp9+Jumbo24 -> Disp33s pattern for
>>>> Loads/Stores, so it is likely would need to "squeeze things" somewhere
>>>> else to make room.
>>>>
>>>> My current estimate is that if I populated the encoding space, it would
>>>> start out basically "already full".
>>> <
>>> Would I be out of line to state that this sounds like a poor starting point?
>> Probably.
>>
>> It is more akin to designing for a 16-bit ISA, where it doesn't take
>> much to eat through pretty much all of it.

Clarification:
I had meant, "it probably was a poor starting point" rather than "it was
probably out of line"...

>>> <
>>> My 66000 has 1/3rd of its Major OpCode space unallocated,
>>> a bit less than 1/2 of its memory reference OpCode Space allocated,
>>> a bit less than 1/2 of its 2-operand OpCode Space allocated,
>>> a bit less than 1/128 of its 1-operand Op[Code Apace allocated,
>>> and 1/4 of its 3-operand OpCode Space unallocated.
>> Starts looking at it a little more, and realizing encoding space may be
>> a more serious problem than I realized initially...
>>
>>
>> I can't really map BJX2 to this new space, it just doesn't fit...
>>
>>
>> Then again, maybe it might win more points with the "RISC means small
>> ISA listing" crowd... Because one runs out of encoding bits before they
>> can fit all that much into it...
>>
>>
>> "Well, Imma define some Disp9 Load/Store Ops...",
>> "Oh-Noes, that was 1/4 of the encoding space!",
>> "How about some 3R Load/Store ops and 3R ALU ops and 2R space",
>> "Now it at 1/2 of the opcode space!"
> <
> To be fair, I made a loot of these mistakes in Mc 88K, and corrected the
> vast majority of them in My 66000.
>>
>> Then one has to struggle to fit some useful 3RI ALU ops, 2RI ops, and
>> Branch ops, before realizing they are already basically out of encoding
>> space...
> <
> The important thing to remember is that the most precious resource is
> the Major OpCode space--and the reason is that this gives you access
> to the other spaces.
> <
> In My 66000, the Major OpCode space consists of all 16-bit immediates
> The branches with IP relative offsets, and the extension OpCodes, of
> which there are 6 {Predication, Shifts, 2R+Disp memory refs, 2-Operand,
> 3-Operand, and 1-Operand.}
> <
> For all of the extended instructions, My 66000 has 3-bits to control the
> signs of the operands and access to long immediates, and access to
> 5-bit immediates in Src1. This supports things like 1<<k in a single instruction.
> <
> The second most important resource is the 3-operand space because
> there are only 8 available entries and we need FMAC (single and double),
> CMOV, and INSert.
> <
> The other spaces are so partially populated that one has a pretty free
> reign.

OK.

In my initial layout, was starting from a 6-bit major space, with a
4-bit minor space for 3R ops, and an additional 6 bits for 2R ops.

Doing a Disp9 or Imm9 op would only have the 6-bit major opcode, which
doesn't really go all that far.

Meanwhile, in this top-level space, in BJX2 it was 3+4+1 (8) bits, with
3R ops adding 4-bits, and 2R ops adding 8 bits.

The new design was seriously choked in the top-level space, but could
have more space for 2R ops.

The result would be to mostly drop Imm9 and use Imm6 instead, but:
There is a lot more that doesn't fit into 6 bits that would have fit into 9;
It basically precludes being able to encode an arbitrary 32 or 33 bit
value in a 64-bit pair (since the Jumbo prefix also needs its chunk of
encoding space, and a 27-bit jumbo prefix is basically no-go).

For my existing ISA, I did go and make a tweak:
Some of the flag bits from SR are saved in the high-order bits of LR
during function calls (and restored on function return);
This means predication should now work across function calls, and also
resolves a few potential ISA semantics issues involving WEX (the WEX
Enable state and similar is also now preserved across function calls).

Though, it does result in LUT cost increasing by a few %, which isn't
ideal. Can't really how much of this is due to an actual/significant
cost increase vs random fluctuation.

WNS hasn't really changed either way, and the WNS value usually
indicates if one has poked at something serious. Likewise, the slow
paths still seem to be mostly stuff within the memory subsystem.

>>
>>
>> Yeah, a shortfall of several bits seems to make a pretty big difference...
>>
>>
>> It goes a little further if one does Load/Store and 3RI ops using
>> Disp6/Imm6 instead of Disp9/Imm9.
>>
>> Not enough bits to encode an Imm33/Disp33 in a 64-bit pair, and not
>> enough bits to encode Imm64 in 96-bits, ...
>>
>>
>> Yeah, "poor starting point" is starting to seem fairly evident...
>>>>
>>>>
>>>> Still not fully settled on instruction layouts yet, and don't feel
>>>> particularly inclined at the moment to pursue this, since the main way
>>>> to "actually take advantage of it" would like require use of modulo loop
>>>> scheduling or clever function inlining or similar (or, basically, one of
>>>> the same issues which Itanium had to deal with).
>>>>
>>>> Some possible debate is whether code would benefit from a move from 32
>>>> to 64 GPRs. Short of some tasks which come up in an OpenGL rasterizer
>>>> (namely parallel edge walking over a bunch of parameters or similar), I
>>>> have doubts.
>>>>
>>>>
>>>> It is more likely to pay off for a wider core, but this would assume
>>>> having a compiler which is effective enough to use the additional width
>>>> (whereas, as-is, my compiler can't even really manage 3-wide effectively).
>>>>
>>> I have lived under the assumption that the wider cores have the HW resources
>>> to do many of these things for themselves, so that code written, compiled, and
>>> scheduled for the 1-wide cores run within spitting distance of the best compiled
>>> code one could target at the GBOoO core. I developed this assumption from the
>>> Mc 88120 effort where we even achieved 2.0 IPC running SPEC 89 XLISP ! and
>>> 5.99 IPC running MATRIX300.
> <
>> I am assuming a lack of any OoO or GBOoO capabilities, and instead a
>> strictly in-order bundle-at-a-time core more like the existing BJX2
>> pipeline, just possibly widened from 3 to 5 or similar.
> <
> Yes, you are targeting a particular chip to hold your design, while I am
> designing from the very small (1-wide In Order) to the moderately large
> (8-wide Out of Order)


Click here to read the complete article
Re: Branch prediction hints

<4d114eb0-d49c-478c-b49c-7270cb39a687n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17132&group=comp.arch#17132

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:4756:: with SMTP id k22mr29203199qtp.193.1621892211350; Mon, 24 May 2021 14:36:51 -0700 (PDT)
X-Received: by 2002:aca:c64a:: with SMTP id w71mr795956oif.44.1621892211105; Mon, 24 May 2021 14:36:51 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!tr3.eu1.usenetexpress.com!feeder.usenetexpress.com!tr1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 24 May 2021 14:36:50 -0700 (PDT)
In-Reply-To: <s8h31l$vv$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:ed6f:f412:a8c7:989c; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:ed6f:f412:a8c7:989c
References: <s8c0j2$q5d$1@newsreader4.netcologne.de> <s8cmv1$1e7$1@dont-email.me> <s8csfm$172$1@dont-email.me> <s8d40h$73f$1@dont-email.me> <s8dakc$27r$1@dont-email.me> <s8e60c$ca6$1@dont-email.me> <s8eabb$ig$1@dont-email.me> <13fa2553-eaf2-43df-a87a-3559a45d88a0n@googlegroups.com> <s8ev0d$t2t$1@dont-email.me> <23b797e3-2809-4cd4-a5b4-2085a35f98cen@googlegroups.com> <s8f9bo$dup$1@dont-email.me> <1430abb7-231c-4f12-985a-44b623c2fcafn@googlegroups.com> <s8h31l$vv$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4d114eb0-d49c-478c-b49c-7270cb39a687n@googlegroups.com>
Subject: Re: Branch prediction hints
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 24 May 2021 21:36:51 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 222
 by: MitchAlsup - Mon, 24 May 2021 21:36 UTC

On Monday, May 24, 2021 at 3:41:28 PM UTC-5, BGB wrote:
> On 5/24/2021 12:52 PM, MitchAlsup wrote:
> > On Sunday, May 23, 2021 at 11:16:58 PM UTC-5, BGB wrote:
> >> On 5/23/2021 8:32 PM, MitchAlsup wrote:

> >>> Would I be out of line to state that this sounds like a poor starting point?
> >> Probably.
> >>
> >> It is more akin to designing for a 16-bit ISA, where it doesn't take
> >> much to eat through pretty much all of it.
> Clarification:
> I had meant, "it probably was a poor starting point" rather than "it was
> probably out of line"...
<
I knew that.......
> >>> <
> >>> My 66000 has 1/3rd of its Major OpCode space unallocated,
> >>> a bit less than 1/2 of its memory reference OpCode Space allocated,
> >>> a bit less than 1/2 of its 2-operand OpCode Space allocated,
> >>> a bit less than 1/128 of its 1-operand Op[Code Apace allocated,
> >>> and 1/4 of its 3-operand OpCode Space unallocated.
> >> Starts looking at it a little more, and realizing encoding space may be
> >> a more serious problem than I realized initially...
> >>
> >>
> >> I can't really map BJX2 to this new space, it just doesn't fit...
> >>
> >>
> >> Then again, maybe it might win more points with the "RISC means small
> >> ISA listing" crowd... Because one runs out of encoding bits before they
> >> can fit all that much into it...
> >>
> >>
> >> "Well, Imma define some Disp9 Load/Store Ops...",
> >> "Oh-Noes, that was 1/4 of the encoding space!",
> >> "How about some 3R Load/Store ops and 3R ALU ops and 2R space",
> >> "Now it at 1/2 of the opcode space!"
> > <
> > To be fair, I made a loot of these mistakes in Mc 88K, and corrected the
> > vast majority of them in My 66000.
> >>
> >> Then one has to struggle to fit some useful 3RI ALU ops, 2RI ops, and
> >> Branch ops, before realizing they are already basically out of encoding
> >> space...
> > <
> > The important thing to remember is that the most precious resource is
> > the Major OpCode space--and the reason is that this gives you access
> > to the other spaces.
> > <
> > In My 66000, the Major OpCode space consists of all 16-bit immediates
> > The branches with IP relative offsets, and the extension OpCodes, of
> > which there are 6 {Predication, Shifts, 2R+Disp memory refs, 2-Operand,
> > 3-Operand, and 1-Operand.}
> > <
> > For all of the extended instructions, My 66000 has 3-bits to control the
> > signs of the operands and access to long immediates, and access to
> > 5-bit immediates in Src1. This supports things like 1<<k in a single instruction.
> > <
> > The second most important resource is the 3-operand space because
> > there are only 8 available entries and we need FMAC (single and double),
> > CMOV, and INSert.
> > <
> > The other spaces are so partially populated that one has a pretty free
> > reign.
> OK.
>
> In my initial layout, was starting from a 6-bit major space, with a
> 4-bit minor space for 3R ops, and an additional 6 bits for 2R ops.
<
6-bit major:: check
I only got 3-bit 3-operand OpCode because I use 3 other bits for sign
control and access to immediates::
<
FMAC Rd, R1,±R2,±R3
So you can change the sign associated with multiplication or with adding
and get 4 flavors of MACing. This seems to work usefully well with the
bit field INSert instruction as bit-inversion rather than negation.
>
> Doing a Disp9 or Imm9 op would only have the 6-bit major opcode, which
> doesn't really go all that far.
<
I guess I am missing something, here, as I get Imm16 and DISP16<<2 for both
of these, and for unconditional branches (or CALL) I get DISP26<<2. Must have
something to do with packing or unpacking of WEX.....
>
>
> Meanwhile, in this top-level space, in BJX2 it was 3+4+1 (8) bits, with
> 3R ops adding 4-bits, and 2R ops adding 8 bits.
>
>
> The new design was seriously choked in the top-level space, but could
> have more space for 2R ops.
<
All of my 2-operand stuff went under 1 Major OpCode ( 001010 )
All of the 3-agen stuff went under 1 Major OpCode ( 001001 )
So I have 6 (of 64) Major OpCodes burned for everything not in the Major
OpCode group. One can tell if it has a chance of being an extension OpCode
(XOP) by looking at the top 2 bits (00), then the second top bit (xx0) is
for operand+immediate and (001) is for operand+operand.
>
>
> The result would be to mostly drop Imm9 and use Imm6 instead, but:
> There is a lot more that doesn't fit into 6 bits that would have fit into 9;
> It basically precludes being able to encode an arbitrary 32 or 33 bit
> value in a 64-bit pair (since the Jumbo prefix also needs its chunk of
> encoding space, and a 27-bit jumbo prefix is basically no-go).
>
>
>
> For my existing ISA, I did go and make a tweak:
> Some of the flag bits from SR are saved in the high-order bits of LR
> during function calls (and restored on function return);
> This means predication should now work across function calls, and also
> resolves a few potential ISA semantics issues involving WEX (the WEX
> Enable state and similar is also now preserved across function calls).
>
>
> Though, it does result in LUT cost increasing by a few %, which isn't
> ideal. Can't really how much of this is due to an actual/significant
> cost increase vs random fluctuation.
>
> WNS hasn't really changed either way, and the WNS value usually
> indicates if one has poked at something serious. Likewise, the slow
> paths still seem to be mostly stuff within the memory subsystem.
> >>
> >>
> >> Yeah, a shortfall of several bits seems to make a pretty big difference...
> >>
> >>
> >> It goes a little further if one does Load/Store and 3RI ops using
> >> Disp6/Imm6 instead of Disp9/Imm9.
> >>
> >> Not enough bits to encode an Imm33/Disp33 in a 64-bit pair, and not
> >> enough bits to encode Imm64 in 96-bits, ...
> >>
> >>
> >> Yeah, "poor starting point" is starting to seem fairly evident...
> >>>>
> >>>>
> >>>> Still not fully settled on instruction layouts yet, and don't feel
> >>>> particularly inclined at the moment to pursue this, since the main way
> >>>> to "actually take advantage of it" would like require use of modulo loop
> >>>> scheduling or clever function inlining or similar (or, basically, one of
> >>>> the same issues which Itanium had to deal with).
> >>>>
> >>>> Some possible debate is whether code would benefit from a move from 32
> >>>> to 64 GPRs. Short of some tasks which come up in an OpenGL rasterizer
> >>>> (namely parallel edge walking over a bunch of parameters or similar), I
> >>>> have doubts.
> >>>>
> >>>>
> >>>> It is more likely to pay off for a wider core, but this would assume
> >>>> having a compiler which is effective enough to use the additional width
> >>>> (whereas, as-is, my compiler can't even really manage 3-wide effectively).
> >>>>
> >>> I have lived under the assumption that the wider cores have the HW resources
> >>> to do many of these things for themselves, so that code written, compiled, and
> >>> scheduled for the 1-wide cores run within spitting distance of the best compiled
> >>> code one could target at the GBOoO core. I developed this assumption from the
> >>> Mc 88120 effort where we even achieved 2.0 IPC running SPEC 89 XLISP ! and
> >>> 5.99 IPC running MATRIX300.
> > <
> >> I am assuming a lack of any OoO or GBOoO capabilities, and instead a
> >> strictly in-order bundle-at-a-time core more like the existing BJX2
> >> pipeline, just possibly widened from 3 to 5 or similar.
> > <
> > Yes, you are targeting a particular chip to hold your design, while I am
> > designing from the very small (1-wide In Order) to the moderately large
> > (8-wide Out of Order)
> Granted.
>
>
> For me, significantly smaller FPGA's fall into the "not generally sold
> on FPGA dev boards on Amazon or similar" territory (1).
>
> And, bigger FPGA's in the "they are too expensive and I don't have money
> to afford them" territory.
>
> Similarly, custom ASIC's are well outside anything I am able likely able
> to be able to afford (and, the ability to "actually do things" is a
> limiting factor).


Click here to read the complete article
Re: Branch prediction hints

<s8h69l$2tt$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17133&group=comp.arch#17133

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-3221-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Branch prediction hints
Date: Mon, 24 May 2021 21:36:53 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <s8h69l$2tt$1@newsreader4.netcologne.de>
References: <s8c0j2$q5d$1@newsreader4.netcologne.de>
<s8dcbt$7f4$1@gioia.aioe.org>
<36626ffe-f5d8-4a62-af27-310684375561n@googlegroups.com>
<jwvfsyd1jio.fsf-monnier+comp.arch@gnu.org>
<ad6c1950-c7df-4ec0-b3ab-20550baccb67n@googlegroups.com>
<12fa6b22-9cf8-4dd0-813d-1b8b21058c50n@googlegroups.com>
<s8h1je$dhj$1@dont-email.me>
Injection-Date: Mon, 24 May 2021 21:36:53 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-3221-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:3221:0:7285:c2ff:fe6c:992d";
logging-data="3005"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Mon, 24 May 2021 21:36 UTC

Marcus <m.delete@this.bitsnbites.eu> schrieb:

> BTW, apart from VVM, are there any good examples of ISA:s with loop
> instructions that are easy to predict ahead of time (thus effectively
> unrolling loops, eliminating compares/branches, and reducing branch
> predictor load)?

POWER might count with its separate count register (sorry for
the pun).

It has instructions like Decrement the CTR, then branch if the
decremented CTR equals or does not equal zero. It is also possible
to combine this with conditions, just to make sure the
branch predictors still have something to do :-)

Still, it is a pretty good match for Fortran's DO loops
or any kind of loop where you know the number of iterations
beforehand.

Re: Branch prediction hints

<s8ha7s$r7k$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17135&group=comp.arch#17135

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Branch prediction hints
Date: Mon, 24 May 2021 15:44:10 -0700
Organization: A noiseless patient Spider
Lines: 57
Message-ID: <s8ha7s$r7k$1@dont-email.me>
References: <s8c0j2$q5d$1@newsreader4.netcologne.de>
<s8dcbt$7f4$1@gioia.aioe.org>
<36626ffe-f5d8-4a62-af27-310684375561n@googlegroups.com>
<jwvfsyd1jio.fsf-monnier+comp.arch@gnu.org>
<ad6c1950-c7df-4ec0-b3ab-20550baccb67n@googlegroups.com>
<12fa6b22-9cf8-4dd0-813d-1b8b21058c50n@googlegroups.com>
<s8h1je$dhj$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 24 May 2021 22:44:12 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="ceb97d31d720d4e68194b1600142d2d4";
logging-data="27892"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+G52LehkFD03cQp5FuvwqRe11s6tdVeLI="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.2
Cancel-Lock: sha1:Dx+JoETuuO6xDlJYW8yDVonZg8s=
In-Reply-To: <s8h1je$dhj$1@dont-email.me>
Content-Language: en-US
 by: Stephen Fuld - Mon, 24 May 2021 22:44 UTC

On 5/24/2021 1:16 PM, Marcus wrote:
> On 2021-05-23, MitchAlsup wrote:
>> On Sunday, May 23, 2021 at 12:15:04 PM UTC-5, robf...@gmail.com wrote:
>>> Speaking of the usefulness of branch hints for prediction I have to
>>> agree
>>> that they are not that useful. As a gag though I added the ability to
>>> supply
>>> branch predictor hints in ‘if’ statements that also allowed the branch
>>> predictor to be selected. How useful is it to be able to select the
>>> branch
>>> predictor to use (assuming multiple predictors are present)?
>>> The only case I can think of is maybe power savings.
>> <
>> I might note that Virtual Vector Method loops do not use the branch
>> predictor
>> but are executed in advance of the loop iteration to effectively
>> perform as if
>> the branch took zero cycles when the loop terminates (and zero cycles
>> when
>> the loop continues.)
>> <
>> This improves the prediction accuracy of the "rest of the branches".
>> <
>> PREDication also does not use the branch predictor getting the HW setup
>> to execute either then-clause or else-clause. This also improves the
>> prediction
>> accuracy of the "rest of the branches".
>>
>
> Those are nice properties, and some of it reminds me of DSP style
> "hardware assisted loops" (e.g. SPLOOP in TI 320C66x).
>
> MRISC32 style vector loops still have regular loop branch instructions,
> and although they execute less frequently (the core of the vector loop
> is "hardware assisted"), they still occupy slots in the branch
> predictor.
>
> BTW, apart from VVM, are there any good examples of ISA:s with loop
> instructions that are easy to predict ahead of time (thus effectively
> unrolling loops, eliminating compares/branches, and reducing branch
> predictor load)?

I think IBM S/360 had something like this. The Univac 1100 series has
an instruction called Jump Greater Than Zero and Decrement (JGD). It is
designed for the end of loops. It takes a register operand and compares
its value to zero. If greater than, it jumps to the address given in
the instruction (immediate or from a register), then decrements the
value in the register by 1. So the vast majority of the time, you can
correctly predict taken. If you want to, you can could recognize that
the decremented value was zero so predict not taken for the next time
the instruction is executed. I don't know if the hardware did that.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Branch prediction hints

<s8hf4q$mrr$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17137&group=comp.arch#17137

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Branch prediction hints
Date: Mon, 24 May 2021 19:07:49 -0500
Organization: A noiseless patient Spider
Lines: 356
Message-ID: <s8hf4q$mrr$1@dont-email.me>
References: <s8c0j2$q5d$1@newsreader4.netcologne.de>
<s8cmv1$1e7$1@dont-email.me> <s8csfm$172$1@dont-email.me>
<s8d40h$73f$1@dont-email.me> <s8dakc$27r$1@dont-email.me>
<s8e60c$ca6$1@dont-email.me> <s8eabb$ig$1@dont-email.me>
<13fa2553-eaf2-43df-a87a-3559a45d88a0n@googlegroups.com>
<s8ev0d$t2t$1@dont-email.me>
<23b797e3-2809-4cd4-a5b4-2085a35f98cen@googlegroups.com>
<s8f9bo$dup$1@dont-email.me>
<1430abb7-231c-4f12-985a-44b623c2fcafn@googlegroups.com>
<s8h31l$vv$1@dont-email.me>
<4d114eb0-d49c-478c-b49c-7270cb39a687n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 25 May 2021 00:07:54 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="2660c744ddf30ee30e8ef7bd54ded436";
logging-data="23419"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/Shne2Q27Vll/7UUR/uS42"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.2
Cancel-Lock: sha1:GROvXnlg2yb3fFPOvWHdTg1nRHc=
In-Reply-To: <4d114eb0-d49c-478c-b49c-7270cb39a687n@googlegroups.com>
Content-Language: en-US
 by: BGB - Tue, 25 May 2021 00:07 UTC

On 5/24/2021 4:36 PM, MitchAlsup wrote:
> On Monday, May 24, 2021 at 3:41:28 PM UTC-5, BGB wrote:
>> On 5/24/2021 12:52 PM, MitchAlsup wrote:
>>> On Sunday, May 23, 2021 at 11:16:58 PM UTC-5, BGB wrote:
>>>> On 5/23/2021 8:32 PM, MitchAlsup wrote:
>
>>>>> Would I be out of line to state that this sounds like a poor starting point?
>>>> Probably.
>>>>
>>>> It is more akin to designing for a 16-bit ISA, where it doesn't take
>>>> much to eat through pretty much all of it.
>> Clarification:
>> I had meant, "it probably was a poor starting point" rather than "it was
>> probably out of line"...
> <
> I knew that.......
>>>>> <
>>>>> My 66000 has 1/3rd of its Major OpCode space unallocated,
>>>>> a bit less than 1/2 of its memory reference OpCode Space allocated,
>>>>> a bit less than 1/2 of its 2-operand OpCode Space allocated,
>>>>> a bit less than 1/128 of its 1-operand Op[Code Apace allocated,
>>>>> and 1/4 of its 3-operand OpCode Space unallocated.
>>>> Starts looking at it a little more, and realizing encoding space may be
>>>> a more serious problem than I realized initially...
>>>>
>>>>
>>>> I can't really map BJX2 to this new space, it just doesn't fit...
>>>>
>>>>
>>>> Then again, maybe it might win more points with the "RISC means small
>>>> ISA listing" crowd... Because one runs out of encoding bits before they
>>>> can fit all that much into it...
>>>>
>>>>
>>>> "Well, Imma define some Disp9 Load/Store Ops...",
>>>> "Oh-Noes, that was 1/4 of the encoding space!",
>>>> "How about some 3R Load/Store ops and 3R ALU ops and 2R space",
>>>> "Now it at 1/2 of the opcode space!"
>>> <
>>> To be fair, I made a loot of these mistakes in Mc 88K, and corrected the
>>> vast majority of them in My 66000.
>>>>
>>>> Then one has to struggle to fit some useful 3RI ALU ops, 2RI ops, and
>>>> Branch ops, before realizing they are already basically out of encoding
>>>> space...
>>> <
>>> The important thing to remember is that the most precious resource is
>>> the Major OpCode space--and the reason is that this gives you access
>>> to the other spaces.
>>> <
>>> In My 66000, the Major OpCode space consists of all 16-bit immediates
>>> The branches with IP relative offsets, and the extension OpCodes, of
>>> which there are 6 {Predication, Shifts, 2R+Disp memory refs, 2-Operand,
>>> 3-Operand, and 1-Operand.}
>>> <
>>> For all of the extended instructions, My 66000 has 3-bits to control the
>>> signs of the operands and access to long immediates, and access to
>>> 5-bit immediates in Src1. This supports things like 1<<k in a single instruction.
>>> <
>>> The second most important resource is the 3-operand space because
>>> there are only 8 available entries and we need FMAC (single and double),
>>> CMOV, and INSert.
>>> <
>>> The other spaces are so partially populated that one has a pretty free
>>> reign.
>> OK.
>>
>> In my initial layout, was starting from a 6-bit major space, with a
>> 4-bit minor space for 3R ops, and an additional 6 bits for 2R ops.
> <
> 6-bit major:: check
> I only got 3-bit 3-operand OpCode because I use 3 other bits for sign
> control and access to immediates::
> <
> FMAC Rd, R1,±R2,±R3
> So you can change the sign associated with multiplication or with adding
> and get 4 flavors of MACing. This seems to work usefully well with the
> bit field INSert instruction as bit-inversion rather than negation.
>>
>> Doing a Disp9 or Imm9 op would only have the 6-bit major opcode, which
>> doesn't really go all that far.
> <
> I guess I am missing something, here, as I get Imm16 and DISP16<<2 for both
> of these, and for unconditional branches (or CALL) I get DISP26<<2. Must have
> something to do with packing or unpacking of WEX.....

Making another layout attempt...
This one keeps Imm9 where appropriate.

Where, ppqq!=0100:

* ppqq-0000 00nn-nnnn ssss-sstt tttt-0000 MOV.B Rn, (Rs, Rt)
* ppqq-0000 00nn-nnnn ssss-sstt tttt-0001 LEA.B Rn, (Rs, Rt)
* ppqq-0000 00nn-nnnn ssss-sstt tttt-0010 MOV.W Rn, (Rs, Rt)
* ppqq-0000 00nn-nnnn ssss-sstt tttt-0011 LEA.W Rn, (Rs, Rt)
* ppqq-0000 00nn-nnnn ssss-sstt tttt-0100 MOV.L Rn, (Rs, Rt)
* ppqq-0000 00nn-nnnn ssss-sstt tttt-0101 LEA.L Rn, (Rs, Rt)
* ppqq-0000 00nn-nnnn ssss-sstt tttt-0110 MOV.Q Rn, (Rs, Rt)
* ppqq-0000 00nn-nnnn ssss-sstt tttt-0111 LEA.Q Rn, (Rs, Rt)
* ppqq-0000 00nn-nnnn ssss-sstt tttt-1000 MOV.B (Rs, Rt), Rn
* ppqq-0000 00nn-nnnn ssss-sstt tttt-1001 MOVU.B (Rs, Rt), Rn
* ppqq-0000 00nn-nnnn ssss-sstt tttt-1010 MOV.W (Rs, Rt), Rn
* ppqq-0000 00nn-nnnn ssss-sstt tttt-1011 MOVU.W (Rs, Rt), Rn
* ppqq-0000 00nn-nnnn ssss-sstt tttt-1100 MOV.L (Rs, Rt), Rn
* ppqq-0000 00nn-nnnn ssss-sstt tttt-1101 MOVU.L (Rs, Rt), Rn
* ppqq-0000 00nn-nnnn ssss-sstt tttt-1110 MOV.Q (Rs, Rt), Rn
* ppqq-0000 00nn-nnnn ssss-sstt tttt-1111 -

* ppqq-0001 00nn-nnnn ssss-sstt tttt-0000 ADD Rs, Rt, Rn
* ppqq-0001 00nn-nnnn ssss-sstt tttt-0001 SUB Rs, Rt, Rn
* ppqq-0001 00nn-nnnn ssss-sstt tttt-0010 MULS Rs, Rt, Rn
* ppqq-0001 00nn-nnnn ssss-sstt tttt-0011 MULU Rs, Rt, Rn
* ppqq-0001 00nn-nnnn ssss-sstt tttt-0100 -
* ppqq-0001 00nn-nnnn ssss-sstt tttt-0101 AND Rs, Rt, Rn
* ppqq-0001 00nn-nnnn ssss-sstt tttt-0110 OR Rs, Rt, Rn
* ppqq-0001 00nn-nnnn ssss-sstt tttt-0111 XOR Rs, Rt, Rn
* ppqq-0001 00nn-nnnn ssss-sstt tttt-1000 SHAD Rs, Rt, Rn
* ppqq-0001 00nn-nnnn ssss-sstt tttt-1001 SHLD Rs, Rt, Rn
* ppqq-0001 00nn-nnnn ssss-sstt tttt-1010 SHADQ Rs, Rt, Rn
* ppqq-0001 00nn-nnnn ssss-sstt tttt-1011 SHLDQ Rs, Rt, Rn
* ppqq-0001 00nn-nnnn ssss-sstt tttt-1100 ADC Rs, Rt, Rn
* ppqq-0001 00nn-nnnn ssss-sstt tttt-1101 SBB Rs, Rt, Rn
* ppqq-0001 00nn-nnnn ssss-sstt tttt-1110 DMULS Rs, Rt, Rn
* ppqq-0001 00nn-nnnn ssss-sstt tttt-1111 DMULU Rs, Rt, Rn

* ppqq-0010 00nn-nnnn ssss-ssii iiii-0000 SHAD Rs, Imm6u, Rn
* ppqq-0010 00nn-nnnn ssss-ssii iiii-0001 SHAD Rs, Imm6n, Rn
* ppqq-0010 00nn-nnnn ssss-ssii iiii-0010 SHADQ Rs, Imm6u, Rn
* ppqq-0010 00nn-nnnn ssss-ssii iiii-0011 SHADQ Rs, Imm6n, Rn
* ppqq-0010 00nn-nnnn ssss-ssii iiii-0100 SHLD Rs, Imm6u, Rn
* ppqq-0010 00nn-nnnn ssss-ssii iiii-0101 SHLD Rs, Imm6n, Rn
* ppqq-0010 00nn-nnnn ssss-ssii iiii-0110 SHLDQ Rs, Imm6u, Rn
* ppqq-0010 00nn-nnnn ssss-ssii iiii-0111 SHLDQ Rs, Imm6n, Rn

Then fill in FPU, ALUX, SIMD, ... operations.

....

* ppqq-0011 00nn-nnnn ssss-ssoo oooo-oooo 1R and 2R spaces.

....

This '00' block is basically where all the 3R and 2R ops go.

* ppqq-0000 01nn-nnnn ssss-ss0i iiii-iiii MOV.B Rn, (Rs, Disp9)
* ppqq-0000 01nn-nnnn ssss-ss1i iiii-iiii LEA.B Rn, (Rs, Disp9)
* ppqq-0001 01nn-nnnn ssss-ss0i iiii-iiii MOV.W Rn, (Rs, Disp9)
* ppqq-0001 01nn-nnnn ssss-ss1i iiii-iiii LEA.W Rn, (Rs, Disp9)
* ppqq-0010 01nn-nnnn ssss-ss0i iiii-iiii MOV.L Rn, (Rs, Disp9)
* ppqq-0010 01nn-nnnn ssss-ss1i iiii-iiii LEA.L Rn, (Rs, Disp9)
* ppqq-0011 01nn-nnnn ssss-ss0i iiii-iiii MOV.Q Rn, (Rs, Disp9)
* ppqq-0011 01nn-nnnn ssss-ss1i iiii-iiii LEA.Q Rn, (Rs, Disp9)
* ppqq-0100 01nn-nnnn ssss-ss0i iiii-iiii MOV.B (Rs, Disp9), Rn
* ppqq-0100 01nn-nnnn ssss-ss1i iiii-iiii MOVU.B (Rs, Disp9), Rn
* ppqq-0101 01nn-nnnn ssss-ss0i iiii-iiii MOV.W (Rs, Disp9), Rn
* ppqq-0101 01nn-nnnn ssss-ss1i iiii-iiii MOVU.W (Rs, Disp9), Rn
* ppqq-0110 01nn-nnnn ssss-ss0i iiii-iiii MOV.L (Rs, Disp9), Rn
* ppqq-0110 01nn-nnnn ssss-ss1i iiii-iiii MOVU.L (Rs, Disp9), Rn
* ppqq-0111 01nn-nnnn ssss-ss0i iiii-iiii MOV.Q (Rs, Disp9), Rn
* ppqq-0111 01nn-nnnn ssss-ss1i iiii-iiii -
....
* ppqq-1111 0100-iiii iiii-iiii iiii-iiii BRA Disp20
* ppqq-1111 0101-iiii iiii-iiii iiii-iiii BSR Disp20

....

* ppqq-0000 10nn-nnnn ssss-ss0i iiii-iiii ADD Rs, Imm9u, Rn
* ppqq-0000 10nn-nnnn ssss-ss1i iiii-iiii ADD Rs, Imm9n, Rn
* ppqq-0001 10nn-nnnn ssss-ss0i iiii-iiii MULS Rs, Imm9u, Rn
* ppqq-0001 10nn-nnnn ssss-ss1i iiii-iiii MULU Rs, Imm9n, Rn
* ppqq-0010 10nn-nnnn ssss-ss0i iiii-iiii ADDSL Rs, Imm9u, Rn
* ppqq-0010 10nn-nnnn ssss-ss1i iiii-iiii ADDSL Rs, Imm9n, Rn
* ppqq-0011 10nn-nnnn ssss-ss0i iiii-iiii ADDUL Rs, Imm9u, Rn
* ppqq-0011 10nn-nnnn ssss-ss1i iiii-iiii ADDUL Rs, Imm9n, Rn
* ppqq-0100 -
* ppqq-0101 10nn-nnnn ssss-ss0i iiii-iiii AND Rs, Imm9u, Rn
* ppqq-0110 10nn-nnnn ssss-ss0i iiii-iiii OR Rs, Imm9u, Rn
* ppqq-0111 10nn-nnnn ssss-ss0i iiii-iiii XOR Rs, Imm9u, Rn


Click here to read the complete article
Pages:123
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor