Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

It's easy to get on the internet and forget you have a life -- Topic on #LinuxGER


devel / comp.arch / In-order vs Out-of-order

SubjectAuthor
* In-order vs Out-of-orderChingu
+* Re: In-order vs Out-of-orderMitchAlsup
|+* Re: In-order vs Out-of-orderChingu
||`- Re: In-order vs Out-of-orderMitchAlsup
|+- Re: In-order vs Out-of-orderBGB
|`* Re: In-order vs Out-of-orderStefan Monnier
| +* Re: In-order vs Out-of-orderBGB
| |`* Re: In-order vs Out-of-orderMitchAlsup
| | +* Re: In-order vs Out-of-orderMichael S
| | |`- Re: In-order vs Out-of-orderAnton Ertl
| | `* Re: In-order vs Out-of-orderBGB
| |  `* Re: In-order vs Out-of-orderMitchAlsup
| |   `* Re: In-order vs Out-of-orderBGB
| |    `* Re: In-order vs Out-of-orderMitchAlsup
| |     +- Re: In-order vs Out-of-orderrobf...@gmail.com
| |     +- Re: In-order vs Out-of-orderBGB
| |     `* Re: In-order vs Out-of-orderTerje Mathisen
| |      `* Re: In-order vs Out-of-orderBernd Linsel
| |       `- Re: In-order vs Out-of-orderBGB
| `- Re: In-order vs Out-of-orderMitchAlsup
`* Re: In-order vs Out-of-ordernedbrek
 +- Re: In-order vs Out-of-orderMarcus
 `* Re: In-order vs Out-of-orderAnton Ertl
  +- Re: In-order vs Out-of-orderMitchAlsup
  `* Re: In-order vs Out-of-orderMichael S
   `* Re: In-order vs Out-of-orderIvan Godard
    `* Re: In-order vs Out-of-orderMichael S
     `* Re: In-order vs Out-of-orderIvan Godard
      `- Re: In-order vs Out-of-orderMichael S

Pages:12
In-order vs Out-of-order

<9d8fb369-e27f-4794-92b8-24686d874ae4n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17790&group=comp.arch#17790

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:4dd:: with SMTP id 29mr21112159qks.100.1623751841398;
Tue, 15 Jun 2021 03:10:41 -0700 (PDT)
X-Received: by 2002:a9d:82b:: with SMTP id 40mr17403694oty.81.1623751841108;
Tue, 15 Jun 2021 03:10:41 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!usenet.pasdenom.info!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 15 Jun 2021 03:10:40 -0700 (PDT)
Injection-Info: google-groups.googlegroups.com; posting-host=2402:8100:21fc:866b:89d6:3e4c:8dd3:a7b1;
posting-account=Rqv8XAoAAACBSBCdM8cNAmPI8FXGnajj
NNTP-Posting-Host: 2402:8100:21fc:866b:89d6:3e4c:8dd3:a7b1
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <9d8fb369-e27f-4794-92b8-24686d874ae4n@googlegroups.com>
Subject: In-order vs Out-of-order
From: ee19m...@smail.iitm.ac.in (Chingu)
Injection-Date: Tue, 15 Jun 2021 10:10:41 +0000
Content-Type: text/plain; charset="UTF-8"
 by: Chingu - Tue, 15 Jun 2021 10:10 UTC

which type of core requires high frequency of operation ,in-order or out-of-order?Is it good to have in-order and out-of-order core to share the same cache level(L2 or L3)?

Re: In-order vs Out-of-order

<ef15af1d-e122-4fec-acea-46aa6af45671n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17796&group=comp.arch#17796

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:5153:: with SMTP id h19mr410722qtn.133.1623774216265;
Tue, 15 Jun 2021 09:23:36 -0700 (PDT)
X-Received: by 2002:a9d:7612:: with SMTP id k18mr109682otl.178.1623774215999;
Tue, 15 Jun 2021 09:23:35 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!usenet.pasdenom.info!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 15 Jun 2021 09:23:35 -0700 (PDT)
In-Reply-To: <9d8fb369-e27f-4794-92b8-24686d874ae4n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:bdeb:cdc9:a12:a1bb;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:bdeb:cdc9:a12:a1bb
References: <9d8fb369-e27f-4794-92b8-24686d874ae4n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ef15af1d-e122-4fec-acea-46aa6af45671n@googlegroups.com>
Subject: Re: In-order vs Out-of-order
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 15 Jun 2021 16:23:36 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Tue, 15 Jun 2021 16:23 UTC

On Tuesday, June 15, 2021 at 5:10:42 AM UTC-5, Chingu wrote:
> which type of core requires high frequency of operation ,in-order or out-of-order?Is it good to have in-order and out-of-order core to share the same cache level(L2 or L3)?
<
Neither has an inherent frequency advantage or an "effective" length of pipeline advantage.
<
Pure In Order is immune to Spectré-like attacks.
<
Out of Order can be made immune to Spectré-like attacks.
<
Out of Order enables (not requires) more instructions "in flight" and can better utilize sparse resources thereby finding Instruction Level Parallelism.

Re: In-order vs Out-of-order

<844924c0-b2b3-4b59-b37a-a9e5d7d27ee6n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17803&group=comp.arch#17803

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ae9:f30f:: with SMTP id p15mr516356qkg.151.1623775422802;
Tue, 15 Jun 2021 09:43:42 -0700 (PDT)
X-Received: by 2002:a4a:c101:: with SMTP id s1mr117165oop.54.1623775422263;
Tue, 15 Jun 2021 09:43:42 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!border2.nntp.ams1.giganews.com!nntp.giganews.com!feeder1.cambriumusenet.nl!feed.tweak.nl!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 15 Jun 2021 09:43:42 -0700 (PDT)
In-Reply-To: <ef15af1d-e122-4fec-acea-46aa6af45671n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2402:8100:21fc:8a12:89d6:3e4c:8dd3:a7b1;
posting-account=Rqv8XAoAAACBSBCdM8cNAmPI8FXGnajj
NNTP-Posting-Host: 2402:8100:21fc:8a12:89d6:3e4c:8dd3:a7b1
References: <9d8fb369-e27f-4794-92b8-24686d874ae4n@googlegroups.com> <ef15af1d-e122-4fec-acea-46aa6af45671n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <844924c0-b2b3-4b59-b37a-a9e5d7d27ee6n@googlegroups.com>
Subject: Re: In-order vs Out-of-order
From: ee19m...@smail.iitm.ac.in (Chingu)
Injection-Date: Tue, 15 Jun 2021 16:43:42 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 17
 by: Chingu - Tue, 15 Jun 2021 16:43 UTC

On Tuesday, June 15, 2021 at 9:53:37 PM UTC+5:30, MitchAlsup wrote:
> On Tuesday, June 15, 2021 at 5:10:42 AM UTC-5, Chingu wrote:
> > which type of core requires high frequency of operation ,in-order or out-of-order?Is it good to have in-order and out-of-order core to share the same cache level(L2 or L3)?
> <
> Neither has an inherent frequency advantage or an "effective" length of pipeline advantage.
> <
> Pure In Order is immune to Spectré-like attacks.
> <
> Out of Order can be made immune to Spectré-like attacks.
> <
> Out of Order enables (not requires) more instructions "in flight" and can better utilize sparse resources thereby finding Instruction Level Parallelism.
Are there any issues when in-order and out-of-order cores share same last level cache?like improper cache partitioning between cores etc.

Re: In-order vs Out-of-order

<ed404eb9-140f-4ed9-9099-d46a47cc0322n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17808&group=comp.arch#17808

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:5e4f:: with SMTP id i15mr710746qtx.362.1623778100523;
Tue, 15 Jun 2021 10:28:20 -0700 (PDT)
X-Received: by 2002:aca:f452:: with SMTP id s79mr184463oih.84.1623778100234;
Tue, 15 Jun 2021 10:28:20 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 15 Jun 2021 10:28:19 -0700 (PDT)
In-Reply-To: <844924c0-b2b3-4b59-b37a-a9e5d7d27ee6n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:bdeb:cdc9:a12:a1bb;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:bdeb:cdc9:a12:a1bb
References: <9d8fb369-e27f-4794-92b8-24686d874ae4n@googlegroups.com>
<ef15af1d-e122-4fec-acea-46aa6af45671n@googlegroups.com> <844924c0-b2b3-4b59-b37a-a9e5d7d27ee6n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ed404eb9-140f-4ed9-9099-d46a47cc0322n@googlegroups.com>
Subject: Re: In-order vs Out-of-order
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 15 Jun 2021 17:28:20 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Tue, 15 Jun 2021 17:28 UTC

On Tuesday, June 15, 2021 at 11:43:43 AM UTC-5, Chingu wrote:
> On Tuesday, June 15, 2021 at 9:53:37 PM UTC+5:30, MitchAlsup wrote:
> > On Tuesday, June 15, 2021 at 5:10:42 AM UTC-5, Chingu wrote:
> > > which type of core requires high frequency of operation ,in-order or out-of-order?Is it good to have in-order and out-of-order core to share the same cache level(L2 or L3)?
> > <
> > Neither has an inherent frequency advantage or an "effective" length of pipeline advantage.
> > <
> > Pure In Order is immune to Spectré-like attacks.
> > <
> > Out of Order can be made immune to Spectré-like attacks.
> > <
> > Out of Order enables (not requires) more instructions "in flight" and can better utilize sparse resources thereby finding Instruction Level Parallelism.
> Are there any issues when in-order and out-of-order cores share same last level cache?like improper cache partitioning between cores etc.
<
Not if the cache coherence protocol is correct.

Re: In-order vs Out-of-order

<saas4t$e83$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17813&group=comp.arch#17813

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: In-order vs Out-of-order
Date: Tue, 15 Jun 2021 13:38:08 -0500
Organization: A noiseless patient Spider
Lines: 205
Message-ID: <saas4t$e83$1@dont-email.me>
References: <9d8fb369-e27f-4794-92b8-24686d874ae4n@googlegroups.com>
<ef15af1d-e122-4fec-acea-46aa6af45671n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 15 Jun 2021 18:39:25 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="f52e778bd953a5d476956e17df69787f";
logging-data="14595"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+FtUnLPQvqFpjLjKSLpJ1t"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:+WQKfPcT6PaGKSTXoD9MWYDdoRg=
In-Reply-To: <ef15af1d-e122-4fec-acea-46aa6af45671n@googlegroups.com>
Content-Language: en-US
 by: BGB - Tue, 15 Jun 2021 18:38 UTC

On 6/15/2021 11:23 AM, MitchAlsup wrote:
> On Tuesday, June 15, 2021 at 5:10:42 AM UTC-5, Chingu wrote:
>> which type of core requires high frequency of operation ,in-order or out-of-order?Is it good to have in-order and out-of-order core to share the same cache level(L2 or L3)?
> <
> Neither has an inherent frequency advantage or an "effective" length of pipeline advantage.
> <
> Pure In Order is immune to Spectré-like attacks.
> <
> Out of Order can be made immune to Spectré-like attacks.
> <
> Out of Order enables (not requires) more instructions "in flight" and can better utilize sparse resources thereby finding Instruction Level Parallelism.
>

Yeah.

Some typical OoO features:
* Able to reorganize instructions around cache misses;
* Able to shuffle instructions around to increase effective ILP;
* ...

For an in-order, any shuffling needs to be done by the compiler, which
may in turn need to account for things like instruction latency, ...

Premise goes, for OoO one can avoid this (though, IME, one can still see
some benefit from reorganizing instructions to minimize dependencies).

For in-order, one can potentially divide things further:
* Scalar / Single-Issue: Only one instruction at a time.
* Multi-Issue: More than one instruction at a time.
** Superscalar: The CPU figures out what it can execute in parallel.
** VLIW: The encoding makes it explicit what it can execute in parallel.

Scalar:
* Non-pipelined:
** Every instruction takes multiple clock cycles;
** Basically, all stages complete for every instruction;
** The processor does not move on until an instruction completes.
* Pipelined: May approach 1 instruction per clock.
** The instructions overlap in terms of execution;
** After each stage, the instruction moves to the next stage;
** Each unit does one piece of work, and immediately moves to the next.

Pretty much everything of note is pipelined, excluding some very old or
very small processors (almost all are "8-bit" architectures), with a few
edge cases for people trying to do RV32E cores while also minimizing LUT
cost and similar.

In theory a pipelined CPU could run at a solid 1.0 IPC, however a few
factors may limit this:
* Things like memory may stall on a cache miss, forcing the CPU to wait;
* Some instructions may trigger an "interlock"
** One instruction may need the result of a prior instruction;
** However, the result isn't ready yet at the time it is requested;
** So, one part of the pipeline stalls while the rest continues;
** This allows the depended-on instruction to complete first;
** During this time, the pipeline may execute "Bubbles" / NOPs.

If one did not have an interlock mechanism in a processor, it would be
necessary for the compiler to insert NOPs anywhere where an unavoidable
dependency may occur. This is undesirable.

Things like pipeline length can also be a tradeoff (longer pipeline
makes it easier to go faster, but also increases potential for hazards
and increases the relative cost of unpredictable branch instructions).

Many examples present things in terms of a 5-stage pipeline, but
admittedly I am running an 8-stage pipeline and am still only typically
running at 50 MHz on an Artix-7 FPGA.

For a Multi-Issue CPU, things get a little wider:
* Instruction Fetch: Returns enough data for the maximum number of
instructions;
* Decode: Runs multiple decoders in parallel;
* Register file: has enough read and write ports for all the
instructions which may execute;
....
* Execute: One has multiple lanes, each with their own ALU, ...;
* ...

For a superscalar, one needs to figure out whether or not the fetched
and decoded instructions can be executed in parallel. The difficulty,
however, is that one doesn't really have the context to know this until
after Instruction Decode has finished, which means one has to get funky
with the decode-and-fecth logic.

For example, the fetch may move initially as if only a
single-instruction had executed, and the decoder may keep track of when
superscalar execution had happened (shifting everything over). This may
end up being added to PC after the fact.

The result of this is, say, for a maximum of 3 instructions, a
superscalar core using this approach may need to fetch enough data for 6
instructions, and then decode with an offset of 0-3 instructions for
every cycle.

Similarly, the decoded instructions may need to provide additional metadata:
Which lanes they are allowed to execute in;
Whether they depend on or modify status flags or other resources;
...

This metadata, along with checking register IDs, etc, would need to be
factored in towards the "can we do this as superscalar" decision.

Another option is VLIW, which can be further subdivided:
* Using a larger fixed-size bundle which is then subdivided into
multiple instructions;
* Using smaller instructions with a tagging scheme to combine them.

Large fixed-size bundles is conceptually simpler, but has a few obvious
drawbacks:
* You either get the maximum possible ILP, or else...;
** Failure to achieve max ILP means lots of wasted space;
** Any unused lanes need to be padded with NOPs.
* Not usually particularly flexible regarding instruction encodings.

The use of tagging makes things more flexible:
* The ISA can facilitate variable-length instructions and bundles;
* Can use less space when less ILP is available;
* ...

The tagging approach is more like superscalar in presentation, but
allows the bundle length to be determined earlier (such as during the
Instruction Fetch stage), meaning there is no ambiguity in terms of how
much can be executed. If we see a bundle, we execute a bundle.

Comparably, it also allows for a wider core to be done more cheaply than
a comparable superscalar.

However:
It does require the compiler to be aware of the specific properties of
the target machine, such as maximum bundle length, which instructions
are allowed in which lanes, ...

In some sense, this favors superscalar as a more "in general" solution.

With a tag-based VLIW, it may be possible for mismatched code for the
CPU core to fall back to executing it one instruction at a time
(provided the ISA makes the provision that all valid bundles are also
required to be valid if the instructions are executed sequentially).

This case also potentially allows a superscalar (or maybe even OoO)
implementation of the same ISA to ignore the VLIW tag bits and execute
the code as if it were a more conventional RISC style ISA.

Potentially, the compiler may need to be capable of some fancy
transformations to be able to utilize it effectively.

So, for 2 or 3 lanes, one may find that cases which can utilize it may
"fall out" of code without doing too much. However, trying to "saturate"
these lanes, or to make effective use of a wider core, would require a
fair bit more heavy lifting from the compiler.

For example, one could have a 5 or 6 lane core, but absent a
particularly smart compiler, these lanes would almost always end up
being "wasted".

Similarly, absent these sorts of optimizations, OoO is about the only
other way one could hope to use this potential ILP (where we basically
try to get the hardware to do a bunch of stuff which might have
otherwise been done by the compiler).

So, the "most bang for the buck" for an in-order machine seems to be for
a width of around 2 or 3 lanes.

There also tends to be a local optimum of around 32 registers, though
cases may still occur where 32 isn't enough. For small functions, there
might not be enough going on to justify more.

For example, something like "strcmp()" might fit entirely within 6 or so
scratch registers, and one may find they can do little to optimize it
significantly (within the realm of conventional optimizations).

If rewriting it in ASM, one can make it process 8 characters at a time.
It may be possible to get a little more ILP by processing 16 characters
at a time, but then one runs into another problem:
The 16-at-a-time version may actually end up being slower, the main
reason being that it only reaches peak efficiency for strings larger
than 32 or 48 characters, and this falls into "hardly ever happens"
territory (whereas, say, the 8 characters at a time version reaches its
peak at around 16 characters, which is a little more reasonable).

OoO gets a bit more complicated, but am not going to say as much here.

Now as for caches, these are their own issue.

Generally, the L2 or L3 cache need not care what is happening in the CPU
core. The L1 may care, since it is generally tied in with the pipeline
machinery. However, some "more generic" mechanisms are likely to be used
for the larger cache levels, which are typically more concerned with
moving blocks of data around than what exactly happens with them.

Re: In-order vs Out-of-order

<jwveed2pqbb.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17819&group=comp.arch#17819

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: In-order vs Out-of-order
Date: Tue, 15 Jun 2021 21:17:43 -0400
Organization: A noiseless patient Spider
Lines: 22
Message-ID: <jwveed2pqbb.fsf-monnier+comp.arch@gnu.org>
References: <9d8fb369-e27f-4794-92b8-24686d874ae4n@googlegroups.com>
<ef15af1d-e122-4fec-acea-46aa6af45671n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Injection-Info: reader02.eternal-september.org; posting-host="6ffce584aa345cd3907d51e041475433";
logging-data="30833"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/M9N81PfI15NTqzPYZNstN"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)
Cancel-Lock: sha1:LNnlwluWYEgH4litYTgdEtv+qvw=
sha1:wg0zt0dlCstCLOfrvrgpxU3c/CM=
 by: Stefan Monnier - Wed, 16 Jun 2021 01:17 UTC

MitchAlsup [2021-06-15 09:23:35] wrote:
> On Tuesday, June 15, 2021 at 5:10:42 AM UTC-5, Chingu wrote:
>> which type of core requires high frequency of operation ,in-order or
>> out-of-order?Is it good to have in-order and out-of-order core to share
>> the same cache level(L2 or L3)?
> [...]
> Pure In Order is immune to Spectré-like attacks.

Really? I mean, maybe it happens to be true in practice for existing
in-order designs, but is it true in theory?

I mean, given a deep enough in-order pipeline, it seems it should be
possible to get the branch predictor to cause speculative execution of
a load followed by speculative execution of another load that depends on
the value of the first load before the mispredicted branch reaches the
retire stage. It does require a "large" number of stages between the
load execution and the "branch retire", tho.

Or am I missing something?

Stefan

Re: In-order vs Out-of-order

<sabmlo$jkt$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17820&group=comp.arch#17820

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: In-order vs Out-of-order
Date: Tue, 15 Jun 2021 21:10:49 -0500
Organization: A noiseless patient Spider
Lines: 63
Message-ID: <sabmlo$jkt$1@dont-email.me>
References: <9d8fb369-e27f-4794-92b8-24686d874ae4n@googlegroups.com>
<ef15af1d-e122-4fec-acea-46aa6af45671n@googlegroups.com>
<jwveed2pqbb.fsf-monnier+comp.arch@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 16 Jun 2021 02:12:08 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="95d0b049062693e4257495767f50cf01";
logging-data="20125"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+xbYU8tAYYXcdadkuXNFZj"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:xbUJ6PtZrwCJmo2FhpMajA8XWvc=
In-Reply-To: <jwveed2pqbb.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US
 by: BGB - Wed, 16 Jun 2021 02:10 UTC

On 6/15/2021 8:17 PM, Stefan Monnier wrote:
> MitchAlsup [2021-06-15 09:23:35] wrote:
>> On Tuesday, June 15, 2021 at 5:10:42 AM UTC-5, Chingu wrote:
>>> which type of core requires high frequency of operation ,in-order or
>>> out-of-order?Is it good to have in-order and out-of-order core to share
>>> the same cache level(L2 or L3)?
>> [...]
>> Pure In Order is immune to Spectré-like attacks.
>
> Really? I mean, maybe it happens to be true in practice for existing
> in-order designs, but is it true in theory?
>
> I mean, given a deep enough in-order pipeline, it seems it should be
> possible to get the branch predictor to cause speculative execution of
> a load followed by speculative execution of another load that depends on
> the value of the first load before the mispredicted branch reaches the
> retire stage. It does require a "large" number of stages between the
> load execution and the "branch retire", tho.
>
> Or am I missing something?
>

This only seems like it could be true if the branch doesn't initiate
(and thus trigger a pipeline flush) until the WB stage or some other
late-stage in a very long pipeline.

If one has a branch mechanism which, say:
Starts branch initiation in EX1;
Branch takes effect in EX2 (thus triggering a flush).

Then, the ability to perform a load during this time is effectively
impossible no matter how many stages there are in total.

And, you wouldn't necessarily want to wait until WB or similar to
initiate a branch (while also allowing things like memory access), since
then it would be much harder to avoid visible state changes due to the
branch misprediction.

To allow for the state of memory to be rolled back during a flush, one
would also effectively need to keep in-flight memory loads and stores in
a queue, ..., which is already starting to go down the OoO path.

Then again, there are apparently some large in-order VLIW machines, such
as the Elbrus, so I guess the question is whether something like this
could be subject to such an exploit.

Though, what I can gather implies they are still using an 8-stage
pipeline (with 2 extra stages for FPU ops).

It also appears they are using variable-length bundles, but accomplish
this via an explicit bundle-length field rather than via a daisy-chain
encoding.

....

>
> Stefan
>

Re: In-order vs Out-of-order

<4d964948-f1a6-4360-892a-4c7420b9533en@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17822&group=comp.arch#17822

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a0c:e8cd:: with SMTP id m13mr716851qvo.52.1623858906733;
Wed, 16 Jun 2021 08:55:06 -0700 (PDT)
X-Received: by 2002:aca:5755:: with SMTP id l82mr7943681oib.44.1623858906550;
Wed, 16 Jun 2021 08:55:06 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!usenet.pasdenom.info!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 16 Jun 2021 08:55:06 -0700 (PDT)
In-Reply-To: <jwveed2pqbb.fsf-monnier+comp.arch@gnu.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:8ddf:e3ee:a2a5:3a73;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:8ddf:e3ee:a2a5:3a73
References: <9d8fb369-e27f-4794-92b8-24686d874ae4n@googlegroups.com>
<ef15af1d-e122-4fec-acea-46aa6af45671n@googlegroups.com> <jwveed2pqbb.fsf-monnier+comp.arch@gnu.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4d964948-f1a6-4360-892a-4c7420b9533en@googlegroups.com>
Subject: Re: In-order vs Out-of-order
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 16 Jun 2021 15:55:06 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Wed, 16 Jun 2021 15:55 UTC

On Tuesday, June 15, 2021 at 8:17:45 PM UTC-5, Stefan Monnier wrote:
> MitchAlsup [2021-06-15 09:23:35] wrote:
> > On Tuesday, June 15, 2021 at 5:10:42 AM UTC-5, Chingu wrote:
> >> which type of core requires high frequency of operation ,in-order or
> >> out-of-order?Is it good to have in-order and out-of-order core to share
> >> the same cache level(L2 or L3)?
> > [...]
> > Pure In Order is immune to Spectré-like attacks.
> Really? I mean, maybe it happens to be true in practice for existing
> in-order designs, but is it true in theory?
<
Well, let's see:: the vast majority of IO architectures do not have various
predictors that Spectré and Metldown exploit, leaving only the cache
footprint, and seldom does one architect a IO machine to start a subsequent
memory ref before one verifies via TLB that the access is supposed to
transpire. So while not inconceivable, the probabilities are low and
easily suppressed (assuming one has Spectré and Meltdown on their
hit list.
>
> I mean, given a deep enough in-order pipeline, it seems it should be
> possible to get the branch predictor to cause speculative execution of
> a load followed by speculative execution of another load that depends on
> the value of the first load before the mispredicted branch reaches the
> retire stage. It does require a "large" number of stages between the
> load execution and the "branch retire", tho.
>
> Or am I missing something?
>
>
> Stefan

Re: In-order vs Out-of-order

<5d522ec2-1448-47a6-87df-6483f52a94afn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17823&group=comp.arch#17823

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:19c8:: with SMTP id j8mr727828qvc.36.1623858988153;
Wed, 16 Jun 2021 08:56:28 -0700 (PDT)
X-Received: by 2002:a9d:7518:: with SMTP id r24mr514183otk.22.1623858988013;
Wed, 16 Jun 2021 08:56:28 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 16 Jun 2021 08:56:27 -0700 (PDT)
In-Reply-To: <sabmlo$jkt$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:8ddf:e3ee:a2a5:3a73;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:8ddf:e3ee:a2a5:3a73
References: <9d8fb369-e27f-4794-92b8-24686d874ae4n@googlegroups.com>
<ef15af1d-e122-4fec-acea-46aa6af45671n@googlegroups.com> <jwveed2pqbb.fsf-monnier+comp.arch@gnu.org>
<sabmlo$jkt$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5d522ec2-1448-47a6-87df-6483f52a94afn@googlegroups.com>
Subject: Re: In-order vs Out-of-order
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 16 Jun 2021 15:56:28 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Wed, 16 Jun 2021 15:56 UTC

On Tuesday, June 15, 2021 at 9:12:11 PM UTC-5, BGB wrote:
> On 6/15/2021 8:17 PM, Stefan Monnier wrote:
> > MitchAlsup [2021-06-15 09:23:35] wrote:
> >> On Tuesday, June 15, 2021 at 5:10:42 AM UTC-5, Chingu wrote:
> >>> which type of core requires high frequency of operation ,in-order or
> >>> out-of-order?Is it good to have in-order and out-of-order core to share
> >>> the same cache level(L2 or L3)?
> >> [...]
> >> Pure In Order is immune to Spectré-like attacks.
> >
> > Really? I mean, maybe it happens to be true in practice for existing
> > in-order designs, but is it true in theory?
> >
> > I mean, given a deep enough in-order pipeline, it seems it should be
> > possible to get the branch predictor to cause speculative execution of
> > a load followed by speculative execution of another load that depends on
> > the value of the first load before the mispredicted branch reaches the
> > retire stage. It does require a "large" number of stages between the
> > load execution and the "branch retire", tho.
> >
> > Or am I missing something?
> >
> This only seems like it could be true if the branch doesn't initiate
> (and thus trigger a pipeline flush) until the WB stage or some other
> late-stage in a very long pipeline.
>
> If one has a branch mechanism which, say:
> Starts branch initiation in EX1;
> Branch takes effect in EX2 (thus triggering a flush).
>
> Then, the ability to perform a load during this time is effectively
> impossible no matter how many stages there are in total.
<
The only hole is whether the LD under the branch can cause a cache
miss which alters cache state after the branch got flushed.
>
>
> And, you wouldn't necessarily want to wait until WB or similar to
> initiate a branch (while also allowing things like memory access), since
> then it would be much harder to avoid visible state changes due to the
> branch misprediction.
>
> To allow for the state of memory to be rolled back during a flush, one
> would also effectively need to keep in-flight memory loads and stores in
> a queue, ..., which is already starting to go down the OoO path.
>
>
>
> Then again, there are apparently some large in-order VLIW machines, such
> as the Elbrus, so I guess the question is whether something like this
> could be subject to such an exploit.
>
> Though, what I can gather implies they are still using an 8-stage
> pipeline (with 2 extra stages for FPU ops).
>
> It also appears they are using variable-length bundles, but accomplish
> this via an explicit bundle-length field rather than via a daisy-chain
> encoding.
>
> ...
>
>
> >
> > Stefan
> >

Re: In-order vs Out-of-order

<905aa5cc-5391-4828-90b9-adb0317cb0b0n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17831&group=comp.arch#17831

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:6c4:: with SMTP id 187mr317017qkg.95.1623877503588;
Wed, 16 Jun 2021 14:05:03 -0700 (PDT)
X-Received: by 2002:a9d:82b:: with SMTP id 40mr1589582oty.81.1623877503260;
Wed, 16 Jun 2021 14:05:03 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 16 Jun 2021 14:05:03 -0700 (PDT)
In-Reply-To: <5d522ec2-1448-47a6-87df-6483f52a94afn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=87.68.182.191; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 87.68.182.191
References: <9d8fb369-e27f-4794-92b8-24686d874ae4n@googlegroups.com>
<ef15af1d-e122-4fec-acea-46aa6af45671n@googlegroups.com> <jwveed2pqbb.fsf-monnier+comp.arch@gnu.org>
<sabmlo$jkt$1@dont-email.me> <5d522ec2-1448-47a6-87df-6483f52a94afn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <905aa5cc-5391-4828-90b9-adb0317cb0b0n@googlegroups.com>
Subject: Re: In-order vs Out-of-order
From: already5...@yahoo.com (Michael S)
Injection-Date: Wed, 16 Jun 2021 21:05:03 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: Michael S - Wed, 16 Jun 2021 21:05 UTC

On Wednesday, June 16, 2021 at 6:56:29 PM UTC+3, MitchAlsup wrote:
> On Tuesday, June 15, 2021 at 9:12:11 PM UTC-5, BGB wrote:
> > On 6/15/2021 8:17 PM, Stefan Monnier wrote:
> > > MitchAlsup [2021-06-15 09:23:35] wrote:
> > >> On Tuesday, June 15, 2021 at 5:10:42 AM UTC-5, Chingu wrote:
> > >>> which type of core requires high frequency of operation ,in-order or
> > >>> out-of-order?Is it good to have in-order and out-of-order core to share
> > >>> the same cache level(L2 or L3)?
> > >> [...]
> > >> Pure In Order is immune to Spectré-like attacks.
> > >
> > > Really? I mean, maybe it happens to be true in practice for existing
> > > in-order designs, but is it true in theory?
> > >
> > > I mean, given a deep enough in-order pipeline, it seems it should be
> > > possible to get the branch predictor to cause speculative execution of
> > > a load followed by speculative execution of another load that depends on
> > > the value of the first load before the mispredicted branch reaches the
> > > retire stage. It does require a "large" number of stages between the
> > > load execution and the "branch retire", tho.
> > >
> > > Or am I missing something?
> > >
> > This only seems like it could be true if the branch doesn't initiate
> > (and thus trigger a pipeline flush) until the WB stage or some other
> > late-stage in a very long pipeline.
> >
> > If one has a branch mechanism which, say:
> > Starts branch initiation in EX1;
> > Branch takes effect in EX2 (thus triggering a flush).
> >
> > Then, the ability to perform a load during this time is effectively
> > impossible no matter how many stages there are in total.
> <
> The only hole is whether the LD under the branch can cause a cache
> miss which alters cache state after the branch got flushed.

And that's exactly what they call Spectre Variant 1.

> >
> >
> > And, you wouldn't necessarily want to wait until WB or similar to
> > initiate a branch (while also allowing things like memory access), since
> > then it would be much harder to avoid visible state changes due to the
> > branch misprediction.
> >
> > To allow for the state of memory to be rolled back during a flush, one
> > would also effectively need to keep in-flight memory loads and stores in
> > a queue, ..., which is already starting to go down the OoO path.
> >
> >
> >
> > Then again, there are apparently some large in-order VLIW machines, such
> > as the Elbrus, so I guess the question is whether something like this
> > could be subject to such an exploit.
> >
> > Though, what I can gather implies they are still using an 8-stage
> > pipeline (with 2 extra stages for FPU ops).
> >
> > It also appears they are using variable-length bundles, but accomplish
> > this via an explicit bundle-length field rather than via a daisy-chain
> > encoding.
> >
> > ...
> >
> >
> > >
> > > Stefan
> > >

Re: In-order vs Out-of-order

<ff861de4-197b-4e9b-b3f5-79ff7e627c0en@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17832&group=comp.arch#17832

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ae9:e601:: with SMTP id z1mr648844qkf.369.1623882681890;
Wed, 16 Jun 2021 15:31:21 -0700 (PDT)
X-Received: by 2002:a05:6830:711:: with SMTP id y17mr1882093ots.5.1623882681629;
Wed, 16 Jun 2021 15:31:21 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 16 Jun 2021 15:31:21 -0700 (PDT)
In-Reply-To: <9d8fb369-e27f-4794-92b8-24686d874ae4n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=70.113.20.0; posting-account=lu6GOwoAAACaCIFgUW59w1jIk36mvMWA
NNTP-Posting-Host: 70.113.20.0
References: <9d8fb369-e27f-4794-92b8-24686d874ae4n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ff861de4-197b-4e9b-b3f5-79ff7e627c0en@googlegroups.com>
Subject: Re: In-order vs Out-of-order
From: nedb...@yahoo.com (nedbrek)
Injection-Date: Wed, 16 Jun 2021 22:31:21 +0000
Content-Type: text/plain; charset="UTF-8"
 by: nedbrek - Wed, 16 Jun 2021 22:31 UTC

On Tuesday, June 15, 2021 at 5:10:42 AM UTC-5, Chingu wrote:
> which type of core requires high frequency of operation ,in-order or out-of-order?Is it good to have in-order and out-of-order core to share the same cache level(L2 or L3)?

One of the interesting results of our research was that out-of-order machines are easier to ramp to high frequency. This is because an in-order machine suffers more as L1D latency increases (you need to get the compiler to extract more and more ILP).

Re: In-order vs Out-of-order

<sae8kg$r7n$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17837&group=comp.arch#17837

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: In-order vs Out-of-order
Date: Wed, 16 Jun 2021 20:29:37 -0500
Organization: A noiseless patient Spider
Lines: 106
Message-ID: <sae8kg$r7n$1@dont-email.me>
References: <9d8fb369-e27f-4794-92b8-24686d874ae4n@googlegroups.com>
<ef15af1d-e122-4fec-acea-46aa6af45671n@googlegroups.com>
<jwveed2pqbb.fsf-monnier+comp.arch@gnu.org> <sabmlo$jkt$1@dont-email.me>
<5d522ec2-1448-47a6-87df-6483f52a94afn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 17 Jun 2021 01:30:56 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="6c5f39fa1efdec7d340eecaa0825f3e4";
logging-data="27895"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+sXmdhOG5n2PtPLa1ZhGhp"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:osgz3Xml9BZtGwJo0bwOIQJveOA=
In-Reply-To: <5d522ec2-1448-47a6-87df-6483f52a94afn@googlegroups.com>
Content-Language: en-US
 by: BGB - Thu, 17 Jun 2021 01:29 UTC

On 6/16/2021 10:56 AM, MitchAlsup wrote:
> On Tuesday, June 15, 2021 at 9:12:11 PM UTC-5, BGB wrote:
>> On 6/15/2021 8:17 PM, Stefan Monnier wrote:
>>> MitchAlsup [2021-06-15 09:23:35] wrote:
>>>> On Tuesday, June 15, 2021 at 5:10:42 AM UTC-5, Chingu wrote:
>>>>> which type of core requires high frequency of operation ,in-order or
>>>>> out-of-order?Is it good to have in-order and out-of-order core to share
>>>>> the same cache level(L2 or L3)?
>>>> [...]
>>>> Pure In Order is immune to Spectré-like attacks.
>>>
>>> Really? I mean, maybe it happens to be true in practice for existing
>>> in-order designs, but is it true in theory?
>>>
>>> I mean, given a deep enough in-order pipeline, it seems it should be
>>> possible to get the branch predictor to cause speculative execution of
>>> a load followed by speculative execution of another load that depends on
>>> the value of the first load before the mispredicted branch reaches the
>>> retire stage. It does require a "large" number of stages between the
>>> load execution and the "branch retire", tho.
>>>
>>> Or am I missing something?
>>>
>> This only seems like it could be true if the branch doesn't initiate
>> (and thus trigger a pipeline flush) until the WB stage or some other
>> late-stage in a very long pipeline.
>>
>> If one has a branch mechanism which, say:
>> Starts branch initiation in EX1;
>> Branch takes effect in EX2 (thus triggering a flush).
>>
>> Then, the ability to perform a load during this time is effectively
>> impossible no matter how many stages there are in total.
> <
> The only hole is whether the LD under the branch can cause a cache
> miss which alters cache state after the branch got flushed.

It is possible, if this is the case.

In my case, with the branch mechanism initiating in EX2, it should also
flush EX1 (turning the load into a NOP) before it has a chance to send
the request to the L1 cache.

Though, if the load request does make it to the L1, it stands a chance
of triggering a cache miss.

In other news, recently have gotten a few speed boosts in my project:
Dhrystone (2.1) score went from ~ 37000 to ~ 49000 by rewriting
strcmp/strlen/strncmp/strcat in ASM, and using a few new helper ops (*);
Got another boost up to ~ 53000 (~ 30.2 DMIPS) by adding a compiler
feature to skip creating stack frames for small leaf functions (where it
can fit everything into scratch registers).

This latter change also resulted in an ~9% reduction in the size of the
generated binaries (for both Doom and Quake). Not much visible impact on
Doom performance though. GLQuake does now seem to be a little faster
(went from ~ 2-5 fps to ~ 3-8; in-game, GLQuake now seems to be a little
faster than Software Quake).

*1: In particular, added a "Packed Byte Search" op which can be used to
(quickly) detect the presence of a NUL byte (not really sure if
something like this counts as "cheating" though).

At the moment, it seems to be getting similar speeds to a 486DX2-66
according to this metric. Still a mystery how the 486 did so well given
how much x86 sucks, but then again for my compiler, much of the
benchmark seems to turn into a giant mess of memory loads and stores
(once one shaves off the big chunk of time that was previously going
into "strcmp()" and similar).

>>
>>
>> And, you wouldn't necessarily want to wait until WB or similar to
>> initiate a branch (while also allowing things like memory access), since
>> then it would be much harder to avoid visible state changes due to the
>> branch misprediction.
>>
>> To allow for the state of memory to be rolled back during a flush, one
>> would also effectively need to keep in-flight memory loads and stores in
>> a queue, ..., which is already starting to go down the OoO path.
>>
>>
>>
>> Then again, there are apparently some large in-order VLIW machines, such
>> as the Elbrus, so I guess the question is whether something like this
>> could be subject to such an exploit.
>>
>> Though, what I can gather implies they are still using an 8-stage
>> pipeline (with 2 extra stages for FPU ops).
>>
>> It also appears they are using variable-length bundles, but accomplish
>> this via an explicit bundle-length field rather than via a daisy-chain
>> encoding.
>>
>> ...
>>
>>
>>>
>>> Stefan
>>>

Re: In-order vs Out-of-order

<2021Jun17.085414@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17838&group=comp.arch#17838

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: In-order vs Out-of-order
Date: Thu, 17 Jun 2021 06:54:14 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 19
Message-ID: <2021Jun17.085414@mips.complang.tuwien.ac.at>
References: <9d8fb369-e27f-4794-92b8-24686d874ae4n@googlegroups.com> <ef15af1d-e122-4fec-acea-46aa6af45671n@googlegroups.com> <jwveed2pqbb.fsf-monnier+comp.arch@gnu.org> <sabmlo$jkt$1@dont-email.me> <5d522ec2-1448-47a6-87df-6483f52a94afn@googlegroups.com> <905aa5cc-5391-4828-90b9-adb0317cb0b0n@googlegroups.com>
Injection-Info: reader02.eternal-september.org; posting-host="1818588233a05fc743079aa2e0d1b502";
logging-data="21032"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19fxs5QOT/DJV3Lgb6UgQYE"
Cancel-Lock: sha1:tsz+AUKFq1pf8p5eMZAmFon6A7M=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Thu, 17 Jun 2021 06:54 UTC

Michael S <already5chosen@yahoo.com> writes:
>On Wednesday, June 16, 2021 at 6:56:29 PM UTC+3, MitchAlsup wrote:
>> The only hole is whether the LD under the branch can cause a cache=20
>> miss which alters cache state after the branch got flushed.
>
>
>And that's exactly what they call Spectre Variant 1.

That's not any variant of Spectre, because it does not reveal any
speculatively accessed data.

Spectre consists of a spectlative load of the secret followed by some
mechanism that reveals the secret through a side channel. The first
Spectre variants used cache side channels.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: In-order vs Out-of-order

<02ec0039-1a23-4130-8059-b5d1358eac78n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17852&group=comp.arch#17852

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:9504:: with SMTP id x4mr4142164qkd.235.1623942501536; Thu, 17 Jun 2021 08:08:21 -0700 (PDT)
X-Received: by 2002:a4a:d0c5:: with SMTP id u5mr2000002oor.40.1623942501212; Thu, 17 Jun 2021 08:08:21 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed8.news.xs4all.nl!tr3.eu1.usenetexpress.com!feeder.usenetexpress.com!tr3.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 17 Jun 2021 08:08:21 -0700 (PDT)
In-Reply-To: <sae8kg$r7n$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3d96:df7c:31b3:6d42; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3d96:df7c:31b3:6d42
References: <9d8fb369-e27f-4794-92b8-24686d874ae4n@googlegroups.com> <ef15af1d-e122-4fec-acea-46aa6af45671n@googlegroups.com> <jwveed2pqbb.fsf-monnier+comp.arch@gnu.org> <sabmlo$jkt$1@dont-email.me> <5d522ec2-1448-47a6-87df-6483f52a94afn@googlegroups.com> <sae8kg$r7n$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <02ec0039-1a23-4130-8059-b5d1358eac78n@googlegroups.com>
Subject: Re: In-order vs Out-of-order
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 17 Jun 2021 15:08:21 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 131
 by: MitchAlsup - Thu, 17 Jun 2021 15:08 UTC

On Wednesday, June 16, 2021 at 8:30:59 PM UTC-5, BGB wrote:
> On 6/16/2021 10:56 AM, MitchAlsup wrote:
> > On Tuesday, June 15, 2021 at 9:12:11 PM UTC-5, BGB wrote:
> >> On 6/15/2021 8:17 PM, Stefan Monnier wrote:
> >>> MitchAlsup [2021-06-15 09:23:35] wrote:
> >>>> On Tuesday, June 15, 2021 at 5:10:42 AM UTC-5, Chingu wrote:
> >>>>> which type of core requires high frequency of operation ,in-order or
> >>>>> out-of-order?Is it good to have in-order and out-of-order core to share
> >>>>> the same cache level(L2 or L3)?
> >>>> [...]
> >>>> Pure In Order is immune to Spectré-like attacks.
> >>>
> >>> Really? I mean, maybe it happens to be true in practice for existing
> >>> in-order designs, but is it true in theory?
> >>>
> >>> I mean, given a deep enough in-order pipeline, it seems it should be
> >>> possible to get the branch predictor to cause speculative execution of
> >>> a load followed by speculative execution of another load that depends on
> >>> the value of the first load before the mispredicted branch reaches the
> >>> retire stage. It does require a "large" number of stages between the
> >>> load execution and the "branch retire", tho.
> >>>
> >>> Or am I missing something?
> >>>
> >> This only seems like it could be true if the branch doesn't initiate
> >> (and thus trigger a pipeline flush) until the WB stage or some other
> >> late-stage in a very long pipeline.
> >>
> >> If one has a branch mechanism which, say:
> >> Starts branch initiation in EX1;
> >> Branch takes effect in EX2 (thus triggering a flush).
> >>
> >> Then, the ability to perform a load during this time is effectively
> >> impossible no matter how many stages there are in total.
> > <
> > The only hole is whether the LD under the branch can cause a cache
> > miss which alters cache state after the branch got flushed.
> It is possible, if this is the case.
>
> In my case, with the branch mechanism initiating in EX2, it should also
> flush EX1 (turning the load into a NOP) before it has a chance to send
> the request to the L1 cache.
>
> Though, if the load request does make it to the L1, it stands a chance
> of triggering a cache miss.
>
>
>
> In other news, recently have gotten a few speed boosts in my project:
> Dhrystone (2.1) score went from ~ 37000 to ~ 49000 by rewriting
> strcmp/strlen/strncmp/strcat in ASM, and using a few new helper ops (*);
> Got another boost up to ~ 53000 (~ 30.2 DMIPS) by adding a compiler
> feature to skip creating stack frames for small leaf functions (where it
> can fit everything into scratch registers).
>
> This latter change also resulted in an ~9% reduction in the size of the
> generated binaries (for both Doom and Quake). Not much visible impact on
> Doom performance though. GLQuake does now seem to be a little faster
> (went from ~ 2-5 fps to ~ 3-8; in-game, GLQuake now seems to be a little
> faster than Software Quake).
>
>
> *1: In particular, added a "Packed Byte Search" op which can be used to
> (quickly) detect the presence of a NUL byte (not really sure if
> something like this counts as "cheating" though).
<
It is not cheating, it is borrowing. Mc 88110 had its CMP instruction set
bits for AnyByteZero and AnyHalfwordZero for identical purposes.
>
>
> At the moment, it seems to be getting similar speeds to a 486DX2-66
> according to this metric. Still a mystery how the 486 did so well given
> how much x86 sucks, but then again for my compiler, much of the
> benchmark seems to turn into a giant mess of memory loads and stores
> (once one shaves off the big chunk of time that was previously going
> into "strcmp()" and similar).
> >>
> >>
> >> And, you wouldn't necessarily want to wait until WB or similar to
> >> initiate a branch (while also allowing things like memory access), since
> >> then it would be much harder to avoid visible state changes due to the
> >> branch misprediction.
> >>
> >> To allow for the state of memory to be rolled back during a flush, one
> >> would also effectively need to keep in-flight memory loads and stores in
> >> a queue, ..., which is already starting to go down the OoO path.
> >>
> >>
> >>
> >> Then again, there are apparently some large in-order VLIW machines, such
> >> as the Elbrus, so I guess the question is whether something like this
> >> could be subject to such an exploit.
> >>
> >> Though, what I can gather implies they are still using an 8-stage
> >> pipeline (with 2 extra stages for FPU ops).
> >>
> >> It also appears they are using variable-length bundles, but accomplish
> >> this via an explicit bundle-length field rather than via a daisy-chain
> >> encoding.
> >>
> >> ...
> >>
> >>
> >>>
> >>> Stefan
> >>>

Re: In-order vs Out-of-order

<sag50n$o9u$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17881&group=comp.arch#17881

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: In-order vs Out-of-order
Date: Thu, 17 Jun 2021 13:40:08 -0500
Organization: A noiseless patient Spider
Lines: 260
Message-ID: <sag50n$o9u$1@dont-email.me>
References: <9d8fb369-e27f-4794-92b8-24686d874ae4n@googlegroups.com>
<ef15af1d-e122-4fec-acea-46aa6af45671n@googlegroups.com>
<jwveed2pqbb.fsf-monnier+comp.arch@gnu.org> <sabmlo$jkt$1@dont-email.me>
<5d522ec2-1448-47a6-87df-6483f52a94afn@googlegroups.com>
<sae8kg$r7n$1@dont-email.me>
<02ec0039-1a23-4130-8059-b5d1358eac78n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 17 Jun 2021 18:41:27 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="6c5f39fa1efdec7d340eecaa0825f3e4";
logging-data="24894"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+t/YrbIfblO97Lc81GQX+c"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:DcdLkhmd2TJr1LNNUlamhDyHYMM=
In-Reply-To: <02ec0039-1a23-4130-8059-b5d1358eac78n@googlegroups.com>
Content-Language: en-US
 by: BGB - Thu, 17 Jun 2021 18:40 UTC

On 6/17/2021 10:08 AM, MitchAlsup wrote:
> On Wednesday, June 16, 2021 at 8:30:59 PM UTC-5, BGB wrote:
>> On 6/16/2021 10:56 AM, MitchAlsup wrote:
>>> On Tuesday, June 15, 2021 at 9:12:11 PM UTC-5, BGB wrote:
>>>> On 6/15/2021 8:17 PM, Stefan Monnier wrote:
>>>>> MitchAlsup [2021-06-15 09:23:35] wrote:
>>>>>> On Tuesday, June 15, 2021 at 5:10:42 AM UTC-5, Chingu wrote:
>>>>>>> which type of core requires high frequency of operation ,in-order or
>>>>>>> out-of-order?Is it good to have in-order and out-of-order core to share
>>>>>>> the same cache level(L2 or L3)?
>>>>>> [...]
>>>>>> Pure In Order is immune to Spectré-like attacks.
>>>>>
>>>>> Really? I mean, maybe it happens to be true in practice for existing
>>>>> in-order designs, but is it true in theory?
>>>>>
>>>>> I mean, given a deep enough in-order pipeline, it seems it should be
>>>>> possible to get the branch predictor to cause speculative execution of
>>>>> a load followed by speculative execution of another load that depends on
>>>>> the value of the first load before the mispredicted branch reaches the
>>>>> retire stage. It does require a "large" number of stages between the
>>>>> load execution and the "branch retire", tho.
>>>>>
>>>>> Or am I missing something?
>>>>>
>>>> This only seems like it could be true if the branch doesn't initiate
>>>> (and thus trigger a pipeline flush) until the WB stage or some other
>>>> late-stage in a very long pipeline.
>>>>
>>>> If one has a branch mechanism which, say:
>>>> Starts branch initiation in EX1;
>>>> Branch takes effect in EX2 (thus triggering a flush).
>>>>
>>>> Then, the ability to perform a load during this time is effectively
>>>> impossible no matter how many stages there are in total.
>>> <
>>> The only hole is whether the LD under the branch can cause a cache
>>> miss which alters cache state after the branch got flushed.
>> It is possible, if this is the case.
>>
>> In my case, with the branch mechanism initiating in EX2, it should also
>> flush EX1 (turning the load into a NOP) before it has a chance to send
>> the request to the L1 cache.
>>
>> Though, if the load request does make it to the L1, it stands a chance
>> of triggering a cache miss.
>>
>>
>>
>> In other news, recently have gotten a few speed boosts in my project:
>> Dhrystone (2.1) score went from ~ 37000 to ~ 49000 by rewriting
>> strcmp/strlen/strncmp/strcat in ASM, and using a few new helper ops (*);
>> Got another boost up to ~ 53000 (~ 30.2 DMIPS) by adding a compiler
>> feature to skip creating stack frames for small leaf functions (where it
>> can fit everything into scratch registers).
>>
>> This latter change also resulted in an ~9% reduction in the size of the
>> generated binaries (for both Doom and Quake). Not much visible impact on
>> Doom performance though. GLQuake does now seem to be a little faster
>> (went from ~ 2-5 fps to ~ 3-8; in-game, GLQuake now seems to be a little
>> faster than Software Quake).
>>

Clarification:
Rather than demand-loading variables into registers and then spilling
them back to memory at the end of the basic block (with the top N
variables being statically assigned to registers), it uses a different
approach.

All the local variables/arguments/temporaries are statically assigned to
registers throughout the entire function, and no stack-frame is created
for them (and thus no stack spills will occur).

There are two sub-variants for different handling of constants and globals:
They may also be statically assigned to registers, loaded in at the
start of the function, and any "dirty" globals being written back when
the function returns (excluding 'volatile');
A hybrid mode is used, with constants and globals still being
demand-loaded (and stored-back / evicted at the end of the basic block).

The distinction between these is based on whether or not everything can
fit into registers using the direct approach. If there isn't sufficient
register space to fit everything in the second case, it falls back to
stack-frame creation and demand-loading as usual.

Similarly, it excludes leaf functions which access global variables
which are also referenced via a function pointer, since this is a
problem case for the C ABI (no way to make this work reliably absent
creating a stack frame).

This approach does seem to result in a notable improvement for the
performance of the functions it applies to, but seems to be being
limited by the size of the register space (with 32 registers), with most
of the potentially eligible leaf functions still running out of registers.

It looks like if I extend the C compilers' register allocator to be able
to deal with the XGPR space, then it will be possible to handle a lot
more functions with this approach.

....

The limited gains with Doom performance seems to be mostly due to leaf
functions being relatively uncommon (and generally not in
performance-sensitive areas).

Meanwhile, Quake seems to have a significantly higher number of leaf
functions relative to Doom.

An intermediate option, where it creates a stack-frame but just uses it
to save callee preserved registers, is another possibility (this could
then prefer scratch registers and then use the callee-preserved
registers as an overflow space; falling back to the more traditional
approach if it exceeds the limit of 27 or 59 usable GPRs).

Or potentially a similar sort of static assignment strategy could also
be applied to a subset of non-leaf functions?...

Granted, I don't really expect all this is particularly novel.

>>
>> *1: In particular, added a "Packed Byte Search" op which can be used to
>> (quickly) detect the presence of a NUL byte (not really sure if
>> something like this counts as "cheating" though).
> <
> It is not cheating, it is borrowing. Mc 88110 had its CMP instruction set
> bits for AnyByteZero and AnyHalfwordZero for identical purposes.

Yeah. This is at least slightly more generic:
* PSCHEQ.B Rm, Rt, Rn
Compare if bytes in Rm match Rt within a 64-bit QWord.
Sets Rn to the index of the first match and sets SR.T if a match is found.

* PSCHNE.B Rm, Rt, Rn
Compare if bytes in Rm differ from Rt.
Sets Rn to the index of the first mismatch and sets SR.T if a mismatch
is found.

These can be used to accelerate C string functions (detecting a NUL byte
or finding the first mismatch), but also potentially some other things
(eg: LZ77 compression).

Relatedly, there is also a Word variant:
* PSCHEQ.W Rm, Rt, Rn
* PSCHNE.W Rm, Rt, Rn

Which does similar, except using packed 16-bit words.

This is more useful for UCS2 / UTF16 strings, but was also intended for
things like dictionary objects (where, say, one looks up a value in an
array based on a 16-bit key index).

But, it is possible that some might try to argue that strcmp is necessarily:
while(*s1 && (*s1==*t1))
{ s1++; t1++; }
if(*s1>*t1)return(1);
if(*s1<*t1)return(-1);
return(0);
...

And that, potentially, using instructions that allow running the loop 64
bits at a time and using ASM is cheating.

Though. potentially similarly for implementing memcpy and memset using
QWord or XWord/OWord ops rather than byte ops, say:
while(n--)*t++=*s++;
Or:
while(n--)*t++=c;
....

Functions like strcpy/strcat need to fall back to byte-copying at the
end though, since AFAIK they aren't allowed to stomp memory past the end
of the destination string (even if this would be a little faster).

Similarly, trying to splice the destination (via bit-masking) wouldn't
necessarily be much (if any) faster than falling back to byte-copying
for the final bytes.

Similarly, I also replaced some amount of the "math.h" functions to be
"less terrible", mostly using things like unrolled Taylor-series
expansions and similar...

Namely, if one calculates the factorials, calls "pow()", does a
division, ..., for every loop iteration, this *sucks* in terms of
performance.

The malloc implementation effectively got replaced as well, with the C
library's "malloc()" effectively being a thin wrapper over a function
pointer, which is then redirected to the "tk_malloc()" implementation.

Then added a few extensions, like "_msize()" / "malloc_usable_size()"
and similar. The actual memory manager also supports a few extra
features (type tags and reference counts), but these are not terribly
relevant to normal C code.


Click here to read the complete article
Re: In-order vs Out-of-order

<1a6c3d8b-42d4-497a-9029-3a0f583c1fb0n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17892&group=comp.arch#17892

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:a095:: with SMTP id j143mr6313106qke.68.1623972423667;
Thu, 17 Jun 2021 16:27:03 -0700 (PDT)
X-Received: by 2002:a9d:12a9:: with SMTP id g38mr7127761otg.114.1623972423424;
Thu, 17 Jun 2021 16:27:03 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.mixmin.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 17 Jun 2021 16:27:03 -0700 (PDT)
In-Reply-To: <sag50n$o9u$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <9d8fb369-e27f-4794-92b8-24686d874ae4n@googlegroups.com>
<ef15af1d-e122-4fec-acea-46aa6af45671n@googlegroups.com> <jwveed2pqbb.fsf-monnier+comp.arch@gnu.org>
<sabmlo$jkt$1@dont-email.me> <5d522ec2-1448-47a6-87df-6483f52a94afn@googlegroups.com>
<sae8kg$r7n$1@dont-email.me> <02ec0039-1a23-4130-8059-b5d1358eac78n@googlegroups.com>
<sag50n$o9u$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1a6c3d8b-42d4-497a-9029-3a0f583c1fb0n@googlegroups.com>
Subject: Re: In-order vs Out-of-order
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 17 Jun 2021 23:27:03 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Thu, 17 Jun 2021 23:27 UTC

On Thursday, June 17, 2021 at 1:41:29 PM UTC-5, BGB wrote:
> On 6/17/2021 10:08 AM, MitchAlsup wrote:

> >> *1: In particular, added a "Packed Byte Search" op which can be used to
> >> (quickly) detect the presence of a NUL byte (not really sure if
> >> something like this counts as "cheating" though).
> > <
> > It is not cheating, it is borrowing. Mc 88110 had its CMP instruction set
> > bits for AnyByteZero and AnyHalfwordZero for identical purposes.
<
> Yeah. This is at least slightly more generic:
<
> * PSCHEQ.B Rm, Rt, Rn
> Compare if bytes in Rm match Rt within a 64-bit QWord.
> Sets Rn to the index of the first match and sets SR.T if a match is found.
<
The Mc 88110 CMP instruction performed word, half, byte compares
along with the AnyByteZero and AnyHalfZero. All in one instruction.
>
> * PSCHNE.B Rm, Rt, Rn
> Compare if bytes in Rm differ from Rt.
> Sets Rn to the index of the first mismatch and sets SR.T if a mismatch
> is found.

>
> This is more useful for UCS2 / UTF16 strings, but was also intended for
> things like dictionary objects (where, say, one looks up a value in an
> array based on a 16-bit key index).
>
>
> But, it is possible that some might try to argue that strcmp is necessarily:
> while(*s1 && (*s1==*t1))
> { s1++; t1++; }
> if(*s1>*t1)return(1);
> if(*s1<*t1)return(-1);
> return(0);
<
int strcmp
{ while(*s1 && (*s1==*t1))
{ s1++; t1++; }
return s1-t1;
} <
It does not have to return +1 and -1 it can return positive or negative.
> ...

Re: In-order vs Out-of-order

<0addf547-ad10-4efd-88f8-82fe1c5dcb3an@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17897&group=comp.arch#17897

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:8407:: with SMTP id g7mr7525274qkd.123.1623990420558;
Thu, 17 Jun 2021 21:27:00 -0700 (PDT)
X-Received: by 2002:a05:6808:24a:: with SMTP id m10mr481551oie.110.1623990420283;
Thu, 17 Jun 2021 21:27:00 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 17 Jun 2021 21:27:00 -0700 (PDT)
In-Reply-To: <1a6c3d8b-42d4-497a-9029-3a0f583c1fb0n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2607:fea8:1de1:d400:2cd8:3c91:a0f4:5ee3;
posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 2607:fea8:1de1:d400:2cd8:3c91:a0f4:5ee3
References: <9d8fb369-e27f-4794-92b8-24686d874ae4n@googlegroups.com>
<ef15af1d-e122-4fec-acea-46aa6af45671n@googlegroups.com> <jwveed2pqbb.fsf-monnier+comp.arch@gnu.org>
<sabmlo$jkt$1@dont-email.me> <5d522ec2-1448-47a6-87df-6483f52a94afn@googlegroups.com>
<sae8kg$r7n$1@dont-email.me> <02ec0039-1a23-4130-8059-b5d1358eac78n@googlegroups.com>
<sag50n$o9u$1@dont-email.me> <1a6c3d8b-42d4-497a-9029-3a0f583c1fb0n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <0addf547-ad10-4efd-88f8-82fe1c5dcb3an@googlegroups.com>
Subject: Re: In-order vs Out-of-order
From: robfi...@gmail.com (robf...@gmail.com)
Injection-Date: Fri, 18 Jun 2021 04:27:00 +0000
Content-Type: text/plain; charset="UTF-8"
 by: robf...@gmail.com - Fri, 18 Jun 2021 04:27 UTC

On Thursday, June 17, 2021 at 7:27:05 PM UTC-4, MitchAlsup wrote:
> On Thursday, June 17, 2021 at 1:41:29 PM UTC-5, BGB wrote:
> > On 6/17/2021 10:08 AM, MitchAlsup wrote:
>
> > >> *1: In particular, added a "Packed Byte Search" op which can be used to
> > >> (quickly) detect the presence of a NUL byte (not really sure if
> > >> something like this counts as "cheating" though).
> > > <
> > > It is not cheating, it is borrowing. Mc 88110 had its CMP instruction set
> > > bits for AnyByteZero and AnyHalfwordZero for identical purposes.
> <
> > Yeah. This is at least slightly more generic:
> <
> > * PSCHEQ.B Rm, Rt, Rn
> > Compare if bytes in Rm match Rt within a 64-bit QWord.
> > Sets Rn to the index of the first match and sets SR.T if a match is found.
> <
> The Mc 88110 CMP instruction performed word, half, byte compares
> along with the AnyByteZero and AnyHalfZero. All in one instruction.
> >
> > * PSCHNE.B Rm, Rt, Rn
> > Compare if bytes in Rm differ from Rt.
> > Sets Rn to the index of the first mismatch and sets SR.T if a mismatch
> > is found.
>
> >
> > This is more useful for UCS2 / UTF16 strings, but was also intended for
> > things like dictionary objects (where, say, one looks up a value in an
> > array based on a 16-bit key index).
> >
> >
> > But, it is possible that some might try to argue that strcmp is necessarily:
> > while(*s1 && (*s1==*t1))
> > { s1++; t1++; }
> > if(*s1>*t1)return(1);
> > if(*s1<*t1)return(-1);
> > return(0);
> <
> int strcmp
> {
> while(*s1 && (*s1==*t1))
> { s1++; t1++; }
> return s1-t1;
> }
> <
> It does not have to return +1 and -1 it can return positive or negative.
> > ...

>Yeah. This is at least slightly more generic:
>* PSCHEQ.B Rm, Rt, Rn
>Compare if bytes in Rm match Rt within a 64-bit QWord.
>Sets Rn to the index of the first match and sets SR.T if a match is found.

The same thing is available in ANY1 as the BYTNDX,WYDNDX, and U21NDX instructions.
UTF21 being used for unicode strings. The instructions have both register and immediate forms.
BYTNDX can also be applied to vector registers to search all the elements for the first occurrence
of a byte.

Re: In-order vs Out-of-order

<sahajr$d97$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17898&group=comp.arch#17898

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: In-order vs Out-of-order
Date: Fri, 18 Jun 2021 00:21:47 -0500
Organization: A noiseless patient Spider
Lines: 98
Message-ID: <sahajr$d97$1@dont-email.me>
References: <9d8fb369-e27f-4794-92b8-24686d874ae4n@googlegroups.com>
<ef15af1d-e122-4fec-acea-46aa6af45671n@googlegroups.com>
<jwveed2pqbb.fsf-monnier+comp.arch@gnu.org> <sabmlo$jkt$1@dont-email.me>
<5d522ec2-1448-47a6-87df-6483f52a94afn@googlegroups.com>
<sae8kg$r7n$1@dont-email.me>
<02ec0039-1a23-4130-8059-b5d1358eac78n@googlegroups.com>
<sag50n$o9u$1@dont-email.me>
<1a6c3d8b-42d4-497a-9029-3a0f583c1fb0n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 18 Jun 2021 05:23:07 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="87462cafb506b47ba2cf53f49b461068";
logging-data="13607"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+ZA4kC2WH0X5dvWl/yiRBt"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:FVvo2PGn52QUw3EDI4rQOHBD8uQ=
In-Reply-To: <1a6c3d8b-42d4-497a-9029-3a0f583c1fb0n@googlegroups.com>
Content-Language: en-US
 by: BGB - Fri, 18 Jun 2021 05:21 UTC

On 6/17/2021 6:27 PM, MitchAlsup wrote:
> On Thursday, June 17, 2021 at 1:41:29 PM UTC-5, BGB wrote:
>> On 6/17/2021 10:08 AM, MitchAlsup wrote:
>
>>>> *1: In particular, added a "Packed Byte Search" op which can be used to
>>>> (quickly) detect the presence of a NUL byte (not really sure if
>>>> something like this counts as "cheating" though).
>>> <
>>> It is not cheating, it is borrowing. Mc 88110 had its CMP instruction set
>>> bits for AnyByteZero and AnyHalfwordZero for identical purposes.
> <
>> Yeah. This is at least slightly more generic:
> <
>> * PSCHEQ.B Rm, Rt, Rn
>> Compare if bytes in Rm match Rt within a 64-bit QWord.
>> Sets Rn to the index of the first match and sets SR.T if a match is found.
> <
> The Mc 88110 CMP instruction performed word, half, byte compares
> along with the AnyByteZero and AnyHalfZero. All in one instruction.

OK. The instruction I have can do similar, except that one needs load
zero into a register, typically before entering the loop.

>>
>> * PSCHNE.B Rm, Rt, Rn
>> Compare if bytes in Rm differ from Rt.
>> Sets Rn to the index of the first mismatch and sets SR.T if a mismatch
>> is found.
>

This one is useful for strcmp in that it can also eliminate needing to
do a byte-compare loop for the last several characters of the string,
and a relative comparison of the first-mismatched-character vs the NUL
terminator location can indicate whether the string ended before it
mismatched.

>>
>> This is more useful for UCS2 / UTF16 strings, but was also intended for
>> things like dictionary objects (where, say, one looks up a value in an
>> array based on a 16-bit key index).
>>
>>
>> But, it is possible that some might try to argue that strcmp is necessarily:
>> while(*s1 && (*s1==*t1))
>> { s1++; t1++; }
>> if(*s1>*t1)return(1);
>> if(*s1<*t1)return(-1);
>> return(0);
> <
> int strcmp
> {
> while(*s1 && (*s1==*t1))
> { s1++; t1++; }
> return s1-t1;
> }
> <
> It does not have to return +1 and -1 it can return positive or negative.

Good to know I guess, could maybe save a few cycles...

The existing logic does something more like:
CMPGT R6, R7
MOV?T 1, R2 | MOV?F -1, R2
But, eg:
SUB R6, R7, R2
Is a little cheaper...

Otherwise...

A little more fiddling with the compiler (mostly more fiddling with the
register allocation logic), I am now up to ~55000 in Dhrystone (and
binaries also shrank a few more %).

The relative amount of clock cycles spent on memory load/store ops has
also shrank slightly, and some other operations (namely branches and ALU
ops) have become a bigger part of the total (to the point of starting to
become a little more relevant).

Those memory ops which remain though seem to have a higher rate of cache
misses, and are also seem more prone to getting caught up in interlock
penalties.

Did go and add logic so now RTS will do the branch-prediction thing as
long as no other ops have modified LR within the pipeline. This is
nearly always true, but is harder to verify statically within the
compiler (more so with the new leaf function logic). For a sufficiently
small leaf function, it is possible to hit the final RTS instruction
before the original BSR instruction has cleared the WB stage (due mostly
to the branch prediction mechanism).

The new logic allows getting a speed advantage similar to the RTSU op
(namely, the RTS will be branch-predicted), but with the relative safety
of RTS (albeit at a slight cost increase vs not doing the pipeline check).

Re: In-order vs Out-of-order

<sahdjm$2f5$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17899&group=comp.arch#17899

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!udcs8n8gSpKJzYN2P2E9tQ.user.gioia.aioe.org.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: In-order vs Out-of-order
Date: Fri, 18 Jun 2021 08:14:13 +0200
Organization: Aioe.org NNTP Server
Lines: 58
Message-ID: <sahdjm$2f5$1@gioia.aioe.org>
References: <9d8fb369-e27f-4794-92b8-24686d874ae4n@googlegroups.com>
<ef15af1d-e122-4fec-acea-46aa6af45671n@googlegroups.com>
<jwveed2pqbb.fsf-monnier+comp.arch@gnu.org> <sabmlo$jkt$1@dont-email.me>
<5d522ec2-1448-47a6-87df-6483f52a94afn@googlegroups.com>
<sae8kg$r7n$1@dont-email.me>
<02ec0039-1a23-4130-8059-b5d1358eac78n@googlegroups.com>
<sag50n$o9u$1@dont-email.me>
<1a6c3d8b-42d4-497a-9029-3a0f583c1fb0n@googlegroups.com>
NNTP-Posting-Host: udcs8n8gSpKJzYN2P2E9tQ.user.gioia.aioe.org
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Complaints-To: abuse@aioe.org
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.7.1
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Fri, 18 Jun 2021 06:14 UTC

MitchAlsup wrote:
> On Thursday, June 17, 2021 at 1:41:29 PM UTC-5, BGB wrote:
>> On 6/17/2021 10:08 AM, MitchAlsup wrote:
>
>>>> *1: In particular, added a "Packed Byte Search" op which can be used to
>>>> (quickly) detect the presence of a NUL byte (not really sure if
>>>> something like this counts as "cheating" though).
>>> <
>>> It is not cheating, it is borrowing. Mc 88110 had its CMP instruction set
>>> bits for AnyByteZero and AnyHalfwordZero for identical purposes.
> <
>> Yeah. This is at least slightly more generic:
> <
>> * PSCHEQ.B Rm, Rt, Rn
>> Compare if bytes in Rm match Rt within a 64-bit QWord.
>> Sets Rn to the index of the first match and sets SR.T if a match is found.
> <
> The Mc 88110 CMP instruction performed word, half, byte compares
> along with the AnyByteZero and AnyHalfZero. All in one instruction.
>>
>> * PSCHNE.B Rm, Rt, Rn
>> Compare if bytes in Rm differ from Rt.
>> Sets Rn to the index of the first mismatch and sets SR.T if a mismatch
>> is found.
>
>>
>> This is more useful for UCS2 / UTF16 strings, but was also intended for
>> things like dictionary objects (where, say, one looks up a value in an
>> array based on a 16-bit key index).
>>
>>
>> But, it is possible that some might try to argue that strcmp is necessarily:
>> while(*s1 && (*s1==*t1))
>> { s1++; t1++; }
>> if(*s1>*t1)return(1);
>> if(*s1<*t1)return(-1);
>> return(0);
> <
> int strcmp
> {
> while(*s1 && (*s1==*t1))
> { s1++; t1++; }
> return s1-t1;
> }
> <
> It does not have to return +1 and -1 it can return positive or negative.

I have used that pattern, it fails when you can have a string which is
larger than half your int range, but is otherwise OK.

For a 64-bit int I'd be prefectly OK using it, but 32-bit is iffy.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: In-order vs Out-of-order

<sahi8q$kej$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17904&group=comp.arch#17904

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: In-order vs Out-of-order
Date: Fri, 18 Jun 2021 09:33:46 +0200
Organization: A noiseless patient Spider
Lines: 21
Message-ID: <sahi8q$kej$1@dont-email.me>
References: <9d8fb369-e27f-4794-92b8-24686d874ae4n@googlegroups.com>
<ff861de4-197b-4e9b-b3f5-79ff7e627c0en@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 18 Jun 2021 07:33:46 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="8702032c9a04ee9776e62605e24a438a";
logging-data="20947"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19jEvryoPNgL+1jMsqZ3Q2gEBBcGpSE1ZY="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.8.1
Cancel-Lock: sha1:a4EC8+L5yd6syz3q5T+1IWiTMZE=
In-Reply-To: <ff861de4-197b-4e9b-b3f5-79ff7e627c0en@googlegroups.com>
Content-Language: en-US
 by: Marcus - Fri, 18 Jun 2021 07:33 UTC

On 2021-06-17, nedbrek wrote:
> On Tuesday, June 15, 2021 at 5:10:42 AM UTC-5, Chingu wrote:
>> which type of core requires high frequency of operation ,in-order or out-of-order?Is it good to have in-order and out-of-order core to share the same cache level(L2 or L3)?
>
> One of the interesting results of our research was that out-of-order machines are easier to ramp to high frequency. This is because an in-order machine suffers more as L1D latency increases (you need to get the compiler to extract more and more ILP).
>

I've had this gut feeling for some time: In order to reach higher
frequencies you typically need to reduce the gate depth of each
pipeline stage, and hence introduce more pipeline stages. That
means that you increase the latency (in clock cycles) before a
result becomes ready to use (and this becomes especially noticeable
for data load operations, as well as for some long-pipeline floating-
point operations I'd suppose).

For an in-order machine this means that you get more pipeline stalls,
thus eating some of the benefits from the increased clock frequency,
but for an out-of-order machine it's possible to "hide" these stalls
by running other instructions in the mean time.

/Marcus

Re: In-order vs Out-of-order

<sahsc2$7bb$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17912&group=comp.arch#17912

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!4.us.feeder.erje.net!2.eu.feeder.erje.net!feeder.erje.net!news.uzoreto.com!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2a0a-a544-189e-0-752c-b4f3-795b-ea43.ipv6dyn.netcologne.de!not-for-mail
From: bl1-remo...@gmx.com (Bernd Linsel)
Newsgroups: comp.arch
Subject: Re: In-order vs Out-of-order
Date: Fri, 18 Jun 2021 12:26:10 +0200
Organization: news.netcologne.de
Distribution: world
Message-ID: <sahsc2$7bb$1@newsreader4.netcologne.de>
References: <9d8fb369-e27f-4794-92b8-24686d874ae4n@googlegroups.com>
<ef15af1d-e122-4fec-acea-46aa6af45671n@googlegroups.com>
<jwveed2pqbb.fsf-monnier+comp.arch@gnu.org> <sabmlo$jkt$1@dont-email.me>
<5d522ec2-1448-47a6-87df-6483f52a94afn@googlegroups.com>
<sae8kg$r7n$1@dont-email.me>
<02ec0039-1a23-4130-8059-b5d1358eac78n@googlegroups.com>
<sag50n$o9u$1@dont-email.me>
<1a6c3d8b-42d4-497a-9029-3a0f583c1fb0n@googlegroups.com>
<sahdjm$2f5$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 18 Jun 2021 10:26:10 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2a0a-a544-189e-0-752c-b4f3-795b-ea43.ipv6dyn.netcologne.de:2a0a:a544:189e:0:752c:b4f3:795b:ea43";
logging-data="7531"; mail-complaints-to="abuse@netcologne.de"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
In-Reply-To: <sahdjm$2f5$1@gioia.aioe.org>
 by: Bernd Linsel - Fri, 18 Jun 2021 10:26 UTC

On 18.06.2021 08:14, Terje Mathisen wrote:
> MitchAlsup wrote:
>>
>> int strcmp
>> {
>> while(*s1 && (*s1==*t1))
>> { s1++; t1++; }
>> return s1-t1;
>> }
>> <
>> It does not have to return +1 and -1 it can return positive or negative.
>
> I have used that pattern, it fails when you can have a string which is larger than half your int range, but is otherwise OK.
>
> For a 64-bit int I'd be prefectly OK using it, but 32-bit is iffy.

In fact, this pattern is simply wrong. It returns the difference between
the two string pointers, not between the first differing characters. The
failure you observed is only based on this programming error.

Irrespective of the result of the comparison, the return value is the
distance between the start address of the first and the start address of
the second string (ptrdiff_t casted to int). The compiler is even
entitled to remove the the complete while loop, since it does not
contribute to the returned result at all.

A correct version would be (C99):

int strcmp(char const* s1, char const* t1)
{ while (*s1 && (*s1 == *t1)) ++s1, ++t1;
return *s1 - *t1; // [sic] dereference pointers before subtraction
}

This still lacks verification for non-null arguments, but we could rely
on GPFs and runtime environment. However, it'd be helpful for the
compiler to declare strcmp with GCC attribute nonnull, so it will
prevent any calls with NULL pointers where it can deduce their values.

--
Regards,
Bernd

Re: In-order vs Out-of-order

<sai9vf$pch$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17918&group=comp.arch#17918

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: In-order vs Out-of-order
Date: Fri, 18 Jun 2021 09:17:03 -0500
Organization: A noiseless patient Spider
Lines: 66
Message-ID: <sai9vf$pch$1@dont-email.me>
References: <9d8fb369-e27f-4794-92b8-24686d874ae4n@googlegroups.com>
<ef15af1d-e122-4fec-acea-46aa6af45671n@googlegroups.com>
<jwveed2pqbb.fsf-monnier+comp.arch@gnu.org> <sabmlo$jkt$1@dont-email.me>
<5d522ec2-1448-47a6-87df-6483f52a94afn@googlegroups.com>
<sae8kg$r7n$1@dont-email.me>
<02ec0039-1a23-4130-8059-b5d1358eac78n@googlegroups.com>
<sag50n$o9u$1@dont-email.me>
<1a6c3d8b-42d4-497a-9029-3a0f583c1fb0n@googlegroups.com>
<sahdjm$2f5$1@gioia.aioe.org> <sahsc2$7bb$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 18 Jun 2021 14:18:23 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="87462cafb506b47ba2cf53f49b461068";
logging-data="26001"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19ue847zfsoBY/RDx3l7GZf"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:c0jrnrRE7LNo+fE9VdTCegOTjPc=
In-Reply-To: <sahsc2$7bb$1@newsreader4.netcologne.de>
Content-Language: en-US
 by: BGB - Fri, 18 Jun 2021 14:17 UTC

On 6/18/2021 5:26 AM, Bernd Linsel wrote:
> On 18.06.2021 08:14, Terje Mathisen wrote:
>> MitchAlsup wrote:
>>>
>>> int strcmp
>>> {
>>>       while(*s1 && (*s1==*t1))
>>>       { s1++; t1++; }
>>>       return s1-t1;
>>> }
>>> <
>>> It does not have to return +1 and -1 it can return positive or negative.
>>
>> I have used that pattern, it fails when you can have a string which is
>> larger than half your int range, but is otherwise OK.
>>
>> For a 64-bit int I'd be prefectly OK using it, but 32-bit is iffy.
>
> In fact, this pattern is simply wrong. It returns the difference between
> the two string pointers, not between the first differing characters. The
> failure you observed is only based on this programming error.
>
> Irrespective of the result of the comparison, the return value is the
> distance between the start address of the first and the start address of
> the second string (ptrdiff_t casted to int). The compiler is even
> entitled to remove the the complete while loop, since it does not
> contribute to the returned result at all.
>
> A correct version would be (C99):
>
> int strcmp(char const* s1, char const* t1)
> {
>     while (*s1 && (*s1 == *t1)) ++s1, ++t1;
>     return *s1 - *t1;    // [sic] dereference pointers before subtraction
> }
>

In my case, I assumed this was what was meant...

> This still lacks verification for non-null arguments, but we could rely
> on GPFs and runtime environment. However, it'd be helpful for the
> compiler to declare strcmp with GCC attribute nonnull, so it will
> prevent any calls with NULL pointers where it can deduce their values.
>

In my case, I am using "BGBCC" for this, which is my own compiler for my
own ISA, though sadly kinda fails at generating good code for it (what
it does generate is primarily scalar code, with little ability to avoid
interlocks).

Meanwhile, an in-order superscalar wouldn't do much better with this
code, as the compiler can manage the "easy part" (putting independent
ops in parallel), but not really the harder parts. It has some logic to
try to shuffle ops around, but this is fairly constrained in that many
operations are immovable, and it can't change the relative order of
memory operations.

The reordering is also limited by false dependencies between registers
(eg, if the same register is immediately reused), but this is another issue.

Some recent tweaks though do greatly reduce the number of register
spills in leaf functions by tweaking the register allocation logic, so
this is something...

....

Re: In-order vs Out-of-order

<2021Jun19.184356@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17943&group=comp.arch#17943

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: In-order vs Out-of-order
Date: Sat, 19 Jun 2021 16:43:56 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 21
Message-ID: <2021Jun19.184356@mips.complang.tuwien.ac.at>
References: <9d8fb369-e27f-4794-92b8-24686d874ae4n@googlegroups.com> <ff861de4-197b-4e9b-b3f5-79ff7e627c0en@googlegroups.com>
Injection-Info: reader02.eternal-september.org; posting-host="614c9c916565d861914e7c9f21286520";
logging-data="6362"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX191Xmh0UWUjDX8PuQDwt/i3"
Cancel-Lock: sha1:0Z6zDA/l5xQgJa3etrbqCAjcXlY=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Sat, 19 Jun 2021 16:43 UTC

nedbrek <nedbrek@yahoo.com> writes:
>One of the interesting results of our research was that out-of-order machines are easier to ramp to high frequency. This is because an in-order machine suffers more as L1D latency increases (you need to get the compiler to extract more and more ILP).

The numbers I have seen posted here (in an IA-64 discussion) are that
one extra L1D cycles costs 10% on in-order and 5% on OoO. But I don't
see why this should make it hard to ramp up clock rates. Sure, you
don't see as much benefit from the higher clock rate because of that
effect, but you should still see a benefit, because ALU ops go faster
and because the compiler is able to fill some of the load latency with
useful work.

My impression is that in-order designs are harder to clock fast
because they have less localized control recurrences than OoO; and
last time I wrote that, the people in the know did not really
contradict me, but discussed some techniques to work around that, such
as having a backup of the pipeline state and rolling back to that.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: In-order vs Out-of-order

<4168ff16-858f-4878-aba8-5175cc161a8dn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17946&group=comp.arch#17946

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:bc04:: with SMTP id m4mr15301468qkf.100.1624123732669;
Sat, 19 Jun 2021 10:28:52 -0700 (PDT)
X-Received: by 2002:a4a:e9b1:: with SMTP id t17mr14211223ood.0.1624123732456;
Sat, 19 Jun 2021 10:28:52 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 19 Jun 2021 10:28:52 -0700 (PDT)
In-Reply-To: <2021Jun19.184356@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:2c5a:5a95:84f4:1003;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:2c5a:5a95:84f4:1003
References: <9d8fb369-e27f-4794-92b8-24686d874ae4n@googlegroups.com>
<ff861de4-197b-4e9b-b3f5-79ff7e627c0en@googlegroups.com> <2021Jun19.184356@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4168ff16-858f-4878-aba8-5175cc161a8dn@googlegroups.com>
Subject: Re: In-order vs Out-of-order
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 19 Jun 2021 17:28:52 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Sat, 19 Jun 2021 17:28 UTC

On Saturday, June 19, 2021 at 11:57:56 AM UTC-5, Anton Ertl wrote:
> nedbrek <ned...@yahoo.com> writes:
> >One of the interesting results of our research was that out-of-order machines are easier to ramp to high frequency. This is because an in-order machine suffers more as L1D latency increases (you need to get the compiler to extract more and more ILP).
<
> The numbers I have seen posted here (in an IA-64 discussion) are that
> one extra L1D cycles costs 10% on in-order and 5% on OoO. But I don't
<
Back in the R2000 and Mc 88100 days it was measured around 13%.
So we are not in any real disagreement. But this was 2-cycle versus
3-cycle, both in order. It would have been significantly worse 2-cycle
versus 4-cycle.
<
Conversely, I don't see much degradation in GBOoO 3-cycle versus 4-cycle.
But we can both agree it is rather small,a nd that frequency gain can easily
outweigh the pipeline disadvantage.
<
> see why this should make it hard to ramp up clock rates. Sure, you
> don't see as much benefit from the higher clock rate because of that
> effect, but you should still see a benefit, because ALU ops go faster
> and because the compiler is able to fill some of the load latency with
> useful work.
<
Since IO is so dependent on actual latency, many IO designs are
built around direct mapped caches so one can be aligning cache data
into register format even while hit is being detected.
<
Conversely, OoO designs, being less sensitive in the first place, can
afford set associative L1 caches and take the hit rate advantage to
ameliorate any pipeline disadvantage--often getting more than break
even.
>
> My impression is that in-order designs are harder to clock fast
> because they have less localized control recurrences than OoO; and
> last time I wrote that, the people in the know did not really
> contradict me, but discussed some techniques to work around that, such
> as having a backup of the pipeline state and rolling back to that.
<
My impression is that it is time to jump over into OoO (reservation
stations and all that) about the time set associative L1 caches are
called for. With today's transistor counts, there is little to prevent this
jump.
<
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: In-order vs Out-of-order

<8dd6f635-56c3-4d21-94f6-0af3ce1dff18n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17950&group=comp.arch#17950

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a0c:e18c:: with SMTP id p12mr3458636qvl.54.1624129064913;
Sat, 19 Jun 2021 11:57:44 -0700 (PDT)
X-Received: by 2002:a4a:d781:: with SMTP id c1mr14371798oou.23.1624129064710;
Sat, 19 Jun 2021 11:57:44 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!news-out.netnews.com!news.alt.net!fdc3.netnews.com!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 19 Jun 2021 11:57:44 -0700 (PDT)
In-Reply-To: <2021Jun19.184356@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=87.68.182.191; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 87.68.182.191
References: <9d8fb369-e27f-4794-92b8-24686d874ae4n@googlegroups.com>
<ff861de4-197b-4e9b-b3f5-79ff7e627c0en@googlegroups.com> <2021Jun19.184356@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <8dd6f635-56c3-4d21-94f6-0af3ce1dff18n@googlegroups.com>
Subject: Re: In-order vs Out-of-order
From: already5...@yahoo.com (Michael S)
Injection-Date: Sat, 19 Jun 2021 18:57:44 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 3145
 by: Michael S - Sat, 19 Jun 2021 18:57 UTC

On Saturday, June 19, 2021 at 7:57:56 PM UTC+3, Anton Ertl wrote:
> nedbrek <ned...@yahoo.com> writes:
> >One of the interesting results of our research was that out-of-order machines are easier to ramp to high frequency. This is because an in-order machine suffers more as L1D latency increases (you need to get the compiler to extract more and more ILP).
> The numbers I have seen posted here (in an IA-64 discussion) are that
> one extra L1D cycles costs 10% on in-order and 5% on OoO. But I don't
> see why this should make it hard to ramp up clock rates. Sure, you
> don't see as much benefit from the higher clock rate because of that
> effect, but you should still see a benefit, because ALU ops go faster
> and because the compiler is able to fill some of the load latency with
> useful work.
>
> My impression is that in-order designs are harder to clock fast
> because they have less localized control recurrences than OoO; and
> last time I wrote that, the people in the know did not really
> contradict me, but discussed some techniques to work around that, such
> as having a backup of the pipeline state and rolling back to that.
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

I never tried to think hard about it, but my uneducated impression always was that it applies to "normal" In-Order, the one that does interlocks in hardware.
I don't see why it should apply to in-order designs with exposed pipeline, especially to those that also expose to software a banked structure of register files.
Of course, even designs like those has to have a way to deal with variable-length latency of memory instructions... May be, if it's done by replay the control could remain distributed?

Please , don't think for a second, that I advocated that sort of designs. Just say that they are possible.

Pages:12
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor