Welcome to novaBBS (click a section below)

mail files register newsreader groups login

Message-ID:

I have not yet begun to byte!

Re: instruction set binding time, was Encoding 20 and 40 bit

Subject	Author
Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	Brett
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	Brett
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Stefan Monnier
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Stefan Monnier
Re: Encoding 20 and 40 bit instructions in 128 bits	Bernd Linsel
Re: Encoding 20 and 40 bit instructions in 128 bits	Anton Ertl
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Brian G. Lucas
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Anton Ertl
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	Stefan Monnier
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	John Levine
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Stefan Monnier
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Stefan Monnier
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	John Levine
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Terje Mathisen
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Terje Mathisen
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	John Levine
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Quadibloc
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Scott Smader
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Stefan Monnier
Re: instruction set binding time, was Encoding 20 and 40 bit	Scott Smader
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	James Van Buskirk
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Statically scheduled plus run ahead.	Brett
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Anton Ertl
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Paul A. Clayton

Pages:1 2 3 4 5 678 9 10 11 12 13 14

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sufh00$24j$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23471&group=comp.arch#23471

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 15 Feb 2022 06:25:04 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sufh00$24j$1@newsreader4.netcologne.de>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<suerog$cd0$1@dont-email.me>
<496d73fc-58d7-4dd8-8b98-ddcd920bd13en@googlegroups.com>
Injection-Date: Tue, 15 Feb 2022 06:25:04 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:19cd:0:7285:c2ff:fe6c:992d";
logging-data="2195"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Tue, 15 Feb 2022 06:25 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:
> We actually god faster code most
> of the time with -O1 and sometimes -O2 than -O3.

Nice typo :-)

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sufiml$2jn$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23472&group=comp.arch#23472

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 15 Feb 2022 06:54:13 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sufiml$2jn$1@newsreader4.netcologne.de>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
Injection-Date: Tue, 15 Feb 2022 06:54:13 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:19cd:0:7285:c2ff:fe6c:992d";
logging-data="2679"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Tue, 15 Feb 2022 06:54 UTC

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
> Will
> software developers finally get around to designing software that
> makes the maximum of the hardware, because inefficiency can no longer
> be papered over with faster hardware?

I have my doubts.

If one trusts a random statistic grabbed off the internet, there
are more than 25 million software developers at the moment (however
that is defined, I probably would not qualify). Many of them are
not very well qualified, and with the advent of trends like "low
coding", this will be even worse. Firing "Dinobabies" like IBM
is doing will also reduce the average qualification of programmers.

Looking at the numerous ransomware attacks and zero-day-exploits,
the sheer amount and complexity of software already overwhelm
our capability to write it. We are already in the middle of a
software crisis, and it will become worse when computers stop
getting (much) faster.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<70f6512f-243d-4492-80e6-299b40361ea5n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23473&group=comp.arch#23473

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:7e8c:: with SMTP id w12mr1821909qtj.342.1644909591541;
Mon, 14 Feb 2022 23:19:51 -0800 (PST)
X-Received: by 2002:a05:6808:11cf:b0:2ce:6ee7:2cf7 with SMTP id
p15-20020a05680811cf00b002ce6ee72cf7mr1057293oiv.293.1644909591263; Mon, 14
Feb 2022 23:19:51 -0800 (PST)
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!2.eu.feeder.erje.net!feeder.erje.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 14 Feb 2022 23:19:51 -0800 (PST)
In-Reply-To: <jwvwnhwvnlj.fsf-monnier+comp.arch@gnu.org>
Injection-Info: google-groups.googlegroups.com; posting-host=162.229.185.59; posting-account=Gm3E_woAAACkDRJFCvfChVjhgA24PTsb
NNTP-Posting-Host: 162.229.185.59
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me>
<suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at> <212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<jwvwnhwvnlj.fsf-monnier+comp.arch@gnu.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <70f6512f-243d-4492-80e6-299b40361ea5n@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: yogaman...@yahoo.com (Scott Smader)
Injection-Date: Tue, 15 Feb 2022 07:19:51 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: Scott Smader - Tue, 15 Feb 2022 07:19 UTC

On Monday, February 14, 2022 at 7:10:21 PM UTC-8, Stefan Monnier wrote:
> > Not sure if the assumption of OoO superiority is justified at current
> > transistor budget, but for the same transistor budget at the same clock
> > rate, statically scheduled should win because it more efficiently uses the
> > available transistors to do useful work.
> What makes you think so? AFAIK OoO cores are better at keeping their
> FUs busy, and they're probably also better at keeping the memory
> hierarchy busy. Maybe they do that at the cost of more transistors
> "wasted" doing "administrative overhead", but it's far from obvious that
> this overhead is the only thing that matters.

I would appreciate links about OoO's superior use of FUs.

Statically scheduled design really seems to require VLIW for the compilers to optimally utilize FUs, unrolling loops and vectorizing as convenient. As you note, there haven't been a bunch of popular VLIW machines, although Mill stands on the cusp. To compare the relatively mature performance of iterated OoO designs against the state of existing statically scheduled machines doesn't seem to answer questions about performance in the limit when transistor budgets are frozen, which is how I understood your question.

I made a cheating assumption that a statically scheduled machine would include dynamic branch prediction hardware. I think that goes a long way toward equalizing cache/memory access times. I believe Mill even offers the compiler a way to initialize its DBP HW, and IIRC, HW can even update the executable image with a new hint value.

We've seen what OoO hath wrought, but we haven't yet seen a machine like Mill put through its paces. My impression is that Mill's innovations will dramatically advance the path to improved statically scheduled machines. It may not be fair to include all of the Mill's clever inventions into what I'm pointing at for static scheduling, but I am.

OoO works definitely better for programs which depend on conditions that can't be known (or accurately guessed) at compile time, provided the metric is speed. But OoO lags when better is measured by die area or power-per-instruction or when the program's sequencing is predictable.

So should the power and die area be dedicated to speculation about what to do next? Or to getting more done at once? For a truly general purpose machine, a Big.little approach probably makes most sense, but for getting the most work done, I think static scheduling wins when datasets are huge, and I expect problem datasets to be really, really huge when transistor budgets stagnate.

Disclosure: I am so smitten by Mill, that I have made an investment in Mill Computing. I hope that isn't swaying my analysis. If I'm wildly wrong, I'd really like to know. Please be the friend who breaks it to me.

>
>
> Stefan

Statically scheduled plus run ahead.

<sufov7$l3m$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23475&group=comp.arch#23475

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ggt...@yahoo.com (Brett)
Newsgroups: comp.arch
Subject: Statically scheduled plus run ahead.
Date: Tue, 15 Feb 2022 08:41:11 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 91
Message-ID: <sufov7$l3m$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me>
<suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me>
<subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 15 Feb 2022 08:41:11 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="04a37c10c6327f59d9e9d3dd0421bc10";
logging-data="21622"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+9HqU8IeWofX1I1pVEthpE"
User-Agent: NewsTap/5.5 (iPad)
Cancel-Lock: sha1:DEh3q2K7g27m3qhhc8/jCorudRU=
sha1:/RPhVgr1bYHfTjO3jVE7GJz4KJk=

by: Brett - Tue, 15 Feb 2022 08:41 UTC

Scott Smader <yogaman101@yahoo.com> wrote:
> On Monday, February 14, 2022 at 2:26:46 PM UTC-8, Anton Ertl wrote:
>> BGB <cr8...@gmail.com> writes:
>>> Hitting this limit is not likely to bode well for the long term
>>> "superiority" of OoO,
>> Why? If OoO is superior now at the current transistor budget, why
>> would it stop being superior if the transistor budget does not
>> increase?
>
> Not sure if the assumption of OoO superiority is justified at current
> transistor budget, but for the same transistor budget at the same clock
> rate, statically scheduled should win because it more efficiently uses
> the available transistors to do useful work.
> Every time a program runs on an OoO machine, it wastes power to speculate
> about un-taken paths that can be avoided when source code compiles to a
> statically scheduled target instruction set.
> Once the speculation transistors are removed, a statically scheduled chip
> can replace them with additional FUs to get more done each clock cycle.

Memory is 150 cycles away and CPU’s are 6 wide, so that is a 900
instruction advantage for an OoO engine that guesses right with its branch
predictions.

An in-order design needs to have a separate run ahead engine predicting
branches and prefetching data, basically an OoO engine on front. That OoO
engine is doing 90% of the work making the back in-order engine pointless,
UNLESS that back engine is wider than the front engine. Note that the front
engine can throw away most work instructions focusing on branches and
loads, so it does not need to be wide.

I would look at a 16 wide in-order back design with a 3 wide OoO front
engine.

When I say 3 wide I am partially lying as it needs 8 wide fetch and decode,
and then throws away most of those instructions.

When I say in-order I am lying as even primitive in-order designs do OoO
loads, the and OoO front engine is also prefetching loads.

Note that the front engine would mispredict backwards branches as not
taken, as it does not care about loops. Instead it would do a bulk preload
off the base pointers in the loop if those and the loop count are known.

There are so many gotcha’s in this approach that you are facing doom if you
don’t have Intel or Apple levels of cubic dollars invested.

Intel bought a company that claimed it could do something like this, a
decade ago, people were expecting products that would crush AMD several
years ago. No sign of this on roadmaps and don’t know if Intel has shut
this unit down yet. Instead Intel makes huge empty promises of
breakthroughs and ships tiny irrelevant tweaks to existing designs.

Intel has been run by incompetents for decades, and it has not mattered,
that is the power of a monopoly.

Looks like Intel is betting on on-die memory requiring lots of fabs making
huge dies instead. Apple will be first, and if AMD keeps it’s act together
they will be second ahead of Intel.

My prediction is 8 cores with 8 gigabytes of ram on a huge die, the perfect
gaming system. Yes you actually need 16 gigabytes, but off die ram will
take care of the overflow. Once density hits 32 gigabytes off die ram dies.

>>> and may well lead to push-back towards
>>> simpler/cheaper cores.
>> If/When the increase in transistors/$ finally stops, it will be
>> interesting to see what happens then. Will we see more specialized
>> hardware, because now you can amortize the development (and mask
>> costs) over a longer time, even if you only serve a niche? Will
>> software developers finally get around to designing software that
>> makes the maximum of the hardware, because inefficiency can no longer
>> be papered over with faster hardware?
>>
>> I have my doubts. Network effects will neuter the advantages of
>> specialized hardware and of carefully designed software that is late
>> to the market.
>> - anton
>
> Statically scheduled isn't specialized. Optimizing transistor budgets in
> a general purpose chip is, to a great degree, separable from optimizing
> the applications that run on it, and it should be.
>
> Imo, that's what makes Mill so keen, but as you say, it will be interesting to see.
>
>> --
>> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
>> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>
>

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sufp98$lbm$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23476&group=comp.arch#23476

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 15 Feb 2022 00:46:32 -0800
Organization: A noiseless patient Spider
Lines: 46
Message-ID: <sufp98$lbm$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<suerog$cd0$1@dont-email.me>
<e360bac3-6a11-49e0-b089-f2158b0c762fn@googlegroups.com>
<suf799$b04$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 15 Feb 2022 08:46:32 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="93b1554788c317f6fd2ff3a26e7468cb";
logging-data="21878"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+FI6sgAzoIR9SbBTJb8Aoq"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.1
Cancel-Lock: sha1:qJ8rSPlOMJ7CGFEExeObKDsx+HI=
In-Reply-To: <suf799$b04$1@dont-email.me>
Content-Language: en-US

by: Ivan Godard - Tue, 15 Feb 2022 08:46 UTC

On 2/14/2022 7:38 PM, BGB wrote:
> On 2/14/2022 8:49 PM, Quadibloc wrote:
>> On Monday, February 14, 2022 at 5:22:45 PM UTC-7, BGB wrote:
>>> On 2/14/2022 4:17 PM, Anton Ertl wrote:
>>>> BGB <cr8...@gmail.com> writes:
>>>>> Hitting this limit is not likely to bode well for the long term
>>>>> "superiority" of OoO,
>>
>>>> Why? If OoO is superior now at the current transistor budget, why
>>>> would it stop being superior if the transistor budget does not
>>>> increase?
>>
>>> Advantages of OoO were mostly:
>>> "Ye Olde" scalar code goes keeps getting faster;
>>> Not requiring as much from the compiler;
>>> Effective at getting maximizing single thread performance;
>>
>>> It achieved this at the expense of spending considerable transistor
>>> budget and energy on the problem.
>>
>>> The demand for higher performance will continue past the point where
>>> continuously increasing transistor budget is no longer viable.
>>
>> You - and Mitch Alsup - are such *optimists*!
>>
>> Given that out-of-order execution requires a _lot_ of transistors
>> for the extra performance it provides, then it logically follows
>> that one could get more throughput by putting a large number
>> of in-order processors on the same die, does it not?
>>
>> If so, why did Intel change the Atom over from in-order to
>> out-of-order, and why did the Xeon Phi fail?
>>
>
> One issue with x86 (and x86-64) is that the design of the ISA kinda
> sucks for in-order cores as well.
>
> What I am imagining here is not x86.
>
>
>
> I am not aware of any (particularly mainstream) attempts at VLIW
> processors, at least much beyond the (ill fated) Itanium.
>

TI C64. Qualcomm Hexagon.

Re: instruction set binding time, was Encoding 20 and 40 bit

<sufpjo$oij$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23477&group=comp.arch#23477

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
Date: Tue, 15 Feb 2022 00:52:07 -0800
Organization: A noiseless patient Spider
Lines: 20
Message-ID: <sufpjo$oij$1@dont-email.me>
References: <70f6512f-243d-4492-80e6-299b40361ea5n@googlegroups.com>
<memo.20220215082304.7708J@jgd.cix.co.uk>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 15 Feb 2022 08:52:08 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="93b1554788c317f6fd2ff3a26e7468cb";
logging-data="25171"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+jL6tGIMBtburjWIyOH0Ja"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.1
Cancel-Lock: sha1:u0JRJ8DZ3VjuZ+nXuo507psqZCw=
In-Reply-To: <memo.20220215082304.7708J@jgd.cix.co.uk>
Content-Language: en-US

by: Ivan Godard - Tue, 15 Feb 2022 08:52 UTC

On 2/15/2022 12:22 AM, John Dallman wrote:
> In article <70f6512f-243d-4492-80e6-299b40361ea5n@googlegroups.com>,
> yogaman101@yahoo.com (Scott Smader) wrote:
>
>> Statically scheduled design really seems to require VLIW for the
>> compilers to optimally utilize FUs, unrolling loops and vectorizing
>> as convenient.
>
> There is - or was - a pretty widely used architecture based on VLIW and
> static scheduling. Unfortunately, it was Itanium, which was a massive
> failure performance-wise, and only had the limited commercial success
> that it managed due to vendor lock-in on HP-UX, Non-Stop and Open VMS.

Itanium was not a VLIW, it was an EPIC architecture; granted, both were
wide issue, which is commonly confused with VLIW.

The Mill is not a VLIW (or EPIC) either, although it is closer than the
Itanium.

You probably have a VLIW in your pocket: Qualcomm Hexagon.

Re: Statically scheduled plus run ahead.

<sufqb5$t1l$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23478&group=comp.arch#23478

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Statically scheduled plus run ahead.
Date: Tue, 15 Feb 2022 01:04:35 -0800
Organization: A noiseless patient Spider
Lines: 108
Message-ID: <sufqb5$t1l$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<sufov7$l3m$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 15 Feb 2022 09:04:37 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="93b1554788c317f6fd2ff3a26e7468cb";
logging-data="29749"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/e7slxwxW1EKxFfZ/Jaho0"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.1
Cancel-Lock: sha1:X1oaINCck0buEUTwSoohERHHFSA=
In-Reply-To: <sufov7$l3m$1@dont-email.me>
Content-Language: en-US

by: Ivan Godard - Tue, 15 Feb 2022 09:04 UTC

On 2/15/2022 12:41 AM, Brett wrote:
> Scott Smader <yogaman101@yahoo.com> wrote:
>> On Monday, February 14, 2022 at 2:26:46 PM UTC-8, Anton Ertl wrote:
>>> BGB <cr8...@gmail.com> writes:
>>>> Hitting this limit is not likely to bode well for the long term
>>>> "superiority" of OoO,
>>> Why? If OoO is superior now at the current transistor budget, why
>>> would it stop being superior if the transistor budget does not
>>> increase?
>>
>> Not sure if the assumption of OoO superiority is justified at current
>> transistor budget, but for the same transistor budget at the same clock
>> rate, statically scheduled should win because it more efficiently uses
>> the available transistors to do useful work.
>> Every time a program runs on an OoO machine, it wastes power to speculate
>> about un-taken paths that can be avoided when source code compiles to a
>> statically scheduled target instruction set.
>> Once the speculation transistors are removed, a statically scheduled chip
>> can replace them with additional FUs to get more done each clock cycle.
>
> Memory is 150 cycles away and CPU’s are 6 wide, so that is a 900
> instruction advantage for an OoO engine that guesses right with its branch
> predictions.

Are you assuming that in-order precludes branch prediction? That would
markedly beg the question.

> An in-order design needs to have a separate run ahead engine predicting
> branches and prefetching data, basically an OoO engine on front. That OoO
> engine is doing 90% of the work making the back in-order engine pointless,
> UNLESS that back engine is wider than the front engine. Note that the front
> engine can throw away most work instructions focusing on branches and
> loads, so it does not need to be wide.

You are right that run-ahead prediction using run-ahead execution needs
OOO to be effective - very early Mill had a scout processor (the
"precore") for that, and it didn't work and we abandoned it.

Mill has a patent that does runahead branch prediction without
execution; ours is not the only possible approach. Any of these (RA
speculation without RA execution) can run rings around using an OOO as a
scout instead of branch prediction.

> I would look at a 16 wide in-order back design with a 3 wide OoO front
> engine.
>
> When I say 3 wide I am partially lying as it needs 8 wide fetch and decode,
> and then throws away most of those instructions.
>
> When I say in-order I am lying as even primitive in-order designs do OoO
> loads, the and OoO front engine is also prefetching loads.
>
> Note that the front engine would mispredict backwards branches as not
> taken, as it does not care about loops. Instead it would do a bulk preload
> off the base pointers in the loop if those and the loop count are known.
>
> There are so many gotcha’s in this approach that you are facing doom if you
> don’t have Intel or Apple levels of cubic dollars invested.
>
> Intel bought a company that claimed it could do something like this, a
> decade ago, people were expecting products that would crush AMD several
> years ago. No sign of this on roadmaps and don’t know if Intel has shut
> this unit down yet. Instead Intel makes huge empty promises of
> breakthroughs and ships tiny irrelevant tweaks to existing designs.

It is not unknown for market dominators to buy potential market
disruptors and quietly kill them to let the golden goose keep its margins.

> Intel has been run by incompetents for decades, and it has not mattered,
> that is the power of a monopoly.
>
> Looks like Intel is betting on on-die memory requiring lots of fabs making
> huge dies instead. Apple will be first, and if AMD keeps it’s act together
> they will be second ahead of Intel.
>
> My prediction is 8 cores with 8 gigabytes of ram on a huge die, the perfect
> gaming system. Yes you actually need 16 gigabytes, but off die ram will
> take care of the overflow. Once density hits 32 gigabytes off die ram dies.
>
>>>> and may well lead to push-back towards
>>>> simpler/cheaper cores.
>>> If/When the increase in transistors/$ finally stops, it will be
>>> interesting to see what happens then. Will we see more specialized
>>> hardware, because now you can amortize the development (and mask
>>> costs) over a longer time, even if you only serve a niche? Will
>>> software developers finally get around to designing software that
>>> makes the maximum of the hardware, because inefficiency can no longer
>>> be papered over with faster hardware?
>>>
>>> I have my doubts. Network effects will neuter the advantages of
>>> specialized hardware and of carefully designed software that is late
>>> to the market.
>>> - anton
>>
>> Statically scheduled isn't specialized. Optimizing transistor budgets in
>> a general purpose chip is, to a great degree, separable from optimizing
>> the applications that run on it, and it should be.
>>
>> Imo, that's what makes Mill so keen, but as you say, it will be interesting to see.
>>
>>> --
>>> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
>>> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>
>>
>
>
>

Re: Encoding 20 and 40 bit instructions in 128 bits

<2022Feb15.103604@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23479&group=comp.arch#23479

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Encoding 20 and 40 bit instructions in 128 bits
Date: Tue, 15 Feb 2022 09:36:04 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 45
Message-ID: <2022Feb15.103604@mips.complang.tuwien.ac.at>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <8511a1a5-48be-4186-9d6e-8831616ba9a5n@googlegroups.com> <53335b0a-0f90-4836-8cf9-5668308e38a9n@googlegroups.com> <su6qeh$c9q$1@newsreader4.netcologne.de> <4uQNJ.16175$8V_7.8153@fx04.iad> <su9ajt$5pr$1@dont-email.me> <5db43324-630c-46a0-9ece-ab7343975715n@googlegroups.com> <su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me> <2022Feb13.184118@mips.complang.tuwien.ac.at> <suc5r0$hlp$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="5b45c8e4e010c1ee414315010f84ca5d";
logging-data="24668"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/+whrheHFcEMCXOTFQ2vFQ"
Cancel-Lock: sha1:E7W890sNqXxwuKMgLzu8bN+ki3w=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Tue, 15 Feb 2022 09:36 UTC

Ivan Godard <ivan@millcomputing.com> writes:
>On 2/13/2022 9:41 AM, Anton Ertl wrote:
>> Ivan Godard <ivan@millcomputing.com> writes:
>>> In Mill the translation happens once, at install time, via a transparent
>>> invocation of the specializer (not a recompile, and the source code is
>>> not needed). In a microcoded machine (or a packing/cracking scheme) the
>>> translation takes place at execution time, every time. Either way, it
>>> gets done, and either way permits ISA extension freely. They differ in
>>> power, area, and time; the choice is an engineering design dimension.
>>>
>>> I assert without proof that Mill-style software translation in fact
>>> permits ISA extension that is *more* flexible and powerful than what can
>>> be achieved by hardware approaches. YMMV.
>>
>> My mileage is that we have RAID 1 on two SSDs with a Debian
>> installation on a machine with a Ryzen 5800X. When we got a machine
>> with a Xeon-W 1370P to work, we took one of the SSDs, put it in the
>> Xeon box, and the system worked. We also put empty SSDs in both
>> machines and told the system to use the new SSD for the RAID 1.
>>
>> The way you describe the Mill way, we could not do that with two
>> different Mill models, even from the same company.
>>
>> - anton
>
>With the caution that Mill is a core ISA, not a board architecture, you
>should be able to do as you describe on Mill cores.

Not sure what you mean by "core ISA" and "board architecture"; there
certainly is a software part to making this work, with Linux detecting
the available I/O hardware and abstracting away most of the
differences. My Windows experience is that switching boards on
Windows is not as smooth.

>If you try to run a LM on a
>core that does not have a conAsm version in the cache then the system
>invokes the current specializer and builds, and caches, a new conAsm
>that will run on the new host.

That's good and should be able to deal with the situation I describe.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suft9b$e57$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23481&group=comp.arch#23481

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 15 Feb 2022 03:54:50 -0600
Organization: A noiseless patient Spider
Lines: 202
Message-ID: <suft9b$e57$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<suerog$cd0$1@dont-email.me>
<496d73fc-58d7-4dd8-8b98-ddcd920bd13en@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 15 Feb 2022 09:54:51 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="b15bfce4c7a8c2fd9d7d9df155af1db1";
logging-data="14503"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/Wj404HIHmlk7w9of4lsKP"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.0
Cancel-Lock: sha1:xUOA7aXP1VtHD0NRPAk6WRfO9rM=
In-Reply-To: <496d73fc-58d7-4dd8-8b98-ddcd920bd13en@googlegroups.com>
Content-Language: en-US

by: BGB - Tue, 15 Feb 2022 09:54 UTC

On 2/14/2022 8:23 PM, MitchAlsup wrote:
> On Monday, February 14, 2022 at 6:22:45 PM UTC-6, BGB wrote:
>> On 2/14/2022 4:17 PM, Anton Ertl wrote:
>>> BGB <cr8...@gmail.com> writes:
>>>> Hitting this limit is not likely to bode well for the long term
>>>> "superiority" of OoO,
>>>
>>> Why? If OoO is superior now at the current transistor budget, why
>>> would it stop being superior if the transistor budget does not
>>> increase?
>>>
>> Advantages of OoO were mostly:
>> "Ye Olde" scalar code goes keeps getting faster;
>> Not requiring as much from the compiler;
> <
> A good OoO implementation wants the compiler to do LESS--encode
> the subroutine with as few instructions as possible, don't bother to
> schedule them, ... in the early 1990s Mc88120 was underperforming
> on code optimized for Mc88100, where the code was heavily scheduled
> loops software pipelined, instructions put in "inconsiderate" places
> for the GBOoO machine we had. We actually god faster code most
> of the time with -O1 and sometimes -O2 than -O3.
> <

I had a different experience, at least with Bulldozer:
Writing ASM code like I was working with an in-order superscalar
generally performed better than more free-form code.

Though, a lot of stuff changed around when I got a Ryzen, and some of my
previous "extra fast" code saw its performance advantages suddenly
evaporate.

>> Effective at getting maximizing single thread performance;
>> ...
>>
>> It achieved this at the expense of spending considerable transistor
>> budget and energy on the problem.
>>
>> The demand for higher performance will continue past the point where
>> continuously increasing transistor budget is no longer viable.
>>
>>
>> However, the push-back may favor designs which can give more performance
>> relative to transistor budget and energy use. Namely, such as VLIW cores
>> with dynamic translation from some higher-level IR (could be a
>> stack-based IR, but also likely would be an existing ISA being
>> repurposed as an IR).
>>
>>
>>
>> Ideally, one needs an ISA that "doesn't suck" as an IR, where I suspect
>> both x86-64 and ARM64 making use of condition-codes does not exactly
>> work in their favor in this use-case.
> <
> x86 basically has 3 condition codes C, O, and ZAPS and HW trance each
> independently.

It is not ideal for an IR since the condition codes require stare that
needs to be updated and maintained and probably 99% of the time is
effectively irrelevant.

However, one heuristic I did add once when I had written an x86
emulator, was that if the currently decoded instruction was followed by
an instruction which would stomp the flags, then the current instruction
would skip updating the flags.

>>
>> While RISC-V doesn't use condition codes (good in this sense), it also
>> kinda sucks in many other areas.
> <
> Academic quality.
>>
>> I have yet to figure out what the ideal IR would look like exactly (or
>> even if it would necessarily be singular).
> <
> LLVM intermediate is actually quite good.

One drawback is that LLVM bitcode is a relatively opaque format.
But, at least the ASCII format mostly makes sense I guess.

Though, pros/cons, LLVM IR would require a semi-complex backend to make
use of effectively. It would need to parsed into an internal format, use
a register allocator, ...

So, the "barrier to entry" would likely be higher than it would be, say,
for an IR format with a design more like that of JVM Bytecode.

There is a similar issue with BGBCC's RPNIL format, where its use of
symbolic references and implicit operation types preclude a "simple"
backend.

Say, one example idea for an IR could look like:
File is read in as an image;
Header has a magic (FOURCC) and pointer to an index table;
Index table holds file offsets for various other tables of interest;
...

Then, say, one has stack bytecode has operations like:
ADDI (Add top to stack elements, push result, Type=Int32)
ADDL (Add top to stack elements, push result, Type=Int64)
SUBI/SUBL
...
LOADI (Load Int32 local variable, push to stack)
STOREI (Pop Int32 value and store to local varible)
...
LABEL (End of basic block / Branch Target Location)
EOFUNC (End-Of-Function)
RETI (Return Int32, popped from stack)
RETZ (Return Void)
...
GOTO (Branch to a given label)
BEQI (Compare top two Int32 values and branch to label if equal)
...

Say, function structure:
SOFUNC, Start of Function, defines things like the number of elements in
the local variable array, etc (this array is also where function
arguments arrive). May also encode the initialization of local automatic
storage (arrays and objects).

Multiple blocks, separated by LABEL instructions. Branches are only
allowed to target LABEL instructions. Label instructions are otherwise
effectively NOP. Nothing is allowed on the stack during a LABEL (so,
say, if we get to a LABEL and the stack offset is non-zero, the bytecode
is invalid).

Function ends with an EOFUNC instruction, which indicates that no more
blocks remain. Control flow is not allowed to reach this instruction.

One can also do a 3AC IR, say:
ADDI Rn, Rs, Rt
Pull values from and store directly to local variables (represented
conceptually as being within an abstract array).

However, from a design front 3AC IR's tend to end up with more up-front
complexity than a stack IR, however, for a simplistic compiler or
interpreter, the 3AC IR would have significantly fewer instructions (and
thus better performance).

Intermediate options are also possible, though some possibilities would
interfere with the possibility of an efficient direct interpreter.

>>
>> Though, there were a few considered (but mostly stalled) sub-efforts
>> towards trying to run x86-64 code on top of BJX2.
>>>> and may well lead to push-back towards
>>>> simpler/cheaper cores.
>>>
>>> If/When the increase in transistors/$ finally stops, it will be
>>> interesting to see what happens then. Will we see more specialized
>>> hardware, because now you can amortize the development (and mask
>>> costs) over a longer time, even if you only serve a niche? Will
>>> software developers finally get around to designing software that
>>> makes the maximum of the hardware, because inefficiency can no longer
>>> be papered over with faster hardware?
>>>
>>> I have my doubts. Network effects will neuter the advantages of
>>> specialized hardware and of carefully designed software that is late
>>> to the market.
>>>
>> My prediction is more in the form of VLIW based manycore systems likely
>> running software on top of a dynamic translation layer.
>>
>> Unlike traditional VLIW compilers, the dynamic translation later could
>> have access to live / real-time profiler data, so could make better
>> guesses about things like when and how to go about modulo scheduling
>> loops and similar.
>>
>>
>> Sadly, what I am imagining here, isn't all that much like my BJX2
>> project, but BJX2 is limited more to what I can fit on an FPGA, and
>> ended up where it is because I kept finding ways I could push it
>> "forwards" with mostly only small/incremental increases in resource cost.
>>
>>
>> Though, yes, a 1-wide pipelined RISC core is a little cheaper than a
>> 3-wide VLIW, but if limited to the same clock-speed, a 3-wide core can
>> outperform a 1-wide core.
>>
>>
>> The situation might be different if the 1-wide RISC were running at
>> 100MHz, but I have found that "reliably"/"easily" passing timing at
>> 100MHz (on the Spartan-7 and Artix-7) seems to require a using a 32-bit
>> ISA design (such as RV32I).
>>
>> But, personally, I don't find RV32I all that inspiring.
>> Like, RV32I is like some sort of weird phantom that does pretty good on
>> Dhrystone but seemingly kinda sucks at nearly everything else.
>>
>> ...
> <
> Also note: I now have over 100 wed sites where JavaSxcript is turned off
> just to get rid of annoying advertising.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<2022Feb15.104639@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23482&group=comp.arch#23482

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits
Date: Tue, 15 Feb 2022 09:46:39 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 105
Message-ID: <2022Feb15.104639@mips.complang.tuwien.ac.at>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at> <sudb0g$rq3$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="5b45c8e4e010c1ee414315010f84ca5d";
logging-data="24541"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+HpWfgkso0oJ57JLV+wo97"
Cancel-Lock: sha1:05KY6Wj03oj6NpssBsaadQav4iE=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Tue, 15 Feb 2022 09:46 UTC

Ivan Godard <ivan@millcomputing.com> writes:
>The
>specializer does quite extensive optimization - bundle-packing,
>scheduling, CFG collapse, software pipelining, in- and out-lining, etc.
>- which is too expensive for a JIT, but the passes that do the
>optimizations are optionally bypassed.

By contrast, an OoO microarchitecture does not care whether the native
code has been generated by a JIT compiler or not, it will
branch-predict and schedule the instructions all the same.

>As it seems unlikely that you would want to JIT for multiple targets
>simultaneously, we expect that most JITs will generate machine code,
>which is no more difficult to do than for any conventional-ISA machine
>code target. Standard JIT methods - gluing code snippets together, for
>example - will in effect emulate an in-order one-wide machine just as
>they do on any ISA, and at no greater effort.

On a traditional architecture, the native code does not specify
execution order or width, which allows a Zen3 or Alder Lake to execute
the same code as the 386 did, but with much higher IPC.

Many JIT compilers do much more than gluing code snippets together,
but Gforth pretty much does that, and the result is a typical IPC of
2-3 on a Skylake (usually seen as 4-wide).

>Even with a specializer designed for
>> specialization speed (or a JIT compiler directly generating native
>> code), features like the compressed instruction encoding will cause
>> slowness in native-code generation.
>
>Why? It's just an encoding. Every code generator does bit stuffing. You
>pull the bits from a table and copy them over.

Gforth's code generator just glues code snippets together, and does
not change bits in the native code. More sophisticated code
generators change literals and registers in the native code, but I
think that's cheaper than general instruction compression.

>> Doing it as a two-stage process
>> (JIT+specializer) is also going to slow JIT translation down, while a
>> one-stage approach means that m JIT compilers combined with n Mill
>> native hardware implementations cause m*n effort (ideally, you just
>> need to change parameters, but someone certainly would have to invest
>> more effort in the JITs to make them parameterizable in that way).
>
>Every JIT for a new target binary has to deal with the vagaries of that
>binary, so a Mill JIT is as much work as a foobar JIT. However, it seems
>you are thinking that a Mill member is as different from some other
>member as (say) x86 is from ARM. That is flat false.

I am rather thinking of Zen3 microinstructions vs. Alder Lake
microinstructions; or, for a start, Zen2 vs. Zen (very appropriate,
because Zen splits AVX256 instructions into two 128-bit parts, while
Zen2 does not split, just like the difference between some Mill
models).

It's not clear what you mean above. If Gold is as much work as, say,
AMD64, and Silver is also as much work, then supporting, say, 4 Mill
models would mean m*4 effort. As mentioned, I expect somewhat less
effort, but more than for one traditional architecture.

>The specializer puts a *lot* of work into stuffing as many instructions
>as possible into the bundle format - Mill is a (very) wide architecture.
>But for quick and dirty, i.e. JIT, code you can just put one instruction
>in each bundle and be done with it, and ignore the fact that you are
>wasting all that parallelism. The result will be no worse than code for
>any other one-instruction-at-a-time code.

Except that Mill will actually execute this code slowly, while even
the in-order Pentium or Cortex-A53 can execute 2 independent
instructions in parallel (but not all pairs are independent), and, as
mentioned, on OoO implementations we see good IPC even from very
simple JITs.

>Inter-binary migration is possible at the function granularity; it is
>not clear we will choose to support it. After all, there doesn't seem to
>be much call for migrating a running x86 process to an ARM.

There is no x86. I have read about the practice of moving virtual
machines between AMD64 machines in the cloud, but I don't know if they
are moved between microarchitectures there (it seems to me that the
customer pays for a certain microarchitecture, so there may be little
call for that). In any case, I expect no problems when moving virtual
machines from one microarchitecture to another, as long as the second
microarchitecture supports the same instruction set extensions as the
first one.

>> The single address space is also a problem in that
>> context.
>
>It seems adequate for anything you would want to do on a single chip,
>and is expandable if that turns out to be short-sighted. We made the
>deliberate business decision to not support off-chip shared address
>spaces; the box/room/building scales will have to do message passing,
>which seems to be the trend in supercomputers anyway.

That's beside the point. If a virtual machine is uses certain pages,
in an SAS system these pages need to be free on the target system when
you migrate the virtual machine.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<2022Feb15.112417@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23483&group=comp.arch#23483

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits
Date: Tue, 15 Feb 2022 10:24:17 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 41
Message-ID: <2022Feb15.112417@mips.complang.tuwien.ac.at>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at> <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
Injection-Info: reader02.eternal-september.org; posting-host="5b45c8e4e010c1ee414315010f84ca5d";
logging-data="24541"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19iAcz/oWMEJpnNWx86YL8b"
Cancel-Lock: sha1:LqnMk96Ke/0l34a+DSIOmotjGa4=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Tue, 15 Feb 2022 10:24 UTC

Quadibloc <jsavard@ecn.ab.ca> writes:
>On Monday, February 14, 2022 at 2:15:21 AM UTC-7, Anton Ertl wrote:
>
>> My provisional verdict: Mill has been designed with the supercomputer
>> mindset, and suffers from the usual problems of such designs.
>
>Hmm. While I suspect _my_ designs have that flaw, my perception was
>that the focus in Mill was low power consumption rather than maximum
>performance.
>
>Of course, a focus on throughput, rather than single-thread performance,
>may be exactly what you are referring to.

It seems to me that the extremely wide designs have a focus on
single-thread performance for applications with high instruction-level
parallelism.

But that's not what I meant with "supercomputer mindset". What I mean
is: design the hardware for maximum theoretical performance (or
efficiency) at the cost of making the programmer's jobs harder (not
necessarily the application programmer, but certainly some software
guy or gal).

Examples are abundant; prominent ones are IA-64, Transmeta (which BTW
was first promising performance, later efficiency), the Cell
architecture.

A successful example is SIMD instructions. Why is this successful?
Because it's optional. You don't have to use it, and most software
does not (so one may dispute that SIMD is successful); but if you have
an application that can benefit from SIMD, sometimes you spend the
manpower to vectorize it and it gives you an advantage over the
non-vectorized competition. Nevertheless, Mitch Alsup's VVM is
intended to make this complication of programmer's lives unnecessary,
and if it is implemented in a mainstream architecture, we may see the
use of SIMD for new code wither away on that architecture.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<2022Feb15.114353@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23485&group=comp.arch#23485

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits
Date: Tue, 15 Feb 2022 10:43:53 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 23
Message-ID: <2022Feb15.114353@mips.complang.tuwien.ac.at>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at> <sudb0g$rq3$1@dont-email.me> <jwv7d9xh0dt.fsf-monnier+comp.arch@gnu.org>
Injection-Info: reader02.eternal-september.org; posting-host="5b45c8e4e010c1ee414315010f84ca5d";
logging-data="24541"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18dfxe9bTYu9Ni6NnztdQMY"
Cancel-Lock: sha1:+YjULF4oSQu1UMHOAH5LSIcW7yE=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Tue, 15 Feb 2022 10:43 UTC

Stefan Monnier <monnier@iro.umontreal.ca> writes:
>I wonder tho: is it the case that there's a "supercomputer mindset"?
>That doesn't seem to match your background, at least.
>I also got the impression that the Mill might target more
>embedded-style markets (maybe because of your mention of DSPs in some
>of your talks), but I can't remember mentions of Mills targeting HPC.

DSPs might be seen as a niche success of the supercomputer mindset:
Add features like dual address spaces and wide accumulators that
require extra love by programmers to make use of. Because DSPs don't
experience the software crisis* (certainly not for the kernels that
much of the time is spent in), the manufacturer can pay the programmer
to invest that extra love.

* Before people make assumptions about what I mean with "software
crisis": When the software cost is higher than the hardware cost,
the software crisis reigns. This has been the case for much of the
software for several decades.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sug2j8$bid$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23486&group=comp.arch#23486

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 15 Feb 2022 11:25:28 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sug2j8$bid$1@newsreader4.netcologne.de>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at> <sudb0g$rq3$1@dont-email.me>
<jwv7d9xh0dt.fsf-monnier+comp.arch@gnu.org>
<2022Feb15.114353@mips.complang.tuwien.ac.at>
Injection-Date: Tue, 15 Feb 2022 11:25:28 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:19cd:0:7285:c2ff:fe6c:992d";
logging-data="11853"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Tue, 15 Feb 2022 11:25 UTC

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

> * Before people make assumptions about what I mean with "software
> crisis": When the software cost is higher than the hardware cost,
> the software crisis reigns. This has been the case for much of the
> software for several decades.

Two reasonable dates for that: 1957 (the first Fortran compiler) or
1964, when the /360 demonstrated for all to see that software (especially
compatibility) was more important than any particular hardware.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sug2p7$bid$2@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23487&group=comp.arch#23487

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 15 Feb 2022 11:28:39 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sug2p7$bid$2@newsreader4.netcologne.de>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me>
<acdd5308-1666-4888-aa86-ab2d82380f1en@googlegroups.com>
Injection-Date: Tue, 15 Feb 2022 11:28:39 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:19cd:0:7285:c2ff:fe6c:992d";
logging-data="11853"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Tue, 15 Feb 2022 11:28 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:

> L1s are stable because if the get bigger than 64 KB they take more than
> one cycle to "access", where SRAM and wire routing are both considered.

One cycle latency or throughput?

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<2022Feb15.120729@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23488&group=comp.arch#23488

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits
Date: Tue, 15 Feb 2022 11:07:29 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 75
Message-ID: <2022Feb15.120729@mips.complang.tuwien.ac.at>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at> <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at> <212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
Injection-Info: reader02.eternal-september.org; posting-host="5b45c8e4e010c1ee414315010f84ca5d";
logging-data="22552"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18jRSuzTc9ly4eogvWg5Xsi"
Cancel-Lock: sha1:ksUXefhkFSBFAo8+3XkpWtwxKZ0=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Tue, 15 Feb 2022 11:07 UTC

Scott Smader <yogaman101@yahoo.com> writes:
>On Monday, February 14, 2022 at 2:26:46 PM UTC-8, Anton Ertl wrote:
>> Why? If OoO is superior now at the current transistor budget, why=20
>> would it stop being superior if the transistor budget does not=20
>> increase?
>
>Not sure if the assumption of OoO superiority is justified at current trans=
>istor budget,

Definitely. Here are two benchmarks, all numbers are times in
seconds:

LaTeX:

- Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Debian 9 64bit 2.368
- AMD E-450 1650MHz (Lenovo Thinkpad X121e), Ubuntu 11.10 64-bit 1.216
- Odroid N2 (1896MHz Cortex A53) Ubuntu 18.04 2.488
- Odroid N2 (1800MHz Cortex A73) Ubuntu 18.04 1.224

Gforth:

sieve bubble matrix fib fft
0.492 0.556 0.424 0.700 0.396 Intel Atom 330 (Bonnell) 1.6GHz; gcc-4.9
0.321 0.479 0.219 0.594 0.229 AMD E-350 1.6GHz; gcc version 4.7.1
0.350 0.390 0.240 0.470 0.280 Odroid C2 (1536MHz Cortex-A53), gcc-6.3.0
0.180 0.224 0.108 0.208 0.100 Odroid N2 (1800MHz Cortex-A73), gcc-6.3.0

All these cores are 2-wide. The Atom 330 is in-order, the E-450 is
OoO (both AMD64); the Cortex-A53 is in-order, the Cortex-A73 is OoO
(both ARM A64).

Intel switched from in-order to OoO for it's little/efficiency cores
with Silvermont, Apple uses OoO cores for their efficiency cores as
well as their performance cores. Intel even switched from in-order to
OoO for the Xeon Phi, which targets HPC, the area where in-order is
strongest.

>but for the same transistor budget at the same clock rate, st=
>atically scheduled should win because it more efficiently uses the availabl=
>e transistors to do useful work.

That's the fallacy that caused Intel and HP to waste billions on
IA-64, and Transmeta investors to invest $969M, most of which was
lost.

>Every time a program runs on an OoO machine, it wastes power to speculate a=
>bout un-taken paths that can be avoided when source code compiles to a stat=
>ically scheduled target instruction set.

OoO is superior in both performance and in efficiency at competetive
performance for smartphones, desktops and servers, as evidenced by the
use of OoO cores as efficiency cores by Apple, Intel and AMD. The
only ones who still stick with in-order efficiency cores are ARM, but
their efficiency does not look great compared to their own efficient
OoO cores, and compared to Apple's efficiency cores.

Concerning speculation, yes, it does waste power, but rarely, because
branch mispredictions are rare.

>Once the speculation transistors are removed, a statically scheduled chip c=
>an replace them with additional FUs to get more done each clock cycle.

"Speculation transistors"? Ok, you can remove the branch predictor
and spend the transistors for an additional FU. The result for much
of the software will be that there will be done much less in each
cycle, because there are far fewer instructions ready for execution on
average without speculative execution, and therefore less work will be
done each cycle. If you also remove OoO execution, even fewer
instructions will be executed each cycle. So you have more FUs
available, but they will be idle.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<2022Feb15.124937@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23489&group=comp.arch#23489

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits
Date: Tue, 15 Feb 2022 11:49:37 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 86
Message-ID: <2022Feb15.124937@mips.complang.tuwien.ac.at>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at> <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at> <suerog$cd0$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="5b45c8e4e010c1ee414315010f84ca5d";
logging-data="9596"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19wghWnwDFDHUWIZa2cLfyK"
Cancel-Lock: sha1:tKA/dECpcUTzkjzSoaw87d6PneY=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Tue, 15 Feb 2022 11:49 UTC

BGB <cr88192@gmail.com> writes:
>On 2/14/2022 4:17 PM, Anton Ertl wrote:
>> BGB <cr88192@gmail.com> writes:
>>> Hitting this limit is not likely to bode well for the long term
>>> "superiority" of OoO,
>>
>> Why? If OoO is superior now at the current transistor budget, why
>> would it stop being superior if the transistor budget does not
>> increase?
>>
>
>Advantages of OoO were mostly:
> "Ye Olde" scalar code goes keeps getting faster;

That, too.

> Not requiring as much from the compiler;

That's an understatement. Basically, IA-64 tried to succeed by
combining architectural features and smart compilers. The result
suggests that it's too hard to make the envisioned smart compilers.

> Effective at getting maximizing single thread performance;

Yes, especially that.

Also: OoO allows scheduling across many hundreds of instructions (512
in Alder Lake) rather than typically a few or a few dozen for static
scheduling, and allows making good use of dynamic branch predictors,
which are much more accurate than static branch predictors (someone
here suggested a way to make dynamic branch preduction available to
statically scheduled code, but it's unclear whether it would be
practical to use that with compiled code).

A frequent claim is that OoO wins because cache misses cause varying
latencies, but in an earlier discussion we established that in-order
suffers more from long latencies that from varying latencies (so we
tend to see smaller lower-latency, higher miss-rate L1 caches with
in-order); essentially, in-order is already bad at dealing with
long-latency D-cache hits; D-cache misses are worse, but larger
longer-latency lower-miss D-caches still don't pay off.

>It achieved this at the expense of spending considerable transistor
>budget and energy on the problem.

And yet the Cortex-A55 is less efficient than the Cortex-A75 for
nearly all of the A55 performance range.
<https://images.anandtech.com/doci/14072/Exynos9820-Perf-Eff-Estimated.png>.

We have also discussed the area (not transistors, but area is more
relevant anyway), and performance per area does not look beneficial
for IIRC the Cortex-A55 vs. the Cortex-A75, either; i.e., the A55 is
smaller, but it delivers so much less performance that even for
parallel code the performance per area is not much better, while for
single-threaded performance performance per area (assuming you put
more cores in the same area as an A75) is much worse.

>However, the push-back may favor designs which can give more performance
>relative to transistor budget and energy use. Namely, such as VLIW cores
>with dynamic translation from some higher-level IR (could be a
>stack-based IR, but also likely would be an existing ISA being
>repurposed as an IR).

I.e., Transmeta. I don't see any development that would allow that
approach to become competetive. On the contrary, since the heyday of
Transmeta OoO has made significant advances, while I am not aware of
such advances in in-order and VLIW designs.

>Ideally, one needs an ISA that "doesn't suck" as an IR, where I suspect
>both x86-64 and ARM64 making use of condition-codes does not exactly
>work in their favor in this use-case.

Given that the OoO architectures seems to have no insurmountable
problem with the condition codes (even the nasty AMD64 ones), and that
Transmeta was not put off by that problem, I expect that it is
manageable (although not ideal) for this use-case.

>My prediction is more in the form of VLIW based manycore systems likely
>running software on top of a dynamic translation layer.

What has changed in favour of VLIW since Transmeta?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sug7bd$b9l$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23490&group=comp.arch#23490

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 15 Feb 2022 04:46:37 -0800
Organization: A noiseless patient Spider
Lines: 151
Message-ID: <sug7bd$b9l$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at> <sudb0g$rq3$1@dont-email.me>
<2022Feb15.104639@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 15 Feb 2022 12:46:37 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="93b1554788c317f6fd2ff3a26e7468cb";
logging-data="11573"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18wRgStXeds52u8mQFivmPO"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.1
Cancel-Lock: sha1:r0A0d7rGxjI9fLC9Ruz9m/KqAWs=
In-Reply-To: <2022Feb15.104639@mips.complang.tuwien.ac.at>
Content-Language: en-US

by: Ivan Godard - Tue, 15 Feb 2022 12:46 UTC

On 2/15/2022 1:46 AM, Anton Ertl wrote:
> Ivan Godard <ivan@millcomputing.com> writes:
>> The
>> specializer does quite extensive optimization - bundle-packing,
>> scheduling, CFG collapse, software pipelining, in- and out-lining, etc.
>> - which is too expensive for a JIT, but the passes that do the
>> optimizations are optionally bypassed.
>
> By contrast, an OoO microarchitecture does not care whether the native
> code has been generated by a JIT compiler or not, it will
> branch-predict and schedule the instructions all the same.

So will a Mill.

The specializer passes that a JIT would skip are target independent, and
give better code for any target: OOO, IO, or sideways. The specializer
is a multi-target compiler back end, and it does the kind of things that
any highly optimizing back end does. Or doesn't do, if you or your JIT
decide that that generation time is more important than to-the-wall
execution speed.

I don't know of any JIT that does PGO or LTO or other analysis-heavy
optimization. Perhaps you could enlighten me.

Once you ignore the target-independent stuff, the only difference
between a OOO target and an IO target is instruction scheduling, which
is a small part of the cost of a back end for either kind of target.

It's a common misapprehension that OOO targets don't need to be
scheduled. They do. A simplistic scheduler needs to schedule producers
before consumers, and so must track dependency relations regardless of
target. A more sophisticated scheduler tracks machine resources such as
registers, and reorders instructions (statically) to lower resource
pressure. Anyone who has done a few compilers knows that the interaction
between register allocation/spill and instruction schedule is the most
annoying part :-)

>> As it seems unlikely that you would want to JIT for multiple targets
>> simultaneously, we expect that most JITs will generate machine code,
>> which is no more difficult to do than for any conventional-ISA machine
>> code target. Standard JIT methods - gluing code snippets together, for
>> example - will in effect emulate an in-order one-wide machine just as
>> they do on any ISA, and at no greater effort.
>
> On a traditional architecture, the native code does not specify
> execution order or width, which allows a Zen3 or Alder Lake to execute
> the same code as the 386 did, but with much higher IPC.
>
> Many JIT compilers do much more than gluing code snippets together,
> but Gforth pretty much does that, and the result is a typical IPC of
> 2-3 on a Skylake (usually seen as 4-wide).
>
>> Even with a specializer designed for
>>> specialization speed (or a JIT compiler directly generating native
>>> code), features like the compressed instruction encoding will cause
>>> slowness in native-code generation.
>>
>> Why? It's just an encoding. Every code generator does bit stuffing. You
>> pull the bits from a table and copy them over.
>
> Gforth's code generator just glues code snippets together, and does
> not change bits in the native code. More sophisticated code
> generators change literals and registers in the native code, but I
> think that's cheaper than general instruction compression.
>
>>> Doing it as a two-stage process
>>> (JIT+specializer) is also going to slow JIT translation down, while a
>>> one-stage approach means that m JIT compilers combined with n Mill
>>> native hardware implementations cause m*n effort (ideally, you just
>>> need to change parameters, but someone certainly would have to invest
>>> more effort in the JITs to make them parameterizable in that way).
>>
>> Every JIT for a new target binary has to deal with the vagaries of that
>> binary, so a Mill JIT is as much work as a foobar JIT. However, it seems
>> you are thinking that a Mill member is as different from some other
>> member as (say) x86 is from ARM. That is flat false.
>
> I am rather thinking of Zen3 microinstructions vs. Alder Lake
> microinstructions; or, for a start, Zen2 vs. Zen (very appropriate,
> because Zen splits AVX256 instructions into two 128-bit parts, while
> Zen2 does not split, just like the difference between some Mill
> models).
>
> It's not clear what you mean above. If Gold is as much work as, say,
> AMD64, and Silver is also as much work, then supporting, say, 4 Mill
> models would mean m*4 effort. As mentioned, I expect somewhat less
> effort, but more than for one traditional architecture.

Nope. Mill (any and all members) is as much work as x86 or ARM. Given a
Mill for any member, another member is no added work.

One of our talks was a live demo of the Mill compiler tech. On the spot
we created a new instruction, defined a new member by copying the spec
of an existing member and adding the spec for the new instruction,
rebuilt the tool chain using the new member as a target, wrote an asm
program that used the new instruction, compiled it, and ran it single
stepping in the debugger to show the execution of the new instruction,
correctly scheduled in parallel with the surrounding instructions. All
that in an hour, including slides, talk, and questions.

Mill is a compatible family just as much as the x86 family is. It just
puts the split between family-wide and family-member-specific at a
different point in the hierarchy than a OOO does.

>> The specializer puts a *lot* of work into stuffing as many instructions
>> as possible into the bundle format - Mill is a (very) wide architecture.
>> But for quick and dirty, i.e. JIT, code you can just put one instruction
>> in each bundle and be done with it, and ignore the fact that you are
>> wasting all that parallelism. The result will be no worse than code for
>> any other one-instruction-at-a-time code.
>
> Except that Mill will actually execute this code slowly, while even
> the in-order Pentium or Cortex-A53 can execute 2 independent
> instructions in parallel (but not all pairs are independent), and, as
> mentioned, on OoO implementations we see good IPC even from very
> simple JITs.

You and your straw men. If you want to compare with a two-wide x86 than
you should compare with a two-wide Mill. Actually, there is no such
thing as a two wide Mill - the minimal width is four - so I suppose that
you should compare against code that has been deliberately restricted to
two instructions per bundle and ignore the rest of the width.

>> Inter-binary migration is possible at the function granularity; it is
>> not clear we will choose to support it. After all, there doesn't seem to
>> be much call for migrating a running x86 process to an ARM.
>
> There is no x86. I have read about the practice of moving virtual
> machines between AMD64 machines in the cloud, but I don't know if they
> are moved between microarchitectures there (it seems to me that the
> customer pays for a certain microarchitecture, so there may be little
> call for that). In any case, I expect no problems when moving virtual
> machines from one microarchitecture to another, as long as the second
> microarchitecture supports the same instruction set extensions as the
> first one.
>
>>> The single address space is also a problem in that
>>> context.
>>
>> It seems adequate for anything you would want to do on a single chip,
>> and is expandable if that turns out to be short-sighted. We made the
>> deliberate business decision to not support off-chip shared address
>> spaces; the box/room/building scales will have to do message passing,
>> which seems to be the trend in supercomputers anyway.
>
> That's beside the point. If a virtual machine is uses certain pages,
> in an SAS system these pages need to be free on the target system when
> you migrate the virtual machine.

You are unacquainted with ASIDs used to isolate VMs?

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<2022Feb15.133946@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23491&group=comp.arch#23491

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits
Date: Tue, 15 Feb 2022 12:39:46 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 52
Message-ID: <2022Feb15.133946@mips.complang.tuwien.ac.at>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at> <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at> <suerog$cd0$1@dont-email.me> <e360bac3-6a11-49e0-b089-f2158b0c762fn@googlegroups.com>
Injection-Info: reader02.eternal-september.org; posting-host="5b45c8e4e010c1ee414315010f84ca5d";
logging-data="9596"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX189ucdviphz5nxYqWLDTWQJ"
Cancel-Lock: sha1:Is4jcfmUhwm6PX9iIFc0PYXah7E=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Tue, 15 Feb 2022 12:39 UTC

Quadibloc <jsavard@ecn.ab.ca> writes:
>Given that out-of-order execution requires a _lot_ of transistors
>for the extra performance it provides, then it logically follows
>that one could get more throughput by putting a large number
>of in-order processors on the same die, does it not?

Performance per area does not look great for the Cortex A55 compared
to the A75, simply because the performance is so much worse.

But assuming we could replace, say, one Zen3 with 8 single-issue
in-order cores in the same area, that would mean 64 cores in a Zen3
CCD, and you would need the infrastructure to coordinate cache and
memory accesses by these 64 cores, and that costs area, too. We have
seen Xeon Phis with 72 cores, but that has apparently been replaced
with the mainline CPUs (with currently up to 40 cores) by Intel.
There have been some many-core CPUs (e.g., from Tilera) quite a while
ago, but they did not take off. Now, with EPYC and some ARM server
CPUs we get in the same region of cores, but with relatively wide
cores. So it appears that even in areas where there's enough parallel
work, the sweet spot is in having relatively wide OoO cores rather
than single-issue in-order.

>If so, why did Intel change the Atom over from in-order to
>out-of-order, and why did the Xeon Phi fail?
>
>I have two answers to offer:
>
>1) We don't know how to write parallel code well, and we
>never will. At least for the values of "well" which would
>allow speed to scale as the number of processors applied
>over a much larger range than is the case at present for
>a given application.

For the Xeon Phi I doubt it. It has HPC as target area, and there is
lots of parallelism in the application there. It seems that the
mainline CPUs were close enough in performance that having another
iteration of Xeon Phi would not have paid off.

>2) The potential memory bandwidth feeding a single die
>is a serious constraint. Thus, it will not be possible to
>make use of the maximum throughput that could be achieved
>on a die through a very large number of in-order processors.

Depends on the application. There certainly are some where this
matters. But there are also others that are not as
bandwidth-sensitive. Intel certainly did not put 72 cores on the die
for bandwidth-limited applications.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<3f9a142c-5b1c-4333-bc9f-6b8f14fe13ccn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23492&group=comp.arch#23492

copy link Newsgroups: comp.arch

X-Received: by 2002:ad4:5963:: with SMTP id eq3mr2646915qvb.74.1644932443814;
Tue, 15 Feb 2022 05:40:43 -0800 (PST)
X-Received: by 2002:a05:6870:660a:: with SMTP id gf10mr314202oab.333.1644932443200;
Tue, 15 Feb 2022 05:40:43 -0800 (PST)
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 15 Feb 2022 05:40:43 -0800 (PST)
In-Reply-To: <2022Feb15.133946@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:1da9:46e3:e9c4:35aa;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:1da9:46e3:e9c4:35aa
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at> <suerog$cd0$1@dont-email.me>
<e360bac3-6a11-49e0-b089-f2158b0c762fn@googlegroups.com> <2022Feb15.133946@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3f9a142c-5b1c-4333-bc9f-6b8f14fe13ccn@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Tue, 15 Feb 2022 13:40:43 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2132

by: Quadibloc - Tue, 15 Feb 2022 13:40 UTC

On Tuesday, February 15, 2022 at 5:57:43 AM UTC-7, Anton Ertl wrote:
> Intel certainly did not put 72 cores on the die
> for bandwidth-limited applications.

That's good. I'm glad to hear that Intel wasn't
stupid. But my point was that *we* shouldn't
be stupid in the way that Intel wasn't. We shouldn't
assume that putting an enormous number of cores
on a chip is going to be useful for nearly all kinds
of application.

John Savard

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<jwv35kkfe8h.fsf-monnier+comp.arch@gnu.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23493&group=comp.arch#23493

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits
Date: Tue, 15 Feb 2022 08:43:55 -0500
Organization: A noiseless patient Spider
Lines: 21
Message-ID: <jwv35kkfe8h.fsf-monnier+comp.arch@gnu.org>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="f5eb5aba23f1f57d35d02e72b57d6302";
logging-data="27491"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18xjJVYi3sqZeQ1yQuvMF9V"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:7irHoJpu7OZENnzycAg1dNe/VdM=
sha1:FzWYhg/BI97pnvo9PCNfYz2o+6U=

by: Stefan Monnier - Tue, 15 Feb 2022 13:43 UTC

> Concerning speculation, yes, it does waste power, but rarely, because
> branch mispredictions are rare.

Of course in-order cores also speculate, so this is only
tangentially related. But as a side-note I'll point out that statically
scheduled processors encourage the compiler to move long-latency
instructions such as loads to "as soon as it's safe to do it" rather
than "as soon as we know we will need it", and that sometimes
requires adding yet more "compensation code" in other branches.

Of course, such compilation strategies can also be done for OoO
processors, but they're less beneficial to overall performance there.

So a given program optimized for in-order execution may end up
executing more instructions than if it had been optimized for an OoO CPU.
That can be another efficiency advantage of OoO.
I don't actually know of any study that demonstrates this effect, tho.
If someone knows of one, I'd be interested to hear about it.

Stefan

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<RsPOJ.2790$3Pje.2504@fx09.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23495&group=comp.arch#23495

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx09.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions
in 128 bits
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at> <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at> <212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com> <2022Feb15.120729@mips.complang.tuwien.ac.at>
In-Reply-To: <2022Feb15.120729@mips.complang.tuwien.ac.at>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 49
Message-ID: <RsPOJ.2790$3Pje.2504@fx09.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Tue, 15 Feb 2022 15:16:01 UTC
Date: Tue, 15 Feb 2022 10:15:12 -0500
X-Received-Bytes: 3602

by: EricP - Tue, 15 Feb 2022 15:15 UTC

Anton Ertl wrote:
> Scott Smader <yogaman101@yahoo.com> writes:
>
>> Every time a program runs on an OoO machine, it wastes power to speculate a=
>> bout un-taken paths that can be avoided when source code compiles to a stat=
>> ically scheduled target instruction set.
>
> OoO is superior in both performance and in efficiency at competetive
> performance for smartphones, desktops and servers, as evidenced by the
> use of OoO cores as efficiency cores by Apple, Intel and AMD. The
> only ones who still stick with in-order efficiency cores are ARM, but
> their efficiency does not look great compared to their own efficient
> OoO cores, and compared to Apple's efficiency cores.
>
> Concerning speculation, yes, it does waste power, but rarely, because
> branch mispredictions are rare.
>
>> Once the speculation transistors are removed, a statically scheduled chip c=
>> an replace them with additional FUs to get more done each clock cycle.
>
> "Speculation transistors"? Ok, you can remove the branch predictor
> and spend the transistors for an additional FU. The result for much
> of the software will be that there will be done much less in each
> cycle, because there are far fewer instructions ready for execution on
> average without speculative execution, and therefore less work will be
> done each cycle. If you also remove OoO execution, even fewer
> instructions will be executed each cycle. So you have more FUs
> available, but they will be idle.
>
> - anton

The static and dynamic leakage current matters here:
an in-order sitting idle waiting for memory but still leaking
is drawing some % of what an OoO would have used those transistors for,
but getting no benefit.

The gate voltage affects leakage but also delay,
lower voltage lowers leakage but increases delay ,
and gate delay interacts with stage delay, which interacts with frequency.
If the frequency of the mem stage is faster than cache then it can wind up
wasting a partial extra cycle to round the access up to a whole cycle,
which waste power by leaking while sitting idle.

There is probably some great whacking differential equation
that tells up where the optimum of all the parameters lies.

One could program that into an analog computer with patch cords
and see what it says.

Re: Statically scheduled plus run ahead.

<7c9f3d71-68f4-49d4-a38b-3f339d67b271n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23498&group=comp.arch#23498

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:2809:: with SMTP id f9mr2318300qkp.527.1644940982392;
Tue, 15 Feb 2022 08:03:02 -0800 (PST)
X-Received: by 2002:a05:6870:b242:: with SMTP id b2mr1574392oam.314.1644940982109;
Tue, 15 Feb 2022 08:03:02 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!feeder.erje.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 15 Feb 2022 08:03:01 -0800 (PST)
In-Reply-To: <sufqb5$t1l$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:1001:91f9:724e:655b;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:1001:91f9:724e:655b
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me>
<suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at> <212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<sufov7$l3m$1@dont-email.me> <sufqb5$t1l$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <7c9f3d71-68f4-49d4-a38b-3f339d67b271n@googlegroups.com>
Subject: Re: Statically scheduled plus run ahead.
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 15 Feb 2022 16:03:02 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 172

by: MitchAlsup - Tue, 15 Feb 2022 16:03 UTC

On Tuesday, February 15, 2022 at 3:04:40 AM UTC-6, Ivan Godard wrote:
> On 2/15/2022 12:41 AM, Brett wrote:
> > Scott Smader <yogam...@yahoo.com> wrote:
> >> On Monday, February 14, 2022 at 2:26:46 PM UTC-8, Anton Ertl wrote:
> >>> BGB <cr8...@gmail.com> writes:
> >>>> Hitting this limit is not likely to bode well for the long term
> >>>> "superiority" of OoO,
> >>> Why? If OoO is superior now at the current transistor budget, why
> >>> would it stop being superior if the transistor budget does not
> >>> increase?
> >>
> >> Not sure if the assumption of OoO superiority is justified at current
> >> transistor budget, but for the same transistor budget at the same clock
> >> rate, statically scheduled should win because it more efficiently uses
> >> the available transistors to do useful work.
> >> Every time a program runs on an OoO machine, it wastes power to speculate
> >> about un-taken paths that can be avoided when source code compiles to a
> >> statically scheduled target instruction set.
> >> Once the speculation transistors are removed, a statically scheduled chip
> >> can replace them with additional FUs to get more done each clock cycle..
> >
> > Memory is 150 cycles away and CPU’s are 6 wide, so that is a 900
> > instruction advantage for an OoO engine that guesses right with its branch
> > predictions.
<
> Are you assuming that in-order precludes branch prediction? That would
> markedly beg the question.
<
The My 66130 design point is 1-wide, but Co Issues instructions to different
function units as long as 3R1W register traffic is obeyed.
<
It has no branch prediction, but scans ahead in the instruction queue looking
for branches. The original RISC designs needed a delay slot after the branch
to make the pipeline work. I call this 2.0 cycles. M66130 averages a branch
taken cost of 0.108 cycles. a large amount of the time the branch is CoIssued
with a LD or with a calculation or with a compare. So, the amount of time
branching adds to the execution is very low. AND there is no predictor !!
Performance essentially equal to a really good predictor without the predictor.
<
Wider issue machine need a predictor as the scan ahead problem increase
in complexity with the square to cube of issue width.
<
> > An in-order design needs to have a separate run ahead engine predicting
> > branches and prefetching data, basically an OoO engine on front. That OoO
> > engine is doing 90% of the work making the back in-order engine pointless,
> > UNLESS that back engine is wider than the front engine. Note that the front
> > engine can throw away most work instructions focusing on branches and
> > loads, so it does not need to be wide.
<
> You are right that run-ahead prediction using run-ahead execution needs
> OOO to be effective - very early Mill had a scout processor (the
> "precore") for that, and it didn't work and we abandoned it.
<
Neither do many of the data prediction schemes.......
>
> Mill has a patent that does runahead branch prediction without
> execution; ours is not the only possible approach. Any of these (RA
> speculation without RA execution) can run rings around using an OOO as a
> scout instead of branch prediction.
> > I would look at a 16 wide in-order back design with a 3 wide OoO front
> > engine.
> >
> > When I say 3 wide I am partially lying as it needs 8 wide fetch and decode,
> > and then throws away most of those instructions.
> >
> > When I say in-order I am lying as even primitive in-order designs do OoO
> > loads, the and OoO front engine is also prefetching loads.
> >
> > Note that the front engine would mispredict backwards branches as not
> > taken, as it does not care about loops. Instead it would do a bulk preload
> > off the base pointers in the loop if those and the loop count are known..
> >
> > There are so many gotcha’s in this approach that you are facing doom if you
> > don’t have Intel or Apple levels of cubic dollars invested.
> >
> > Intel bought a company that claimed it could do something like this, a
> > decade ago, people were expecting products that would crush AMD several
> > years ago. No sign of this on roadmaps and don’t know if Intel has shut
> > this unit down yet. Instead Intel makes huge empty promises of
> > breakthroughs and ships tiny irrelevant tweaks to existing designs.
<
> It is not unknown for market dominators to buy potential market
> disruptors and quietly kill them to let the golden goose keep its margins..
<
This was outlined in Maciavelli's "The Prince", and was the MO of pirates
and thieves for milinea.
<
> > Intel has been run by incompetents for decades, and it has not mattered,
> > that is the power of a monopoly.
<
IBM went through decades of similar incompetence..
> >
> > Looks like Intel is betting on on-die memory requiring lots of fabs making
> > huge dies instead. Apple will be first, and if AMD keeps it’s act together
> > they will be second ahead of Intel.
> >
> > My prediction is 8 cores with 8 gigabytes of ram on a huge die, the perfect
> > gaming system. Yes you actually need 16 gigabytes, but off die ram will
> > take care of the overflow. Once density hits 32 gigabytes off die ram dies.
> >
> >>>> and may well lead to push-back towards
> >>>> simpler/cheaper cores.
> >>> If/When the increase in transistors/$ finally stops, it will be
> >>> interesting to see what happens then. Will we see more specialized
> >>> hardware, because now you can amortize the development (and mask
> >>> costs) over a longer time, even if you only serve a niche? Will
> >>> software developers finally get around to designing software that
> >>> makes the maximum of the hardware, because inefficiency can no longer
> >>> be papered over with faster hardware?
> >>>
> >>> I have my doubts. Network effects will neuter the advantages of
> >>> specialized hardware and of carefully designed software that is late
> >>> to the market.
> >>> - anton
> >>
> >> Statically scheduled isn't specialized. Optimizing transistor budgets in
> >> a general purpose chip is, to a great degree, separable from optimizing
> >> the applications that run on it, and it should be.
> >>
> >> Imo, that's what makes Mill so keen, but as you say, it will be interesting to see.
> >>
> >>> --
> >>> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> >>> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>
> >>
> >
> >
> >

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sugjhv$v6u$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23500&group=comp.arch#23500

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 15 Feb 2022 08:14:53 -0800
Organization: A noiseless patient Spider
Lines: 22
Message-ID: <sugjhv$v6u$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<suerog$cd0$1@dont-email.me> <2022Feb15.124937@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 15 Feb 2022 16:14:55 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="64adb0abefc0ce856c62cf7bcd904d98";
logging-data="31966"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19MToI9TU62aUeq0zbtq6PtfHEAR4qRXMw="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.0
Cancel-Lock: sha1:IVFFvufsbCVhQsi/P2o1iBXm7tI=
In-Reply-To: <2022Feb15.124937@mips.complang.tuwien.ac.at>
Content-Language: en-US

by: Stephen Fuld - Tue, 15 Feb 2022 16:14 UTC

On 2/15/2022 3:49 AM, Anton Ertl wrote:

snip

> Also: OoO allows scheduling across many hundreds of instructions (512
> in Alder Lake) rather than typically a few or a few dozen for static
> scheduling,

I am not suggesting that isn't true, but I do question why it is true.
That is, if it is beneficial, I presume a compiler could do its
scheduling across a window at least as big as HW. The compiler can use
more memory and time than is available to the HW. As a minimum, it
could emulate what the HW does so should be equal (excepting for
variable length delays).

So is such a big window not beneficial to the compiler?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23501&group=comp.arch#23501

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:d81:: with SMTP id e1mr3026733qve.70.1644942091849;
Tue, 15 Feb 2022 08:21:31 -0800 (PST)
X-Received: by 2002:a05:6808:1998:: with SMTP id bj24mr1968936oib.281.1644942091606;
Tue, 15 Feb 2022 08:21:31 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!eternal-september.org!reader02.eternal-september.org!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 15 Feb 2022 08:21:31 -0800 (PST)
In-Reply-To: <2022Feb15.120729@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=162.229.185.59; posting-account=Gm3E_woAAACkDRJFCvfChVjhgA24PTsb
NNTP-Posting-Host: 162.229.185.59
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me>
<suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at> <212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: yogaman...@yahoo.com (Scott Smader)
Injection-Date: Tue, 15 Feb 2022 16:21:31 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 111

by: Scott Smader - Tue, 15 Feb 2022 16:21 UTC

On Tuesday, February 15, 2022 at 3:47:39 AM UTC-8, Anton Ertl wrote:
> Scott Smader <yogam...@yahoo.com> writes:
> >On Monday, February 14, 2022 at 2:26:46 PM UTC-8, Anton Ertl wrote:
> >> Why? If OoO is superior now at the current transistor budget, why=20
> >> would it stop being superior if the transistor budget does not=20
> >> increase?
> >
> >Not sure if the assumption of OoO superiority is justified at current trans=
> >istor budget,
>
> Definitely. Here are two benchmarks, all numbers are times in
> seconds:
>
> LaTeX:
>
> - Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Debian 9 64bit 2.368
> - AMD E-450 1650MHz (Lenovo Thinkpad X121e), Ubuntu 11.10 64-bit 1.216
> - Odroid N2 (1896MHz Cortex A53) Ubuntu 18.04 2.488
> - Odroid N2 (1800MHz Cortex A73) Ubuntu 18.04 1.224
>
> Gforth:
>
> sieve bubble matrix fib fft
> 0.492 0.556 0.424 0.700 0.396 Intel Atom 330 (Bonnell) 1.6GHz; gcc-4.9
> 0.321 0.479 0.219 0.594 0.229 AMD E-350 1.6GHz; gcc version 4.7.1
> 0.350 0.390 0.240 0.470 0.280 Odroid C2 (1536MHz Cortex-A53), gcc-6.3.0
> 0.180 0.224 0.108 0.208 0.100 Odroid N2 (1800MHz Cortex-A73), gcc-6.3.0
>
> All these cores are 2-wide. The Atom 330 is in-order, the E-450 is
> OoO (both AMD64); the Cortex-A53 is in-order, the Cortex-A73 is OoO
> (both ARM A64).
>

Fine results. Thank you. But at least the A53/A73 numbers don't prove your claim. The A73 cores are twice as big as the A53 cores, according to https://www.anandtech.com/show/10347/arm-cortex-a73-artemis-unveiled/3. So a 2x performance improvement indicates equal efficiency per unit area, not superior.

> Intel switched from in-order to OoO for it's little/efficiency cores
> with Silvermont, Apple uses OoO cores for their efficiency cores as
> well as their performance cores. Intel even switched from in-order to
> OoO for the Xeon Phi, which targets HPC, the area where in-order is
> strongest.
>
> >but for the same transistor budget at the same clock rate, st=
> >atically scheduled should win because it more efficiently uses the availabl> >e transistors to do useful work.
> That's the fallacy that caused Intel and HP to waste billions on
> IA-64, and Transmeta investors to invest $969M, most of which was
> lost.

Agreed that Transmeta and IA-64 failed. Disagree that two failures prove static scheduling's efficiency to be a fallacy.

>
> >Every time a program runs on an OoO machine, it wastes power to speculate a=
> >bout un-taken paths that can be avoided when source code compiles to a stat=
> >ically scheduled target instruction set.
>
> OoO is superior in both performance and in efficiency at competetive
> performance for smartphones, desktops and servers, as evidenced by the
> use of OoO cores as efficiency cores by Apple, Intel and AMD. The
> only ones who still stick with in-order efficiency cores are ARM, but
> their efficiency does not look great compared to their own efficient
> OoO cores, and compared to Apple's efficiency cores.
>
> Concerning speculation, yes, it does waste power, but rarely, because
> branch mispredictions are rare.
>
> >Once the speculation transistors are removed, a statically scheduled chip c> >an replace them with additional FUs to get more done each clock cycle.
> "Speculation transistors"? Ok, you can remove the branch predictor
> and spend the transistors for an additional FU. The result for much
> of the software will be that there will be done much less in each
> cycle, because there are far fewer instructions ready for execution on
> average without speculative execution, and therefore less work will be
> done each cycle. If you also remove OoO execution, even fewer
> instructions will be executed each cycle. So you have more FUs
> available, but they will be idle.

I've said statically scheduled needs VLIW which you've ignored by comparing equal issue width cores.

Whether fewer instructions are executed per cycle is less important than the number of results produced by the FUs per cycle. VLIW lets more FUs be active per cycle. And VLIW doesn't necessarily see execution delays for NOP placeholders. OoO complexity goes up approximately O(n^2) with the number of FUs, right?

I've agreed that dynamic branch prediction deserves a place in a statically scheduled architecture, so the correct instruction will be available just as soon as in an OoO. OoO execution unavoidably consumes power and chip area that static scheduling avoids.

But I'm repeating myself.

Let's build a Mill and see what happens.

> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23502&group=comp.arch#23502

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits
Date: Tue, 15 Feb 2022 11:22:13 -0500
Organization: A noiseless patient Spider
Lines: 14
Message-ID: <jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at>
<suerog$cd0$1@dont-email.me>
<2022Feb15.124937@mips.complang.tuwien.ac.at>
<sugjhv$v6u$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="f5eb5aba23f1f57d35d02e72b57d6302";
logging-data="29399"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+vpZeQZcasAGu7UXdMtPWk"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:rONpb2wiED9/xxQAOingWXViTHE=
sha1:Ou4eA/1bgEPE7VtaVMlvfBiti+o=

by: Stefan Monnier - Tue, 15 Feb 2022 16:22 UTC

> I am not suggesting that isn't true, but I do question why it is true. That
> is, if it is beneficial, I presume a compiler could do its scheduling across
> a window at least as big as HW. The compiler can use more memory and time
> than is available to the HW. As a minimum, it could emulate what the HW
> does so should be equal (excepting for variable length delays).

To make good use of a 1000-instruction window, you need extremely good
branch prediction. We know how to build such predictors for CPUs
(i.e. where they have access to the actual run-time behavior) but we
have no clue how to do that in a compiler (which works on the static
code without knowledge of the actual run time values manipulated).

Stefan

Pages:1 2 3 4 5 678 9 10 11 12 13 14

server_pubkey.txt

rocksolid light 0.9.81
clearnet tor