Welcome to novaBBS (click a section below)

mail files register newsreader groups login

Message-ID:

Memory fault -- core...uh...um...core... Oh dammit, I forget!

Re: instruction set binding time, was Encoding 20 and 40 bit

Subject	Author
Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	Brett
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	Brett
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Stefan Monnier
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Stefan Monnier
Re: Encoding 20 and 40 bit instructions in 128 bits	Bernd Linsel
Re: Encoding 20 and 40 bit instructions in 128 bits	Anton Ertl
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Brian G. Lucas
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Anton Ertl
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	Stefan Monnier
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	John Levine
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Stefan Monnier
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Stefan Monnier
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	John Levine
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Terje Mathisen
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Terje Mathisen
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	John Levine
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Quadibloc
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Scott Smader
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Stefan Monnier
Re: instruction set binding time, was Encoding 20 and 40 bit	Scott Smader
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	James Van Buskirk
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Statically scheduled plus run ahead.	Brett
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Anton Ertl
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Paul A. Clayton

Pages:1 2 3 4 5 6 789 10 11 12 13 14

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sugkji$6vi$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23503&group=comp.arch#23503

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 15 Feb 2022 08:32:47 -0800
Organization: A noiseless patient Spider
Lines: 22
Message-ID: <sugkji$6vi$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<suerog$cd0$1@dont-email.me> <2022Feb15.124937@mips.complang.tuwien.ac.at>
<sugjhv$v6u$1@dont-email.me> <jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 15 Feb 2022 16:32:50 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="b43fb6b5ccbe9b8b38be619901029d12";
logging-data="7154"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+qiuOWm2ZigdZ4SSkBZ/PyXRehcTkuQeY="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.0
Cancel-Lock: sha1:RILz6GQZtpXys/jS6ghZpu02bYw=
In-Reply-To: <jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US

by: Stephen Fuld - Tue, 15 Feb 2022 16:32 UTC

On 2/15/2022 8:22 AM, Stefan Monnier wrote:
>> I am not suggesting that isn't true, but I do question why it is true. That
>> is, if it is beneficial, I presume a compiler could do its scheduling across
>> a window at least as big as HW. The compiler can use more memory and time
>> than is available to the HW. As a minimum, it could emulate what the HW
>> does so should be equal (excepting for variable length delays).
>
> To make good use of a 1000-instruction window, you need extremely good
> branch prediction. We know how to build such predictors for CPUs
> (i.e. where they have access to the actual run-time behavior) but we
> have no clue how to do that in a compiler (which works on the static
> code without knowledge of the actual run time values manipulated).

Thank you. That makes perfect sense. BTW, does that present an
opportunity for some sort of profile driven optimization where the run
time branch history is fed back to a future compilation for better
optimization? Probably not as useful for an OoO machine.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sugkul$996$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23504&group=comp.arch#23504

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: not_va...@comcast.net (James Van Buskirk)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits
Date: Tue, 15 Feb 2022 09:38:38 -0700
Organization: A noiseless patient Spider
Lines: 2
Message-ID: <sugkul$996$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at> <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at> <212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com> <jwvwnhwvnlj.fsf-monnier+comp.arch@gnu.org> <70f6512f-243d-4492-80e6-299b40361ea5n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain;
format=flowed;
charset="UTF-8";
reply-type=original
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 15 Feb 2022 16:38:46 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="76f37e383c237faa65a04e45c018ee6d";
logging-data="9510"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/d7+HTFWGwbaiq/QQjH6y/mC3JF78RiFA="
Cancel-Lock: sha1:bXp62fdNow1sxkmevmegDD+PGz8=
X-MimeOLE: Produced By Microsoft MimeOLE V16.4.3528.331
In-Reply-To: <70f6512f-243d-4492-80e6-299b40361ea5n@googlegroups.com>
X-Newsreader: Microsoft Windows Live Mail 16.4.3528.331
Importance: Normal
X-Priority: 3
X-MSMail-Priority: Normal

by: James Van Buskirk - Tue, 15 Feb 2022 16:38 UTC

"Scott Smader" wrote in message
news:70f6512f-243d-4492-80e6-299b40361ea5n@googlegroups.com...

> Disclosure: I am so smitten by Mill, that I have made an investment
> in Mill Computing. I hope that isn't swaying my analysis. If I'm wildly
> wrong, I'd really like to know. Please be the friend who breaks it to me.

I'm really cheering for Mill but I worry about the Magic Compiler
issue and perhaps not offering enough support for pizza-faced
kids in their parents' garages trying to implement the a fun
IoT device on a no-income budget and no formal training or
oscilloscope or the like.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23505&group=comp.arch#23505

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits
Date: Tue, 15 Feb 2022 11:58:11 -0500
Organization: A noiseless patient Spider
Lines: 24
Message-ID: <jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at>
<suerog$cd0$1@dont-email.me>
<2022Feb15.124937@mips.complang.tuwien.ac.at>
<sugjhv$v6u$1@dont-email.me>
<jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org>
<sugkji$6vi$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="f5eb5aba23f1f57d35d02e72b57d6302";
logging-data="2872"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+HUUrI7FMFnZY7EuRuQHJY"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:bN1JB75tCCpZABPxkFaBJG5zYjc=
sha1:SsXTGVimqd0q+7pAQdI7RlRmsXE=

by: Stefan Monnier - Tue, 15 Feb 2022 16:58 UTC

> Thank you. That makes perfect sense. BTW, does that present an opportunity
> for some sort of profile driven optimization where the run time branch
> history is fed back to a future compilation for better optimization?
> Probably not as useful for an OoO machine.

Profile-driven optimization is used, of course, but the problem remains:
the compiler needs to generate code that works for all possible
situations and it can't freely duplicate code all over the place.
A bit of code duplication (to specialize a code path to a few different
scenarios) can be done, but only within fairly strict limits otherwise
code size will explode and performance goes down the drain again.

In contrast, an OoO is free to use a different schedule each time
a chunk of code is run (and similarly the branch predictor is free to
provide wildly different predictions each time that chunk of code is
run) without any downside.

The OoO works on the trace of the actual execution, where the main limit
is the size of the window it can consider (linked to the accuracy of the
branch predictor), whereas the compiler is not limited to such a window
but instead it's limited to work on the non-unrolled code.

Stefan

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sugmke$2sfo$1@gal.iecc.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23507&group=comp.arch#23507

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!news.iecc.com!.POSTED.news.iecc.com!not-for-mail
From: joh...@taugh.com (John Levine)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 15 Feb 2022 17:07:26 -0000 (UTC)
Organization: Taughannock Networks
Message-ID: <sugmke$2sfo$1@gal.iecc.com>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <jwv7d9xh0dt.fsf-monnier+comp.arch@gnu.org> <2022Feb15.114353@mips.complang.tuwien.ac.at> <sug2j8$bid$1@newsreader4.netcologne.de>
Injection-Date: Tue, 15 Feb 2022 17:07:26 -0000 (UTC)
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="94712"; mail-complaints-to="abuse@iecc.com"
In-Reply-To: <ssu0r5$p2m$1@newsreader4.netcologne.de> <jwv7d9xh0dt.fsf-monnier+comp.arch@gnu.org> <2022Feb15.114353@mips.complang.tuwien.ac.at> <sug2j8$bid$1@newsreader4.netcologne.de>
Cleverness: some
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: johnl@iecc.com (John Levine)

by: John Levine - Tue, 15 Feb 2022 17:07 UTC

According to Thomas Koenig <tkoenig@netcologne.de>:
>Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>
>> * Before people make assumptions about what I mean with "software
>> crisis": When the software cost is higher than the hardware cost,
>> the software crisis reigns. This has been the case for much of the
>> software for several decades.
>
>Two reasonable dates for that: 1957 (the first Fortran compiler) or
>1964, when the /360 demonstrated for all to see that software (especially
>compatibility) was more important than any particular hardware.

It was earlier than that. IBM introduced the transistorized 7070
announced in 1958 and shipped in 1960, intended to replace the vacuum
tube 650 and 705. But it was a flop because it couldn't run 705
programs. IBM rushed out the 7080, which was essentially a
transistorized 705 built out of 7090 parts.

The 360 was a huge gamble since they knew customers would not be happy
to have to reprogram. That's why every 360 up to the /65 had optional
emulation microcode so customers could run their 70xx and 14xx software.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sugnq4$s6j$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23508&group=comp.arch#23508

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 15 Feb 2022 09:27:32 -0800
Organization: A noiseless patient Spider
Lines: 30
Message-ID: <sugnq4$s6j$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<suerog$cd0$1@dont-email.me> <2022Feb15.124937@mips.complang.tuwien.ac.at>
<sugjhv$v6u$1@dont-email.me> <jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org>
<sugkji$6vi$1@dont-email.me> <jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 15 Feb 2022 17:27:32 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="b43fb6b5ccbe9b8b38be619901029d12";
logging-data="28883"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18h++UKGiRAPaV/1R8WMApPIZdDfVMyznI="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.0
Cancel-Lock: sha1:YIbsUbKKy1upeM1MVnMBDezy5HY=
In-Reply-To: <jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US

by: Stephen Fuld - Tue, 15 Feb 2022 17:27 UTC

On 2/15/2022 8:58 AM, Stefan Monnier wrote:
>> Thank you. That makes perfect sense. BTW, does that present an opportunity
>> for some sort of profile driven optimization where the run time branch
>> history is fed back to a future compilation for better optimization?
>> Probably not as useful for an OoO machine.
>
> Profile-driven optimization is used, of course, but the problem remains:
> the compiler needs to generate code that works for all possible
> situations and it can't freely duplicate code all over the place.
> A bit of code duplication (to specialize a code path to a few different
> scenarios) can be done, but only within fairly strict limits otherwise
> code size will explode and performance goes down the drain again.
>
> In contrast, an OoO is free to use a different schedule each time
> a chunk of code is run (and similarly the branch predictor is free to
> provide wildly different predictions each time that chunk of code is
> run) without any downside.
>
> The OoO works on the trace of the actual execution, where the main limit
> is the size of the window it can consider (linked to the accuracy of the
> branch predictor), whereas the compiler is not limited to such a window
> but instead it's limited to work on the non-unrolled code.

That all makes sense. Thank you again.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<6eac4285-5bb0-4e89-a692-2e7a91009987n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23511&group=comp.arch#23511

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:4445:: with SMTP id w5mr108395qkp.459.1644948475047;
Tue, 15 Feb 2022 10:07:55 -0800 (PST)
X-Received: by 2002:a05:6808:1408:b0:2d2:7782:694a with SMTP id
w8-20020a056808140800b002d27782694amr2292915oiv.261.1644948474756; Tue, 15
Feb 2022 10:07:54 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 15 Feb 2022 10:07:54 -0800 (PST)
In-Reply-To: <jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:1001:91f9:724e:655b;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:1001:91f9:724e:655b
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me>
<suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at> <suerog$cd0$1@dont-email.me>
<2022Feb15.124937@mips.complang.tuwien.ac.at> <sugjhv$v6u$1@dont-email.me> <jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <6eac4285-5bb0-4e89-a692-2e7a91009987n@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 15 Feb 2022 18:07:55 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 32

by: MitchAlsup - Tue, 15 Feb 2022 18:07 UTC

On Tuesday, February 15, 2022 at 10:22:17 AM UTC-6, Stefan Monnier wrote:
> > I am not suggesting that isn't true, but I do question why it is true. That
> > is, if it is beneficial, I presume a compiler could do its scheduling across
> > a window at least as big as HW. The compiler can use more memory and time
> > than is available to the HW. As a minimum, it could emulate what the HW
> > does so should be equal (excepting for variable length delays).
<
> To make good use of a 1000-instruction window, you need extremely good
> branch prediction. We know how to build such predictors for CPUs
> (i.e. where they have access to the actual run-time behavior) but we
> have no clue how to do that in a compiler (which works on the static
> code without knowledge of the actual run time values manipulated).
<
Back in the days of Opteron: up to 50%* of the work inserted into execution
never made it to retirement. This was an execution window of depth ~64
3-wide issue, and 16Kbytes of branch predictor. On average Opteron
would perform 1 x86-64 instructions per clock, ranging in bursts of 3 IPC
down to 0 IPC averaging just over 1 IPC. A lot of this was window full
waiting for memory.
<
So, we lost ~50% to BP 3.0->1.5; then we lost another ~33% to execution widow
size (latency)
<
A 1000 instruction sized window cannot be done "flat"* one would have to
have a layered window with 100-ish using FUs, 600 waiting to get in, 200
waiting to retire--or something on those orders. Notice that modern IBM
z-Series have 10-ish stage pipeline of retirement.
<
(*) where flat means every instruction is in reservation stations from issue
to retirement.
>
>
> Stefan

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<2022Feb15.191558@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23515&group=comp.arch#23515

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits
Date: Tue, 15 Feb 2022 18:15:58 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 30
Distribution: world
Message-ID: <2022Feb15.191558@mips.complang.tuwien.ac.at>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at> <sudb0g$rq3$1@dont-email.me> <jwv7d9xh0dt.fsf-monnier+comp.arch@gnu.org> <2022Feb15.114353@mips.complang.tuwien.ac.at> <sug2j8$bid$1@newsreader4.netcologne.de>
Injection-Info: reader02.eternal-september.org; posting-host="5b45c8e4e010c1ee414315010f84ca5d";
logging-data="22089"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19A93nhBHj191OER2eZuvOK"
Cancel-Lock: sha1:zm1PnBHzsBS5smUIn3M2fbzKVY4=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Tue, 15 Feb 2022 18:15 UTC

Thomas Koenig <tkoenig@netcologne.de> writes:
>Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>
>> * Before people make assumptions about what I mean with "software
>> crisis": When the software cost is higher than the hardware cost,
>> the software crisis reigns. This has been the case for much of the
>> software for several decades.
>
>Two reasonable dates for that: 1957 (the first Fortran compiler) or
>1964, when the /360 demonstrated for all to see that software (especially
>compatibility) was more important than any particular hardware.

Yes, the term "software crisis" is from 1968, but programming language
implementations demonstrated already before Fortran (and actually more
so, because pre-Fortran languages had less implementation investment
and did not utilize the hardware as well) that there is a world where
software cost is more relevant than hardware cost. But of course at
the time, that software segment was considered to be a negligible
niche. Actually similar sentiments survive until today, as
demonstrated by some recent comments about JITs.

And likewise, the building of a family of compatible machines
demonstrates that software cost played an important role for hardware
design (and trumped the clean-slate approach that hardware designers
would prefer) already in the IBM 7094 and certainly in the S/360.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<2022Feb15.194310@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23517&group=comp.arch#23517

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits
Date: Tue, 15 Feb 2022 18:43:10 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 26
Message-ID: <2022Feb15.194310@mips.complang.tuwien.ac.at>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at> <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at> <212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com> <2022Feb15.120729@mips.complang.tuwien.ac.at> <jwv35kkfe8h.fsf-monnier+comp.arch@gnu.org>
Injection-Info: reader02.eternal-september.org; posting-host="5b45c8e4e010c1ee414315010f84ca5d";
logging-data="22089"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/hjn6rZxqDyNyUAbkz4pBB"
Cancel-Lock: sha1:lrHEFqvOLlMtUpCefMwL2Z9tlzU=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Tue, 15 Feb 2022 18:43 UTC

Stefan Monnier <monnier@iro.umontreal.ca> writes:
>> Concerning speculation, yes, it does waste power, but rarely, because
>> branch mispredictions are rare.
>
>Of course in-order cores also speculate, so this is only
>tangentially related. But as a side-note I'll point out that statically
>scheduled processors encourage the compiler to move long-latency
>instructions such as loads to "as soon as it's safe to do it" rather
>than "as soon as we know we will need it", and that sometimes
>requires adding yet more "compensation code" in other branches.

Loads are a bad example, because you typically don't move the above a
branch they control-depend on, because loads can trap. Instead one
tends to use prefetches. IA-64 had a mechanism that allowed moving
loads up, however. Non-trapping instructions such as multiplications
can be moved up speculatively by the compiler. And because compiler
branch prediction (~10% miss rate) is much worse than dynamic branch
prediction (~1% miss rate, both numbers vary strongly with the
application, so take them with a grain of salt), a static scheduling
speculating compiler will tend to waste more energy for the same
degree of speculation.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sugu4m$9au$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23519&group=comp.arch#23519

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 15 Feb 2022 13:15:32 -0600
Organization: A noiseless patient Spider
Lines: 188
Message-ID: <sugu4m$9au$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 15 Feb 2022 19:15:34 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="b15bfce4c7a8c2fd9d7d9df155af1db1";
logging-data="9566"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+F9jUFF8mVrBCrcjlgM8Jg"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.0
Cancel-Lock: sha1:NcKosO34DQ3e0zxcMudywdCgFp0=
In-Reply-To: <3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
Content-Language: en-US

by: BGB - Tue, 15 Feb 2022 19:15 UTC

On 2/15/2022 10:21 AM, Scott Smader wrote:
> On Tuesday, February 15, 2022 at 3:47:39 AM UTC-8, Anton Ertl wrote:
>> Scott Smader <yogam...@yahoo.com> writes:
>>> On Monday, February 14, 2022 at 2:26:46 PM UTC-8, Anton Ertl wrote:
>>>> Why? If OoO is superior now at the current transistor budget, why=20
>>>> would it stop being superior if the transistor budget does not=20
>>>> increase?
>>>
>>> Not sure if the assumption of OoO superiority is justified at current trans=
>>> istor budget,
>>
>> Definitely. Here are two benchmarks, all numbers are times in
>> seconds:
>>
>> LaTeX:
>>
>> - Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Debian 9 64bit 2.368
>> - AMD E-450 1650MHz (Lenovo Thinkpad X121e), Ubuntu 11.10 64-bit 1.216
>> - Odroid N2 (1896MHz Cortex A53) Ubuntu 18.04 2.488
>> - Odroid N2 (1800MHz Cortex A73) Ubuntu 18.04 1.224
>>
>> Gforth:
>>
>> sieve bubble matrix fib fft
>> 0.492 0.556 0.424 0.700 0.396 Intel Atom 330 (Bonnell) 1.6GHz; gcc-4.9
>> 0.321 0.479 0.219 0.594 0.229 AMD E-350 1.6GHz; gcc version 4.7.1
>> 0.350 0.390 0.240 0.470 0.280 Odroid C2 (1536MHz Cortex-A53), gcc-6.3.0
>> 0.180 0.224 0.108 0.208 0.100 Odroid N2 (1800MHz Cortex-A73), gcc-6.3.0
>>
>> All these cores are 2-wide. The Atom 330 is in-order, the E-450 is
>> OoO (both AMD64); the Cortex-A53 is in-order, the Cortex-A73 is OoO
>> (both ARM A64).
>>
>
> Fine results. Thank you. But at least the A53/A73 numbers don't prove your claim. The A73 cores are twice as big as the A53 cores, according to https://www.anandtech.com/show/10347/arm-cortex-a73-artemis-unveiled/3. So a 2x performance improvement indicates equal efficiency per unit area, not superior.
>

Also the thing I was getting at:
Assume a future which is transistor-budget limited.

Unless the OoO cores deliver superior performance relative to their
transistor budget, they have a problem in this case.

OoO has tended to be faster, but at the cost of a higher transistor budget.

>> Intel switched from in-order to OoO for it's little/efficiency cores
>> with Silvermont, Apple uses OoO cores for their efficiency cores as
>> well as their performance cores. Intel even switched from in-order to
>> OoO for the Xeon Phi, which targets HPC, the area where in-order is
>> strongest.
>>
>>> but for the same transistor budget at the same clock rate, st=
>>> atically scheduled should win because it more efficiently uses the availabl=
>>> e transistors to do useful work.
>> That's the fallacy that caused Intel and HP to waste billions on
>> IA-64, and Transmeta investors to invest $969M, most of which was
>> lost.
>
> Agreed that Transmeta and IA-64 failed. Disagree that two failures prove static scheduling's efficiency to be a fallacy.
>

Likewise.

Though, one can't ignore one big drawback of in-order designs:
They do not deal well with cache misses.

One can use prefetching to good effect, but this may require involvement
from the programmer, and C doesn't really have a "good" way to express a
prefetch operation.

Also, the prefetch has to be timed correctly as well, far enough back to
absorb the cache miss, but recent enough that the prefetched data
doesn't end up evicted in the time between the prefetch and the time the
data is used.

If one could potentially put another (watered down) sat of fetch/decode
stages just ahead of the current position, it could be possible to use
them as a prefetcher. Less clear is how to do this "effectively".

One option would be to partly duplicate and delay the execute stages, say:
IF ID1 (V_ID2 V_EX1 V_EX2 V_EX3) ID2 EX1 EX2 EX3 WB

With the V_* stages mostly serving as "what if" stages, their results
not being saved to the register file, but any loads/stores resulting in
an L1 prefetch (stores are not written, and the L1 does not stall on a
miss).

But, this would pay some in terms of resource budget, and would increase
branch latency.

Though, the alternative is figuring out how to make the compiler insert
prefetches without needing this sort of trickery. Granted, this is
assuming that (ideally) the L1 can perform prefetches without them
resulting in a pipeline stall (which would mostly defeat the point).

One could also possibly try to work around some things if the compiler
were able to figure out when and where cache misses will occur (vs
current compilers which have no real way of knowing this).

This doesn't mean that in-order is basically dead in the water, but one
can't expect it to be at a performance parity with an OoO core within
similar constraints (issue width, number of cores, ...).

But, rather, "Could the core one could fit in the same transistor budget
as an OoO core deliver better performance?".

Or, even if each core performs a little worse, but you can fit enough of
them into the same area that performance relative to area works out as a
net-win.

Thus far, it hasn't really come to this mostly because the transistor
budget has been steadily increasing, but will not necessarily hold true
once it stops.

It is like the issue of all the ridiculously inefficient coding
practices during the "MHz keeps getting bigger" era, and then single
threaded performance hits a wall, and suddenly people are finding that
writing efficient code matters again.

We have had around a decade without much improvement in single-thread
performance.

If one is in a world where it has been over a decade past the last "die
shrink", with no more "die shrinks" in the foreseeable future, they may
start thinking about things differently.

People will also have a lot more time to try to work out the "magic
compiler" issues, ...

>>
>>> Every time a program runs on an OoO machine, it wastes power to speculate a=
>>> bout un-taken paths that can be avoided when source code compiles to a stat=
>>> ically scheduled target instruction set.
>>
>> OoO is superior in both performance and in efficiency at competetive
>> performance for smartphones, desktops and servers, as evidenced by the
>> use of OoO cores as efficiency cores by Apple, Intel and AMD. The
>> only ones who still stick with in-order efficiency cores are ARM, but
>> their efficiency does not look great compared to their own efficient
>> OoO cores, and compared to Apple's efficiency cores.
>>
>> Concerning speculation, yes, it does waste power, but rarely, because
>> branch mispredictions are rare.
>>
>>> Once the speculation transistors are removed, a statically scheduled chip c=
>>> an replace them with additional FUs to get more done each clock cycle.
>> "Speculation transistors"? Ok, you can remove the branch predictor
>> and spend the transistors for an additional FU. The result for much
>> of the software will be that there will be done much less in each
>> cycle, because there are far fewer instructions ready for execution on
>> average without speculative execution, and therefore less work will be
>> done each cycle. If you also remove OoO execution, even fewer
>> instructions will be executed each cycle. So you have more FUs
>> available, but they will be idle.
>
> I've said statically scheduled needs VLIW which you've ignored by comparing equal issue width cores.
>
> Whether fewer instructions are executed per cycle is less important than the number of results produced by the FUs per cycle. VLIW lets more FUs be active per cycle. And VLIW doesn't necessarily see execution delays for NOP placeholders. OoO complexity goes up approximately O(n^2) with the number of FUs, right?
>
> I've agreed that dynamic branch prediction deserves a place in a statically scheduled architecture, so the correct instruction will be available just as soon as in an OoO. OoO execution unavoidably consumes power and chip area that static scheduling avoids.
>

Yeah.

I am also not thinking of "OoO core vs in-order core running same ISA",
but rather, OoO core vs a VLIW core at a similar transistor budget.

If the latter runs the same machine code as the former, it would likely
be via an emulation layer.

> But I'm repeating myself.
>
> Let's build a Mill and see what happens.
>
>> - anton
>> --
>> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
>> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sugurh$h1$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23520&group=comp.arch#23520

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!news.freedyn.de!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 15 Feb 2022 19:27:45 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sugurh$h1$1@newsreader4.netcologne.de>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at> <sudb0g$rq3$1@dont-email.me>
<2022Feb15.104639@mips.complang.tuwien.ac.at>
Injection-Date: Tue, 15 Feb 2022 19:27:45 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:19cd:0:7285:c2ff:fe6c:992d";
logging-data="545"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Tue, 15 Feb 2022 19:27 UTC

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

[...]

> In any case, I expect no problems when moving virtual
> machines from one microarchitecture to another, as long as the second
> microarchitecture supports the same instruction set extensions as the
> first one.

I remember reading that ARM has a problem with SVE when moving
between microarchitectures with different lengths of execution
units, in effect restricting the vector length to the minimum,
128 bits.

Duh.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suguvo$es7$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23521&group=comp.arch#23521

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 15 Feb 2022 13:29:58 -0600
Organization: A noiseless patient Spider
Lines: 51
Message-ID: <suguvo$es7$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<jwv35kkfe8h.fsf-monnier+comp.arch@gnu.org>
<2022Feb15.194310@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 15 Feb 2022 19:30:00 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="b15bfce4c7a8c2fd9d7d9df155af1db1";
logging-data="15239"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+96piGbz6OY1PYMMrTWUVI"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.0
Cancel-Lock: sha1:r4WtQ/nSP9vuEr/dxpK8e5Vzpyc=
In-Reply-To: <2022Feb15.194310@mips.complang.tuwien.ac.at>
Content-Language: en-US

by: BGB - Tue, 15 Feb 2022 19:29 UTC

On 2/15/2022 12:43 PM, Anton Ertl wrote:
> Stefan Monnier <monnier@iro.umontreal.ca> writes:
>>> Concerning speculation, yes, it does waste power, but rarely, because
>>> branch mispredictions are rare.
>>
>> Of course in-order cores also speculate, so this is only
>> tangentially related. But as a side-note I'll point out that statically
>> scheduled processors encourage the compiler to move long-latency
>> instructions such as loads to "as soon as it's safe to do it" rather
>> than "as soon as we know we will need it", and that sometimes
>> requires adding yet more "compensation code" in other branches.
>
> Loads are a bad example, because you typically don't move the above a
> branch they control-depend on, because loads can trap. Instead one
> tends to use prefetches. IA-64 had a mechanism that allowed moving
> loads up, however. Non-trapping instructions such as multiplications
> can be moved up speculatively by the compiler. And because compiler
> branch prediction (~10% miss rate) is much worse than dynamic branch
> prediction (~1% miss rate, both numbers vary strongly with the
> application, so take them with a grain of salt), a static scheduling
> speculating compiler will tend to waste more energy for the same
> degree of speculation.
>

Whether to rely on predicted branches, or on the compiler (such as in a
modulo loop), would ideally need some what for the compiler to know
which branches are predictable.

Similar goes for things like predicated instructions:
Branch is highly predictable, better to branch.
Branch is highly unpredictable, may be better to predicate.

In a traditional compiler, one has no knowledge of the run-time state.

One could potentially have a compiler which incorporates profile
information from profile runs of the program (sometimes done).

Another option could be if the compiler were integrated with an
emulator, and so could "dummy profile" a lot of the code it is running.
Though, this would also require a bit of cooperation from the programmer
to make this work (such as some way to get mock-up data to use to
emulate/benchmark various parts of the program).

Say for example, for code for encoding and decoding JPEGs, one would
need a "runs at build time" function which proceeds to load and decode a
bunch of JPEGs, then re-encodes them with various options, ... But, this
is not emitted in the final program (and serves mostly to give the
compiler some idea how the code is likely to behave at run-time).

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<c63ae120-0478-4530-80f5-c045d2edd5c8n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23522&group=comp.arch#23522

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:5181:: with SMTP id kl1mr608063qvb.26.1644954893274;
Tue, 15 Feb 2022 11:54:53 -0800 (PST)
X-Received: by 2002:a05:6808:2003:: with SMTP id q3mr170635oiw.133.1644954892985;
Tue, 15 Feb 2022 11:54:52 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 15 Feb 2022 11:54:52 -0800 (PST)
In-Reply-To: <sugu4m$9au$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=162.229.185.59; posting-account=Gm3E_woAAACkDRJFCvfChVjhgA24PTsb
NNTP-Posting-Host: 162.229.185.59
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me>
<suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at> <212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at> <3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
<sugu4m$9au$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c63ae120-0478-4530-80f5-c045d2edd5c8n@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: yogaman...@yahoo.com (Scott Smader)
Injection-Date: Tue, 15 Feb 2022 19:54:53 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 76

by: Scott Smader - Tue, 15 Feb 2022 19:54 UTC

On Tuesday, February 15, 2022 at 11:15:38 AM UTC-8, BGB wrote:

<snip>

> Also the thing I was getting at:
> Assume a future which is transistor-budget limited.
>
> Unless the OoO cores deliver superior performance relative to their
> transistor budget, they have a problem in this case.
>
> OoO has tended to be faster, but at the cost of a higher transistor budget.

<snip>

> This doesn't mean that in-order is basically dead in the water, but one
> can't expect it to be at a performance parity with an OoO core within
> similar constraints (issue width, number of cores, ...).
>
> But, rather, "Could the core one could fit in the same transistor budget
> as an OoO core deliver better performance?".
>
> Or, even if each core performs a little worse, but you can fit enough of
> them into the same area that performance relative to area works out as a
> net-win.
>
> Thus far, it hasn't really come to this mostly because the transistor
> budget has been steadily increasing, but will not necessarily hold true
> once it stops.
>
> It is like the issue of all the ridiculously inefficient coding
> practices during the "MHz keeps getting bigger" era, and then single
> threaded performance hits a wall, and suddenly people are finding that
> writing efficient code matters again.
>
> We have had around a decade without much improvement in single-thread
> performance.
>
>
> If one is in a world where it has been over a decade past the last "die
> shrink", with no more "die shrinks" in the foreseeable future, they may
> start thinking about things differently.
>
> People will also have a lot more time to try to work out the "magic
> compiler" issues, ...

> I am also not thinking of "OoO core vs in-order core running same ISA",
> but rather, OoO core vs a VLIW core at a similar transistor budget.
>
> If the latter runs the same machine code as the former, it would likely
> be via an emulation layer.

Right. If you've built as large a DBP window as gives significant BP improvement, and if your FU issue width per instruction is as wide as you can figure out how to make it do useful work, then if you still have idle FUs across too many of your target applications, it's time to prune back a little bit and use the freed silicon area for the next core.

For the actual OoO v. Statically Scheduled Super Bowl, we can argue about what architecture the emulated ISA should assume, too.

> > Let's build a Mill and see what happens.
> >
> >> - anton
> >> --
> >> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> >> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suh34c$3pd$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23526&group=comp.arch#23526

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!news.freedyn.de!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 15 Feb 2022 20:40:44 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <suh34c$3pd$1@newsreader4.netcologne.de>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<jwv35kkfe8h.fsf-monnier+comp.arch@gnu.org>
<2022Feb15.194310@mips.complang.tuwien.ac.at>
Injection-Date: Tue, 15 Feb 2022 20:40:44 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:19cd:0:7285:c2ff:fe6c:992d";
logging-data="3885"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Tue, 15 Feb 2022 20:40 UTC

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
> And because compiler
> branch prediction (~10% miss rate)

That seems optimistic.

>is much worse than dynamic branch
> prediction (~1% miss rate, both numbers vary strongly with the
> application, so take them with a grain of salt),

What is the branch miss rate on a binary search, or a sort?
Should be close to 50%, correct?

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suh3bq$3pd$2@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23527&group=comp.arch#23527

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 15 Feb 2022 20:44:42 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <suh3bq$3pd$2@newsreader4.netcologne.de>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<RsPOJ.2790$3Pje.2504@fx09.iad>
Injection-Date: Tue, 15 Feb 2022 20:44:42 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:19cd:0:7285:c2ff:fe6c:992d";
logging-data="3885"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Tue, 15 Feb 2022 20:44 UTC

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

> The gate voltage affects leakage but also delay,
> lower voltage lowers leakage but increases delay ,
> and gate delay interacts with stage delay, which interacts with frequency.
> If the frequency of the mem stage is faster than cache then it can wind up
> wasting a partial extra cycle to round the access up to a whole cycle,
> which waste power by leaking while sitting idle.
>
> There is probably some great whacking differential equation
> that tells up where the optimum of all the parameters lies.

Unfortunately, designing a CPU has both integer variables (number of
gates) and real variables (length of wires, diameters, transistor
sizes), and mixed optimization problems are generally much harder
than purely continuous ones.

Oh, and just mix in a few stochastic variables as well, for all
the tolerances :-)

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suh3kq$3pd$3@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23529&group=comp.arch#23529

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.swapon.de!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 15 Feb 2022 20:49:30 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <suh3kq$3pd$3@newsreader4.netcologne.de>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at> <sudb0g$rq3$1@dont-email.me>
<2022Feb15.104639@mips.complang.tuwien.ac.at> <sug7bd$b9l$1@dont-email.me>
Injection-Date: Tue, 15 Feb 2022 20:49:30 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:19cd:0:7285:c2ff:fe6c:992d";
logging-data="3885"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Tue, 15 Feb 2022 20:49 UTC

Ivan Godard <ivan@millcomputing.com> schrieb:
> On 2/15/2022 1:46 AM, Anton Ertl wrote:
>> Ivan Godard <ivan@millcomputing.com> writes:
>>> The
>>> specializer does quite extensive optimization - bundle-packing,
>>> scheduling, CFG collapse, software pipelining, in- and out-lining, etc.
>>> - which is too expensive for a JIT, but the passes that do the
>>> optimizations are optionally bypassed.
>>
>> By contrast, an OoO microarchitecture does not care whether the native
>> code has been generated by a JIT compiler or not, it will
>> branch-predict and schedule the instructions all the same.
>
> So will a Mill.
>
> The specializer passes that a JIT would skip are target independent, and
> give better code for any target: OOO, IO, or sideways.

If it is target independent, why is it in the specializer?

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<7fbd38c5-55a7-4417-b833-8778017a6da0n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23530&group=comp.arch#23530

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:11ca:: with SMTP id n10mr838538qtk.42.1644959274713;
Tue, 15 Feb 2022 13:07:54 -0800 (PST)
X-Received: by 2002:a05:6870:1107:: with SMTP id 7mr356194oaf.337.1644959274430;
Tue, 15 Feb 2022 13:07:54 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 15 Feb 2022 13:07:54 -0800 (PST)
In-Reply-To: <sugu4m$9au$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:1001:91f9:724e:655b;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:1001:91f9:724e:655b
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me>
<suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at> <212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at> <3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
<sugu4m$9au$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <7fbd38c5-55a7-4417-b833-8778017a6da0n@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 15 Feb 2022 21:07:54 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 269

by: MitchAlsup - Tue, 15 Feb 2022 21:07 UTC

On Tuesday, February 15, 2022 at 1:15:38 PM UTC-6, BGB wrote:
> On 2/15/2022 10:21 AM, Scott Smader wrote:
> > On Tuesday, February 15, 2022 at 3:47:39 AM UTC-8, Anton Ertl wrote:
> >> Scott Smader <yogam...@yahoo.com> writes:
> >>> On Monday, February 14, 2022 at 2:26:46 PM UTC-8, Anton Ertl wrote:
> >>>> Why? If OoO is superior now at the current transistor budget, why=20
> >>>> would it stop being superior if the transistor budget does not=20
> >>>> increase?
> >>>
> >>> Not sure if the assumption of OoO superiority is justified at current trans=
> >>> istor budget,
> >>
> >> Definitely. Here are two benchmarks, all numbers are times in
> >> seconds:
> >>
> >> LaTeX:
> >>
> >> - Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Debian 9 64bit 2.368
> >> - AMD E-450 1650MHz (Lenovo Thinkpad X121e), Ubuntu 11.10 64-bit 1.216
> >> - Odroid N2 (1896MHz Cortex A53) Ubuntu 18.04 2.488
> >> - Odroid N2 (1800MHz Cortex A73) Ubuntu 18.04 1.224
> >>
> >> Gforth:
> >>
> >> sieve bubble matrix fib fft
> >> 0.492 0.556 0.424 0.700 0.396 Intel Atom 330 (Bonnell) 1.6GHz; gcc-4.9
> >> 0.321 0.479 0.219 0.594 0.229 AMD E-350 1.6GHz; gcc version 4.7.1
> >> 0.350 0.390 0.240 0.470 0.280 Odroid C2 (1536MHz Cortex-A53), gcc-6.3.0
> >> 0.180 0.224 0.108 0.208 0.100 Odroid N2 (1800MHz Cortex-A73), gcc-6.3.0
> >>
> >> All these cores are 2-wide. The Atom 330 is in-order, the E-450 is
> >> OoO (both AMD64); the Cortex-A53 is in-order, the Cortex-A73 is OoO
> >> (both ARM A64).
> >>
> >
> > Fine results. Thank you. But at least the A53/A73 numbers don't prove your claim. The A73 cores are twice as big as the A53 cores, according to https://www.anandtech.com/show/10347/arm-cortex-a73-artemis-unveiled/3. So a 2x performance improvement indicates equal efficiency per unit area, not superior.
> >
> Also the thing I was getting at:
> Assume a future which is transistor-budget limited.
>
> Unless the OoO cores deliver superior performance relative to their
> transistor budget, they have a problem in this case.
>
> OoO has tended to be faster, but at the cost of a higher transistor budget.
<
Late in my AMD career, I did a study design of a 1-wide in order x86-64
to see how much the OoO-ness was costing.
<
We discussed this back when Nick McLaren was still posting.
<
The bottom line is that the GBOoO design is 12× bigger than the 1-wide
I-O design (including L1 cache and predictors but excluding L2 cache).
The GBOoO core ran 2× faster than LBIO core from the same L2 outwards.
The LBIO cores also operated at 1/12 the power--logic power scales with
area, [SD]RAM and pin power scales with utilization.
<
Until SW figures out how to utilize "craploads" of cores, the market will
continue to demand GBOoO--------it really is that simple.
<
> >> Intel switched from in-order to OoO for it's little/efficiency cores
> >> with Silvermont, Apple uses OoO cores for their efficiency cores as
> >> well as their performance cores. Intel even switched from in-order to
> >> OoO for the Xeon Phi, which targets HPC, the area where in-order is
> >> strongest.
> >>
> >>> but for the same transistor budget at the same clock rate, st=
> >>> atically scheduled should win because it more efficiently uses the availabl=
> >>> e transistors to do useful work.
> >> That's the fallacy that caused Intel and HP to waste billions on
> >> IA-64, and Transmeta investors to invest $969M, most of which was
> >> lost.
> >
> > Agreed that Transmeta and IA-64 failed. Disagree that two failures prove static scheduling's efficiency to be a fallacy.
> >
> Likewise.
>
> Though, one can't ignore one big drawback of in-order designs:
> They do not deal well with cache misses.
<
But cores are enough smaller, that for equal die size, you can double the biggest cache.
>
> One can use prefetching to good effect, but this may require involvement
> from the programmer, and C doesn't really have a "good" way to express a
> prefetch operation.
<
Still have not seen a prefetcher that performs well on linked list codes........
>
> Also, the prefetch has to be timed correctly as well, far enough back to
> absorb the cache miss, but recent enough that the prefetched data
> doesn't end up evicted in the time between the prefetch and the time the
> data is used.
>
>
> If one could potentially put another (watered down) sat of fetch/decode
> stages just ahead of the current position, it could be possible to use
> them as a prefetcher. Less clear is how to do this "effectively".
>
> One option would be to partly duplicate and delay the execute stages, say:
> IF ID1 (V_ID2 V_EX1 V_EX2 V_EX3) ID2 EX1 EX2 EX3 WB
>
> With the V_* stages mostly serving as "what if" stages, their results
> not being saved to the register file, but any loads/stores resulting in
> an L1 prefetch (stores are not written, and the L1 does not stall on a
> miss).
>
>
> But, this would pay some in terms of resource budget, and would increase
> branch latency.
>
> Though, the alternative is figuring out how to make the compiler insert
> prefetches without needing this sort of trickery. Granted, this is
> assuming that (ideally) the L1 can perform prefetches without them
> resulting in a pipeline stall (which would mostly defeat the point).
>
> One could also possibly try to work around some things if the compiler
> were able to figure out when and where cache misses will occur (vs
> current compilers which have no real way of knowing this).
>
>
>
> This doesn't mean that in-order is basically dead in the water, but one
> can't expect it to be at a performance parity with an OoO core within
> similar constraints (issue width, number of cores, ...).
<
BTW the intent of my study was to make an x86-64 so small that we could
put a core on each HT far end and handle MMIO accesses in a few nanoseconds
rather than hundreds of nanoseconds. Such an x86-64 would be vanilla verilog
compiled logic--but it would run the same binary as the OS. Interrupts were
handled topologically adjacent to the device. Like a channel but using the same ISA.
>
> But, rather, "Could the core one could fit in the same transistor budget
> as an OoO core deliver better performance?".
>
> Or, even if each core performs a little worse, but you can fit enough of
> them into the same area that performance relative to area works out as a
> net-win.
<
12× is the number you want. 2.6× is the number SW can use.
>
> Thus far, it hasn't really come to this mostly because the transistor
> budget has been steadily increasing, but will not necessarily hold true
> once it stops.
>
It gets hard, at a fundamental level, once gate oxide becomes 1 atom.
>
> It is like the issue of all the ridiculously inefficient coding
> practices during the "MHz keeps getting bigger" era, and then single
> threaded performance hits a wall, and suddenly people are finding that
> writing efficient code matters again.
>
> We have had around a decade without much improvement in single-thread
> performance.
>
IPC-wise : yes, frequency wise we are growing at 1/3rd what we were growing
in the 19890s.
>
> If one is in a world where it has been over a decade past the last "die
> shrink", with no more "die shrinks" in the foreseeable future, they may
> start thinking about things differently.
>
> People will also have a lot more time to try to work out the "magic
> compiler" issues, ...
> >>
> >>> Every time a program runs on an OoO machine, it wastes power to speculate a=
> >>> bout un-taken paths that can be avoided when source code compiles to a stat=
> >>> ically scheduled target instruction set.
> >>
> >> OoO is superior in both performance and in efficiency at competetive
> >> performance for smartphones, desktops and servers, as evidenced by the
> >> use of OoO cores as efficiency cores by Apple, Intel and AMD. The
> >> only ones who still stick with in-order efficiency cores are ARM, but
> >> their efficiency does not look great compared to their own efficient
> >> OoO cores, and compared to Apple's efficiency cores.
> >>
> >> Concerning speculation, yes, it does waste power, but rarely, because
> >> branch mispredictions are rare.
> >>
> >>> Once the speculation transistors are removed, a statically scheduled chip c=
> >>> an replace them with additional FUs to get more done each clock cycle..
> >> "Speculation transistors"? Ok, you can remove the branch predictor
> >> and spend the transistors for an additional FU. The result for much
> >> of the software will be that there will be done much less in each
> >> cycle, because there are far fewer instructions ready for execution on
> >> average without speculative execution, and therefore less work will be
> >> done each cycle. If you also remove OoO execution, even fewer
> >> instructions will be executed each cycle. So you have more FUs
> >> available, but they will be idle.
> >
> > I've said statically scheduled needs VLIW which you've ignored by comparing equal issue width cores.
> >
> > Whether fewer instructions are executed per cycle is less important than the number of results produced by the FUs per cycle. VLIW lets more FUs be active per cycle. And VLIW doesn't necessarily see execution delays for NOP placeholders. OoO complexity goes up approximately O(n^2) with the number of FUs, right?
> >
> > I've agreed that dynamic branch prediction deserves a place in a statically scheduled architecture, so the correct instruction will be available just as soon as in an OoO. OoO execution unavoidably consumes power and chip area that static scheduling avoids.
> >
> Yeah.
>
> I am also not thinking of "OoO core vs in-order core running same ISA",
> but rather, OoO core vs a VLIW core at a similar transistor budget.
>
> If the latter runs the same machine code as the former, it would likely
> be via an emulation layer.
> > But I'm repeating myself.
> >
> > Let's build a Mill and see what happens.
> >
> >> - anton
> >> --
> >> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> >> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Click here to read the complete article

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suh5d3$po5$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23532&group=comp.arch#23532

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 15 Feb 2022 13:19:29 -0800
Organization: A noiseless patient Spider
Lines: 30
Message-ID: <suh5d3$po5$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at> <sudb0g$rq3$1@dont-email.me>
<2022Feb15.104639@mips.complang.tuwien.ac.at> <sug7bd$b9l$1@dont-email.me>
<suh3kq$3pd$3@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 15 Feb 2022 21:19:31 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="93b1554788c317f6fd2ff3a26e7468cb";
logging-data="26373"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19q/Q5kFk64VPjxShdnVvYL"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.1
Cancel-Lock: sha1:ePtTDiLTo1izk+yvXAwHSAmOAYQ=
In-Reply-To: <suh3kq$3pd$3@newsreader4.netcologne.de>
Content-Language: en-US

by: Ivan Godard - Tue, 15 Feb 2022 21:19 UTC

On 2/15/2022 12:49 PM, Thomas Koenig wrote:
> Ivan Godard <ivan@millcomputing.com> schrieb:
>> On 2/15/2022 1:46 AM, Anton Ertl wrote:
>>> Ivan Godard <ivan@millcomputing.com> writes:
>>>> The
>>>> specializer does quite extensive optimization - bundle-packing,
>>>> scheduling, CFG collapse, software pipelining, in- and out-lining, etc.
>>>> - which is too expensive for a JIT, but the passes that do the
>>>> optimizations are optionally bypassed.
>>>
>>> By contrast, an OoO microarchitecture does not care whether the native
>>> code has been generated by a JIT compiler or not, it will
>>> branch-predict and schedule the instructions all the same.
>>
>> So will a Mill.
>>
>> The specializer passes that a JIT would skip are target independent, and
>> give better code for any target: OOO, IO, or sideways.
>
> If it is target independent, why is it in the specializer?

The specializer is the back-end of the compiler, and does back-end stuff
in addition to member-targeting stuff. The functionality could be split,
but that would require a communication format across the split.

Actually there are three kinds of work done: family independent (done in
both Mill and x86, like inlining); member independent (done in all Mill
members but not other ISAs, like ganging op/compare instructions, or
injecting pseudo-ops for load retires), and member dependent (done
uniquely for each member, like bit stuffing into the binary).

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suh6sg$37r$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23534&group=comp.arch#23534

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 15 Feb 2022 15:44:46 -0600
Organization: A noiseless patient Spider
Lines: 140
Message-ID: <suh6sg$37r$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me>
<acdd5308-1666-4888-aa86-ab2d82380f1en@googlegroups.com>
<sug2p7$bid$2@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 15 Feb 2022 21:44:48 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="b15bfce4c7a8c2fd9d7d9df155af1db1";
logging-data="3323"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+DXn5sPwMmCU3bBzrFxG/3"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.0
Cancel-Lock: sha1:DtQvlB2OpeQG+65LAKxmYqETmPo=
In-Reply-To: <sug2p7$bid$2@newsreader4.netcologne.de>
Content-Language: en-US

by: BGB - Tue, 15 Feb 2022 21:44 UTC

On 2/15/2022 5:28 AM, Thomas Koenig wrote:
> MitchAlsup <MitchAlsup@aol.com> schrieb:
>
>> L1s are stable because if the get bigger than 64 KB they take more than
>> one cycle to "access", where SRAM and wire routing are both considered.
>
> One cycle latency or throughput?

Probably latency.

It could be argued, I have an FPGA, it wouldn't be hard to throw 64K
L1's at the problem and then skip an L2.

However, timing doesn't really like big L1s...

For the L2, one may need to do things like:
tBlkDataA1 <= tBlkArrDataA[tReqIdx];
tBlkAddrA1 <= tBlkArrAddrA[tReqIdx];
tBlkDataB1 <= tBlkArrDataB[tReqIdx];
tBlkAddrB1 <= tBlkArrAddrB[tReqIdx];
//forward to next cycle
tBlkDataA <= tBlkDataA1;
tBlkAddrA <= tBlkDataA1;
tBlkDataB <= tBlkDataB1;
tBlkAddrB <= tBlkDataB1;
//the above represent what we check.

This would not be so ideal for L1.

For L2, I can have an ~ 4 cycle latency between the request arriving on
the ring-bus, and the time the request exits the L2 via the ring-bus (we
need to act on whether or not it was a hit or miss before it leaves).
The latency could be made longer if needed.

Multiple requests can arrive and be handled one-after-another if there
is no miss (with any missed L2 requests needing to circle around the bus
until they can be handled).

For L1, I have EX1/EX2/EX3. It gets the address in EX1, needs to do its
work in EX2, and have a result by EX3. One can't really stick an extra
clock cycle in there.

For L1's, I can get away with 16K or 32K at 50MHz. These fit OK in
BRAMs, and pass timing. They also have upwards of a 95% hit rate.

More so, much past 16K or 32K, the L1 hit-rate seems to mostly hit a
plateau (hits ~ 95 or 96% and then seemingly stops improving).
Increasing associativity helps slightly here, but only by ~ 1.5%.

Seems to work out better in this case to throw any additional memory at
the L2, which sees more benefit from the larger size (also L2 seems to
benefit more from associative caching when using the ringbus design,
hence the 2-way L2).

The L2 can offer around a 70% or so chance of hitting, which is a lot
better than needing to go to DRAM.

However, timing is a lot harder at 100MHz, so one may need to drop the
L1's to 2K or similar (allowing the use of LUTRAM).

However, if one has an L1 that only hits around 60% of the time or
similar, performance is not happy. Similarly, associative caching does
not fix this issue, so a 2K L1 cache still sucks.

It may work out faster overall to run at half the clock speed, but with
a better L1 cache, and also a wider execute pipeline (3-wide vs 1-wide),
even if the code is mostly limited to ~ 1.5 lanes or so on-average
(average case ILP being relatively weak).

The actual LUT cost difference between a 1-wide and 3-wide core is
smaller than one might expect, mostly because non-duplicated features
(L1 caches, FPU, TLB, ...) tend to eat up the majority of the budget
(the extra decoders, ALUs, ... don't effect cost all that much).

There is a cost due to the larger number of ports in the register file,
but this cost is shared by the ability to use it for sake of things like
SIMD operations and 128-bit Load/Store, ... which might not be possible
otherwise.

However, 3 wide seems to be the local optima. Going much wider than 3,
and "the crap hits the fan"...

So, partly in effect, it is 3-wide for 64-bit operations, but 1-wide for
128-bit operations.

The extra width is also being used for things like being able to load a
64-bit immediate within a single clock cycle, ... (Internally, the
immediate is spread across multiple lanes and then glued together when
the value is fetched, with the immediate fields in the pipeline not
actually being wide enough to pass the value directly).

Similar sort of issue for what I am still (mostly) stuck at a
double-precision FPU (64 to 96 bits is a bigger cost jump than it may
seem on the surface). Can enable it experimentally, but then usually end
up disabling it again due to having to keep fight with timing and
similar about it. And, at present, it doesn't offer much advantage over
software-emulated Binary128, which can be helped out some by the
(relatively cheaper) 128-bit ALUX instructions.

....

One could almost make a strong case for doing a 32-bit machine and being
able to run it at twice the clock-speed, but then one still has the L1
cache size issue. Code written to make effective use of a 64-bit machine
also benefits more from the 64b nature, than optimizing for (or being
limited to) a 32-bit machine.

I have also noted that "better numbers in Dhystone but worse at pretty
much everything else" isn't really a win...

Despite it also seeming like Doom would prefer a high-clocked 32-bit
machine, Doom's performance tends to be more effected by the performance
of the memory subsystem. On a 50MHz core with a 100% L2 hit rate, it
would also run pegged at the 32 fps frame-rate limiter.

Doom also remains surprisingly playable at 16 or 25 MHz, if one assumes
that DRAM access remains fast.

But, one other thing Doom does also notice with a fairly obvious effect,
is a high L1 miss rate (this having a bigger impact on performance than
the actual clock speed).

....

Re: instruction set binding time, was Encoding 20 and 40 bit

<2022Feb15.230425@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23536&group=comp.arch#23536

copy link Newsgroups: comp.arch

by: Anton Ertl - Tue, 15 Feb 2022 22:04 UTC

jgd@cix.co.uk (John Dallman) writes:
>Itanium had fixed-size bundles that sometimes needed to be padded with
>no-ops, and stop bits for indicating inter-bundle dependencies.

On IA-64 bundles are encoding. Groups (separated by stop bits)
comprise a set of instructions that can be executed in parallel as far
as register dependencies are concerned; you can have memory
dependencies within a group, and I don't know how the hardware handles
that.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: instruction set binding time, was Encoding 20 and 40 bit

<suh8bp$ce8$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23537&group=comp.arch#23537

copy link Newsgroups: comp.arch

by: Ivan Godard - Tue, 15 Feb 2022 22:10 UTC

On 2/15/2022 8:59 AM, John Dallman wrote:
>> Itanium was not a VLIW, it was an EPIC architecture; granted, both
>> were wide issue, which is commonly confused with VLIW.
>>
>> The Mill is not a VLIW (or EPIC) either, although it is closer than
>> the Itanium.
>
> Can you enlarge on the differences? The basic concept of presenting
> several instructions to the processor at once, with the compiler ensuring
> that there are no data hazards between those instructions, seems to be
> much the same.
>
> Itanium had fixed-size bundles that sometimes needed to be padded with
> no-ops, and stop bits for indicating inter-bundle dependencies. I know
> Mill has variable-size bundles. What else is different in this aspect?
>
>> You probably have a VLIW in your pocket: Qualcomm Hexagon.
>
> On the table, but yes. Not that I use it very hard.
>
> John

These differ in which hazards are shifted from the hardware to the
compiler when dealing with a wide bundle.

EPIC hardware assumed that there are no intra-bundle hazards, so the
issue queue and its associated hazard check can be omitted. However,
there were no such guarantees for inter-bundle hazards, so either the
entire bundle had to run to completion, or, in later Itaniums, the
hardware did OOO-like retire hazard checking to deal with varying
instruction latency.

A classic VLIW assumes that there are neither issue nor retire hazards
in the schedule, and hence the latency of everything is statically
known. Neither issue nor retire hardware checking is needed. These days
the fixed latency requirement restricts VLIWs to applications which do
not present variable instruction latencies: DSPs, mini-engines like
crypto blocks, and the like. Usually there is no latency variability
physically possible, but if the core is connected to external devices
with unpredictable latency (such as DRAM) the hardware stalls if
external data is not available.

Mill is VLIW-like for all instructions but loads: everything has static
fixed latency, and neither issue nor retire hazards need hardware hazard
checking or scheduling queues. Like a VLIW, it does not check for or
take advantage of early-out such as multiply-by-one. For loads, which
are the only source of significant latency variability, Mill splits the
issue and retire into two different instructions which are independently
static scheduled (patented). A load-retire instruction stalls the core
if it has no data yet.

The distinction among these categories is not a matter of encoding - all
are wide issue, and each has its own scheme. The Itanium used fixed size
bundles; classic VLIW uses variable-size bundles; and Mill uses a
bundle-of-six-bundles encoding. However, any of the categories could use
any of the encoding schemes. It's a matter of what is done by the
compiler vs. hardware, not a matter of the bitsy representation.

This makes the Mill suitable for general-purpose work that would cause a
classic VLIW to spend too much time in stall, yet still take advantage
of any schedule variability to do other work while (possibly) waiting
for external data, and without needing OOO retire hazard hardware.

The gain from Mill's split load is limited by the maximal gap (in time)
between the load-issue and load-retire instructions, which is in turn
determined by how much work is available to do that neither depends on
the load result nor is depended on by the load issue. That's easily
determined by conventional dataflow analysis in the compiler, and is
exactly the same as the amount of work that an OOO can do while waiting
for a load to retire.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suh8l5$7e8$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23538&group=comp.arch#23538

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 15 Feb 2022 22:15:01 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <suh8l5$7e8$1@newsreader4.netcologne.de>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at> <sudb0g$rq3$1@dont-email.me>
<2022Feb15.104639@mips.complang.tuwien.ac.at> <sug7bd$b9l$1@dont-email.me>
<suh3kq$3pd$3@newsreader4.netcologne.de> <suh5d3$po5$1@dont-email.me>
Injection-Date: Tue, 15 Feb 2022 22:15:01 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:19cd:0:7285:c2ff:fe6c:992d";
logging-data="7624"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Tue, 15 Feb 2022 22:15 UTC

Ivan Godard <ivan@millcomputing.com> schrieb:
> On 2/15/2022 12:49 PM, Thomas Koenig wrote:
>> Ivan Godard <ivan@millcomputing.com> schrieb:
>>> On 2/15/2022 1:46 AM, Anton Ertl wrote:
>>>> Ivan Godard <ivan@millcomputing.com> writes:
>>>>> The
>>>>> specializer does quite extensive optimization - bundle-packing,
>>>>> scheduling, CFG collapse, software pipelining, in- and out-lining, etc.
>>>>> - which is too expensive for a JIT, but the passes that do the
>>>>> optimizations are optionally bypassed.
>>>>
>>>> By contrast, an OoO microarchitecture does not care whether the native
>>>> code has been generated by a JIT compiler or not, it will
>>>> branch-predict and schedule the instructions all the same.
>>>
>>> So will a Mill.
>>>
>>> The specializer passes that a JIT would skip are target independent, and
>>> give better code for any target: OOO, IO, or sideways.
>>
>> If it is target independent, why is it in the specializer?
>
> The specializer is the back-end of the compiler, and does back-end stuff
> in addition to member-targeting stuff. The functionality could be split,
> but that would require a communication format across the split.

That really depends on how you defined your model-independent
language, GenAsm. I would have expected this to be defined
in such a way that the specializer really only had to do the
target-dependent stuff, so most of the heavy lifting is done in
the normal compilation step.

You must have had your reasons, I just don't understand them.

>
> Actually there are three kinds of work done: family independent (done in
> both Mill and x86, like inlining);

In the specializer? That sounds a lot of what a compiler middle end
should do (maybe guided by some information abut the back end,
like number of registers - a thorny issue).

>member independent (done in all Mill
> members but not other ISAs,

That sounds like the task of a more or less traditional back
end to me (like the nvptx "back end", which also generates
intermediate code.

>like ganging op/compare instructions, or
> injecting pseudo-ops for load retires), and member dependent (done
> uniquely for each member, like bit stuffing into the binary).

This is the part that I would probably like to keep as small
as possible, if it were my project, which it isn't :-)

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suh8vh$g7r$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23539&group=comp.arch#23539

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 15 Feb 2022 16:20:32 -0600
Organization: A noiseless patient Spider
Lines: 70
Message-ID: <suh8vh$g7r$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<suerog$cd0$1@dont-email.me> <2022Feb15.124937@mips.complang.tuwien.ac.at>
<sugjhv$v6u$1@dont-email.me> <jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org>
<sugkji$6vi$1@dont-email.me> <jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 15 Feb 2022 22:20:33 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="b15bfce4c7a8c2fd9d7d9df155af1db1";
logging-data="16635"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19JYA+0S+WbTtikszSe1SKS"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.0
Cancel-Lock: sha1:JYokxt70R6EhBNjGpb9e8FGORw8=
In-Reply-To: <jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US

by: BGB - Tue, 15 Feb 2022 22:20 UTC

On 2/15/2022 10:58 AM, Stefan Monnier wrote:
>> Thank you. That makes perfect sense. BTW, does that present an opportunity
>> for some sort of profile driven optimization where the run time branch
>> history is fed back to a future compilation for better optimization?
>> Probably not as useful for an OoO machine.
>
> Profile-driven optimization is used, of course, but the problem remains:
> the compiler needs to generate code that works for all possible
> situations and it can't freely duplicate code all over the place.
> A bit of code duplication (to specialize a code path to a few different
> scenarios) can be done, but only within fairly strict limits otherwise
> code size will explode and performance goes down the drain again.
>

But, with a profiler, you can know *where* it likely matters, so it
could be done without exploding the code-size. The compiler would
realize a rarely-used path is rarely used, and this not bother unrolling
or modulo-scheduling it's loops, ...

Partial issue is pulling off profile driven optimization in a way that
is not sufficiently annoying that the programmers don't bother with it.

Most ideas I can imagine ATM would likely require running an emulator
inside the compiler with some amount of programmer-supplied test data.

Say:
[[auto_profile_hint]] void SomeBenchmarkFunction()
{ ... run some simulated benchmarks ...
}

Compiler sees the attribute, and tries running the function in an
emulator to see what it does and where its "hot-spots" are located.

> In contrast, an OoO is free to use a different schedule each time
> a chunk of code is run (and similarly the branch predictor is free to
> provide wildly different predictions each time that chunk of code is
> run) without any downside.
>
> The OoO works on the trace of the actual execution, where the main limit
> is the size of the window it can consider (linked to the accuracy of the
> branch predictor), whereas the compiler is not limited to such a window
> but instead it's limited to work on the non-unrolled code.
>

While this is true, I suspect it is a case of "best" vs "good enough".
If one can get the in-order scheduling to a "good enough" stage, it may
be possible to work around this deficiency by being able to throw
(slightly) more cores at the problem.

Though, I will note that I am assuming that core counts remain within
the limits of plausible memory bandwidth, but consider this to be a weak
argument because, if additional cores would be limited by bandwidth, a
single faster core would also be limited by bandwidth.

One is more limited by how effectively one can parallelize the codebase,
but if one is assuming an extended timeframe where no other
hardware-level performance improvements are viable, programmers will
"make it work" (IOW: if one assumes that the only other option is
multiple decades of stagnation).

Also I can note that some people have done things like sticking a whole
bunch of RV32I cores or similar onto an FPGA. I don't exactly consider
this sort of thing to be a viable approach either, this is not really
what I am talking about here...

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suh97n$hng$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23540&group=comp.arch#23540

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 15 Feb 2022 14:24:54 -0800
Organization: A noiseless patient Spider
Lines: 32
Message-ID: <suh97n$hng$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<jwv35kkfe8h.fsf-monnier+comp.arch@gnu.org>
<2022Feb15.194310@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 15 Feb 2022 22:24:55 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="93b1554788c317f6fd2ff3a26e7468cb";
logging-data="18160"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18SEVLDRqLPK0tffkIwp5m+"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.1
Cancel-Lock: sha1:CUmyS4T7xmMwshpxiyxD/OQm/PU=
In-Reply-To: <2022Feb15.194310@mips.complang.tuwien.ac.at>
Content-Language: en-US

by: Ivan Godard - Tue, 15 Feb 2022 22:24 UTC

On 2/15/2022 10:43 AM, Anton Ertl wrote:
> Stefan Monnier <monnier@iro.umontreal.ca> writes:
>>> Concerning speculation, yes, it does waste power, but rarely, because
>>> branch mispredictions are rare.
>>
>> Of course in-order cores also speculate, so this is only
>> tangentially related. But as a side-note I'll point out that statically
>> scheduled processors encourage the compiler to move long-latency
>> instructions such as loads to "as soon as it's safe to do it" rather
>> than "as soon as we know we will need it", and that sometimes
>> requires adding yet more "compensation code" in other branches.
>
> Loads are a bad example, because you typically don't move the above a
> branch they control-depend on, because loads can trap. Instead one
> tends to use prefetches. IA-64 had a mechanism that allowed moving
> loads up, however. Non-trapping instructions such as multiplications
> can be moved up speculatively by the compiler. And because compiler
> branch prediction (~10% miss rate) is much worse than dynamic branch
> prediction (~1% miss rate, both numbers vary strongly with the
> application, so take them with a grain of salt), a static scheduling
> speculating compiler will tend to waste more energy for the same
> degree of speculation.

Not in modern processes, or so I'm told by the hardware guys. Leakage is
of the same order of cost as execution these days, so an idle ALU might
as well do something potentially useful. Consequently it is worth while
for the compiler to if-convert everything until it runs out of FUs.

Incidentally, in-order != static-prediction. No in-order core much above
a Z80 will use static branch prediction, for the reason you give. Well,
no competent core, anyway.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suh9lp$ka3$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23541&group=comp.arch#23541

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 15 Feb 2022 14:32:24 -0800
Organization: A noiseless patient Spider
Lines: 113
Message-ID: <suh9lp$ka3$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
<sugu4m$9au$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 15 Feb 2022 22:32:25 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="93b1554788c317f6fd2ff3a26e7468cb";
logging-data="20803"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18oYdVb3AOBIhowk1g5Z6kT"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.1
Cancel-Lock: sha1:T6EHVDg2n1XEorDAqhn6F3IfrIg=
In-Reply-To: <sugu4m$9au$1@dont-email.me>
Content-Language: en-US

by: Ivan Godard - Tue, 15 Feb 2022 22:32 UTC

On 2/15/2022 11:15 AM, BGB wrote:
> On 2/15/2022 10:21 AM, Scott Smader wrote:
>> On Tuesday, February 15, 2022 at 3:47:39 AM UTC-8, Anton Ertl wrote:
>>> Scott Smader <yogam...@yahoo.com> writes:
>>>> On Monday, February 14, 2022 at 2:26:46 PM UTC-8, Anton Ertl wrote:
>>>>> Why? If OoO is superior now at the current transistor budget, why=20
>>>>> would it stop being superior if the transistor budget does not=20
>>>>> increase?
>>>>
>>>> Not sure if the assumption of OoO superiority is justified at
>>>> current trans=
>>>> istor budget,
>>>
>>> Definitely. Here are two benchmarks, all numbers are times in
>>> seconds:
>>>
>>> LaTeX:
>>>
>>> - Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Debian 9 64bit 2.368
>>> - AMD E-450 1650MHz (Lenovo Thinkpad X121e), Ubuntu 11.10 64-bit 1.216
>>> - Odroid N2 (1896MHz Cortex A53) Ubuntu 18.04 2.488
>>> - Odroid N2 (1800MHz Cortex A73) Ubuntu 18.04 1.224
>>>
>>> Gforth:
>>>
>>> sieve bubble matrix fib fft
>>> 0.492 0.556 0.424 0.700 0.396 Intel Atom 330 (Bonnell) 1.6GHz; gcc-4.9
>>> 0.321 0.479 0.219 0.594 0.229 AMD E-350 1.6GHz; gcc version 4.7.1
>>> 0.350 0.390 0.240 0.470 0.280 Odroid C2 (1536MHz Cortex-A53), gcc-6.3.0
>>> 0.180 0.224 0.108 0.208 0.100 Odroid N2 (1800MHz Cortex-A73), gcc-6.3.0
>>>
>>> All these cores are 2-wide. The Atom 330 is in-order, the E-450 is
>>> OoO (both AMD64); the Cortex-A53 is in-order, the Cortex-A73 is OoO
>>> (both ARM A64).
>>>
>>
>> Fine results. Thank you. But at least the A53/A73 numbers don't prove
>> your claim. The A73 cores are twice as big as the A53 cores, according
>> to
>> https://www.anandtech.com/show/10347/arm-cortex-a73-artemis-unveiled/3. So
>> a 2x performance improvement indicates equal efficiency per unit area,
>> not superior.
>>
>
> Also the thing I was getting at:
> Assume a future which is transistor-budget limited.
>
> Unless the OoO cores deliver superior performance relative to their
> transistor budget, they have a problem in this case.
>
> OoO has tended to be faster, but at the cost of a higher transistor budget.
>
>
>>> Intel switched from in-order to OoO for it's little/efficiency cores
>>> with Silvermont, Apple uses OoO cores for their efficiency cores as
>>> well as their performance cores. Intel even switched from in-order to
>>> OoO for the Xeon Phi, which targets HPC, the area where in-order is
>>> strongest.
>>>
>>>> but for the same transistor budget at the same clock rate, st=
>>>> atically scheduled should win because it more efficiently uses the
>>>> availabl=
>>>> e transistors to do useful work.
>>> That's the fallacy that caused Intel and HP to waste billions on
>>> IA-64, and Transmeta investors to invest $969M, most of which was
>>> lost.
>>
>> Agreed that Transmeta and IA-64 failed. Disagree that two failures
>> prove static scheduling's efficiency to be a fallacy.
>>
>
> Likewise.
>
> Though, one can't ignore one big drawback of in-order designs:
> They do not deal well with cache misses.
>
> One can use prefetching to good effect, but this may require involvement
> from the programmer, and C doesn't really have a "good" way to express a
> prefetch operation.
>
> Also, the prefetch has to be timed correctly as well, far enough back to
> absorb the cache miss, but recent enough that the prefetched data
> doesn't end up evicted in the time between the prefetch and the time the
> data is used.
>
>
> If one could potentially put another (watered down) sat of fetch/decode
> stages just ahead of the current position, it could be possible to use
> them as a prefetcher. Less clear is how to do this "effectively".
>
> One option would be to partly duplicate and delay the execute stages, say:
> IF ID1 (V_ID2 V_EX1 V_EX2 V_EX3) ID2 EX1 EX2 EX3 WB
>
> With the V_* stages mostly serving as "what if" stages, their results
> not being saved to the register file, but any loads/stores resulting in
> an L1 prefetch (stores are not written, and the L1 does not stall on a
> miss).
>
>
> But, this would pay some in terms of resource budget, and would increase
> branch latency.
>
> Though, the alternative is figuring out how to make the compiler insert
> prefetches without needing this sort of trickery. Granted, this is
> assuming that (ideally) the L1 can perform prefetches without them
> resulting in a pipeline stall (which would mostly defeat the point).
>
> One could also possibly try to work around some things if the compiler
> were able to figure out when and where cache misses will occur (vs
> current compilers which have no real way of knowing this).

Congratulations: you have invented the Mill load instruction, which is
really a "prefetch this if it doesn't get in trouble" instruction.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suhafj$our$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23542&group=comp.arch#23542

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 15 Feb 2022 14:46:10 -0800
Organization: A noiseless patient Spider
Lines: 38
Message-ID: <suhafj$our$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<suerog$cd0$1@dont-email.me> <2022Feb15.124937@mips.complang.tuwien.ac.at>
<sugjhv$v6u$1@dont-email.me> <jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org>
<sugkji$6vi$1@dont-email.me> <jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 15 Feb 2022 22:46:11 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="93b1554788c317f6fd2ff3a26e7468cb";
logging-data="25563"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18pcXJGp0NO+IXAqI3hUu6S"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.1
Cancel-Lock: sha1:ojdKP5/YmpVIjAjASirqYPj1qQ0=
In-Reply-To: <jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US

by: Ivan Godard - Tue, 15 Feb 2022 22:46 UTC

Actually the limit to useful window size in GP code is dataflow
dependencies. All the fancy numbers bandied about for big windows are
for embarrassingly parallel apps - walk a huge array doing the same
thing for every element for example. Those (typically HPC) are
important, but engender a supercomputer bias in design.

GP code - the classic payroll app, or the great bulk of real work once
the embarrassingly parallel code has been moved off to special purpose
engines - hits a dataflow dependence within a few tends of instructions.
The rest of the window can be filled with instructions awaiting issue
resolution - but you might as well have left them in the icache.

"The dirty little secret about OOO is how little OOO there really is."
- Andy Glue

Pages:1 2 3 4 5 6 789 10 11 12 13 14

server_pubkey.txt

rocksolid light 0.9.81
clearnet tor