Welcome to novaBBS (click a section below)

mail files register newsreader groups login

Message-ID:

AUTHOR FvwmAuto just appeared one day, nobody knows how. -- FvwmAuto(1x)

Re: Hoisting load issue out of functions; was: instruction set

Subject	Author
Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	Brett
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	Brett
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Stefan Monnier
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Stefan Monnier
Re: Encoding 20 and 40 bit instructions in 128 bits	Bernd Linsel
Re: Encoding 20 and 40 bit instructions in 128 bits	Anton Ertl
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Brian G. Lucas
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Anton Ertl
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	Stefan Monnier
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	John Levine
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Stefan Monnier
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Stefan Monnier
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	John Levine
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Terje Mathisen
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Terje Mathisen
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	John Levine
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Quadibloc
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Scott Smader
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Stefan Monnier
Re: instruction set binding time, was Encoding 20 and 40 bit	Scott Smader
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	James Van Buskirk
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Statically scheduled plus run ahead.	Brett
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Anton Ertl
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Paul A. Clayton

Pages:1 2 3 4 5 6 7 8 9 10 11 12 1314

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sv21k5$fsg$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23753&group=comp.arch#23753

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!news.freedyn.de!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-622-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 22 Feb 2022 06:59:17 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sv21k5$fsg$1@newsreader4.netcologne.de>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<suerog$cd0$1@dont-email.me> <2022Feb15.124937@mips.complang.tuwien.ac.at>
<sugjhv$v6u$1@dont-email.me> <jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org>
<sugkji$6vi$1@dont-email.me> <jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org>
<suhafj$our$1@dont-email.me> <2022Feb21.115543@mips.complang.tuwien.ac.at>
<sv0k1f$ju$1@dont-email.me> <sv13ah$sor$1@newsreader4.netcologne.de>
<f85b718a-a9de-4de7-8154-aecf34c0207fn@googlegroups.com>
Injection-Date: Tue, 22 Feb 2022 06:59:17 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-622-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:622:0:7285:c2ff:fe6c:992d";
logging-data="16272"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Tue, 22 Feb 2022 06:59 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:

>> You have to watch out for one thing - for code like
>>
>> void foo (double *a, double *b, double *c)
>> {
>> *a = 42.;
>> *b = 42.;
>> *c = 42.;
>> }
> Compiler is not restricted from doing::
><
> MOV Rt,#42
> STD Rt,[Ra]
> STD Rt,[Rb]
> STD Rt,[Rc]
><
> but it does not HAVE to.
><
> Also consider that you are fetching 4-8 words wide, so the amount of time it takes to
> STD #42,[Ra]
> STD #42,[Rb]
> STD #42,[Rc]
> may be less than the above example.

(Not the above example are for 8-byte floating point constants,
not four-byte integers, so the difference is 6 vs. 12 words,
but the same thing currently also happens for 10 identical
constants).

You're saying "does not HAVE to" and "may be less", so it is
not clear.

This is actually indicative of a problem with offering a richer
instruction set: The compiler has to make more choices, which are
at first obvious to the compiler writer and which may also change
with model number and time.

Another point is https://github.com/bagel99/llvm-my66000/issues/1,
which contains the code

vec r24,{r17}
ldd r20,[r12,r17<<3,-40]
ldd r1,[r12,r17<<3,-48]
ldd r27,[r12,r17<<3,-24]
ldd r18,[r12,r17<<3,-32]

Eight words for four loads, where

vec r24,{r17}
la ra,[r12,17<<3]
ldd r20,[ra,-40]
ldd r1,[ra,-48]
ldd r27,[ra,-24]
ldd r18,[ra,-40]

would be five. Better? Worse? How should a compiler writer know?

Or it would have been possible to use "load multiple", loading
the base address [r12,r14<<3,-48] into a register and adjusting
the register allocation so the registers are consecutive.
Better? Worse? Better or worse in 10 years when the third
generation of My66000 chips hits the market (hopefully)?

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sv2bei$9id$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23754&group=comp.arch#23754

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 22 Feb 2022 01:46:58 -0800
Organization: A noiseless patient Spider
Lines: 66
Message-ID: <sv2bei$9id$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<suerog$cd0$1@dont-email.me> <2022Feb15.124937@mips.complang.tuwien.ac.at>
<sugjhv$v6u$1@dont-email.me> <jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org>
<sugkji$6vi$1@dont-email.me> <jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org>
<suhafj$our$1@dont-email.me> <2022Feb21.115543@mips.complang.tuwien.ac.at>
<sv0k1f$ju$1@dont-email.me> <sv13ah$sor$1@newsreader4.netcologne.de>
<f85b718a-a9de-4de7-8154-aecf34c0207fn@googlegroups.com>
<sv21k5$fsg$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 22 Feb 2022 09:46:59 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="aa681367c2296b0b3369e867e926f680";
logging-data="9805"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX197R2ZrNbpVhdxYbDCAUK05"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:7m3rtY6IvNP1RnsemqfMRBAHB6o=
In-Reply-To: <sv21k5$fsg$1@newsreader4.netcologne.de>
Content-Language: en-US

by: Ivan Godard - Tue, 22 Feb 2022 09:46 UTC

On 2/21/2022 10:59 PM, Thomas Koenig wrote:
> MitchAlsup <MitchAlsup@aol.com> schrieb:
>
>>> You have to watch out for one thing - for code like
>>>
>>> void foo (double *a, double *b, double *c)
>>> {
>>> *a = 42.;
>>> *b = 42.;
>>> *c = 42.;
>>> }
>> Compiler is not restricted from doing::
>> <
>> MOV Rt,#42
>> STD Rt,[Ra]
>> STD Rt,[Rb]
>> STD Rt,[Rc]
>> <
>> but it does not HAVE to.
>> <
>> Also consider that you are fetching 4-8 words wide, so the amount of time it takes to
>> STD #42,[Ra]
>> STD #42,[Rb]
>> STD #42,[Rc]
>> may be less than the above example.
>
> (Not the above example are for 8-byte floating point constants,
> not four-byte integers, so the difference is 6 vs. 12 words,
> but the same thing currently also happens for 10 identical
> constants).
>
> You're saying "does not HAVE to" and "may be less", so it is
> not clear.
>
> This is actually indicative of a problem with offering a richer
> instruction set: The compiler has to make more choices, which are
> at first obvious to the compiler writer and which may also change
> with model number and time.
>
> Another point is https://github.com/bagel99/llvm-my66000/issues/1,
> which contains the code
>
> vec r24,{r17}
> ldd r20,[r12,r17<<3,-40]
> ldd r1,[r12,r17<<3,-48]
> ldd r27,[r12,r17<<3,-24]
> ldd r18,[r12,r17<<3,-32]
>
> Eight words for four loads, where
>
> vec r24,{r17}
> la ra,[r12,17<<3]
> ldd r20,[ra,-40]
> ldd r1,[ra,-48]
> ldd r27,[ra,-24]
> ldd r18,[ra,-40]
>
> would be five. Better? Worse? How should a compiler writer know?
>
> Or it would have been possible to use "load multiple", loading
> the base address [r12,r14<<3,-48] into a register and adjusting
> the register allocation so the registers are consecutive.
> Better? Worse? Better or worse in 10 years when the third
> generation of My66000 chips hits the market (hopefully)?

That's why compiler people get the big bucks.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sv2g1e$nna$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23755&group=comp.arch#23755

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-622-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 22 Feb 2022 11:05:18 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sv2g1e$nna$1@newsreader4.netcologne.de>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<suerog$cd0$1@dont-email.me> <2022Feb15.124937@mips.complang.tuwien.ac.at>
<sugjhv$v6u$1@dont-email.me> <jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org>
<sugkji$6vi$1@dont-email.me> <jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org>
<suhafj$our$1@dont-email.me> <2022Feb21.115543@mips.complang.tuwien.ac.at>
<sv0k1f$ju$1@dont-email.me> <sv13ah$sor$1@newsreader4.netcologne.de>
<f85b718a-a9de-4de7-8154-aecf34c0207fn@googlegroups.com>
<sv21k5$fsg$1@newsreader4.netcologne.de> <sv2bei$9id$1@dont-email.me>
Injection-Date: Tue, 22 Feb 2022 11:05:18 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-622-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:622:0:7285:c2ff:fe6c:992d";
logging-data="24298"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Tue, 22 Feb 2022 11:05 UTC

Ivan Godard <ivan@millcomputing.com> schrieb:

> That's why compiler people get the big bucks.

Or nothing, if they are volunteers :-)

(Counting travel costs to a GNU cauldron, my actual money balance
for gfortran contributions is negative, but I certainly got to
meet some very interesting people there, and hobbies are allowed
to cost money :-)

compiler vs. hardware scheduling (was: instruction set binding ...)

<2022Feb22.112450@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23756&group=comp.arch#23756

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: compiler vs. hardware scheduling (was: instruction set binding ...)
Date: Tue, 22 Feb 2022 10:24:50 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 105
Message-ID: <2022Feb22.112450@mips.complang.tuwien.ac.at>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at> <suerog$cd0$1@dont-email.me> <2022Feb15.124937@mips.complang.tuwien.ac.at> <sugjhv$v6u$1@dont-email.me> <jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org> <sugkji$6vi$1@dont-email.me> <jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org> <suh8vh$g7r$1@dont-email.me> <jwvr18392lk.fsf-monnier+comp.arch@gnu.org>
Injection-Info: reader02.eternal-september.org; posting-host="58b8e5fe98465b859372797aad413050";
logging-data="23410"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX197itZ7+h8jX1V/0R+A6quR"
Cancel-Lock: sha1:YOPDshwD/gcZU39Esr/gvOibjlk=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Tue, 22 Feb 2022 10:24 UTC

Stefan Monnier <monnier@iro.umontreal.ca> writes:
>In theory, a compiler could do wonders.

Not only in theory. You can demonstrate wonders for cherry-picked
pieces of code.

>That's why the failure of Itanium was easy to foresee for people in
>the compiler business

I was not in compiler business at the time, only in compiler research,
but I expected IA-64 to succeed. The main reasons were that designing
architectures to work with compilers seemed to work well for RISCs
(These days it seems to me that the major advance of RISCs was
pipelining and the resulting increase in IPC, or reduction in CPI, as
the favourite metric was at the time), so hardware scheduling (with
hardware register renaming and hardware branch prediction; the whole
package is nowadays called OoO) did not look like a winner at the
time.

There were the same arguments that Scott Smader presented here
recently why IA-64 was expected to be superior: The additional
hardware of OoO would require more area that a compiler-scheduled CPU
could use for better purposes, and many people also believed that it
would result in slow clock rates (the R2000 and R3000 with their lack
of interlocks provided some fodder for that idea, as did the low clock
of OoO instances like the R10000 and the K5; and while the Pentium Pro
had a faster clock than the fastest Pentium at the time of its
release, its clock was slower than the in-order 21164; and the OoO
21264 had a slower clock than the in-order 21164a).

As for IPC, in-order implementations of RISCs had very limited
speculation potential and therefore IPC, but IA-64 has architectural
solutions for all of that: advanced loads for speculatively executing
load instructions, a mechanism for dealing with memory aliases,
predicated execution for if-conversion.

And the idea was that hardware scheduling would be relatively limited,
because it costs hardware and large schedulers would have even lower
clock rates than we expected for smaller ones; and that they would be
dumb, because it has to be quick, while compiler speculation could be
smart and only speculatively execute instructions from the critical
path, and it could schedule across larger regions.

Also, trace scheduling and software pipelining existed and
demonstrated compiler possibilities, although limited in the kind of
control structures they supported; but I expected that compilers would
be able to schedule across arbitrary control flow.

How did it turn out?

First and foremost, when IA-64 implementations appeared, they had low
clock rates compared to the competing OoO CPUs. And this stayed this
way; other in-order cores also seem to have limited clock rates.
Anyway, if my expectations for IA-64 clock rates relative to OoO clock
rates had come true, IA-64 would have been quite competetive for quite
some time, and maybe OoO would never have reached its current state of
maturity.

Performance as evidenced in SPEC CPU results was initially competetive
for SPEC CFP, but not for SPEC Cint. For the latest iterations
(Poulson and Kittson), which had a significantly changed
microarchitecture compared to the earlier McKinley-based, they did not
even present SPEC numbers last I looked. The message they (and other
manufacturers who do not present such numbers for their architectures)
send me is that their numbers are so weak that they prefer not to show
them.

So, I wondered why my expectations did not come true. My explanations
are:

Concerning the clock rate, the reason seems to be that OoO hardware
uses more local feedback loops for learning whether processing can
advance or not, while these feedbacks are more global in in-order
processors. This became an issue during the 1990s, with wires taking
more and more time compared to transistors.

Also, dynamic (hardware) branch prediction made great advances since
the early 1990s, while compiler branch prediction did not. This
allowed the instruction fetching and decoding to run far ahead of
retirement (i.e., architectural execution) without incurring a
too-extreme mis-speculation penalty. As a consequence, OoO could go
much deeper than I had expected in the 90s (obviously without clock
rate penalty), and the lack of smartness in scheduling could be
compensated by that depth.

On the compiler side, I did not see the advances I expected.
Scheduling for arbitrary control flow stuck at the conceptual level
and did not advance into usable research compilers, much less
production compilers AFAIK. Basically the only advance beyond what
was available in the 80s was if-conversion and reverse if-conversion,
which allowed software pipelining of loops containing ifs, but the
result is much less efficient (in terms of machine utilization) than
software pipelining of simple loops. In any case, the compiler
techniques limit scheduling to simple or if-converted loops, or to
traces/superblocks, with scheduling boundaries and their accompanying
ramp-up and ramp-down effects. Of course, even with arbitrary control
flow like I have in mind, function/method boundaries are scheduling
boundaries; that can be mitigated somewhat with inlining, but that
results in code expansion, and for polymorphic methods that poses
problems.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<c7e68c07-24e7-4eb4-8142-6d3b67331edan@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23758&group=comp.arch#23758

copy link Newsgroups: comp.arch

X-Received: by 2002:adf:facd:0:b0:1e7:ceff:7695 with SMTP id a13-20020adffacd000000b001e7ceff7695mr21117581wrs.656.1645548102850;
Tue, 22 Feb 2022 08:41:42 -0800 (PST)
X-Received: by 2002:aca:a9c5:0:b0:2d4:373d:98c8 with SMTP id
s188-20020acaa9c5000000b002d4373d98c8mr2458104oie.272.1645548102215; Tue, 22
Feb 2022 08:41:42 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 22 Feb 2022 08:41:42 -0800 (PST)
In-Reply-To: <sv21k5$fsg$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:58f1:1d2:7050:984a;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:58f1:1d2:7050:984a
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at> <suerog$cd0$1@dont-email.me>
<2022Feb15.124937@mips.complang.tuwien.ac.at> <sugjhv$v6u$1@dont-email.me>
<jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org> <sugkji$6vi$1@dont-email.me>
<jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org> <suhafj$our$1@dont-email.me>
<2022Feb21.115543@mips.complang.tuwien.ac.at> <sv0k1f$ju$1@dont-email.me>
<sv13ah$sor$1@newsreader4.netcologne.de> <f85b718a-a9de-4de7-8154-aecf34c0207fn@googlegroups.com>
<sv21k5$fsg$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c7e68c07-24e7-4eb4-8142-6d3b67331edan@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 22 Feb 2022 16:41:42 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 86

by: MitchAlsup - Tue, 22 Feb 2022 16:41 UTC

On Tuesday, February 22, 2022 at 12:59:20 AM UTC-6, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> >> You have to watch out for one thing - for code like
> >>
> >> void foo (double *a, double *b, double *c)
> >> {
> >> *a = 42.;
> >> *b = 42.;
> >> *c = 42.;
> >> }
> > Compiler is not restricted from doing::
> ><
> > MOV Rt,#42
> > STD Rt,[Ra]
> > STD Rt,[Rb]
> > STD Rt,[Rc]
> ><
> > but it does not HAVE to.
> ><
> > Also consider that you are fetching 4-8 words wide, so the amount of time it takes to
> > STD #42,[Ra]
> > STD #42,[Rb]
> > STD #42,[Rc]
> > may be less than the above example.
<
> (Not the above example are for 8-byte floating point constants,
> not four-byte integers, so the difference is 6 vs. 12 words,
> but the same thing currently also happens for 10 identical
> constants).
<
I accept the blame for using longs not floats. But the size of the
constant is the same:
STD #42,[Rs]
is a 3-word instruction
STD #42.0E+0,[Rs]
is also a 3-word instruction
And the size chosen was doubleword nonetheless.
>
> You're saying "does not HAVE to" and "may be less", so it is
> not clear.
<
on a lesser implementation, the former might be faster
on a great big implementation it is likely a wash. Yes,
you might not know in the compiler targeting the whole
range. In any event you are unlikely to notice the change
in performance this optimization makes.
>
> This is actually indicative of a problem with offering a richer
> instruction set: The compiler has to make more choices, which are
> at first obvious to the compiler writer and which may also change
> with model number and time.
>
> Another point is https://github.com/bagel99/llvm-my66000/issues/1,
> which contains the code
>
> vec r24,{r17}
> ldd r20,[r12,r17<<3,-40]
> ldd r1,[r12,r17<<3,-48]
> ldd r27,[r12,r17<<3,-24]
> ldd r18,[r12,r17<<3,-32]
>
> Eight words for four loads, where
>
> vec r24,{r17}
> la ra,[r12,17<<3]
> ldd r20,[ra,-40]
> ldd r1,[ra,-48]
> ldd r27,[ra,-24]
> ldd r18,[ra,-40]
>
> would be five. Better? Worse? How should a compiler writer know?
<
Yes, the compiler writer (Brian) should know, and once again you
on a 1-wide macine fetching 4-words per cycle, you would not
see the difference in the pipeline, and would be unlikely to see
any change in I$ performance.
>
> Or it would have been possible to use "load multiple", loading
> the base address [r12,r14<<3,-48] into a register and adjusting
> the register allocation so the registers are consecutive.
<
You are complaining about nuance level stuff on a back end
Brian wrote "for fun". In order to use LDM the registers have
to be in sequence.
<
> Better? Worse? Better or worse in 10 years when the third
> generation of My66000 chips hits the market (hopefully)?

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sv348s$sj4$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23759&group=comp.arch#23759

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 22 Feb 2022 10:50:35 -0600
Organization: A noiseless patient Spider
Lines: 30
Message-ID: <sv348s$sj4$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<suerog$cd0$1@dont-email.me> <2022Feb15.124937@mips.complang.tuwien.ac.at>
<sugjhv$v6u$1@dont-email.me> <jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org>
<sugkji$6vi$1@dont-email.me> <jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org>
<suhafj$our$1@dont-email.me> <2022Feb21.115543@mips.complang.tuwien.ac.at>
<sv0k1f$ju$1@dont-email.me> <sv13ah$sor$1@newsreader4.netcologne.de>
<f85b718a-a9de-4de7-8154-aecf34c0207fn@googlegroups.com>
<sv21k5$fsg$1@newsreader4.netcologne.de> <sv2bei$9id$1@dont-email.me>
<sv2g1e$nna$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 22 Feb 2022 16:50:36 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="3c9640f024530e8f700b91ad353ec73d";
logging-data="29284"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18qJxSk5Rp9YpOkoKmUAYH6"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:eDyITBXrqEZO8eMqs9bEY9BZSWc=
In-Reply-To: <sv2g1e$nna$1@newsreader4.netcologne.de>
Content-Language: en-US

by: BGB - Tue, 22 Feb 2022 16:50 UTC

On 2/22/2022 5:05 AM, Thomas Koenig wrote:
> Ivan Godard <ivan@millcomputing.com> schrieb:
>
>> That's why compiler people get the big bucks.
>
> Or nothing, if they are volunteers :-)
>
> (Counting travel costs to a GNU cauldron, my actual money balance
> for gfortran contributions is negative, but I certainly got to
> meet some very interesting people there, and hobbies are allowed
> to cost money :-)

I wrote a C compiler for my project, my balance is still a 0...

Overall, it would probably be negative when one counts money spent on
FPGA boards and similar, or the (hypothetical) losses due to opportunity
cost (though, this assumes the existence of some other greater-or-equal
path that would have led to some form of "profit").

Though, if not for all the time thrown at my BJX2 project, my 3D engine
projects might not have stalled. Question is if I could have made any
money there, but it seems that there is a "make it not suck" issue in my
case that is separate from the amount of effort I put into something.

Otherwise, I have thoughts that I may need to come up with a new name
for my ISA at some point, as I had become aware that the name I am using
for it has some unfortunate implications.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<c177c78c-85aa-4d4f-9115-f140d00ff16en@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23760&group=comp.arch#23760

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6000:22f:b0:1e3:3415:4078 with SMTP id l15-20020a056000022f00b001e334154078mr20152145wrz.69.1645549621895;
Tue, 22 Feb 2022 09:07:01 -0800 (PST)
X-Received: by 2002:a05:6830:1b6f:b0:5af:d2f:eed9 with SMTP id
d15-20020a0568301b6f00b005af0d2feed9mr4943796ote.331.1645549621359; Tue, 22
Feb 2022 09:07:01 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!nntp.club.cc.cmu.edu!5.161.45.24.MISMATCH!2.us.feeder.erje.net!feeder.erje.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 22 Feb 2022 09:07:01 -0800 (PST)
In-Reply-To: <c7e68c07-24e7-4eb4-8142-6d3b67331edan@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:58f1:1d2:7050:984a;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:58f1:1d2:7050:984a
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at> <suerog$cd0$1@dont-email.me>
<2022Feb15.124937@mips.complang.tuwien.ac.at> <sugjhv$v6u$1@dont-email.me>
<jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org> <sugkji$6vi$1@dont-email.me>
<jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org> <suhafj$our$1@dont-email.me>
<2022Feb21.115543@mips.complang.tuwien.ac.at> <sv0k1f$ju$1@dont-email.me>
<sv13ah$sor$1@newsreader4.netcologne.de> <f85b718a-a9de-4de7-8154-aecf34c0207fn@googlegroups.com>
<sv21k5$fsg$1@newsreader4.netcologne.de> <c7e68c07-24e7-4eb4-8142-6d3b67331edan@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c177c78c-85aa-4d4f-9115-f140d00ff16en@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 22 Feb 2022 17:07:01 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 97

by: MitchAlsup - Tue, 22 Feb 2022 17:07 UTC

On Tuesday, February 22, 2022 at 10:41:45 AM UTC-6, MitchAlsup wrote:
> On Tuesday, February 22, 2022 at 12:59:20 AM UTC-6, Thomas Koenig wrote:
> > MitchAlsup <Mitch...@aol.com> schrieb:
> > >> You have to watch out for one thing - for code like
> > >>
> > >> void foo (double *a, double *b, double *c)
> > >> {
> > >> *a = 42.;
> > >> *b = 42.;
> > >> *c = 42.;
> > >> }
> > > Compiler is not restricted from doing::
> > ><
> > > MOV Rt,#42
> > > STD Rt,[Ra]
> > > STD Rt,[Rb]
> > > STD Rt,[Rc]
> > ><
> > > but it does not HAVE to.
> > ><
> > > Also consider that you are fetching 4-8 words wide, so the amount of time it takes to
> > > STD #42,[Ra]
> > > STD #42,[Rb]
> > > STD #42,[Rc]
> > > may be less than the above example.
> <
> > (Not the above example are for 8-byte floating point constants,
> > not four-byte integers, so the difference is 6 vs. 12 words,
> > but the same thing currently also happens for 10 identical
> > constants).
> <
> I accept the blame for using longs not floats. But the size of the
> constant is the same:
> STD #42,[Rs]
> is a 3-word instruction
> STD #42.0E+0,[Rs]
> is also a 3-word instruction
> And the size chosen was doubleword nonetheless.
<
I must also point out that if the data type was integer and
the value in the range {-16..15} the constant can be found
in the Rd position of the instruction; so:
<
STW #7,[Ra]
<
is a 1 word instruction.
Thanks to Brian for finding this.
<
> >
> > You're saying "does not HAVE to" and "may be less", so it is
> > not clear.
> <
> on a lesser implementation, the former might be faster
> on a great big implementation it is likely a wash. Yes,
> you might not know in the compiler targeting the whole
> range. In any event you are unlikely to notice the change
> in performance this optimization makes.
> >
> > This is actually indicative of a problem with offering a richer
> > instruction set: The compiler has to make more choices, which are
> > at first obvious to the compiler writer and which may also change
> > with model number and time.
> >
> > Another point is https://github.com/bagel99/llvm-my66000/issues/1,
> > which contains the code
> >
> > vec r24,{r17}
> > ldd r20,[r12,r17<<3,-40]
> > ldd r1,[r12,r17<<3,-48]
> > ldd r27,[r12,r17<<3,-24]
> > ldd r18,[r12,r17<<3,-32]
> >
> > Eight words for four loads, where
> >
> > vec r24,{r17}
> > la ra,[r12,17<<3]
> > ldd r20,[ra,-40]
> > ldd r1,[ra,-48]
> > ldd r27,[ra,-24]
> > ldd r18,[ra,-40]
> >
> > would be five. Better? Worse? How should a compiler writer know?
> <
> Yes, the compiler writer (Brian) should know, and once again you
> on a 1-wide macine fetching 4-words per cycle, you would not
> see the difference in the pipeline, and would be unlikely to see
> any change in I$ performance.
> >
> > Or it would have been possible to use "load multiple", loading
> > the base address [r12,r14<<3,-48] into a register and adjusting
> > the register allocation so the registers are consecutive.
> <
> You are complaining about nuance level stuff on a back end
> Brian wrote "for fun". In order to use LDM the registers have
> to be in sequence.
> <
> > Better? Worse? Better or worse in 10 years when the third
> > generation of My66000 chips hits the market (hopefully)?

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<5d9bc622-4286-4cbe-a9f4-1181b23bda6an@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23761&group=comp.arch#23761

copy link Newsgroups: comp.arch

X-Received: by 2002:a5d:47a8:0:b0:1ea:85d5:3cd9 with SMTP id 8-20020a5d47a8000000b001ea85d53cd9mr3526879wrb.349.1645549717006;
Tue, 22 Feb 2022 09:08:37 -0800 (PST)
X-Received: by 2002:a05:6808:f8b:b0:2d7:b8e:49f with SMTP id
o11-20020a0568080f8b00b002d70b8e049fmr277836oiw.120.1645549715437; Tue, 22
Feb 2022 09:08:35 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!2.us.feeder.erje.net!feeder.erje.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 22 Feb 2022 09:08:35 -0800 (PST)
In-Reply-To: <sv348s$sj4$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:58f1:1d2:7050:984a;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:58f1:1d2:7050:984a
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at> <suerog$cd0$1@dont-email.me>
<2022Feb15.124937@mips.complang.tuwien.ac.at> <sugjhv$v6u$1@dont-email.me>
<jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org> <sugkji$6vi$1@dont-email.me>
<jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org> <suhafj$our$1@dont-email.me>
<2022Feb21.115543@mips.complang.tuwien.ac.at> <sv0k1f$ju$1@dont-email.me>
<sv13ah$sor$1@newsreader4.netcologne.de> <f85b718a-a9de-4de7-8154-aecf34c0207fn@googlegroups.com>
<sv21k5$fsg$1@newsreader4.netcologne.de> <sv2bei$9id$1@dont-email.me>
<sv2g1e$nna$1@newsreader4.netcologne.de> <sv348s$sj4$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5d9bc622-4286-4cbe-a9f4-1181b23bda6an@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 22 Feb 2022 17:08:37 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 31

by: MitchAlsup - Tue, 22 Feb 2022 17:08 UTC

On Tuesday, February 22, 2022 at 10:50:39 AM UTC-6, BGB wrote:
> On 2/22/2022 5:05 AM, Thomas Koenig wrote:
> > Ivan Godard <iv...@millcomputing.com> schrieb:
> >
> >> That's why compiler people get the big bucks.
> >
> > Or nothing, if they are volunteers :-)
> >
> > (Counting travel costs to a GNU cauldron, my actual money balance
> > for gfortran contributions is negative, but I certainly got to
> > meet some very interesting people there, and hobbies are allowed
> > to cost money :-)
> I wrote a C compiler for my project, my balance is still a 0...
>
>
> Overall, it would probably be negative when one counts money spent on
> FPGA boards and similar, or the (hypothetical) losses due to opportunity
> cost (though, this assumes the existence of some other greater-or-equal
> path that would have led to some form of "profit").
>
>
> Though, if not for all the time thrown at my BJX2 project, my 3D engine
> projects might not have stalled. Question is if I could have made any
> money there, but it seems that there is a "make it not suck" issue in my
> case that is separate from the amount of effort I put into something.
>
> Otherwise, I have thoughts that I may need to come up with a new name
> for my ISA at some point, as I had become aware that the name I am using
> for it has some unfortunate implications.
<
You could call it "By Gollly <some sequence of digits>"
as in BG 360 or BG 6600,...

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sv37vo$83t$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23762&group=comp.arch#23762

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!rd9pRsUZyxkRLAEK7e/Uzw.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 22 Feb 2022 18:54:02 +0100
Organization: Aioe.org NNTP Server
Message-ID: <sv37vo$83t$1@gioia.aioe.org>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<suerog$cd0$1@dont-email.me> <2022Feb15.124937@mips.complang.tuwien.ac.at>
<sugjhv$v6u$1@dont-email.me> <jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org>
<sugkji$6vi$1@dont-email.me> <jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org>
<suhafj$our$1@dont-email.me> <2022Feb21.115543@mips.complang.tuwien.ac.at>
<sv0k1f$ju$1@dont-email.me> <sv13ah$sor$1@newsreader4.netcologne.de>
<f85b718a-a9de-4de7-8154-aecf34c0207fn@googlegroups.com>
<sv21k5$fsg$1@newsreader4.netcologne.de> <sv2bei$9id$1@dont-email.me>
<sv2g1e$nna$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="8317"; posting-host="rd9pRsUZyxkRLAEK7e/Uzw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.10.2
X-Notice: Filtered by postfilter v. 0.9.2

by: Terje Mathisen - Tue, 22 Feb 2022 17:54 UTC

Thomas Koenig wrote:
> Ivan Godard <ivan@millcomputing.com> schrieb:
>
>> That's why compiler people get the big bucks.
>
> Or nothing, if they are volunteers :-)
>
> (Counting travel costs to a GNU cauldron, my actual money balance
> for gfortran contributions is negative, but I certainly got to
> meet some very interesting people there, and hobbies are allowed
> to cost money :-)

By total number of code lines, a large majority of everything I've ever
written must be considered hobby programming, i.e. not directly
generating any income.

OTOH, all that programming has been fun, and it did lead to a few very
interesting fringe benefits, i.e. a free top-end 1U NTP server from
Meinberg which I wrote the optimized ntpd deamon for, as well as Pentium
Pro and Larrabee engineering samples.

The terminal emulator/file transfer Dos program I started writing in
1982 did generate quite a few sales, so that even after 83% total
taxation, the profit paid for our first mountain cabin.

Later on, Intel paid me enough for showing them how to run maximally
hard BluRay decoding in software that it paid for a very nice 50-year
party. :-)

However, the most important consideration is that hobby programming ==
fun (by definition). :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sv3iop$htd$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23764&group=comp.arch#23764

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-622-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Tue, 22 Feb 2022 20:58:01 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sv3iop$htd$1@newsreader4.netcologne.de>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<suerog$cd0$1@dont-email.me> <2022Feb15.124937@mips.complang.tuwien.ac.at>
<sugjhv$v6u$1@dont-email.me> <jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org>
<sugkji$6vi$1@dont-email.me> <jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org>
<suhafj$our$1@dont-email.me> <2022Feb21.115543@mips.complang.tuwien.ac.at>
<sv0k1f$ju$1@dont-email.me> <sv13ah$sor$1@newsreader4.netcologne.de>
<f85b718a-a9de-4de7-8154-aecf34c0207fn@googlegroups.com>
<sv21k5$fsg$1@newsreader4.netcologne.de>
<c7e68c07-24e7-4eb4-8142-6d3b67331edan@googlegroups.com>
Injection-Date: Tue, 22 Feb 2022 20:58:01 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-622-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:622:0:7285:c2ff:fe6c:992d";
logging-data="18349"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Tue, 22 Feb 2022 20:58 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:
> On Tuesday, February 22, 2022 at 12:59:20 AM UTC-6, Thomas Koenig wrote:
>> MitchAlsup <Mitch...@aol.com> schrieb:
>> >> You have to watch out for one thing - for code like
>> >>
>> >> void foo (double *a, double *b, double *c)
>> >> {
>> >> *a = 42.;
>> >> *b = 42.;
>> >> *c = 42.;
>> >> }
>> > Compiler is not restricted from doing::
>> ><
>> > MOV Rt,#42
>> > STD Rt,[Ra]
>> > STD Rt,[Rb]
>> > STD Rt,[Rc]
>> ><
>> > but it does not HAVE to.
>> ><
>> > Also consider that you are fetching 4-8 words wide, so the amount of time it takes to
>> > STD #42,[Ra]
>> > STD #42,[Rb]
>> > STD #42,[Rc]
>> > may be less than the above example.
><
>> (Not the above example are for 8-byte floating point constants,
>> not four-byte integers, so the difference is 6 vs. 12 words,
>> but the same thing currently also happens for 10 identical
>> constants).
><
> I accept the blame for using longs not floats. But the size of the
> constant is the same:
> STD #42,[Rs]
> is a 3-word instruction
> STD #42.0E+0,[Rs]
> is also a 3-word instruction
> And the size chosen was doubleword nonetheless.
>>
>> You're saying "does not HAVE to" and "may be less", so it is
>> not clear.
><
> on a lesser implementation, the former might be faster
> on a great big implementation it is likely a wash. Yes,
> you might not know in the compiler targeting the whole
> range. In any event you are unlikely to notice the change
> in performance this optimization makes.

Unless it starts hitting an icache limit, or somebody wants
to save code size.

I've submitted more than 400 gcc bug reports, and around 80 of
them have the keyword missed-optimization. If I had stumbled
across this with any supported gcc target, I would certainly
have reported it.

>> This is actually indicative of a problem with offering a richer
>> instruction set: The compiler has to make more choices, which are
>> at first obvious to the compiler writer and which may also change
>> with model number and time.
>>
>> Another point is https://github.com/bagel99/llvm-my66000/issues/1,
>> which contains the code
>>
>> vec r24,{r17}
>> ldd r20,[r12,r17<<3,-40]
>> ldd r1,[r12,r17<<3,-48]
>> ldd r27,[r12,r17<<3,-24]
>> ldd r18,[r12,r17<<3,-32]
>>
>> Eight words for four loads, where
>>
>> vec r24,{r17}
>> la ra,[r12,17<<3]
>> ldd r20,[ra,-40]
>> ldd r1,[ra,-48]
>> ldd r27,[ra,-24]
>> ldd r18,[ra,-40]
>>
>> would be five. Better? Worse? How should a compiler writer know?
><
> Yes, the compiler writer (Brian) should know, and once again you
> on a 1-wide macine fetching 4-words per cycle, you would not
> see the difference in the pipeline, and would be unlikely to see
> any change in I$ performance.

Same reasoning as above.

(Actually, the code in question unrolls stuff within vec/loop,
which I suppose is suboptimal, but that could just be the
compiler front end already doing this, so I didn't put this
into the github issue.)

>>
>> Or it would have been possible to use "load multiple", loading
>> the base address [r12,r14<<3,-48] into a register and adjusting
>> the register allocation so the registers are consecutive.
><
> You are complaining about nuance level stuff on a back end
> Brian wrote "for fun". In order to use LDM the registers have
> to be in sequence.

.... which is what I meant when I wrote "consecutive".

I am not completely sold on load/store multiple. Case in point:
POWER has lmw (load multiple word) and stmw (store multiple word).
These instructions have been deprecated as far as possible, in
two ways: There is no 64-bit version, and they do not exist in
little-endian mode. They are also microcoded, which is generally
bad news at least for POWER9 (for example, it costs two cycles
decode startup penalty). aarch64 also dropped the LDM and STM
instructions, but they have "load pair" and "store pair".

Did IBM and ARM make a mistake in dropping load/store multiple
for their 64-bit architectures?

Regarding "complaining": I certainly have the highest respect for
anybody who singlehandedly puts a back-end on either LLVM or GCC.
If I had thought ever bug submitter for gfortran complained,
I would certainly have dropped out of gfortran a long time ago
(and gfortran was, until quite recently, a 100% volunteer
effort).

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<7f35c541-e6fd-43e8-97a3-077d8dd76839n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23765&group=comp.arch#23765

copy link Newsgroups: comp.arch

X-Received: by 2002:a5d:6d8d:0:b0:1e3:3de4:e0e6 with SMTP id l13-20020a5d6d8d000000b001e33de4e0e6mr21176565wrs.159.1645569777128;
Tue, 22 Feb 2022 14:42:57 -0800 (PST)
X-Received: by 2002:a05:6870:aa8d:b0:c6:db43:22db with SMTP id
gr13-20020a056870aa8d00b000c6db4322dbmr2705841oab.314.1645569776584; Tue, 22
Feb 2022 14:42:56 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 22 Feb 2022 14:42:56 -0800 (PST)
In-Reply-To: <sv3iop$htd$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:d156:187c:560d:8fbd;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:d156:187c:560d:8fbd
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at> <suerog$cd0$1@dont-email.me>
<2022Feb15.124937@mips.complang.tuwien.ac.at> <sugjhv$v6u$1@dont-email.me>
<jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org> <sugkji$6vi$1@dont-email.me>
<jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org> <suhafj$our$1@dont-email.me>
<2022Feb21.115543@mips.complang.tuwien.ac.at> <sv0k1f$ju$1@dont-email.me>
<sv13ah$sor$1@newsreader4.netcologne.de> <f85b718a-a9de-4de7-8154-aecf34c0207fn@googlegroups.com>
<sv21k5$fsg$1@newsreader4.netcologne.de> <c7e68c07-24e7-4eb4-8142-6d3b67331edan@googlegroups.com>
<sv3iop$htd$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <7f35c541-e6fd-43e8-97a3-077d8dd76839n@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 22 Feb 2022 22:42:57 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Tue, 22 Feb 2022 22:42 UTC

On Tuesday, February 22, 2022 at 2:58:04 PM UTC-6, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > On Tuesday, February 22, 2022 at 12:59:20 AM UTC-6, Thomas Koenig wrote:
> >> MitchAlsup <Mitch...@aol.com> schrieb:
> >> >> You have to watch out for one thing - for code like
> >> >>
> >> >> void foo (double *a, double *b, double *c)
> >> >> {
> >> >> *a = 42.;
> >> >> *b = 42.;
> >> >> *c = 42.;
> >> >> }
> >> > Compiler is not restricted from doing::
> >> ><
> >> > MOV Rt,#42
> >> > STD Rt,[Ra]
> >> > STD Rt,[Rb]
> >> > STD Rt,[Rc]
> >> ><
> >> > but it does not HAVE to.
> >> ><
> >> > Also consider that you are fetching 4-8 words wide, so the amount of time it takes to
> >> > STD #42,[Ra]
> >> > STD #42,[Rb]
> >> > STD #42,[Rc]
> >> > may be less than the above example.
> ><
> >> (Not the above example are for 8-byte floating point constants,
> >> not four-byte integers, so the difference is 6 vs. 12 words,
> >> but the same thing currently also happens for 10 identical
> >> constants).
> ><
> > I accept the blame for using longs not floats. But the size of the
> > constant is the same:
> > STD #42,[Rs]
> > is a 3-word instruction
> > STD #42.0E+0,[Rs]
> > is also a 3-word instruction
> > And the size chosen was doubleword nonetheless.
> >>
> >> You're saying "does not HAVE to" and "may be less", so it is
> >> not clear.
> ><
> > on a lesser implementation, the former might be faster
> > on a great big implementation it is likely a wash. Yes,
> > you might not know in the compiler targeting the whole
> > range. In any event you are unlikely to notice the change
> > in performance this optimization makes.
> Unless it starts hitting an icache limit, or somebody wants
> to save code size.
>
> I've submitted more than 400 gcc bug reports, and around 80 of
> them have the keyword missed-optimization. If I had stumbled
> across this with any supported gcc target, I would certainly
> have reported it.
> >> This is actually indicative of a problem with offering a richer
> >> instruction set: The compiler has to make more choices, which are
> >> at first obvious to the compiler writer and which may also change
> >> with model number and time.
> >>
> >> Another point is https://github.com/bagel99/llvm-my66000/issues/1,
> >> which contains the code
> >>
> >> vec r24,{r17}
> >> ldd r20,[r12,r17<<3,-40]
> >> ldd r1,[r12,r17<<3,-48]
> >> ldd r27,[r12,r17<<3,-24]
> >> ldd r18,[r12,r17<<3,-32]
> >>
> >> Eight words for four loads, where
> >>
> >> vec r24,{r17}
> >> la ra,[r12,17<<3]
> >> ldd r20,[ra,-40]
> >> ldd r1,[ra,-48]
> >> ldd r27,[ra,-24]
> >> ldd r18,[ra,-40]
> >>
> >> would be five. Better? Worse? How should a compiler writer know?
> ><
> > Yes, the compiler writer (Brian) should know, and once again you
> > on a 1-wide macine fetching 4-words per cycle, you would not
> > see the difference in the pipeline, and would be unlikely to see
> > any change in I$ performance.
> Same reasoning as above.
>
> (Actually, the code in question unrolls stuff within vec/loop,
> which I suppose is suboptimal, but that could just be the
> compiler front end already doing this, so I didn't put this
> into the github issue.)
> >>
> >> Or it would have been possible to use "load multiple", loading
> >> the base address [r12,r14<<3,-48] into a register and adjusting
> >> the register allocation so the registers are consecutive.
> ><
> > You are complaining about nuance level stuff on a back end
> > Brian wrote "for fun". In order to use LDM the registers have
> > to be in sequence.
> ... which is what I meant when I wrote "consecutive".
>
> I am not completely sold on load/store multiple. Case in point:
many are
> POWER has lmw (load multiple word) and stmw (store multiple word).
> These instructions have been deprecated as far as possible, in
> two ways: There is no 64-bit version, and they do not exist in
> little-endian mode. They are also microcoded, which is generally
> bad news at least for POWER9 (for example, it costs two cycles
indeed.
> decode startup penalty). aarch64 also dropped the LDM and STM
> instructions, but they have "load pair" and "store pair".
>
> Did IBM and ARM make a mistake in dropping load/store multiple
> for their 64-bit architectures?
<
It depends on what your goals are !
<
For My 66000, I wanted to be able to perform a complete context switch
in 10-ish cycles. The amount of data <currently> needed to be processed
is 9 DoubleWords of <essentially> PSW+MMU, and 32 DoubleWords of
register file.
<
Given a nearly nano-meter process; 16 cores per die, it makes sense
that the interconnect is a cache line in width. So this context switch
can run over the interconnect in 5-cycles (borrowing some of the
control portion of Transport.) {I don't really care if data is 512 wires
or 256 wires double-data-rate.}
<
Since a core has to be able to absorb this at interconnect speeds,
I am writing 8 registers per cycle. {for your amusement: the cycle
before I write this data, I read out the current state of whatever
the core is currently doing and ship this back over the interconnect
where somebody logs it into its long terms resting place.}
<
Given 1 cache line per cycle interconnect, a complete context
switch from one Guest OS thread to a different Guest OS and a
different thread in that different Guest OS (either one can be the
actual HyperVisor) a complete context switch takes 5 cycles.
<
Given ½ cache line per cycle a complete context switch takes 10
cycles.
<
Given that My 66000 implementations change contexts so rapidly,
means I inherently have the ability to write or read contiguous registers
4-8 at a time. LDM and STM are not microcoded.
<
I have variants of LDM and STM called ENTER and EXIT which push
registers and allocate stack frames improving code density around
prologue and epilogue.
<
All of these make use of the wide register paths.
<
So, my goals cause an alteration of the non-CPU side of the architecture,
which put the CPU side in a position that most of the infrastructure needed
to support high performance LDM and STM is already mandated.
<
Oh, and BTW there is no concept of VMenter or VMexit. Every context
switch can go from any thread to any thread, within or across multiple
virtual machines--the same 10-cycles.
<
Context switch from thread in one Guest OS to a thread in a different
Guest OS are currently measured in the 10,000 cycle range.
<
Interrupts are measured in the 1,000 cycle range, these also come down
to 10-cycles.
<
Scheduling of deferred work from an ISR is 5-ish cycles (soft_IRQs
in Linux and DPCs in Windows) without any locking being necessary
or change in priority level.
<
Different goals.
>
> Regarding "complaining": I certainly have the highest respect for
> anybody who singlehandedly puts a back-end on either LLVM or GCC.
> If I had thought ever bug submitter for gfortran complained,
> I would certainly have dropped out of gfortran a long time ago
> (and gfortran was, until quite recently, a 100% volunteer
> effort).

Re: instruction set binding time, was Encoding 20 and 40 bit

<sv8el3$s2o$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23774&group=comp.arch#23774

copy link Newsgroups: comp.arch

by: Ivan Godard - Thu, 24 Feb 2022 17:18 UTC

On 2/24/2022 8:02 AM, John Dallman wrote:
> In article <sv0i51$gs3$1@dont-email.me>, ivan@millcomputing.com (Ivan
> Godard) wrote:
>
>> The cycle count between the as-scheduled issue's bundle and the
>> retire's bundle, biased to zero, is the deferral count put in the
>> issue instruction. If that count is too big for the encoding then
>> the instruction is changed from using countdown to one using
>> explicit pickup, which has no encoding limit on the gap between
>> issue and retire.
>
> Presumably that's the minimum cycle count? One has to cope with code
> like:
>
> Start load.
> Loop with data-dependent number of iterations.
> Retire load.
>
> John

In general deferral is only used when not across a CFG join. Your
example is actually the usecase which caused creation of pickup load
forms in the ISA.

In principle a deferred load could be used across loop with a statically
known number of iterations, but we don't bother in the current
specializer because such a loop will tend to run over the maximal
deferral anyway, just from counting the iterations. Loops so small and
iterated so little that the deferral would fit in the encoding tend to
get unrolled, too, and then deferral gets used if it fits.

Re: Hoisting load issue out of functions; was: instruction set binding time

<t1v0ov$mdj$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24517&group=comp.arch#24517

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: paaroncl...@gmail.com (Paul A. Clayton)
Newsgroups: comp.arch
Subject: Re: Hoisting load issue out of functions; was: instruction set
binding time
Date: Tue, 29 Mar 2022 09:15:10 -0400
Organization: A noiseless patient Spider
Lines: 221
Message-ID: <t1v0ov$mdj$1@dont-email.me>
References: <sufpjo$oij$1@dont-email.me>
<memo.20220215170019.7708M@jgd.cix.co.uk> <suh8bp$ce8$1@dont-email.me>
<2022Feb18.100825@mips.complang.tuwien.ac.at> <suoabu$rfk$1@dont-email.me>
<5276f69f-07bc-4082-bb5b-c371d059d403n@googlegroups.com>
<suoiqc$8h4$2@dont-email.me> <suus16$t33$1@dont-email.me>
<suvjue$gep$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 29 Mar 2022 13:15:11 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="8a9a00f5832fce313be80d767ff613c3";
logging-data="22963"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/LwV0QSp0z/zl/7Sy7BR8TjpX9cSJv5fw="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
Thunderbird/68.0
Cancel-Lock: sha1:R37r8icpVvzEP2e0F7zPlNPIjYE=
In-Reply-To: <suvjue$gep$1@dont-email.me>

by: Paul A. Clayton - Tue, 29 Mar 2022 13:15 UTC

Thank you for reading, and especially for responding to my
post. I apologize for the delay in responding.

I am not entirely sure this response if entirely coherent as
it was composed at multiple times (with some versions having
been abandonned) and my final reading before posting was not
especially careful — I do feel I have delayed too long
already.

Ivan Godard wrote:
> On 2/20/2022 6:05 PM, Paul A. Clayton wrote:
[snip proposal of caller initiated/callee realized loads]

> Your proposal is one we have explored, though perhaps not far
> enough. I'll explain the difficulties we found, in hopes that you
> might see ways around them.
>
> First, there are presently two ways for a Mill load to retire,
> implicitly via timeout and explicitly via the pickup instruction.
> Each presents its own problems.
>
> An explicit load encodes a tag# argument, an arbitrary small
> constant drawn from a set the size of the set of hardware retire
> stations. It is not a physical RS#, but is mapped to one at issue
> in a way essentially equivalent to a rename register in OOO. The
> corresponding pickup instruction carries the same tag, and the
> mapping says which RS to retire. Each call frame has its own tag
> namespace. The specializer schedule and hardware checks assures
> that loads don't issue an in-use tag, nor pickups try to retire
> one that is not in flight, and so on.

Side comment: I am a little surprised that hardware checks
for trying to use an in-use slot. This seems like something
that could be done straightforwardly in software once. (The
software overhead might be undesirable for JIT compilation.)
Trusting that the compilation system handles this correctly
seems reasonable, though some have proposed invalid markers
for registers to catch a similar problem of unitialized
register storage. (A pick-up attempt on an inactive station
would have to error anyway — or perhaps return a not-a-value
result like a permission violating load can though it is not
clear when that would be useful. A pick-up attempt on an
"ordinary", deferred load would be easily recognized anyway.)

[Thinking about returning a NaV, such *might* be useful for
delinquent loads to support runahead or speculating that the
value is not used. In the latter use, resuming execution at
the load acceptance point would only be necessary on wrong
speculation, but both uses would require a means for such
resumption. If NaV state can be tested by software, this
would also provide an "informing load", i.e., a load miss can
be observed by software and an alternative work path used,
but I suspect such is not the best way to architect informing
loads.]

> To use your suggestion with explicit retire would require that
> there be a tag namespace that crossed frame boundaries so a load
> could be loaded in one and retired in a different one. Such a
> notion is very un-Mill because we define interrupts, faults and
> traps as being mere ordinary functions that are involuntarily
> called. As these can occur at any time, a function cannot know
> whose frame is adjacent to its own. This makes for a very easy
> call interface - you can only see your own stuff and your explicit
> arguments - but makes sharing namespaces across frames problematic.

Could the following work? A load marks the load retire station
entry via the load initializing operation/instruction in the
caller as owned by the callee but *held* by the caller. On an
interrupt or exception, held stations would be spilled just
like owned stations. Hardware knows when a 'call' is an
ordinary call and when it is an exception handler or external
interrupt.

This would increase the metadata associated with the load retire
stations and increase the size of the load operation encoding
(especially if the load is allowed to target calls other than
the next call — next call might get most of the benefit, if
there is any, especially since intermediate calls would lose the
use of that station), but it would seem to handle 'involuntary
calls' and call operation bloat (the load retire station
argument is implicit, encoded in the state of the retire
station, more similar to registers).

Passing loads to calls other than the next one would also
mean re-issuing the load after each intervening return until
the targeted call or "wasting" a retire station by reserving
it through all intermediate calls. (It is not clear that the
filler engine could be smart enough and be given enough
information to insert the load just-in-time, as if the
initial load instruction was a prefetch into L1 — or perhaps
a small pending-load cache with more capacity than retire
stations but less latency than L1 — and a second issuing of
the load would complete the load. Getting the second issue
to complete such that the first cycle of the load-targeted
function could contain the pick-up seems challenging.)

Since the value is spilled into the caller's storage, which is
accessible to the caller's debug system, this would restrict
the loads to those the caller is allowed to perform.
Effectively, the caller is loading the value and then passing
it as an implicit argument.

The load value reception would be checked against the *owner*
rather than the *holder*, so if the load completed before the
call it could still receive its value.

[Side thought: I wonder if distinguishing "unaliased" loads,
possibly aliased loads, and shared memory loads would be
worthwhile. Loads that are known at compile time not to have
aliases (or possibly even be guarded by a lock for shared
memory) would not need to be checked against store addresses
nor against cache coherence probes. Aliased but "unshared"
loads would only have to check against local stores. I suspect
the complexity and resource utilization imbalance would
argue against such, but it seems wrong to check for an
impossibility (though always checking can be more efficient
than extracting special cases). Itanium's ALAT mechanism seemed
clunky because all advanced loads were cache coherent even if
the storage was thread local. An unaliased load could, in
theory, retire early, though such would only seem to help the
Mill when a load returns and the retire station is spilled —
currently all loads are presumed aliasable and, if I
understand correctly, the spiller spills the address and the
filler reinstalls the load — the load would not have to be
retried after fill as the value was saved and could not have
been altered. This seems likely to be a rare case.]

Presumably a return from the function targeted by the load
without a pick-up operation in that function would free the
reservation (though one might alternatively require such
loads to be 'acknowledged' by the function — not necessarily
dropped on the belt — to clear the reservation).

If I understand the Mill's load mechanism adequately, such
should support cross-call loads without preventing involuntary
calls from properly preserving state or bloating the call
operation.

(Supporting loads that cross permission domains seems likely
to be too challenging. The hardware presumably would not be
able to easily determine what the permission domain will be
at the time of the load — in theory, hardware could scan
ahead and prefetch the permission domain of the callee, but
that seems a bit much for a likely marginal performance
benefit.)

I am conceptually aware that complexity adds to design time
and effort, increases the likelihood of errors — really bad
for hardware and compilers —, and tends toward fragility of
performance (e.g., providing a 10% benefit in a 1% case most
of the time but sometimes causing a 20x slowdown).

> The alternative to sharing is of course argument passing, as you
> suggest: the call instruction could contain load tags from its
> caller tag set, to be picked up by some kind of callee signature
> instruction that maps the argument tags to the callee tag
> namespace, whence a normal pickup can retire them. The problem is
> one of encoding: the call instruction already has a variable
> length list of belt arguments, and this would require a variable
> length list of tag arguments as well. We never found a
> satisfactory way to encode two lists in one instruction.

It is not clear why the load operation could not encode a larger
tag space to include the callee. The retire stations could
be implicit arguments relative to the call operation. This seems
to be what you suggest elsewhere.

> The implicit load method has its own issues. For a load to timeout
> in a callee it must be counting in that callee. However there may
> be other calls to irrelevant functions between the load issue and
> the intended target call, and the load should not count during
> those. Note that the intervening non-counting calls may include
> interrupts and exceptions that are not statically schedulable.
> Consequently, a call must be able to indicate which in-flight
> loads should continue counting across the call - and we are back
> in the encode-two-lists problem again.

Click here to read the complete article

Re: Hoisting load issue out of functions; was: instruction set

<memo.20220329171528.1928L@jgd.cix.co.uk>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24518&group=comp.arch#24518

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: jgd...@cix.co.uk (John Dallman)
Newsgroups: comp.arch
Subject: Re: Hoisting load issue out of functions; was: instruction set
Date: Tue, 29 Mar 2022 17:15 +0100 (BST)
Organization: A noiseless patient Spider
Lines: 19
Message-ID: <memo.20220329171528.1928L@jgd.cix.co.uk>
References: <t1v0ov$mdj$1@dont-email.me>
Reply-To: jgd@cix.co.uk
Injection-Info: reader02.eternal-september.org; posting-host="39cc4c7568e4f8ff38f6c8b61df2b290";
logging-data="16101"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+7T2PN/iUYEaW+mapDOS7IKuI2ZOFwjq0="
Cancel-Lock: sha1:0kx5u6hI0Wf6J8yPIXDoM9c1Pck=

by: John Dallman - Tue, 29 Mar 2022 16:15 UTC

In article <t1v0ov$mdj$1@dont-email.me>, paaronclayton@gmail.com (Paul A.
Clayton) wrote:

> Side comment: I am a little surprised that hardware checks
> for trying to use an in-use slot. This seems like something
> that could be done straightforwardly in software once. (The
> software overhead might be undesirable for JIT compilation.)
> Trusting that the compilation system handles this correctly
> seems reasonable . . .

Think about the difficulty of trying to figure out what's wrong when the
software that does that malfunctions. This looks harder with Mill than a
conventional register architecture, because everything's moving. It isn't
easy with conventional architectures.

As someone who's done a fair bit of bringing up application software on
immature toolchains, hardware checks are very welcome.

John

Re: Hoisting load issue out of functions; was: instruction set binding time

<t1vd2g$u0u$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24522&group=comp.arch#24522

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Hoisting load issue out of functions; was: instruction set
binding time
Date: Tue, 29 Mar 2022 09:45:04 -0700
Organization: A noiseless patient Spider
Lines: 262
Message-ID: <t1vd2g$u0u$1@dont-email.me>
References: <sufpjo$oij$1@dont-email.me>
<memo.20220215170019.7708M@jgd.cix.co.uk> <suh8bp$ce8$1@dont-email.me>
<2022Feb18.100825@mips.complang.tuwien.ac.at> <suoabu$rfk$1@dont-email.me>
<5276f69f-07bc-4082-bb5b-c371d059d403n@googlegroups.com>
<suoiqc$8h4$2@dont-email.me> <suus16$t33$1@dont-email.me>
<suvjue$gep$1@dont-email.me> <t1v0ov$mdj$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 29 Mar 2022 16:45:05 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4fd733215cd9b2786c828c5ea934e4fe";
logging-data="30750"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19xRVfVTlXCS8CZdBiidE9y"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.7.0
Cancel-Lock: sha1:4EhTBV18kxXjsYNnprudy6PDprc=
In-Reply-To: <t1v0ov$mdj$1@dont-email.me>
Content-Language: en-US

by: Ivan Godard - Tue, 29 Mar 2022 16:45 UTC

On 3/29/2022 6:15 AM, Paul A. Clayton wrote:
> Thank you for reading, and especially for responding to my
> post. I apologize for the delay in responding.
>
> I am not entirely sure this response if entirely coherent as
> it was composed at multiple times (with some versions having
> been abandonned) and my final reading before posting was not
> especially careful — I do feel I have delayed too long
> already.

Always welcome whenever :-)

After some thought I think that "load for him (LFH) loads using the
deferral mechanism are impractical; they would tie the compiler's
instruction scheduler into knots and the resulting bad code would
obviate any gains.

LFH using the pickup form looks more likely, though as noted there are
encoding issues that would cause some bloat. However it looks like a
fairly straightforward implementation can both load for immediate callee
and also for nested callees and across domain boundaries.

> Ivan Godard wrote:
> > On 2/20/2022 6:05 PM, Paul A. Clayton wrote:
> [snip proposal of caller initiated/callee realized loads]
>
>> Your proposal is one we have explored, though perhaps not far enough.
>> I'll explain the difficulties we found, in hopes that you might see
>> ways around them.
>>
>> First, there are presently two ways for a Mill load to retire,
>> implicitly via timeout and explicitly via the pickup instruction. Each
>> presents its own problems.
>>
>> An explicit load encodes a tag# argument, an arbitrary small constant
>> drawn from a set the size of the set of hardware retire stations. It
>> is not a physical RS#, but is mapped to one at issue in a way
>> essentially equivalent to a rename register in OOO. The corresponding
>> pickup instruction carries the same tag, and the mapping says which RS
>> to retire. Each call frame has its own tag namespace. The specializer
>> schedule and hardware checks assures that loads don't issue an in-use
>> tag, nor pickups try to retire one that is not in flight, and so on.
>
> Side comment: I am a little surprised that hardware checks
> for trying to use an in-use slot. This seems like something
> that could be done straightforwardly in software once. (The
> software overhead might be undesirable for JIT compilation.)
> Trusting that the compilation system handles this correctly
> seems reasonable, though some have proposed invalid markers
> for registers to catch a similar problem of unitialized
> register storage. (A pick-up attempt on an inactive station
> would have to error anyway — or perhaps return a not-a-value
> result like a permission violating load can though it is not
> clear when that would be useful. A pick-up attempt on an
> "ordinary", deferred load would be easily recognized anyway.)

By policy the Mill micro-architecture checks for and faults every
nonsensical thing the program does. Mill is extremely focused on RAS
issues, and every bit of uncaught nonsense multiplies the attack
surface. In addition, as different Mill versions might display different
behavior in response to uncaught nonsense, leaving something unchecked
invites customer howls of "but my program worked on your prior chip!",
and we have no wish to impose the curse of bug compatibility on our
corporate future.

> [Thinking about returning a NaV, such *might* be useful for
> delinquent loads to support runahead or speculating that the
> value is not used. In the latter use, resuming execution at
> the load acceptance point would only be necessary on wrong
> speculation, but both uses would require a means for such
> resumption. If NaV state can be tested by software, this
> would also provide an "informing load", i.e., a load miss can
> be observed by software and an alternative work path used,
> but I suspect such is not the best way to architect informing
> loads.]

That's actually what happens when a load hits a protection violation:
you get a NaR at load-retire time, whether timeout or pickup. But
screwing up the station addressing gets you an immediate fault. The
difference is dynamic vs. static: an untaken speculative load could be
to a bad address, but speculative or no the code should never issue two
loads to the same station, nor pickup without a preceding load.

A NaR'd load that is on the taken path will get a NaR fault when it is
used non-speculatively, but one on an untaken speculative path will
never get used (untaken, after all) and so will harmlessly fall off the
belt.

>> To use your suggestion with explicit retire would require that there
>> be a tag namespace that crossed frame boundaries so a load could be
>> loaded in one and retired in a different one. Such a notion is very
>> un-Mill because we define interrupts, faults and traps as being mere
>> ordinary functions that are involuntarily called. As these can occur
>> at any time, a function cannot know whose frame is adjacent to its
>> own. This makes for a very easy call interface - you can only see your
>> own stuff and your explicit arguments - but makes sharing namespaces
>> across frames problematic.
>
> Could the following work? A load marks the load retire station
> entry via the load initializing operation/instruction in the
> caller as owned by the callee but *held* by the caller. On an
> interrupt or exception, held stations would be spilled just
> like owned stations. Hardware knows when a 'call' is an
> ordinary call and when it is an exception handler or external
> interrupt.
>
> This would increase the metadata associated with the load retire
> stations and increase the size of the load operation encoding
> (especially if the load is allowed to target calls other than
> the next call — next call might get most of the benefit, if
> there is any, especially since intermediate calls would lose the
> use of that station), but it would seem to handle 'involuntary
> calls' and call operation bloat (the load retire station
> argument is implicit, encoded in the state of the retire
> station, more similar to registers).
>
> Passing loads to calls other than the next one would also
> mean re-issuing the load after each intervening return until
> the targeted call or "wasting" a retire station by reserving
> it through all intermediate calls. (It is not clear that the
> filler engine could be smart enough and be given enough
> information to insert the load just-in-time, as if the
> initial load instruction was a prefetch into L1 — or perhaps
> a small pending-load cache with more capacity than retire
> stations but less latency than L1 — and a second issuing of
> the load would complete the load. Getting the second issue
> to complete such that the first cycle of the load-targeted
> function could contain the pick-up seems challenging.)
>
> Since the value is spilled into the caller's storage, which is
> accessible to the caller's debug system, this would restrict
> the loads to those the caller is allowed to perform.
> Effectively, the caller is loading the value and then passing
> it as an implicit argument.
>
> The load value reception would be checked against the *owner*
> rather than the *holder*, so if the load completed before the
> call it could still receive its value.

I don't think that an owner marker is necessary if tags are explicitly
passed as part of the signature. And if they are not then I don't see a
way to match load issue (in caller) with load retire (in callee).

> [Side thought: I wonder if distinguishing "unaliased" loads,
> possibly aliased loads, and shared memory loads would be
> worthwhile. Loads that are known at compile time not to have
> aliases (or possibly even be guarded by a lock for shared
> memory) would not need to be checked against store addresses
> nor against cache coherence probes. Aliased but "unshared"
> loads would only have to check against local stores. I suspect
> the complexity and resource utilization imbalance would
> argue against such, but it seems wrong to check for an
> impossibility (though always checking can be more efficient
> than extracting special cases). Itanium's ALAT mechanism seemed
> clunky because all advanced loads were cache coherent even if
> the storage was thread local. An unaliased load could, in
> theory, retire early, though such would only seem to help the
> Mill when a load returns and the retire station is spilled —
> currently all loads are presumed aliasable and, if I
> understand correctly, the spiller spills the address and the
> filler reinstalls the load — the load would not have to be
> retried after fill as the value was saved and could not have
> been altered. This seems likely to be a rare case.]

Aliasing is an issue even in monocores where there is no cache coherency
necessary. It's simplest just to have the stations snoop.

Click here to read the complete article

Re: Hoisting load issue out of functions; was: instruction set

<t1vg7v$p9n$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=24523&group=comp.arch#24523

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: paaroncl...@gmail.com (Paul A. Clayton)
Newsgroups: comp.arch
Subject: Re: Hoisting load issue out of functions; was: instruction set
Date: Tue, 29 Mar 2022 13:39:10 -0400
Organization: A noiseless patient Spider
Lines: 27
Message-ID: <t1vg7v$p9n$1@dont-email.me>
References: <t1v0ov$mdj$1@dont-email.me>
<memo.20220329171528.1928L@jgd.cix.co.uk>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 29 Mar 2022 17:39:11 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="8a9a00f5832fce313be80d767ff613c3";
logging-data="25911"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Bolwks2n8acuNdBJasjijz1b4mDo9/+o="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
Thunderbird/68.0
Cancel-Lock: sha1:GsrMMHPjIQmulBwtbIsoCLvitPU=
In-Reply-To: <memo.20220329171528.1928L@jgd.cix.co.uk>

by: Paul A. Clayton - Tue, 29 Mar 2022 17:39 UTC

John Dallman wrote:
> In article <t1v0ov$mdj$1@dont-email.me>, paaronclayton@gmail.com (Paul A.
> Clayton) wrote:
>
>> Side comment: I am a little surprised that hardware checks
>> for trying to use an in-use slot. This seems like something
>> that could be done straightforwardly in software once. (The
>> software overhead might be undesirable for JIT compilation.)
>> Trusting that the compilation system handles this correctly
>> seems reasonable . . .
>
> Think about the difficulty of trying to figure out what's wrong when the
> software that does that malfunctions. This looks harder with Mill than a
> conventional register architecture, because everything's moving. It isn't
> easy with conventional architectures.
>
> As someone who's done a fair bit of bringing up application software on
> immature toolchains, hardware checks are very welcome.

I can appreciate a "belt and suspenders" approach as an
engineering choice. In this case, it is probably extremely
inexpensive.

I also see your point that such would truly help in
system bring up. (Your Itanium bug comments are a little
disheartening, but appreciated as observations of the
real world.)

Pages:1 2 3 4 5 6 7 8 9 10 11 12 1314

server_pubkey.txt

rocksolid light 0.9.8
clearnet tor