Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

If a train station is a place where a train stops, what's a workstation?


devel / comp.arch / Re: Hoisting load issue out of functions; was: instruction set

SubjectAuthor
* Encoding 20 and 40 bit instructions in 128 bitsThomas Koenig
+* Re: Encoding 20 and 40 bit instructions in 128 bitsStephen Fuld
|`* Re: Encoding 20 and 40 bit instructions in 128 bitsThomas Koenig
| +* Re: Encoding 20 and 40 bit instructions in 128 bitsStephen Fuld
| |+- Re: Encoding 20 and 40 bit instructions in 128 bitsThomas Koenig
| |+* Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
| ||`- Re: Encoding 20 and 40 bit instructions in 128 bitsStephen Fuld
| |`* Re: Encoding 20 and 40 bit instructions in 128 bitsQuadibloc
| | `* Re: Encoding 20 and 40 bit instructions in 128 bitsStephen Fuld
| |  +* Re: Encoding 20 and 40 bit instructions in 128 bitsQuadibloc
| |  |`* Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
| |  | +* Re: Encoding 20 and 40 bit instructions in 128 bitsJimBrakefield
| |  | |+* Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
| |  | ||`- Re: Encoding 20 and 40 bit instructions in 128 bitsJimBrakefield
| |  | |`* Re: Encoding 20 and 40 bit instructions in 128 bitsJimBrakefield
| |  | | +* Re: Encoding 20 and 40 bit instructions in 128 bitsEricP
| |  | | |+* Re: Encoding 20 and 40 bit instructions in 128 bitsJimBrakefield
| |  | | ||+* Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
| |  | | |||`- Re: Encoding 20 and 40 bit instructions in 128 bitsEricP
| |  | | ||`- Re: Encoding 20 and 40 bit instructions in 128 bitsEricP
| |  | | |`* Re: Encoding 20 and 40 bit instructions in 128 bitsEricP
| |  | | | `* Re: Encoding 20 and 40 bit instructions in 128 bitsThomas Koenig
| |  | | |  `* Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
| |  | | |   `* Re: Encoding 20 and 40 bit instructions in 128 bitsThomas Koenig
| |  | | |    +* Re: Encoding 20 and 40 bit instructions in 128 bitsBGB
| |  | | |    |`* Re: Encoding 20 and 40 bit instructions in 128 bitsBrett
| |  | | |    | `* Re: Encoding 20 and 40 bit instructions in 128 bitsBGB
| |  | | |    |  `* Re: Encoding 20 and 40 bit instructions in 128 bitsBrett
| |  | | |    |   +* Re: Encoding 20 and 40 bit instructions in 128 bitsQuadibloc
| |  | | |    |   |`* Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
| |  | | |    |   | `* Re: Encoding 20 and 40 bit instructions in 128 bitsThomas Koenig
| |  | | |    |   |  `* Re: Encoding 20 and 40 bit instructions in 128 bitsStephen Fuld
| |  | | |    |   |   +* Re: Encoding 20 and 40 bit instructions in 128 bitsStefan Monnier
| |  | | |    |   |   |`- Re: Encoding 20 and 40 bit instructions in 128 bitsStephen Fuld
| |  | | |    |   |   +* Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
| |  | | |    |   |   |`* Re: Encoding 20 and 40 bit instructions in 128 bitsQuadibloc
| |  | | |    |   |   | `* Re: Encoding 20 and 40 bit instructions in 128 bitsThomas Koenig
| |  | | |    |   |   |  +* Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
| |  | | |    |   |   |  |+* Re: Encoding 20 and 40 bit instructions in 128 bitsStefan Monnier
| |  | | |    |   |   |  ||+- Re: Encoding 20 and 40 bit instructions in 128 bitsBernd Linsel
| |  | | |    |   |   |  ||+- Re: Encoding 20 and 40 bit instructions in 128 bitsAnton Ertl
| |  | | |    |   |   |  ||`- Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
| |  | | |    |   |   |  |+* Re: Encoding 20 and 40 bit instructions in 128 bitsThomas Koenig
| |  | | |    |   |   |  ||`- Re: Encoding 20 and 40 bit instructions in 128 bitsBrian G. Lucas
| |  | | |    |   |   |  |`- Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
| |  | | |    |   |   |  +* Re: Encoding 20 and 40 bit instructions in 128 bitsAnton Ertl
| |  | | |    |   |   |  |`* Re: Encoding 20 and 40 bit instructions in 128 bitsThomas Koenig
| |  | | |    |   |   |  | `- Re: Encoding 20 and 40 bit instructions in 128 bitsBGB
| |  | | |    |   |   |  +* Re: Encoding 20 and 40 bit instructions in 128 bitsEricP
| |  | | |    |   |   |  |`* Re: Encoding 20 and 40 bit instructions in 128 bitsBGB
| |  | | |    |   |   |  | `* Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
| |  | | |    |   |   |  |  `* Re: Encoding 20 and 40 bit instructions in 128 bitsIvan Godard
| |  | | |    |   |   |  |   `* Re: Encoding 20 and 40 bit instructions in 128 bitsThomas Koenig
| |  | | |    |   |   |  |    `* Re: Encoding 20 and 40 bit instructions in 128 bitsIvan Godard
| |  | | |    |   |   |  |     +* Re: Encoding 20 and 40 bit instructions in 128 bitsThomas Koenig
| |  | | |    |   |   |  |     |`* Re: Encoding 20 and 40 bit instructions in 128 bitsQuadibloc
| |  | | |    |   |   |  |     | +- Re: Encoding 20 and 40 bit instructions in 128 bitsStephen Fuld
| |  | | |    |   |   |  |     | `- Re: Encoding 20 and 40 bit instructions in 128 bitsIvan Godard
| |  | | |    |   |   |  |     +* Re: Encoding 20 and 40 bit instructions in 128 bitsStefan Monnier
| |  | | |    |   |   |  |     |`- Re: Encoding 20 and 40 bit instructions in 128 bitsIvan Godard
| |  | | |    |   |   |  |     +* Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128John Levine
| |  | | |    |   |   |  |     |+* Re: instruction set binding time, was Encoding 20 and 40 bitThomas Koenig
| |  | | |    |   |   |  |     ||+* Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128Stefan Monnier
| |  | | |    |   |   |  |     |||+- Re: instruction set binding time, was Encoding 20 and 40 bitIvan Godard
| |  | | |    |   |   |  |     |||`* Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128Anton Ertl
| |  | | |    |   |   |  |     ||| +* Re: instruction set binding time, was Encoding 20 and 40 bitIvan Godard
| |  | | |    |   |   |  |     ||| |+* Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128Stefan Monnier
| |  | | |    |   |   |  |     ||| ||+- Re: instruction set binding time, was Encoding 20 and 40 bitBGB
| |  | | |    |   |   |  |     ||| ||+- Re: instruction set binding time, was Encoding 20 and 40 bitIvan Godard
| |  | | |    |   |   |  |     ||| ||`* Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128Anton Ertl
| |  | | |    |   |   |  |     ||| || `* Re: instruction set binding time, was Encoding 20 and 40 bitThomas Koenig
| |  | | |    |   |   |  |     ||| ||  +- Re: instruction set binding time, was Encoding 20 and 40 bitJohn Levine
| |  | | |    |   |   |  |     ||| ||  `* Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128Anton Ertl
| |  | | |    |   |   |  |     ||| ||   `* Re: instruction set binding time, was Encoding 20 and 40 bitTerje Mathisen
| |  | | |    |   |   |  |     ||| ||    `* Re: instruction set binding time, was Encoding 20 and 40 bitMitchAlsup
| |  | | |    |   |   |  |     ||| ||     +* Re: instruction set binding time, was Encoding 20 and 40 bitBGB
| |  | | |    |   |   |  |     ||| ||     |`- Re: instruction set binding time, was Encoding 20 and 40 bitMitchAlsup
| |  | | |    |   |   |  |     ||| ||     `- Re: instruction set binding time, was Encoding 20 and 40 bitTerje Mathisen
| |  | | |    |   |   |  |     ||| |`* Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128Anton Ertl
| |  | | |    |   |   |  |     ||| | +* Re: instruction set binding time, was Encoding 20 and 40 bitIvan Godard
| |  | | |    |   |   |  |     ||| | |+* Re: instruction set binding time, was Encoding 20 and 40 bitThomas Koenig
| |  | | |    |   |   |  |     ||| | ||`* Re: instruction set binding time, was Encoding 20 and 40 bitIvan Godard
| |  | | |    |   |   |  |     ||| | || `* Re: instruction set binding time, was Encoding 20 and 40 bitThomas Koenig
| |  | | |    |   |   |  |     ||| | ||  `* Re: instruction set binding time, was Encoding 20 and 40 bitIvan Godard
| |  | | |    |   |   |  |     ||| | ||   `* Re: instruction set binding time, was Encoding 20 and 40 bitThomas Koenig
| |  | | |    |   |   |  |     ||| | ||    +- Re: instruction set binding time, was Encoding 20 and 40 bitIvan Godard
| |  | | |    |   |   |  |     ||| | ||    `- Re: instruction set binding time, was Encoding 20 and 40 bitMitchAlsup
| |  | | |    |   |   |  |     ||| | |+- Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128Anton Ertl
| |  | | |    |   |   |  |     ||| | |`- Re: instruction set binding time, was Encoding 20 and 40 bitJohn Levine
| |  | | |    |   |   |  |     ||| | `* Re: instruction set binding time, was Encoding 20 and 40 bitThomas Koenig
| |  | | |    |   |   |  |     ||| |  `- Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128Anton Ertl
| |  | | |    |   |   |  |     ||| `* Re: instruction set binding time, was Encoding 20 and 40 bitQuadibloc
| |  | | |    |   |   |  |     |||  +* Re: instruction set binding time, was Encoding 20 and 40 bitBGB
| |  | | |    |   |   |  |     |||  |+* Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128Anton Ertl
| |  | | |    |   |   |  |     |||  ||+* Re: instruction set binding time, was Encoding 20 and 40 bitScott Smader
| |  | | |    |   |   |  |     |||  |||+* Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128Stefan Monnier
| |  | | |    |   |   |  |     |||  ||||`* Re: instruction set binding time, was Encoding 20 and 40 bitScott Smader
| |  | | |    |   |   |  |     |||  |||| +* Re: instruction set binding time, was Encoding 20 and 40 bitIvan Godard
| |  | | |    |   |   |  |     |||  |||| |+- Re: instruction set binding time, was Encoding 20 and 40 bitAnton Ertl
| |  | | |    |   |   |  |     |||  |||| |`* Re: instruction set binding time, was Encoding 20 and 40 bitIvan Godard
| |  | | |    |   |   |  |     |||  |||| | +- Re: instruction set binding time, was Encoding 20 and 40 bitMitchAlsup
| |  | | |    |   |   |  |     |||  |||| | +* Re: instruction set binding time, was Encoding 20 and 40 bitIvan Godard
| |  | | |    |   |   |  |     |||  |||| | `* Re: instruction set binding time, was Encoding 20 and 40 bitAnton Ertl
| |  | | |    |   |   |  |     |||  |||| +- Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128James Van Buskirk
| |  | | |    |   |   |  |     |||  |||| `* Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128Anton Ertl
| |  | | |    |   |   |  |     |||  |||+* Statically scheduled plus run ahead.Brett
| |  | | |    |   |   |  |     |||  |||`* Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128Anton Ertl
| |  | | |    |   |   |  |     |||  ||+* Re: instruction set binding time, was Encoding 20 and 40 bitBGB
| |  | | |    |   |   |  |     |||  ||+- Re: instruction set binding time, was Encoding 20 and 40 bitMitchAlsup
| |  | | |    |   |   |  |     |||  ||`* Re: instruction set binding time, was Encoding 20 and 40 bitThomas Koenig
| |  | | |    |   |   |  |     |||  |`* Re: instruction set binding time, was Encoding 20 and 40 bitMitchAlsup
| |  | | |    |   |   |  |     |||  +- Re: instruction set binding time, was Encoding 20 and 40 bitMitchAlsup
| |  | | |    |   |   |  |     |||  `- Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128Anton Ertl
| |  | | |    |   |   |  |     ||`* Re: instruction set binding time, was Encoding 20 and 40 bitIvan Godard
| |  | | |    |   |   |  |     |+- Re: instruction set binding time, was Encoding 20 and 40 bitMitchAlsup
| |  | | |    |   |   |  |     |`* Re: instruction set binding time, was Encoding 20 and 40 bitStephen Fuld
| |  | | |    |   |   |  |     `* Re: Encoding 20 and 40 bit instructions in 128 bitsAnton Ertl
| |  | | |    |   |   |  +* Re: Encoding 20 and 40 bit instructions in 128 bitsQuadibloc
| |  | | |    |   |   |  +- Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
| |  | | |    |   |   |  +- Re: Encoding 20 and 40 bit instructions in 128 bitsQuadibloc
| |  | | |    |   |   |  `- Re: Encoding 20 and 40 bit instructions in 128 bitsQuadibloc
| |  | | |    |   |   +* Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
| |  | | |    |   |   `- Re: Encoding 20 and 40 bit instructions in 128 bitsBGB
| |  | | |    |   `- Re: Encoding 20 and 40 bit instructions in 128 bitsBGB
| |  | | |    `* Re: Encoding 20 and 40 bit instructions in 128 bitsStephen Fuld
| |  | | `- Re: Encoding 20 and 40 bit instructions in 128 bitsThomas Koenig
| |  | `* Re: Encoding 20 and 40 bit instructions in 128 bitsQuadibloc
| |  `- Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
| `- Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
+- Re: Encoding 20 and 40 bit instructions in 128 bitsIvan Godard
+* Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
+* Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
`- Re: Encoding 20 and 40 bit instructions in 128 bitsPaul A. Clayton

Pages:1234567891011121314
Re: instruction set binding time, was Encoding 20 and 40 bit

<suoiqc$8h4$2@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23672&group=comp.arch#23672

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
Date: Fri, 18 Feb 2022 08:51:25 -0800
Organization: A noiseless patient Spider
Lines: 32
Message-ID: <suoiqc$8h4$2@dont-email.me>
References: <sufpjo$oij$1@dont-email.me>
<memo.20220215170019.7708M@jgd.cix.co.uk> <suh8bp$ce8$1@dont-email.me>
<2022Feb18.100825@mips.complang.tuwien.ac.at> <suoabu$rfk$1@dont-email.me>
<5276f69f-07bc-4082-bb5b-c371d059d403n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 18 Feb 2022 16:51:24 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="6eee5a70cc75c332c2d3f16500e8e251";
logging-data="8740"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19z/oWk0oXCqwFxp0kz2mir"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:4w8ZSYPLCP+uqxlkrDJfeFDJ1zI=
In-Reply-To: <5276f69f-07bc-4082-bb5b-c371d059d403n@googlegroups.com>
Content-Language: en-US
 by: Ivan Godard - Fri, 18 Feb 2022 16:51 UTC

On 2/18/2022 8:42 AM, Paul A. Clayton wrote:
> On Friday, February 18, 2022 at 9:27:14 AM UTC-5, Ivan Godard wrote:
> [snip]
>> Still,
>> in principle an OOO can issue a load from inside a called body before
>> the call itself has been issued, and a Mill can't do that.
>
> In theory a caller could start a pick up load that is picked up by the callee. This would be similar to function specific interfaces rather than using a generic ABI. It is not obvious how useful this would be.

The architecture does not permit that - each frame has a distinct set of
load retire stations. While that could be changed, it's not clear that,
given the limited applicability to calls which were statically
resolvable (i.e. no polymorphic), the the game would be worth the
candle. That case can be addressed without architecture change using
out-lining, too.

>
>> Put all this together and the Mill ISA has eliminated all the OOO
>> advantages except: OOO will excel on 1) frequently called 2) polymorphic
>> functions that 3) immediately execute a load whose address arguments are
>> 4) ready and that 5) misses when 6) the hardware is not bandwidth
>> limited. For that case a Mill will stall in the callee and an OOO likely
>> will not.
>
> I am not convinced this is the case, but writing on Google groups from a tablet is painful.
>
>> We decided that this case was not worth 12x.
>
> I doubt the Mill will get a 12x PPA advantage over an OoO implementation of a conventional ISA like AArch64.

12x is not my number (see Mitch), and in any case we have given up some
of whatever it is to pay for width.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<78ed76bf-deb7-4798-aa4f-0f207a402ae0n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23673&group=comp.arch#23673

  copy link   Newsgroups: comp.arch
X-Received: by 2002:adf:fa92:0:b0:1e7:e760:49dd with SMTP id h18-20020adffa92000000b001e7e76049ddmr6812215wrr.99.1645204693194;
Fri, 18 Feb 2022 09:18:13 -0800 (PST)
X-Received: by 2002:a05:6870:505:b0:c4:7dc0:d726 with SMTP id
j5-20020a056870050500b000c47dc0d726mr3024450oao.249.1645204692657; Fri, 18
Feb 2022 09:18:12 -0800 (PST)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!pasdenom.info!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 18 Feb 2022 09:18:12 -0800 (PST)
In-Reply-To: <d588d582-68f7-41a1-ab1e-1e873fb826b9n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:807a:9c3e:6af0:c510;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:807a:9c3e:6af0:c510
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at> <212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at> <3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
<sugu4m$9au$1@dont-email.me> <2022Feb18.073552@mips.complang.tuwien.ac.at> <d588d582-68f7-41a1-ab1e-1e873fb826b9n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <78ed76bf-deb7-4798-aa4f-0f207a402ae0n@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 18 Feb 2022 17:18:13 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Fri, 18 Feb 2022 17:18 UTC

On Friday, February 18, 2022 at 3:10:19 AM UTC-6, Quadibloc wrote:
> On Friday, February 18, 2022 at 12:32:05 AM UTC-7, Anton Ertl wrote:
> > BGB <cr8...@gmail.com> writes:
>
> > >People will also have a lot more time to try to work out the "magic
> > >compiler" issues, ...
>
> > We have had from the appearance of microcode in the 1960s through
> > Metaflow and Cydrome in the 1980s until people lost faith in IA-64
> > around 2010 to work them out. It's unlikely that a breakthrough will
> > be found later.
> This is an issue that continues to confuse me.
>
> There is no way that a compiler can take the place of out-of-order
> circuitry for allowing a computer to deal with cache misses. These aren't
> predictable.
<
Actually, they are predictable in HW.
<
Say we have a loop that does 2 LDDs and one STD: and we run the loop
"lots" of times;
LDD[1] takes a miss every 8 cycles in iteration MOD 3
LDD[2] takes a miss every 8 cycles in iteration MOD 6
ST....... takes a miss every 8 cycles in iteration MOD 1
<
HW can handle this perfectly after a couple of 16+ loops, and it does not
have to be OoO HW to do this. It could be detected between L1<->L2
and just schedule a 3 prefetches every 8×latency-of-loop.
>
> But when it comes to register hazards, _that_ part of what OoO does,
> I would have thought, can of course be fully matched by changing the
> rename registers to explicit registers, and having the compiler allocate
> them and schedule the instructions accordingly. So I would have thought
> this was already a solved problem.
>
> If the burden of circuitry of OoO is very heavy, therefore, I would have
> thought that the burden of making instructions wider, of having more
> registers to save and restore, and so on, would be well worth it. But
<
History indicates a lot of people put a lot of investment here and failed.
<
> maybe it isn't, since memory bandwidth is a big issue in its own right
> as well.
>
> What I _suspect_, though, is that the main reason the choice is usually
> made to go with OoO rather than to go with big register files... is because
> of what happened with RISC. Going from 8 or 16 registers to 32 registers
> was supposed to be an example of this approach. And at one level of
> hardware complexity and performance, it was.
<
The ancient statistics showed going from 8->16 registers was fairly big
going from 16-32 was modest, going from 32-64 was negligible:: on compiled
code using pre-MIPS-company architecture. Stanford very early in RISC era.
>
> But quickly, as more transistors became available, and more performance
> was desired, we got OoO implementations of RISC architectures. Along
> came Itanium - and even I quickly saw that this design was built around
> one particular generation of microelectronics, and it would soon have to
> be replaced with something even bigger and hence incompatible.
>
> Whereas, if you just take a popular CISC architecture, and implement it
> with the aid of OoO execution, nobody has to change all their programs, and
> everyone is happy.
>
> So the issue isn't that doing away with OoO by explicitly allocating registers
> is technically infeasible, or technically inferior. You might be able to get
> better performance at lower cost that way. But as long as transistor counts
> keep going up, doing it that way means you keep having to change your ISA,
> whereas using OoO means you don't.
>
> In that case, the end of Moore's Law doesn't give us "time to solve the
> compiler problem" - that problem is already solved. Instead, it means we
> can finally stop at 256 registers or however many, as the next generation of
> silicon isn't going to come along shortly and make the ISA obsolete.
>
> So that's how I see the end of Moore's Law making VLIW designs possible
> once again.
>
> John Savard

Re: instruction set binding time, was Encoding 20 and 40 bit

<a83bf69a-6b33-48e9-9b28-aad99bc2b18fn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23674&group=comp.arch#23674

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:600c:4f08:b0:37b:e830:d231 with SMTP id l8-20020a05600c4f0800b0037be830d231mr8133687wmq.144.1645205535724;
Fri, 18 Feb 2022 09:32:15 -0800 (PST)
X-Received: by 2002:a05:6808:1710:b0:2d3:f699:b876 with SMTP id
bc16-20020a056808171000b002d3f699b876mr3928813oib.281.1645205535067; Fri, 18
Feb 2022 09:32:15 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!news.uzoreto.com!feeder1.cambriumusenet.nl!feed.tweak.nl!209.85.128.88.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 18 Feb 2022 09:32:14 -0800 (PST)
In-Reply-To: <2022Feb18.100825@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:807a:9c3e:6af0:c510;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:807a:9c3e:6af0:c510
References: <sufpjo$oij$1@dont-email.me> <memo.20220215170019.7708M@jgd.cix.co.uk>
<suh8bp$ce8$1@dont-email.me> <2022Feb18.100825@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <a83bf69a-6b33-48e9-9b28-aad99bc2b18fn@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 18 Feb 2022 17:32:15 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Fri, 18 Feb 2022 17:32 UTC

On Friday, February 18, 2022 at 3:57:35 AM UTC-6, Anton Ertl wrote:
> Ivan Godard <iv...@millcomputing.com> writes:
> >EPIC hardware assumed that there are no intra-bundle hazards,
>

> >This makes the Mill suitable for general-purpose work that would cause a
> >classic VLIW to spend too much time in stall, yet still take advantage
> >of any schedule variability to do other work while (possibly) waiting
> >for external data, and without needing OOO retire hazard hardware.
>
> I assume you mean in-order readyness checking (absent in the MIPS
> R2000 load, which gave the architecture its name), sometimes called a
> scoreboard (although Mitch Alsup tells us that that name is
> inappropriate because the original CDC-6600 scoreboard was a more
> sophisticated feature with some OoO potential). The question is how
> expensive this is. They could afford this feature in the earliest
> RISCs, such as ARM and SPARC, so the cost cannot be that high.
<
A Scoreboard does not start a calculation until all RAW hazards disappear
{that is its operands are ready}
A Scoreboard does not finish a calculation until all RAW hazards disappear.
{that is all consumers of the previous value have gotten that value}
<
Expense:
<
The entire CDC 6600 scoreboard can be implemented in about 2½K gates
managed 10 function units. Had the scoreboard have to manage 32 flat
register file this number would have been about 15K gates. A scoreboard
is quadratic in register management, 3 8-entry files, the A file only being
used by increment, makes its control tiny. 1-32 entry file has 32×32 hazard
points, the 6600 only had 7×7 + 7×7 + 8×8 = 192 compared to 1024.
<
One Tomasulo station costs 1.5K gates. 360/91 had ~12 of them.
>
> >The gain from Mill's split load is limited by the maximal gap (in time)
> >between the load-issue and load-retire instructions, which is in turn
> >determined by how much work is available to do that neither depends on
> >the load result nor is depended on by the load issue. That's easily
> >determined by conventional dataflow analysis in the compiler, and is
> >exactly the same as the amount of work that an OOO can do while waiting
> >for a load to retire.
>
> Do you mean an in-order superscalar? An OoO superscalar often has
> fetched and decoded hundreds of instructions ahead, many of which have
> already been processed and advanced to the retire queue, many of which
> are blocked by dependencies on the load or other non-ready
> instructions, but there are also often a number of instructions that
> have just become ready (e.g., because they depended on the same
> instruction as the load, or because they depended on an instruction
> that produced its result in the same cycle as the last parent of the
> load), and which may be from places that a compiler cannot easily
> schedule (e.g., from behind a polymorphic method call).
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suoplm$np5$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23677&group=comp.arch#23677

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Fri, 18 Feb 2022 12:48:21 -0600
Organization: A noiseless patient Spider
Lines: 80
Message-ID: <suoplm$np5$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
<sugu4m$9au$1@dont-email.me> <2022Feb18.073552@mips.complang.tuwien.ac.at>
<d588d582-68f7-41a1-ab1e-1e873fb826b9n@googlegroups.com>
<78ed76bf-deb7-4798-aa4f-0f207a402ae0n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 18 Feb 2022 18:48:23 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="54262f9b599f1e919b67f458cfb7a066";
logging-data="24357"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19u597GeE5ehNUMVs+ijBJo"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.0
Cancel-Lock: sha1:c5GdusOw+McFhWvCNp0mJ3AYKc0=
In-Reply-To: <78ed76bf-deb7-4798-aa4f-0f207a402ae0n@googlegroups.com>
Content-Language: en-US
 by: BGB - Fri, 18 Feb 2022 18:48 UTC

On 2/18/2022 11:18 AM, MitchAlsup wrote:
> On Friday, February 18, 2022 at 3:10:19 AM UTC-6, Quadibloc wrote:
>> On Friday, February 18, 2022 at 12:32:05 AM UTC-7, Anton Ertl wrote:
>>> BGB <cr8...@gmail.com> writes:
>>
>>>> People will also have a lot more time to try to work out the "magic
>>>> compiler" issues, ...
>>
>>> We have had from the appearance of microcode in the 1960s through
>>> Metaflow and Cydrome in the 1980s until people lost faith in IA-64
>>> around 2010 to work them out. It's unlikely that a breakthrough will
>>> be found later.
>> This is an issue that continues to confuse me.
>>
>> There is no way that a compiler can take the place of out-of-order
>> circuitry for allowing a computer to deal with cache misses. These aren't
>> predictable.
> <
> Actually, they are predictable in HW.
> <
> Say we have a loop that does 2 LDDs and one STD: and we run the loop
> "lots" of times;
> LDD[1] takes a miss every 8 cycles in iteration MOD 3
> LDD[2] takes a miss every 8 cycles in iteration MOD 6
> ST....... takes a miss every 8 cycles in iteration MOD 1
> <
> HW can handle this perfectly after a couple of 16+ loops, and it does not
> have to be OoO HW to do this. It could be detected between L1<->L2
> and just schedule a 3 prefetches every 8×latency-of-loop.
>>
>> But when it comes to register hazards, _that_ part of what OoO does,
>> I would have thought, can of course be fully matched by changing the
>> rename registers to explicit registers, and having the compiler allocate
>> them and schedule the instructions accordingly. So I would have thought
>> this was already a solved problem.
>>
>> If the burden of circuitry of OoO is very heavy, therefore, I would have
>> thought that the burden of making instructions wider, of having more
>> registers to save and restore, and so on, would be well worth it. But
> <
> History indicates a lot of people put a lot of investment here and failed.
> <
>> maybe it isn't, since memory bandwidth is a big issue in its own right
>> as well.
>>
>> What I _suspect_, though, is that the main reason the choice is usually
>> made to go with OoO rather than to go with big register files... is because
>> of what happened with RISC. Going from 8 or 16 registers to 32 registers
>> was supposed to be an example of this approach. And at one level of
>> hardware complexity and performance, it was.
> <
> The ancient statistics showed going from 8->16 registers was fairly big
> going from 16-32 was modest, going from 32-64 was negligible:: on compiled
> code using pre-MIPS-company architecture. Stanford very early in RISC era.

My testing would mostly agree, but I have found a partial edge case:
This assumes that one is not allocating the registers as even-pairs;
Even-pair allocation nearly doubles the register pressure, so now 32
registers is back to behaving like 16, adding incentive to move to 64 to
compensate.

But, yeah, if one values uses one register, then 32 seems to be mostly
sufficient.

Granted, one could argue that pair-allocating registers so that they can
use a 64-bit ISA for working with 128-bit pointers is "kinda stupid" but
alas...

I am not sure the long term effect of this experiment, I kinda expect it
will suck bad enough that I can throw it in a closet and forget about
it, but this is what I also thought about the "expand the address space
to 96 bits" idea...

Though, maybe piling on random crazy ideas over time isn't great in the
sense that it results in the design getting a bit messy.

....

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<4b98d22b-eba8-4201-99af-f459b99dc350n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23678&group=comp.arch#23678

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a7b:c4d2:0:b0:37b:b47c:4a5f with SMTP id g18-20020a7bc4d2000000b0037bb47c4a5fmr12067511wmk.102.1645211387606;
Fri, 18 Feb 2022 11:09:47 -0800 (PST)
X-Received: by 2002:a05:6808:13cb:b0:2d3:7f26:1c52 with SMTP id
d11-20020a05680813cb00b002d37f261c52mr4126645oiw.309.1645211387076; Fri, 18
Feb 2022 11:09:47 -0800 (PST)
Path: i2pn2.org!i2pn.org!news.swapon.de!news.uzoreto.com!feeder1.cambriumusenet.nl!feed.tweak.nl!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 18 Feb 2022 11:09:46 -0800 (PST)
In-Reply-To: <78ed76bf-deb7-4798-aa4f-0f207a402ae0n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:fce0:6dc:f860:f3de;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:fce0:6dc:f860:f3de
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at> <212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at> <3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
<sugu4m$9au$1@dont-email.me> <2022Feb18.073552@mips.complang.tuwien.ac.at>
<d588d582-68f7-41a1-ab1e-1e873fb826b9n@googlegroups.com> <78ed76bf-deb7-4798-aa4f-0f207a402ae0n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4b98d22b-eba8-4201-99af-f459b99dc350n@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Fri, 18 Feb 2022 19:09:47 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: Quadibloc - Fri, 18 Feb 2022 19:09 UTC

On Friday, February 18, 2022 at 10:18:16 AM UTC-7, MitchAlsup wrote:
> On Friday, February 18, 2022 at 3:10:19 AM UTC-6, Quadibloc wrote:

> > There is no way that a compiler can take the place of out-of-order
> > circuitry for allowing a computer to deal with cache misses. These aren't
> > predictable.
> <
> Actually, they are predictable in HW.
> <
> Say we have a loop that does 2 LDDs and one STD: and we run the loop
> "lots" of times;
> LDD[1] takes a miss every 8 cycles in iteration MOD 3
> LDD[2] takes a miss every 8 cycles in iteration MOD 6
> ST....... takes a miss every 8 cycles in iteration MOD 1
> <
> HW can handle this perfectly after a couple of 16+ loops, and it does not
> have to be OoO HW to do this. It could be detected between L1<->L2
> and just schedule a 3 prefetches every 8×latency-of-loop.

Of course, being able to predict it using techniques
similar to those used in branch prediction when the program
is running... doesn't change my point that it isn't predictable
long before, when the compiler is generating machine code
from source code.

John Savard

Re: instruction set binding time, was Encoding 20 and 40 bit

<jwvo834vvpx.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23679&group=comp.arch#23679

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
Date: Fri, 18 Feb 2022 14:16:20 -0500
Organization: A noiseless patient Spider
Lines: 27
Message-ID: <jwvo834vvpx.fsf-monnier+comp.arch@gnu.org>
References: <sufpjo$oij$1@dont-email.me>
<memo.20220215170019.7708M@jgd.cix.co.uk> <suh8bp$ce8$1@dont-email.me>
<2022Feb18.100825@mips.complang.tuwien.ac.at>
<suoabu$rfk$1@dont-email.me>
<5276f69f-07bc-4082-bb5b-c371d059d403n@googlegroups.com>
<suoiqc$8h4$2@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="74cd02b4954a3d78e51042887bce9f0e";
logging-data="3373"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19SNsj5I+5781JM9T7WB5L+"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:ksnIN/MaP1fGOrxsvsqtlX2IZsI=
sha1:FkICZzAVNDkOLYq6BfqT06JoVOg=
 by: Stefan Monnier - Fri, 18 Feb 2022 19:16 UTC

>> In theory a caller could start a pick up load that is picked up by the
>> callee. This would be similar to function specific interfaces rather than
>> using a generic ABI. It is not obvious how useful this would be.
> The architecture does not permit that - each frame has a distinct set of
> load retire stations. While that could be changed, it's not clear that,
> given the limited applicability to calls which were statically resolvable
> (i.e. no polymorphic), the the game would be worth the candle.

Such calls are very common in some languages, and it seems like it would
be fairly easy to add a simple analysis to the compiler to annotate
functions with ad-hoc "in-flight loads" that the callers need
to provide.

Intuitively I'd expect it could be quite useful in languages with many
small functions, since it would allow starting accessing fields of an
argument before entering the function where there might not be much more
to do for the first few cycles otherwise.

> That case can be addressed without architecture change using
> out-lining, too.

At least, on the compiler side, it's much easier to accommodate
specialized per-function calling conventions, than using outlining
or inlining.

Stefan

Re: instruction set binding time, was Encoding 20 and 40 bit

<428ab85d-1a1a-4e58-a582-e9501d9435c5n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23680&group=comp.arch#23680

  copy link   Newsgroups: comp.arch
X-Received: by 2002:adf:facd:0:b0:1e7:ceff:7695 with SMTP id a13-20020adffacd000000b001e7ceff7695mr7460297wrs.656.1645211822546;
Fri, 18 Feb 2022 11:17:02 -0800 (PST)
X-Received: by 2002:a9d:7981:0:b0:5ad:146b:a416 with SMTP id
h1-20020a9d7981000000b005ad146ba416mr2925914otm.142.1645211821875; Fri, 18
Feb 2022 11:17:01 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 18 Feb 2022 11:17:01 -0800 (PST)
In-Reply-To: <a83bf69a-6b33-48e9-9b28-aad99bc2b18fn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:fce0:6dc:f860:f3de;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:fce0:6dc:f860:f3de
References: <sufpjo$oij$1@dont-email.me> <memo.20220215170019.7708M@jgd.cix.co.uk>
<suh8bp$ce8$1@dont-email.me> <2022Feb18.100825@mips.complang.tuwien.ac.at> <a83bf69a-6b33-48e9-9b28-aad99bc2b18fn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <428ab85d-1a1a-4e58-a582-e9501d9435c5n@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Fri, 18 Feb 2022 19:17:02 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: Quadibloc - Fri, 18 Feb 2022 19:17 UTC

On Friday, February 18, 2022 at 10:32:18 AM UTC-7, MitchAlsup wrote:

> Expense:
> <
> The entire CDC 6600 scoreboard can be implemented in about 2½K gates
> managed 10 function units. Had the scoreboard have to manage 32 flat
> register file this number would have been about 15K gates. A scoreboard
> is quadratic in register management, 3 8-entry files, the A file only being
> used by increment, makes its control tiny. 1-32 entry file has 32×32 hazard
> points, the 6600 only had 7×7 + 7×7 + 8×8 = 192 compared to 1024.
> <
> One Tomasulo station costs 1.5K gates. 360/91 had ~12 of them.

Useful information!

While the CDC 6600 had 2,500 gates of overhead, and the 360/91
had 18,000 gates of overhead, then, and a modern OoO computer would
need more (the 360 architecture just had four floating-point registers,
and modern OoO processors, unlike the Pentium II, wish to apply OoO
techniques to integer processing as well)...

Instead of being an argument for using the CDC 6600 scoreboard, this
seems to be an argument for OoO, since its costs are so modest!

And, of course, they _are_ - depending on how ambitious one is. But
the goals of today's "great big out-of-order" processors, as you call them,
are, of course, anything but modest.

Given that they've gone well past the point of diminishing returns on the
one hand, and parallel computing just doesn't seem to work for programmers
on the other, it seems that we're stuck with no good solutions.

Of course, that's only true when one's goals are not modest. People can
still design processors for smartphones without feeling guilty, desperate,
or confused.

John Savard

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suorhk$9eh$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23681&group=comp.arch#23681

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!news.freedyn.de!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Fri, 18 Feb 2022 19:20:20 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <suorhk$9eh$1@newsreader4.netcologne.de>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
<sugu4m$9au$1@dont-email.me> <2022Feb18.073552@mips.complang.tuwien.ac.at>
<d588d582-68f7-41a1-ab1e-1e873fb826b9n@googlegroups.com>
<78ed76bf-deb7-4798-aa4f-0f207a402ae0n@googlegroups.com>
Injection-Date: Fri, 18 Feb 2022 19:20:20 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:19cd:0:7285:c2ff:fe6c:992d";
logging-data="9681"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Fri, 18 Feb 2022 19:20 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:

> The ancient statistics showed going from 8->16 registers was fairly big
> going from 16-32 was modest, going from 32-64 was negligible:: on compiled
> code using pre-MIPS-company architecture. Stanford very early in RISC era.

I've been told that these statistics exist, but I didn't read the
original papers (and would like a pointer to them). It would be
interesting to read details such as which code they used for
benchmarks.

Most RISC architectures have 32 floating point registers in addition
to their 32 integer registers, so a total number of 64 registers
is not outlandish. Your designs are pretty unique in combining
these two. VVM eliminates the need for unrolling inner loops,
so at least My 66000 can get by with less.

Re: instruction set binding time, was Encoding 20 and 40 bit

<eGSPJ.14154$Icha.5393@fx11.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23682&group=comp.arch#23682

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx11.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
References: <sufpjo$oij$1@dont-email.me> <memo.20220215170019.7708M@jgd.cix.co.uk> <suh8bp$ce8$1@dont-email.me> <2022Feb18.100825@mips.complang.tuwien.ac.at> <a83bf69a-6b33-48e9-9b28-aad99bc2b18fn@googlegroups.com>
In-Reply-To: <a83bf69a-6b33-48e9-9b28-aad99bc2b18fn@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 47
Message-ID: <eGSPJ.14154$Icha.5393@fx11.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Fri, 18 Feb 2022 19:44:10 UTC
Date: Fri, 18 Feb 2022 14:44:00 -0500
X-Received-Bytes: 3368
 by: EricP - Fri, 18 Feb 2022 19:44 UTC

MitchAlsup wrote:
> On Friday, February 18, 2022 at 3:57:35 AM UTC-6, Anton Ertl wrote:
>> Ivan Godard <iv...@millcomputing.com> writes:
>>> EPIC hardware assumed that there are no intra-bundle hazards,
>
>>> This makes the Mill suitable for general-purpose work that would cause a
>>> classic VLIW to spend too much time in stall, yet still take advantage
>>> of any schedule variability to do other work while (possibly) waiting
>>> for external data, and without needing OOO retire hazard hardware.
>> I assume you mean in-order readyness checking (absent in the MIPS
>> R2000 load, which gave the architecture its name), sometimes called a
>> scoreboard (although Mitch Alsup tells us that that name is
>> inappropriate because the original CDC-6600 scoreboard was a more
>> sophisticated feature with some OoO potential). The question is how
>> expensive this is. They could afford this feature in the earliest
>> RISCs, such as ARM and SPARC, so the cost cannot be that high.
> <
> A Scoreboard does not start a calculation until all RAW hazards disappear
> {that is its operands are ready}
> A Scoreboard does not finish a calculation until all RAW hazards disappear.
> {that is all consumers of the previous value have gotten that value}
^WAR^
> <
> Expense:
> <
> The entire CDC 6600 scoreboard can be implemented in about 2½K gates
> managed 10 function units. Had the scoreboard have to manage 32 flat
> register file this number would have been about 15K gates. A scoreboard
> is quadratic in register management, 3 8-entry files, the A file only being
> used by increment, makes its control tiny. 1-32 entry file has 32×32 hazard
> points, the 6600 only had 7×7 + 7×7 + 8×8 = 192 compared to 1024.
> <
> One Tomasulo station costs 1.5K gates. 360/91 had ~12 of them.

But the CDC6600 scoreboard allowed writeback to *different* registers
to be performed out of order, making interrupts imprecise.

To make interrupts precise, wouldn't it need an extra matrix to impose
program order on writebacks? This creates a WAW dependency chain to
*different* registers based on program order but one that stalls at
writeback, whereas normal WAW to the same register stalls at issue.

Then having multiple write ports allows multiple writebacks to
*different* registers at once, allowing it to catch up after a bubble.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sup2sr$j53$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23683&group=comp.arch#23683

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Fri, 18 Feb 2022 15:25:45 -0600
Organization: A noiseless patient Spider
Lines: 35
Message-ID: <sup2sr$j53$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
<sugu4m$9au$1@dont-email.me> <2022Feb18.073552@mips.complang.tuwien.ac.at>
<d588d582-68f7-41a1-ab1e-1e873fb826b9n@googlegroups.com>
<78ed76bf-deb7-4798-aa4f-0f207a402ae0n@googlegroups.com>
<suorhk$9eh$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 18 Feb 2022 21:25:47 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="54262f9b599f1e919b67f458cfb7a066";
logging-data="19619"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+ug2G/UIeEPJBEoIOMdR9M"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.0
Cancel-Lock: sha1:CET61V1oQpr9clUjF4FjvYXTXeA=
In-Reply-To: <suorhk$9eh$1@newsreader4.netcologne.de>
Content-Language: en-US
 by: BGB - Fri, 18 Feb 2022 21:25 UTC

On 2/18/2022 1:20 PM, Thomas Koenig wrote:
> MitchAlsup <MitchAlsup@aol.com> schrieb:
>
>> The ancient statistics showed going from 8->16 registers was fairly big
>> going from 16-32 was modest, going from 32-64 was negligible:: on compiled
>> code using pre-MIPS-company architecture. Stanford very early in RISC era.
>
> I've been told that these statistics exist, but I didn't read the
> original papers (and would like a pointer to them). It would be
> interesting to read details such as which code they used for
> benchmarks.
>
> Most RISC architectures have 32 floating point registers in addition
> to their 32 integer registers, so a total number of 64 registers
> is not outlandish. Your designs are pretty unique in combining
> these two. VVM eliminates the need for unrolling inner loops,
> so at least My 66000 can get by with less.

I ended up merging them as well...

At first, this was for SIMD:
32 GPRs, used for ALU or SIMD
16 FPRs, for the FPU

Then I ended up eliminating the FPRs in favor of merging FPU ops into
the GPR space.

This was also the end of the existence of a dedicated FDIV instruction
as well; early forms of the FPU had FDIV and FSQRT, but they were
dropped due to being both "not free" nor offering any real speed
advantage over doing them in software. In both cases, variants of
Newton-Raphson were used, and doing it in the FPU doesn't make the N-R
converge any faster (the other overheads not significantly contributing
to clock-cycle cost).

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<ygn1qzzzs9p.fsf@y.z>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23684&group=comp.arch#23684

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!nntp.club.cc.cmu.edu!5.161.45.24.MISMATCH!2.us.feeder.erje.net!feeder.erje.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx34.iad.POSTED!not-for-mail
From: x...@y.z (Josh Vanderhoof)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
<sugu4m$9au$1@dont-email.me>
<2022Feb18.073552@mips.complang.tuwien.ac.at>
<d588d582-68f7-41a1-ab1e-1e873fb826b9n@googlegroups.com>
<78ed76bf-deb7-4798-aa4f-0f207a402ae0n@googlegroups.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.2 (gnu/linux)
Reply-To: Josh Vanderhoof <jlv@mxsimulator.com>
Message-ID: <ygn1qzzzs9p.fsf@y.z>
Cancel-Lock: sha1:xMYgnMkOnHuQHCcKakh/lA2AJac=
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Lines: 41
X-Complaints-To: https://www.astraweb.com/aup
NNTP-Posting-Date: Fri, 18 Feb 2022 23:09:24 UTC
Date: Fri, 18 Feb 2022 18:09:22 -0500
X-Received-Bytes: 2723
X-Original-Bytes: 2672
 by: Josh Vanderhoof - Fri, 18 Feb 2022 23:09 UTC

MitchAlsup <MitchAlsup@aol.com> writes:

> On Friday, February 18, 2022 at 3:10:19 AM UTC-6, Quadibloc wrote:
>> There is no way that a compiler can take the place of out-of-order
>> circuitry for allowing a computer to deal with cache misses. These aren't
>> predictable.
> <
> Actually, they are predictable in HW.
> <
> Say we have a loop that does 2 LDDs and one STD: and we run the loop
> "lots" of times;
> LDD[1] takes a miss every 8 cycles in iteration MOD 3
> LDD[2] takes a miss every 8 cycles in iteration MOD 6
> ST....... takes a miss every 8 cycles in iteration MOD 1
> <
> HW can handle this perfectly after a couple of 16+ loops, and it does not
> have to be OoO HW to do this. It could be detected between L1<->L2
> and just schedule a 3 prefetches every 8×latency-of-loop.

Say you have a loop with unpredictable cache misses like this:

for (i = 0; i < n; i++)
d[i] = f(s[get_addr(i)]);

If you had an "in_cache()" instruction, would a transformation like this
be a win?

// run loop for everything in cache, skip and prefetch stuff that missed
for (i = 0, j = 0; i < n; i++) {
p = get_addr(i);
if (in_cache(&s[p]))
d[i] = f(s[p]);
else {
prefetch(&s[p]);
skipped[j++] = i;
}
}

// finish entries that missed cache
for (i = 0; i < j; i++)
d[skipped[i]] = f(s[get_addr(skipped[i])]);

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<65d44a8b-62b4-4ff9-8fcc-bc5a0e3c6d69n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23687&group=comp.arch#23687

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a5d:6d8d:0:b0:1e3:3de4:e0e6 with SMTP id l13-20020a5d6d8d000000b001e33de4e0e6mr7724233wrs.159.1645230097439;
Fri, 18 Feb 2022 16:21:37 -0800 (PST)
X-Received: by 2002:a05:6870:a987:b0:d3:3505:4b8e with SMTP id
ep7-20020a056870a98700b000d335054b8emr5315563oab.88.1645230096954; Fri, 18
Feb 2022 16:21:36 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.88.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 18 Feb 2022 16:21:36 -0800 (PST)
In-Reply-To: <suorhk$9eh$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:807a:9c3e:6af0:c510;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:807a:9c3e:6af0:c510
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at> <212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at> <3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
<sugu4m$9au$1@dont-email.me> <2022Feb18.073552@mips.complang.tuwien.ac.at>
<d588d582-68f7-41a1-ab1e-1e873fb826b9n@googlegroups.com> <78ed76bf-deb7-4798-aa4f-0f207a402ae0n@googlegroups.com>
<suorhk$9eh$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <65d44a8b-62b4-4ff9-8fcc-bc5a0e3c6d69n@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 19 Feb 2022 00:21:37 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Sat, 19 Feb 2022 00:21 UTC

On Friday, February 18, 2022 at 1:20:24 PM UTC-6, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > The ancient statistics showed going from 8->16 registers was fairly big
> > going from 16-32 was modest, going from 32-64 was negligible:: on compiled
> > code using pre-MIPS-company architecture. Stanford very early in RISC era.
> I've been told that these statistics exist, but I didn't read the
> original papers (and would like a pointer to them). It would be
> interesting to read details such as which code they used for
> benchmarks.
>
> Most RISC architectures have 32 floating point registers in addition
> to their 32 integer registers, so a total number of 64 registers
> is not outlandish. Your designs are pretty unique in combining
> these two. VVM eliminates the need for unrolling inner loops,
> so at least My 66000 can get by with less.
<
Also note: The C version of Livermore loops compiles without
overflowing the 32 registers--only one program needed more
~23 registers.
<
LLVM from-end Brian's back-end; probably set to -O3

Re: instruction set binding time, was Encoding 20 and 40 bit

<ea314472-0547-49f3-9403-4e0ff7b3636dn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23688&group=comp.arch#23688

  copy link   Newsgroups: comp.arch
X-Received: by 2002:adf:f284:0:b0:1e3:2576:245 with SMTP id k4-20020adff284000000b001e325760245mr7729938wro.529.1645230355677;
Fri, 18 Feb 2022 16:25:55 -0800 (PST)
X-Received: by 2002:a05:6808:1208:b0:2d4:419d:8463 with SMTP id
a8-20020a056808120800b002d4419d8463mr4468554oil.227.1645230355148; Fri, 18
Feb 2022 16:25:55 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 18 Feb 2022 16:25:54 -0800 (PST)
In-Reply-To: <eGSPJ.14154$Icha.5393@fx11.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:807a:9c3e:6af0:c510;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:807a:9c3e:6af0:c510
References: <sufpjo$oij$1@dont-email.me> <memo.20220215170019.7708M@jgd.cix.co.uk>
<suh8bp$ce8$1@dont-email.me> <2022Feb18.100825@mips.complang.tuwien.ac.at>
<a83bf69a-6b33-48e9-9b28-aad99bc2b18fn@googlegroups.com> <eGSPJ.14154$Icha.5393@fx11.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ea314472-0547-49f3-9403-4e0ff7b3636dn@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 19 Feb 2022 00:25:55 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Sat, 19 Feb 2022 00:25 UTC

On Friday, February 18, 2022 at 1:44:13 PM UTC-6, EricP wrote:
> MitchAlsup wrote:
> > On Friday, February 18, 2022 at 3:57:35 AM UTC-6, Anton Ertl wrote:
> >> Ivan Godard <iv...@millcomputing.com> writes:
> >>> EPIC hardware assumed that there are no intra-bundle hazards,
> >
> >>> This makes the Mill suitable for general-purpose work that would cause a
> >>> classic VLIW to spend too much time in stall, yet still take advantage
> >>> of any schedule variability to do other work while (possibly) waiting
> >>> for external data, and without needing OOO retire hazard hardware.
> >> I assume you mean in-order readyness checking (absent in the MIPS
> >> R2000 load, which gave the architecture its name), sometimes called a
> >> scoreboard (although Mitch Alsup tells us that that name is
> >> inappropriate because the original CDC-6600 scoreboard was a more
> >> sophisticated feature with some OoO potential). The question is how
> >> expensive this is. They could afford this feature in the earliest
> >> RISCs, such as ARM and SPARC, so the cost cannot be that high.
> > <
> > A Scoreboard does not start a calculation until all RAW hazards disappear
> > {that is its operands are ready}
> > A Scoreboard does not finish a calculation until all RAW hazards disappear.
> > {that is all consumers of the previous value have gotten that value}
> ^WAR^
Missed that one.
> > <
> > Expense:
> > <
> > The entire CDC 6600 scoreboard can be implemented in about 2½K gates
> > managed 10 function units. Had the scoreboard have to manage 32 flat
> > register file this number would have been about 15K gates. A scoreboard
> > is quadratic in register management, 3 8-entry files, the A file only being
> > used by increment, makes its control tiny. 1-32 entry file has 32×32 hazard
> > points, the 6600 only had 7×7 + 7×7 + 8×8 = 192 compared to 1024.
> > <
> > One Tomasulo station costs 1.5K gates. 360/91 had ~12 of them.
<
> But the CDC6600 scoreboard allowed writeback to *different* registers
> to be performed out of order, making interrupts imprecise.
<
Different Registers problem that can be renamed away. So you track register
dependencies in rename space rather than in architectural space.
>
> To make interrupts precise, wouldn't it need an extra matrix to impose
> program order on writebacks? This creates a WAW dependency chain to
> *different* registers based on program order but one that stalls at
> writeback, whereas normal WAW to the same register stalls at issue.
<
I think Luke was doing this.
>
> Then having multiple write ports allows multiple writebacks to
> *different* registers at once, allowing it to catch up after a bubble.
<
Colloquially known as "catch up bandwidth"

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<7b6e4266-27bf-4e2a-b474-609cfb801841n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23689&group=comp.arch#23689

  copy link   Newsgroups: comp.arch
X-Received: by 2002:adf:fa82:0:b0:1e6:34fe:9bf with SMTP id h2-20020adffa82000000b001e634fe09bfmr7379650wrr.43.1645230555729;
Fri, 18 Feb 2022 16:29:15 -0800 (PST)
X-Received: by 2002:a05:6808:13cb:b0:2d3:7f26:1c52 with SMTP id
d11-20020a05680813cb00b002d37f261c52mr4600798oiw.309.1645230555196; Fri, 18
Feb 2022 16:29:15 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.88.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 18 Feb 2022 16:29:15 -0800 (PST)
In-Reply-To: <sup2sr$j53$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:807a:9c3e:6af0:c510;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:807a:9c3e:6af0:c510
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at> <212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at> <3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
<sugu4m$9au$1@dont-email.me> <2022Feb18.073552@mips.complang.tuwien.ac.at>
<d588d582-68f7-41a1-ab1e-1e873fb826b9n@googlegroups.com> <78ed76bf-deb7-4798-aa4f-0f207a402ae0n@googlegroups.com>
<suorhk$9eh$1@newsreader4.netcologne.de> <sup2sr$j53$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <7b6e4266-27bf-4e2a-b474-609cfb801841n@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 19 Feb 2022 00:29:15 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Sat, 19 Feb 2022 00:29 UTC

On Friday, February 18, 2022 at 3:25:51 PM UTC-6, BGB wrote:
> On 2/18/2022 1:20 PM, Thomas Koenig wrote:
> > MitchAlsup <Mitch...@aol.com> schrieb:
> >
> >> The ancient statistics showed going from 8->16 registers was fairly big
> >> going from 16-32 was modest, going from 32-64 was negligible:: on compiled
> >> code using pre-MIPS-company architecture. Stanford very early in RISC era.
> >
> > I've been told that these statistics exist, but I didn't read the
> > original papers (and would like a pointer to them). It would be
> > interesting to read details such as which code they used for
> > benchmarks.
> >
> > Most RISC architectures have 32 floating point registers in addition
> > to their 32 integer registers, so a total number of 64 registers
> > is not outlandish. Your designs are pretty unique in combining
> > these two. VVM eliminates the need for unrolling inner loops,
> > so at least My 66000 can get by with less.
> I ended up merging them as well...
>
> At first, this was for SIMD:
> 32 GPRs, used for ALU or SIMD
> 16 FPRs, for the FPU
>
> Then I ended up eliminating the FPRs in favor of merging FPU ops into
> the GPR space.
>
> This was also the end of the existence of a dedicated FDIV instruction
> as well; early forms of the FPU had FDIV and FSQRT, but they were
> dropped due to being both "not free" nor offering any real speed
> advantage over doing them in software. In both cases, variants of
> Newton-Raphson were used, and doing it in the FPU doesn't make the N-R
> converge any faster (the other overheads not significantly contributing
> to clock-cycle cost).
<
You can get the iterations started faster, and do the first n-1 iterations using
Goldschmidt finishing with a Newton-Raphson iteration.
<
Goldschmidt uses 2 multiplications that are not dependent on one
another. NR uses 2 multiplies that are dependent on one another.
So if you FPU is pipeline, in HW is faster. Whether your particular
use of your HW gains anything is unknown.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<3113a8d9-8afe-4bf9-af75-7b9a0822b3a1n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23690&group=comp.arch#23690

  copy link   Newsgroups: comp.arch
X-Received: by 2002:adf:f001:0:b0:1e4:b7b1:87c1 with SMTP id j1-20020adff001000000b001e4b7b187c1mr8057809wro.238.1645230891019;
Fri, 18 Feb 2022 16:34:51 -0800 (PST)
X-Received: by 2002:a4a:4442:0:b0:2fb:74c9:8960 with SMTP id
o63-20020a4a4442000000b002fb74c98960mr3086677ooa.61.1645230890512; Fri, 18
Feb 2022 16:34:50 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 18 Feb 2022 16:34:50 -0800 (PST)
In-Reply-To: <ygn1qzzzs9p.fsf@y.z>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:807a:9c3e:6af0:c510;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:807a:9c3e:6af0:c510
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at> <212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at> <3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
<sugu4m$9au$1@dont-email.me> <2022Feb18.073552@mips.complang.tuwien.ac.at>
<d588d582-68f7-41a1-ab1e-1e873fb826b9n@googlegroups.com> <78ed76bf-deb7-4798-aa4f-0f207a402ae0n@googlegroups.com>
<ygn1qzzzs9p.fsf@y.z>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3113a8d9-8afe-4bf9-af75-7b9a0822b3a1n@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 19 Feb 2022 00:34:51 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Sat, 19 Feb 2022 00:34 UTC

On Friday, February 18, 2022 at 5:09:29 PM UTC-6, Josh Vanderhoof wrote:
> MitchAlsup <Mitch...@aol.com> writes:
>
> > On Friday, February 18, 2022 at 3:10:19 AM UTC-6, Quadibloc wrote:
> >> There is no way that a compiler can take the place of out-of-order
> >> circuitry for allowing a computer to deal with cache misses. These aren't
> >> predictable.
> > <
> > Actually, they are predictable in HW.
> > <
> > Say we have a loop that does 2 LDDs and one STD: and we run the loop
> > "lots" of times;
> > LDD[1] takes a miss every 8 cycles in iteration MOD 3
> > LDD[2] takes a miss every 8 cycles in iteration MOD 6
> > ST....... takes a miss every 8 cycles in iteration MOD 1
> > <
> > HW can handle this perfectly after a couple of 16+ loops, and it does not
> > have to be OoO HW to do this. It could be detected between L1<->L2
> > and just schedule a 3 prefetches every 8×latency-of-loop.
> Say you have a loop with unpredictable cache misses like this:
>
> for (i = 0; i < n; i++)
> d[i] = f(s[get_addr(i)]);
<
MOV Ri,#0
for_loop:
LDD R1,[Rsp+Ri<<2]
CALL f
ADD Ri,Ri,#1
CMP Rt,Ri,Rn
BLT for_loop
<
did you mean to call a function ?
>
> If you had an "in_cache()" instruction, would a transformation like this
> be a win?
>
> // run loop for everything in cache, skip and prefetch stuff that missed
> for (i = 0, j = 0; i < n; i++) {
> p = get_addr(i);
> if (in_cache(&s[p]))
> d[i] = f(s[p]);
> else {
> prefetch(&s[p]);
> skipped[j++] = i;
> }
> }
>
> // finish entries that missed cache
> for (i = 0; i < j; i++)
> d[skipped[i]] = f(s[get_addr(skipped[i])]);

Re: instruction set binding time, was Encoding 20 and 40 bit

<supq9d$edu$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23693&group=comp.arch#23693

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
Date: Fri, 18 Feb 2022 20:05:02 -0800
Organization: A noiseless patient Spider
Lines: 58
Message-ID: <supq9d$edu$1@dont-email.me>
References: <sufpjo$oij$1@dont-email.me>
<memo.20220215170019.7708M@jgd.cix.co.uk> <suh8bp$ce8$1@dont-email.me>
<2022Feb18.100825@mips.complang.tuwien.ac.at>
<a83bf69a-6b33-48e9-9b28-aad99bc2b18fn@googlegroups.com>
<eGSPJ.14154$Icha.5393@fx11.iad>
<ea314472-0547-49f3-9403-4e0ff7b3636dn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 19 Feb 2022 04:05:01 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="9d0cd1e73d36d89907f9bb4ca3240064";
logging-data="14782"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/WMgZ4kb2ErnqRkE8jmqzq"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:acHrKBWlxUICkHw39mPgshFWEJQ=
In-Reply-To: <ea314472-0547-49f3-9403-4e0ff7b3636dn@googlegroups.com>
Content-Language: en-US
 by: Ivan Godard - Sat, 19 Feb 2022 04:05 UTC

On 2/18/2022 4:25 PM, MitchAlsup wrote:
> On Friday, February 18, 2022 at 1:44:13 PM UTC-6, EricP wrote:
>> MitchAlsup wrote:
>>> On Friday, February 18, 2022 at 3:57:35 AM UTC-6, Anton Ertl wrote:
>>>> Ivan Godard <iv...@millcomputing.com> writes:
>>>>> EPIC hardware assumed that there are no intra-bundle hazards,
>>>
>>>>> This makes the Mill suitable for general-purpose work that would cause a
>>>>> classic VLIW to spend too much time in stall, yet still take advantage
>>>>> of any schedule variability to do other work while (possibly) waiting
>>>>> for external data, and without needing OOO retire hazard hardware.
>>>> I assume you mean in-order readyness checking (absent in the MIPS
>>>> R2000 load, which gave the architecture its name), sometimes called a
>>>> scoreboard (although Mitch Alsup tells us that that name is
>>>> inappropriate because the original CDC-6600 scoreboard was a more
>>>> sophisticated feature with some OoO potential). The question is how
>>>> expensive this is. They could afford this feature in the earliest
>>>> RISCs, such as ARM and SPARC, so the cost cannot be that high.
>>> <
>>> A Scoreboard does not start a calculation until all RAW hazards disappear
>>> {that is its operands are ready}
>>> A Scoreboard does not finish a calculation until all RAW hazards disappear.
>>> {that is all consumers of the previous value have gotten that value}
>> ^WAR^
> Missed that one.
>>> <
>>> Expense:
>>> <
>>> The entire CDC 6600 scoreboard can be implemented in about 2½K gates
>>> managed 10 function units. Had the scoreboard have to manage 32 flat
>>> register file this number would have been about 15K gates. A scoreboard
>>> is quadratic in register management, 3 8-entry files, the A file only being
>>> used by increment, makes its control tiny. 1-32 entry file has 32×32 hazard
>>> points, the 6600 only had 7×7 + 7×7 + 8×8 = 192 compared to 1024.
>>> <
>>> One Tomasulo station costs 1.5K gates. 360/91 had ~12 of them.
> <
>> But the CDC6600 scoreboard allowed writeback to *different* registers
>> to be performed out of order, making interrupts imprecise.
> <
> Different Registers problem that can be renamed away. So you track register
> dependencies in rename space rather than in architectural space.
>>
>> To make interrupts precise, wouldn't it need an extra matrix to impose
>> program order on writebacks? This creates a WAW dependency chain to
>> *different* registers based on program order but one that stalls at
>> writeback, whereas normal WAW to the same register stalls at issue.
> <
> I think Luke was doing this.

What happened to Luke? - haven't seen him here in a donkey's age.

>>
>> Then having multiple write ports allows multiple writebacks to
>> *different* registers at once, allowing it to catch up after a bubble.
> <
> Colloquially known as "catch up bandwidth"

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suqd52$8ur$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23695&group=comp.arch#23695

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Sat, 19 Feb 2022 09:26:58 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <suqd52$8ur$1@newsreader4.netcologne.de>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
<sugu4m$9au$1@dont-email.me> <2022Feb18.073552@mips.complang.tuwien.ac.at>
<d588d582-68f7-41a1-ab1e-1e873fb826b9n@googlegroups.com>
<78ed76bf-deb7-4798-aa4f-0f207a402ae0n@googlegroups.com>
<suorhk$9eh$1@newsreader4.netcologne.de>
<65d44a8b-62b4-4ff9-8fcc-bc5a0e3c6d69n@googlegroups.com>
Injection-Date: Sat, 19 Feb 2022 09:26:58 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:19cd:0:7285:c2ff:fe6c:992d";
logging-data="9179"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Sat, 19 Feb 2022 09:26 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:
> On Friday, February 18, 2022 at 1:20:24 PM UTC-6, Thomas Koenig wrote:
>> MitchAlsup <Mitch...@aol.com> schrieb:
>> > The ancient statistics showed going from 8->16 registers was fairly big
>> > going from 16-32 was modest, going from 32-64 was negligible:: on compiled
>> > code using pre-MIPS-company architecture. Stanford very early in RISC era.
>> I've been told that these statistics exist, but I didn't read the
>> original papers (and would like a pointer to them). It would be
>> interesting to read details such as which code they used for
>> benchmarks.
>>
>> Most RISC architectures have 32 floating point registers in addition
>> to their 32 integer registers, so a total number of 64 registers
>> is not outlandish. Your designs are pretty unique in combining
>> these two. VVM eliminates the need for unrolling inner loops,
>> so at least My 66000 can get by with less.
><
> Also note: The C version of Livermore loops compiles without
> overflowing the 32 registers--only one program needed more
> ~23 registers.

It would be really interesting to see what your architecture makes
of more typical Fortran code like the Polyhedron benchmarks.

The Livermore loops are a few decades old and not really
represenative of more modern sicentific workloads.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suqu7t$jhp$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23697&group=comp.arch#23697

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Sat, 19 Feb 2022 14:18:37 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <suqu7t$jhp$1@newsreader4.netcologne.de>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
<sugu4m$9au$1@dont-email.me> <2022Feb18.073552@mips.complang.tuwien.ac.at>
<d588d582-68f7-41a1-ab1e-1e873fb826b9n@googlegroups.com>
<78ed76bf-deb7-4798-aa4f-0f207a402ae0n@googlegroups.com>
<suorhk$9eh$1@newsreader4.netcologne.de>
<65d44a8b-62b4-4ff9-8fcc-bc5a0e3c6d69n@googlegroups.com>
<suqd52$8ur$1@newsreader4.netcologne.de>
Injection-Date: Sat, 19 Feb 2022 14:18:37 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:19cd:0:7285:c2ff:fe6c:992d";
logging-data="20025"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Sat, 19 Feb 2022 14:18 UTC

Thomas Koenig <tkoenig@netcologne.de> schrieb:
> MitchAlsup <MitchAlsup@aol.com> schrieb:
>> On Friday, February 18, 2022 at 1:20:24 PM UTC-6, Thomas Koenig wrote:
>>> MitchAlsup <Mitch...@aol.com> schrieb:
>>> > The ancient statistics showed going from 8->16 registers was fairly big
>>> > going from 16-32 was modest, going from 32-64 was negligible:: on compiled
>>> > code using pre-MIPS-company architecture. Stanford very early in RISC era.
>>> I've been told that these statistics exist, but I didn't read the
>>> original papers (and would like a pointer to them). It would be
>>> interesting to read details such as which code they used for
>>> benchmarks.
>>>
>>> Most RISC architectures have 32 floating point registers in addition
>>> to their 32 integer registers, so a total number of 64 registers
>>> is not outlandish. Your designs are pretty unique in combining
>>> these two. VVM eliminates the need for unrolling inner loops,
>>> so at least My 66000 can get by with less.
>><
>> Also note: The C version of Livermore loops compiles without
>> overflowing the 32 registers--only one program needed more
>> ~23 registers.
>
> It would be really interesting to see what your architecture makes
> of more typical Fortran code like the Polyhedron benchmarks.

To be more precise, I tried compiling a random sample of the
Polyhedron benchmarks - fatigue2.f90 with flang v7 to llvm code,
then used llc to translate it to My 66000 via a compiler built
from the source at https://github.com/bagel99/llvm-my66000.git .

When I hit an ICE (not entirely unexpected, if this is the
first time anybody tried flang with that :-)

VVMLoopPass: read_input_m_
Unknown f64 condition code!
UNREACHABLE executed at /home/tkoenig/llvm-my66000/llvm/lib/Target/My66000/My66000ISelLowering.cpp:343!
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0. Program arguments: llc -O3 --march=my66000 --enable-vvm fatigue2.ll
1. Running pass 'Function Pass Manager' on module 'fatigue2.ll'.
2. Running pass 'My66000 DAG->DAG Pattern Instruction Selection' on function '@read_input_m_read_input_'
#0 0x0000000014b1b4b0 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /home/tkoenig/llvm-my66000/llvm/lib/Support/Unix/Signals.inc:565:22

[...]

I did not find a way to raise an issue at github, or I would have
posted the information there.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<88c877c2-5674-430c-94bc-bb77f353c9c9n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23699&group=comp.arch#23699

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a7b:cb44:0:b0:37c:4e2d:3bb2 with SMTP id v4-20020a7bcb44000000b0037c4e2d3bb2mr14963238wmj.96.1645288687296;
Sat, 19 Feb 2022 08:38:07 -0800 (PST)
X-Received: by 2002:a05:6870:b017:b0:ce:c0c9:673 with SMTP id
y23-20020a056870b01700b000cec0c90673mr5862645oae.197.1645288686796; Sat, 19
Feb 2022 08:38:06 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!3.eu.feeder.erje.net!feeder.erje.net!fdn.fr!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 19 Feb 2022 08:38:06 -0800 (PST)
In-Reply-To: <suqu7t$jhp$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:b1ac:886e:a458:f180;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:b1ac:886e:a458:f180
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at> <212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at> <3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
<sugu4m$9au$1@dont-email.me> <2022Feb18.073552@mips.complang.tuwien.ac.at>
<d588d582-68f7-41a1-ab1e-1e873fb826b9n@googlegroups.com> <78ed76bf-deb7-4798-aa4f-0f207a402ae0n@googlegroups.com>
<suorhk$9eh$1@newsreader4.netcologne.de> <65d44a8b-62b4-4ff9-8fcc-bc5a0e3c6d69n@googlegroups.com>
<suqd52$8ur$1@newsreader4.netcologne.de> <suqu7t$jhp$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <88c877c2-5674-430c-94bc-bb77f353c9c9n@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 19 Feb 2022 16:38:07 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Sat, 19 Feb 2022 16:38 UTC

On Saturday, February 19, 2022 at 8:18:40 AM UTC-6, Thomas Koenig wrote:
> Thomas Koenig <tko...@netcologne.de> schrieb:
> > MitchAlsup <Mitch...@aol.com> schrieb:
> >> On Friday, February 18, 2022 at 1:20:24 PM UTC-6, Thomas Koenig wrote:
> >>> MitchAlsup <Mitch...@aol.com> schrieb:
> >>> > The ancient statistics showed going from 8->16 registers was fairly big
> >>> > going from 16-32 was modest, going from 32-64 was negligible:: on compiled
> >>> > code using pre-MIPS-company architecture. Stanford very early in RISC era.
> >>> I've been told that these statistics exist, but I didn't read the
> >>> original papers (and would like a pointer to them). It would be
> >>> interesting to read details such as which code they used for
> >>> benchmarks.
> >>>
> >>> Most RISC architectures have 32 floating point registers in addition
> >>> to their 32 integer registers, so a total number of 64 registers
> >>> is not outlandish. Your designs are pretty unique in combining
> >>> these two. VVM eliminates the need for unrolling inner loops,
> >>> so at least My 66000 can get by with less.
> >><
> >> Also note: The C version of Livermore loops compiles without
> >> overflowing the 32 registers--only one program needed more
> >> ~23 registers.
> >
> > It would be really interesting to see what your architecture makes
> > of more typical Fortran code like the Polyhedron benchmarks.
> To be more precise, I tried compiling a random sample of the
> Polyhedron benchmarks - fatigue2.f90 with flang v7 to llvm code,
> then used llc to translate it to My 66000 via a compiler built
> from the source at https://github.com/bagel99/llvm-my66000.git .
>
> When I hit an ICE (not entirely unexpected, if this is the
> first time anybody tried flang with that :-)
>
> VVMLoopPass: read_input_m_
> Unknown f64 condition code!
> UNREACHABLE executed at /home/tkoenig/llvm-my66000/llvm/lib/Target/My66000/My66000ISelLowering.cpp:343!
> PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
> Stack dump:
> 0. Program arguments: llc -O3 --march=my66000 --enable-vvm fatigue2.ll
> 1. Running pass 'Function Pass Manager' on module 'fatigue2.ll'.
> 2. Running pass 'My66000 DAG->DAG Pattern Instruction Selection' on function '@read_input_m_read_input_'
> #0 0x0000000014b1b4b0 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /home/tkoenig/llvm-my66000/llvm/lib/Support/Unix/Signals.inc:565:22
>
> [...]
>
> I did not find a way to raise an issue at github, or I would have
> posted the information there.
<
To my knowledge, Brian has C working at a pretty good level, but
he has not done Fortran--probably has to do with LLVM-IR on stuff
C does not need or use.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sur7qi$p3k$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23700&group=comp.arch#23700

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: bage...@gmail.com (Brian G. Lucas)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Sat, 19 Feb 2022 11:02:09 -0600
Organization: A noiseless patient Spider
Lines: 34
Message-ID: <sur7qi$p3k$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
<sugu4m$9au$1@dont-email.me> <2022Feb18.073552@mips.complang.tuwien.ac.at>
<d588d582-68f7-41a1-ab1e-1e873fb826b9n@googlegroups.com>
<78ed76bf-deb7-4798-aa4f-0f207a402ae0n@googlegroups.com>
<suorhk$9eh$1@newsreader4.netcologne.de>
<65d44a8b-62b4-4ff9-8fcc-bc5a0e3c6d69n@googlegroups.com>
<suqd52$8ur$1@newsreader4.netcologne.de>
<suqu7t$jhp$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 19 Feb 2022 17:02:10 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="7de4eb8947cefe7d146b5036e2c07aa4";
logging-data="25716"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/gMtevdK3zikGG/+r/0Bkf"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.5.0
Cancel-Lock: sha1:mEeHIThXyAkjEPixzkNbgbWdxhE=
In-Reply-To: <suqu7t$jhp$1@newsreader4.netcologne.de>
Content-Language: en-US
 by: Brian G. Lucas - Sat, 19 Feb 2022 17:02 UTC

On 2/19/22 08:18, Thomas Koenig wrote:
.... snip ...
> To be more precise, I tried compiling a random sample of the
> Polyhedron benchmarks - fatigue2.f90 with flang v7 to llvm code,
> then used llc to translate it to My 66000 via a compiler built
> from the source at https://github.com/bagel99/llvm-my66000.git .
>
> When I hit an ICE (not entirely unexpected, if this is the
> first time anybody tried flang with that :-)
>
> VVMLoopPass: read_input_m_
> Unknown f64 condition code!
> UNREACHABLE executed at /home/tkoenig/llvm-my66000/llvm/lib/Target/My66000/My66000ISelLowering.cpp:343!
> PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
> Stack dump:
> 0. Program arguments: llc -O3 --march=my66000 --enable-vvm fatigue2.ll
> 1. Running pass 'Function Pass Manager' on module 'fatigue2.ll'.
> 2. Running pass 'My66000 DAG->DAG Pattern Instruction Selection' on function '@read_input_m_read_input_'
> #0 0x0000000014b1b4b0 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /home/tkoenig/llvm-my66000/llvm/lib/Support/Unix/Signals.inc:565:22
>
> [...]
>
> I did not find a way to raise an issue at github, or I would have
> posted the information there.

You should be able to raise an issue at github, go to the issues tab and hit
new issue. It has worked in the past. Maybe you have to have an account at
github?

Meanwhile send the fatigue2.ll file to bagel99 AT gmail.com. Probably, as
Mitch said, something flang generates that's new to me. Or maybe some test for
NaN that I haven't implemented.

brian

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sur85u$qrs$2@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23701&group=comp.arch#23701

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Sat, 19 Feb 2022 17:08:14 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sur85u$qrs$2@newsreader4.netcologne.de>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
<sugu4m$9au$1@dont-email.me> <2022Feb18.073552@mips.complang.tuwien.ac.at>
<d588d582-68f7-41a1-ab1e-1e873fb826b9n@googlegroups.com>
<78ed76bf-deb7-4798-aa4f-0f207a402ae0n@googlegroups.com>
<suorhk$9eh$1@newsreader4.netcologne.de>
<65d44a8b-62b4-4ff9-8fcc-bc5a0e3c6d69n@googlegroups.com>
<suqd52$8ur$1@newsreader4.netcologne.de>
<suqu7t$jhp$1@newsreader4.netcologne.de> <sur7qi$p3k$1@dont-email.me>
Injection-Date: Sat, 19 Feb 2022 17:08:14 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:19cd:0:7285:c2ff:fe6c:992d";
logging-data="27516"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Sat, 19 Feb 2022 17:08 UTC

Brian G. Lucas <bagel99@gmail.com> schrieb:
> On 2/19/22 08:18, Thomas Koenig wrote:
> ... snip ...
>> To be more precise, I tried compiling a random sample of the
>> Polyhedron benchmarks - fatigue2.f90 with flang v7 to llvm code,
>> then used llc to translate it to My 66000 via a compiler built
>> from the source at https://github.com/bagel99/llvm-my66000.git .
>>
>> When I hit an ICE (not entirely unexpected, if this is the
>> first time anybody tried flang with that :-)
>>
>> VVMLoopPass: read_input_m_
>> Unknown f64 condition code!
>> UNREACHABLE executed at /home/tkoenig/llvm-my66000/llvm/lib/Target/My66000/My66000ISelLowering.cpp:343!
>> PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
>> Stack dump:
>> 0. Program arguments: llc -O3 --march=my66000 --enable-vvm fatigue2.ll
>> 1. Running pass 'Function Pass Manager' on module 'fatigue2.ll'.
>> 2. Running pass 'My66000 DAG->DAG Pattern Instruction Selection' on function '@read_input_m_read_input_'
>> #0 0x0000000014b1b4b0 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /home/tkoenig/llvm-my66000/llvm/lib/Support/Unix/Signals.inc:565:22
>>
>> [...]
>>
>> I did not find a way to raise an issue at github, or I would have
>> posted the information there.
>
> You should be able to raise an issue at github, go to the issues tab and hit
> new issue. It has worked in the past. Maybe you have to have an account at
> github?

It is indeed strange. I have an account at github, and when
looking at another repository, I can see the "Issues" tab, but
not at your repository.

Maybe I got something wrong, my github-fu is as weak as my git-fu.

> Meanwhile send the fatigue2.ll file to bagel99 AT gmail.com. Probably, as
> Mitch said, something flang generates that's new to me. Or maybe some test for
> NaN that I haven't implemented.

Sent about 10 seconds ago.

Thanks for looking into this!

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<ygno832ts92.fsf@y.z>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23702&group=comp.arch#23702

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsfeed.xs4all.nl!newsfeed9.news.xs4all.nl!peer01.ams4!peer.am4.highwinds-media.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx16.iad.POSTED!not-for-mail
From: x...@y.z (Josh Vanderhoof)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
<sugu4m$9au$1@dont-email.me>
<2022Feb18.073552@mips.complang.tuwien.ac.at>
<d588d582-68f7-41a1-ab1e-1e873fb826b9n@googlegroups.com>
<78ed76bf-deb7-4798-aa4f-0f207a402ae0n@googlegroups.com>
<ygn1qzzzs9p.fsf@y.z>
<3113a8d9-8afe-4bf9-af75-7b9a0822b3a1n@googlegroups.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.2 (gnu/linux)
Reply-To: Josh Vanderhoof <jlv@mxsimulator.com>
Message-ID: <ygno832ts92.fsf@y.z>
Cancel-Lock: sha1:Z7FYvDup7gmb7AXzm7cfBwIkMIA=
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Lines: 72
X-Complaints-To: https://www.astraweb.com/aup
NNTP-Posting-Date: Sat, 19 Feb 2022 22:18:33 UTC
Date: Sat, 19 Feb 2022 17:18:33 -0500
X-Received-Bytes: 4334
 by: Josh Vanderhoof - Sat, 19 Feb 2022 22:18 UTC

MitchAlsup <MitchAlsup@aol.com> writes:

> On Friday, February 18, 2022 at 5:09:29 PM UTC-6, Josh Vanderhoof wrote:
>> MitchAlsup <Mitch...@aol.com> writes:
>>
>> > On Friday, February 18, 2022 at 3:10:19 AM UTC-6, Quadibloc wrote:
>> >> There is no way that a compiler can take the place of out-of-order
>> >> circuitry for allowing a computer to deal with cache misses. These aren't
>> >> predictable.
>> > <
>> > Actually, they are predictable in HW.
>> > <
>> > Say we have a loop that does 2 LDDs and one STD: and we run the loop
>> > "lots" of times;
>> > LDD[1] takes a miss every 8 cycles in iteration MOD 3
>> > LDD[2] takes a miss every 8 cycles in iteration MOD 6
>> > ST....... takes a miss every 8 cycles in iteration MOD 1
>> > <
>> > HW can handle this perfectly after a couple of 16+ loops, and it does not
>> > have to be OoO HW to do this. It could be detected between L1<->L2
>> > and just schedule a 3 prefetches every 8×latency-of-loop.
>> Say you have a loop with unpredictable cache misses like this:
>>
>> for (i = 0; i < n; i++)
>> d[i] = f(s[get_addr(i)]);
> <
> MOV Ri,#0
> for_loop:
> LDD R1,[Rsp+Ri<<2]
> CALL f
> ADD Ri,Ri,#1
> CMP Rt,Ri,Rn
> BLT for_loop
> <
> did you mean to call a function ?

Yes, the function doesn't matter and I assume it would have to be inline
for this to work. What I mean is the loop creates a hard to predict
address, "get_addr()", loads a value from that address, does some work,
"f()", with that value and then stores the result. Any time the load
misses the cache it's going to have to sit and wait for the value, where
an OoO machine would keep busy working on f() for the next loop
iteration that hit the cache.

Would it be worthwhile to break it into 2 loops, one that works on data
already in the cache and skips/prefetches the uncached data, and then a
second loop that runs on the data that the previous loop skipped, which
has hopefully now arrived in the cache due to the prefetches in the
previous loop.

I was thinking a compiler with access to a instruction that tests
whether a memory location is cached could do this, but it also seems
like something you might be able to do automatically in your vvm loops.

>>
>> If you had an "in_cache()" instruction, would a transformation like this
>> be a win?
>>
>> // run loop for everything in cache, skip and prefetch stuff that missed
>> for (i = 0, j = 0; i < n; i++) {
>> p = get_addr(i);
>> if (in_cache(&s[p]))
>> d[i] = f(s[p]);
>> else {
>> prefetch(&s[p]);
>> skipped[j++] = i;
>> }
>> }
>>
>> // finish entries that missed cache
>> for (i = 0; i < j; i++)
>> d[skipped[i]] = f(s[get_addr(skipped[i])]);

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sute97$8mo$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23708&group=comp.arch#23708

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-622-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Sun, 20 Feb 2022 13:04:40 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sute97$8mo$1@newsreader4.netcologne.de>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
<sugu4m$9au$1@dont-email.me> <2022Feb18.073552@mips.complang.tuwien.ac.at>
<d588d582-68f7-41a1-ab1e-1e873fb826b9n@googlegroups.com>
<78ed76bf-deb7-4798-aa4f-0f207a402ae0n@googlegroups.com>
<ygn1qzzzs9p.fsf@y.z>
<3113a8d9-8afe-4bf9-af75-7b9a0822b3a1n@googlegroups.com>
<ygno832ts92.fsf@y.z>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 20 Feb 2022 13:04:40 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-622-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:622:0:7285:c2ff:fe6c:992d";
logging-data="8920"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Sun, 20 Feb 2022 13:04 UTC

Josh Vanderhoof <x@y.z> schrieb:
> MitchAlsup <MitchAlsup@aol.com> writes:
>
>> On Friday, February 18, 2022 at 5:09:29 PM UTC-6, Josh Vanderhoof wrote:
>>> MitchAlsup <Mitch...@aol.com> writes:
>>>
>>> > On Friday, February 18, 2022 at 3:10:19 AM UTC-6, Quadibloc wrote:
>>> >> There is no way that a compiler can take the place of out-of-order
>>> >> circuitry for allowing a computer to deal with cache misses. These aren't
>>> >> predictable.
>>> > <
>>> > Actually, they are predictable in HW.
>>> > <
>>> > Say we have a loop that does 2 LDDs and one STD: and we run the loop
>>> > "lots" of times;
>>> > LDD[1] takes a miss every 8 cycles in iteration MOD 3
>>> > LDD[2] takes a miss every 8 cycles in iteration MOD 6
>>> > ST....... takes a miss every 8 cycles in iteration MOD 1
>>> > <
>>> > HW can handle this perfectly after a couple of 16+ loops, and it does not
>>> > have to be OoO HW to do this. It could be detected between L1<->L2
>>> > and just schedule a 3 prefetches every 8×latency-of-loop.
>>> Say you have a loop with unpredictable cache misses like this:
>>>
>>> for (i = 0; i < n; i++)
>>> d[i] = f(s[get_addr(i)]);
>> <
>> MOV Ri,#0
>> for_loop:
>> LDD R1,[Rsp+Ri<<2]
>> CALL f
>> ADD Ri,Ri,#1
>> CMP Rt,Ri,Rn
>> BLT for_loop
>> <
>> did you mean to call a function ?
>
> Yes, the function doesn't matter and I assume it would have to be inline
> for this to work. What I mean is the loop creates a hard to predict
> address, "get_addr()",

Generating the address of an element in a hash table would probably
be the canonical example.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<2esQJ.20734$Wwf9.7208@fx23.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23710&group=comp.arch#23710

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!peer03.ams4!peer.am4.highwinds-media.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx23.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions
in 128 bits
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at> <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at> <212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com> <2022Feb15.120729@mips.complang.tuwien.ac.at> <3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com> <sugu4m$9au$1@dont-email.me> <2022Feb18.073552@mips.complang.tuwien.ac.at> <d588d582-68f7-41a1-ab1e-1e873fb826b9n@googlegroups.com> <78ed76bf-deb7-4798-aa4f-0f207a402ae0n@googlegroups.com> <ygn1qzzzs9p.fsf@y.z> <3113a8d9-8afe-4bf9-af75-7b9a0822b3a1n@googlegroups.com> <ygno832ts92.fsf@y.z>
In-Reply-To: <ygno832ts92.fsf@y.z>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 55
Message-ID: <2esQJ.20734$Wwf9.7208@fx23.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sun, 20 Feb 2022 14:28:14 UTC
Date: Sun, 20 Feb 2022 09:28:16 -0500
X-Received-Bytes: 3767
 by: EricP - Sun, 20 Feb 2022 14:28 UTC

Josh Vanderhoof wrote:
> MitchAlsup <MitchAlsup@aol.com> writes:
>
>> On Friday, February 18, 2022 at 5:09:29 PM UTC-6, Josh Vanderhoof wrote:
>>> MitchAlsup <Mitch...@aol.com> writes:
>>>
>>>> On Friday, February 18, 2022 at 3:10:19 AM UTC-6, Quadibloc wrote:
>>>>> There is no way that a compiler can take the place of out-of-order
>>>>> circuitry for allowing a computer to deal with cache misses. These aren't
>>>>> predictable.
>>>> <
>>>> Actually, they are predictable in HW.
>>>> <
>>>> Say we have a loop that does 2 LDDs and one STD: and we run the loop
>>>> "lots" of times;
>>>> LDD[1] takes a miss every 8 cycles in iteration MOD 3
>>>> LDD[2] takes a miss every 8 cycles in iteration MOD 6
>>>> ST....... takes a miss every 8 cycles in iteration MOD 1
>>>> <
>>>> HW can handle this perfectly after a couple of 16+ loops, and it does not
>>>> have to be OoO HW to do this. It could be detected between L1<->L2
>>>> and just schedule a 3 prefetches every 8×latency-of-loop.
>>> Say you have a loop with unpredictable cache misses like this:
>>>
>>> for (i = 0; i < n; i++)
>>> d[i] = f(s[get_addr(i)]);
>> <
>> MOV Ri,#0
>> for_loop:
>> LDD R1,[Rsp+Ri<<2]
>> CALL f
>> ADD Ri,Ri,#1
>> CMP Rt,Ri,Rn
>> BLT for_loop
>> <
>> did you mean to call a function ?
>
> Yes, the function doesn't matter and I assume it would have to be inline
> for this to work. What I mean is the loop creates a hard to predict
> address, "get_addr()", loads a value from that address, does some work,
> "f()", with that value and then stores the result. Any time the load
> misses the cache it's going to have to sit and wait for the value, where
> an OoO machine would keep busy working on f() for the next loop
> iteration that hit the cache.
>
> Would it be worthwhile to break it into 2 loops, one that works on data
> already in the cache and skips/prefetches the uncached data, and then a
> second loop that runs on the data that the previous loop skipped, which
> has hopefully now arrived in the cache due to the prefetches in the
> previous loop.

That's basically what scout threads do.

https://en.wikipedia.org/wiki/Hardware_scout

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<2022Feb20.192357@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23715&group=comp.arch#23715

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits
Date: Sun, 20 Feb 2022 18:23:57 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 66
Message-ID: <2022Feb20.192357@mips.complang.tuwien.ac.at>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <2022Feb14.094955@mips.complang.tuwien.ac.at> <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at> <suerog$cd0$1@dont-email.me> <2022Feb15.124937@mips.complang.tuwien.ac.at> <sugjhv$v6u$1@dont-email.me> <jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org> <sugkji$6vi$1@dont-email.me> <jwv5ypgdqnl.fsf-monnier+comp.arch@gnu.org> <suh8vh$g7r$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="22c8ea6ad56a2a6d4aef99b21115316b";
logging-data="4344"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18nNBrF0Jc9ego3kvIWb4Hw"
Cancel-Lock: sha1:Dm96cUDkg2CfHv9SJ7mxWp2F9bw=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Sun, 20 Feb 2022 18:23 UTC

BGB <cr88192@gmail.com> writes:
>On 2/15/2022 10:58 AM, Stefan Monnier wrote:
>> The OoO works on the trace of the actual execution, where the main limit
>> is the size of the window it can consider (linked to the accuracy of the
>> branch predictor), whereas the compiler is not limited to such a window
>> but instead it's limited to work on the non-unrolled code.
>>
>
>While this is true, I suspect it is a case of "best" vs "good enough".
>If one can get the in-order scheduling to a "good enough" stage, it may
>be possible to work around this deficiency by being able to throw
>(slightly) more cores at the problem.

If the code is not thread-limited (and a lot of code is not), throwing
more cores at the problem won't help.

In the following I'll assume a workload that is thread-limited (say,
serving many web clients).

>Though, I will note that I am assuming that core counts remain within
>the limits of plausible memory bandwidth, but consider this to be a weak
>argument because, if additional cores would be limited by bandwidth, a
>single faster core would also be limited by bandwidth.

That's an interesting aspect. Mitch Alsup likes to tell us about his
comparison of core sizes and IPC for a small in-order vs. a big OoO
core, but does not tell us about the memory bandwidth. IIRC his
numbers are a factor 16 in core size (including D-cache, but with a
smaller D-cache for the smaller core), factor 2 in IPC.

So if you put 256 of the small cores on a Zen CCD (assuming that a Zen
3 core takes twice as much area as the big core he looked at earlier;
it probably takes more), and if we assume that the little core has a
twice as high D-cache miss rate, which with the lower IPC comes out as
the same number of D-cache misses per cycle. Now you have 32 times as
many D-cache misses hitting the L2 per cycle, with accesses from
different working sets, so a lower hit rate in the L2 cache. So the
number of L2 misses reaching the L3 cache per cycles will be more than
32 times as high as with the big cores; again, you have more working
sets and therefore a higher L3 miss rate, so the number of main memory
accesses per cycle will be more than 32 times as many as with the big
cores. You could mitigate this problem by making the L2 and L3 a lot
larger, but that means that the area advantage of the little cores
becomes smaller.

Another way to look at this is that a big core keeps the same working
set for a smaller time in the caches, and can afterwards work on
another job with the same caches, making better use of the same amount
of cache than a smaller core. Of course, there is a crossover point
somewhere, but I think Mitch Alsup's number paint a too-rosy picture
for the little cores.

>One is more limited by how effectively one can parallelize the codebase,
>but if one is assuming an extended timeframe where no other
>hardware-level performance improvements are viable, programmers will
>"make it work" (IOW: if one assumes that the only other option is
>multiple decades of stagnation).

We have seen ~17 years of stagnation since the introduction of
multi-cores, and I see no signs of that changing in the foreseeable
future.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Pages:1234567891011121314
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor