Welcome to novaBBS (click a section below)

mail files register newsreader groups login

Message-ID:

It's computer hardware, of course it's worth having <g> -- Espy on #Debian

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

Subject	Author
Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	JimBrakefield
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	Brett
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	Brett
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Stefan Monnier
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Stefan Monnier
Re: Encoding 20 and 40 bit instructions in 128 bits	Bernd Linsel
Re: Encoding 20 and 40 bit instructions in 128 bits	Anton Ertl
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Brian G. Lucas
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Anton Ertl
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	EricP
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	Stefan Monnier
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	John Levine
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Stefan Monnier
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Stefan Monnier
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	John Levine
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Terje Mathisen
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Terje Mathisen
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	John Levine
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Quadibloc
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Scott Smader
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Stefan Monnier
Re: instruction set binding time, was Encoding 20 and 40 bit	Scott Smader
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	James Van Buskirk
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Statically scheduled plus run ahead.	Brett
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	BGB
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Thomas Koenig
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128	Anton Ertl
Re: instruction set binding time, was Encoding 20 and 40 bit	Ivan Godard
Re: instruction set binding time, was Encoding 20 and 40 bit	MitchAlsup
Re: instruction set binding time, was Encoding 20 and 40 bit	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Anton Ertl
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	BGB
Re: Encoding 20 and 40 bit instructions in 128 bits	Stephen Fuld
Re: Encoding 20 and 40 bit instructions in 128 bits	Thomas Koenig
Re: Encoding 20 and 40 bit instructions in 128 bits	Quadibloc
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Ivan Godard
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	MitchAlsup
Re: Encoding 20 and 40 bit instructions in 128 bits	Paul A. Clayton

Pages:1 2 3 4 567 8 9 10 11 12 13 14

Re: Encoding 20 and 40 bit instructions in 128 bits

<jwvczjph0x7.fsf-monnier+comp.arch@gnu.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23444&group=comp.arch#23444

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Encoding 20 and 40 bit instructions in 128 bits
Date: Mon, 14 Feb 2022 11:29:31 -0500
Organization: A noiseless patient Spider
Lines: 9
Message-ID: <jwvczjph0x7.fsf-monnier+comp.arch@gnu.org>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<_IhLJ.4280$0vE9.17@fx17.iad> <3%tLJ.35102$t2Bb.34664@fx98.iad>
<sto41r$4tj$1@newsreader4.netcologne.de>
<9de2cef4-0cfc-4a6b-a96a-fc7cbc836966n@googlegroups.com>
<stuoqv$97e$1@newsreader4.netcologne.de> <stv51f$da9$1@dont-email.me>
<stvf02$v7b$1@dont-email.me> <su10pc$1e5$1@dont-email.me>
<su19ub$9hr$1@dont-email.me>
<731cd4dd-35b7-4509-96d5-bf33c1e8c60fn@googlegroups.com>
<e4234dae-c8de-4ba6-949b-696821e1a15dn@googlegroups.com>
<su59a7$avf$1@newsreader4.netcologne.de> <su65pg$lc$1@dont-email.me>
<8511a1a5-48be-4186-9d6e-8831616ba9a5n@googlegroups.com>
<53335b0a-0f90-4836-8cf9-5668308e38a9n@googlegroups.com>
<su6qeh$c9q$1@newsreader4.netcologne.de>
<3541fae1-e7e2-4ede-b08e-46f53b4bb2a4n@googlegroups.com>
<0I9OJ.22703$4JN7.21667@fx05.iad>
<8d160485-6b5e-4a5c-89b2-1d5172bc0544n@googlegroups.com>
<sucjd9$pvc$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="bbb3919546720aa3be68196937554853";
logging-data="20665"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18eiVf/5Si7E7no6WuR1sgo"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:DdVd5p7fLNtAwg3ux1iD0fEihPU=
sha1:HYr2APIxGKQvBbBPwZRN0aizuTM=

by: Stefan Monnier - Mon, 14 Feb 2022 16:29 UTC

> Clever. I don't see what's wrong about that either, although register
> assignment in the compiler might be "interesting"

BTW, for those who like "interesting", you can go a step further and use
different groups of 8 registers depending on the "base" register so you
can avoid the "silo"ing problem.

Stefan

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<jwv7d9xh0dt.fsf-monnier+comp.arch@gnu.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23445&group=comp.arch#23445

copy link Newsgroups: comp.arch

by: Stefan Monnier - Mon, 14 Feb 2022 16:44 UTC

> The specializer puts a *lot* of work into stuffing as many instructions as
> possible into the bundle format - Mill is a (very) wide architecture. But
> for quick and dirty, i.e. JIT, code you can just put one instruction in each
> bundle and be done with it, and ignore the fact that you are wasting all
> that parallelism. The result will be no worse than code for any other
> one-instruction-at-a-time code.

There's a lot of money poured into making JIT'ed code (Java, Javascript,
....) run as fast as possible, so JIT should not be confused with "quick
and dirty".

Whether it will be harder to generate good JIT'ed code for Mill than for
other machines, I don't know. But the "one-instruction-at-a-time"
option will likely result in a very poor showing: the Mill will look bad
when compared to other architectures, since those will use superscalar
execution to get tolerable speed from the quick&dirty JIT code.

> BIG.little migration is straightforward, so long as both BIG and little
> accept the same binary format. We expect to make "little" configurations
> that are half- or quarter-size one of our large configurations, accepting
> the same binary, and double- or quad-pumping the FUs. After all, all our
> configs have the same instructions set. The differ in parallelism and
> binary format, but they all use the same FUs.

Indeed, another way to deal with the problem is to refrain from using
the flexibility offered by the difference between the abstract ISA and
the actual machine code.

>> My provisional verdict: Mill has been designed with the supercomputer
>> mindset, and suffers from the usual problems of such designs.
> There are answers; perhaps asking for them before verdict would be useful?

I wonder tho: is it the case that there's a "supercomputer mindset"?
That doesn't seem to match your background, at least.
I also got the impression that the Mill might target more
embedded-style markets (maybe because of your mention of DSPs in some
of your talks), but I can't remember mentions of Mills targeting HPC.

Stefan

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sue2fq$24m$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23446&group=comp.arch#23446

copy link Newsgroups: comp.arch

by: Thomas Koenig - Mon, 14 Feb 2022 17:11 UTC

Ivan Godard <ivan@millcomputing.com> schrieb:
> On 2/13/2022 11:20 PM, Thomas Koenig wrote:
>> Ivan Godard <ivan@millcomputing.com> schrieb:
>>> On 2/13/2022 10:30 AM, Thomas Koenig wrote:
>>
>>>> Which begs the question - what is Mill going to do for operating
>>>> system support, are there any plans? Plan 9 with its multiple
>>>> binaries might actually be an option.
>>>
>>> In progress, albeit slowly. We will provide a microkernel, essentially
>>> at or slightly above the HAL level - this is furthest along. On that
>>> will be a basic Linux port sufficient to provide a standard syscall
>>> interface.
>>
>> I was just about to write that Linux is pretty much tied to gcc, but
>> it seems that clang is also supported now:
>> https://www.kernel.org/doc/html/latest/kbuild/llvm.html
>>
>> so your clang-only compiler infrastructure should not be a problem.
>
> We are compiler-independent, although we have only done clang/LLVM so
> far. Everything past genAsm is our own code, and any compiler can emit
> genAsm.

A bit like nvptx, I guess, which is translated in a "device driver"
into native graphics gard code.

>
>>> The real issue comes from existing software that embeds
>>> hardware or structural assumptions - /proc, drivers, etc. The CHERI
>>> project suffers from the same problem.
>>
>> There is also the problem of which surrounding hardware to use.
>> Hm... are pin layouts for CPUs protected, or could you (for example)
>> lay out the connections on your CPU to be compatible with one of
>> the existing AMD or ARM chips?
>
> I understand that HW expects to adopt existing industry standards for
> such things, but IANAHWG and don't know the details, which won't be
> settled anyway until we reach the point of advancing from FPGA to a real
> chip.

This is the first time I read about you talking about an FPGA for Mill.
This could be quite interesting for soft core.

Do you see advantages in running a Mill of whatever metal persuasion as
a soft core vs the conventional soft cores like MicroBlaze?

Re: Encoding 20 and 40 bit instructions in 128 bits

<sue2gm$24m$2@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23447&group=comp.arch#23447

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!news.freedyn.de!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Encoding 20 and 40 bit instructions in 128 bits
Date: Mon, 14 Feb 2022 17:11:50 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sue2gm$24m$2@newsreader4.netcologne.de>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<_IhLJ.4280$0vE9.17@fx17.iad> <3%tLJ.35102$t2Bb.34664@fx98.iad>
<sto41r$4tj$1@newsreader4.netcologne.de>
<9de2cef4-0cfc-4a6b-a96a-fc7cbc836966n@googlegroups.com>
<stuoqv$97e$1@newsreader4.netcologne.de> <stv51f$da9$1@dont-email.me>
<stvf02$v7b$1@dont-email.me> <su10pc$1e5$1@dont-email.me>
<su19ub$9hr$1@dont-email.me>
<731cd4dd-35b7-4509-96d5-bf33c1e8c60fn@googlegroups.com>
<e4234dae-c8de-4ba6-949b-696821e1a15dn@googlegroups.com>
<su59a7$avf$1@newsreader4.netcologne.de> <su65pg$lc$1@dont-email.me>
<8511a1a5-48be-4186-9d6e-8831616ba9a5n@googlegroups.com>
<53335b0a-0f90-4836-8cf9-5668308e38a9n@googlegroups.com>
<su6qeh$c9q$1@newsreader4.netcologne.de>
<3541fae1-e7e2-4ede-b08e-46f53b4bb2a4n@googlegroups.com>
<0I9OJ.22703$4JN7.21667@fx05.iad>
<8d160485-6b5e-4a5c-89b2-1d5172bc0544n@googlegroups.com>
<sucjd9$pvc$1@dont-email.me> <jwvczjph0x7.fsf-monnier+comp.arch@gnu.org>
Injection-Date: Mon, 14 Feb 2022 17:11:50 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:19cd:0:7285:c2ff:fe6c:992d";
logging-data="2198"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Mon, 14 Feb 2022 17:11 UTC

Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
>> Clever. I don't see what's wrong about that either, although register
>> assignment in the compiler might be "interesting"
>
> BTW, for those who like "interesting", you can go a step further and use
> different groups of 8 registers depending on the "base" register so you
> can avoid the "silo"ing problem.

Not sure I understand that.

Could you elaborate?

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sue9vq$que$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23449&group=comp.arch#23449

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!news.freedyn.de!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Mon, 14 Feb 2022 13:19:18 -0600
Organization: A noiseless patient Spider
Lines: 215
Message-ID: <sue9vq$que$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at> <sudb0g$rq3$1@dont-email.me>
<jwv7d9xh0dt.fsf-monnier+comp.arch@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 14 Feb 2022 19:19:22 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="46cc40bd7ad50ed82aef25b1680528d0";
logging-data="27598"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18L5hmzXX8p5QInhLjuvTQu"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.0
Cancel-Lock: sha1:Jol0ipy1Vu90rDjK/3luFYaGuWU=
In-Reply-To: <jwv7d9xh0dt.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US

by: BGB - Mon, 14 Feb 2022 19:19 UTC

On 2/14/2022 10:44 AM, Stefan Monnier wrote:
>> The specializer puts a *lot* of work into stuffing as many instructions as
>> possible into the bundle format - Mill is a (very) wide architecture. But
>> for quick and dirty, i.e. JIT, code you can just put one instruction in each
>> bundle and be done with it, and ignore the fact that you are wasting all
>> that parallelism. The result will be no worse than code for any other
>> one-instruction-at-a-time code.
>
> There's a lot of money poured into making JIT'ed code (Java, Javascript,
> ...) run as fast as possible, so JIT should not be confused with "quick
> and dirty".
>
> Whether it will be harder to generate good JIT'ed code for Mill than for
> other machines, I don't know. But the "one-instruction-at-a-time"
> option will likely result in a very poor showing: the Mill will look bad
> when compared to other architectures, since those will use superscalar
> execution to get tolerable speed from the quick&dirty JIT code.
>

FWIW: A lot of my code generators had operated "One 3AC op at a time"
with a fairly simplistic register allocator strategy:
Initially assume nothing in registers (at start of Basic Block);
Allocate registers and pull things in dynamically;
Spill registers at end of Basic Block.

BGBCC started out with such a strategy (for my current back-ends).

Though, has moved to also statically assigning a certain number of
variables on the function scope. How does it pick them?
It counts how many times they appear in the function and ranks them by this.
Notable drawback: Does not take account for how often these loads or
stores would be executed at run time.

All this seems to work OK on BJX2, and on x86 this strategy works well.

Having previously tried generating code for 32-bit ARM with this
strategy, its performance relative to GCC was terrible (and the
generated code tended to have a very large number of stack-relative
loads and stores). There also aren't really enough registers to
statically assign variables to them.

Comparably, GCC seems to be smarter, knowing what to keep in registers
and when, and how to make the values flow smoothly from one basic-block
to another.

Well, along with other things, such as GCC being able to inline called
functions, ...

Granted, all the ARM devices I have had access to have been using
in-order superscalar.

Throwing a lot of GPRs at the problem can at least reduce the number of
stack spills (or, mostly eliminate them if the number of GPRs is larger
than the number of variables+temporaries in the function, allowing all
of them to be statically assigned).

Note actually, that BGBCC first compiles to a stack-machine (like in JVM
or .NET), and then converts this to 3AC before feeding it into the
backend code-generator.

So, it sorta converts stack-ops one at a time into 3AC ops, though some
stack operations may be "absorbed" by this translation step.

I guess one can speculate for whether (or how much) the ISA could be
faster with a smarter compiler.

Though, if I added a portability layer, it is possible that a variant of
the stack IR might be used. However, the IR still can't entirely hide
the specifics of the target machine (things like "sizeof(void*)", etc,
are rather difficult to gloss over effectively for C code).

>> BIG.little migration is straightforward, so long as both BIG and little
>> accept the same binary format. We expect to make "little" configurations
>> that are half- or quarter-size one of our large configurations, accepting
>> the same binary, and double- or quad-pumping the FUs. After all, all our
>> configs have the same instructions set. The differ in parallelism and
>> binary format, but they all use the same FUs.
>
> Indeed, another way to deal with the problem is to refrain from using
> the flexibility offered by the difference between the abstract ISA and
> the actual machine code.
>
>>> My provisional verdict: Mill has been designed with the supercomputer
>>> mindset, and suffers from the usual problems of such designs.
>> There are answers; perhaps asking for them before verdict would be useful?
>
> I wonder tho: is it the case that there's a "supercomputer mindset"?
> That doesn't seem to match your background, at least.
> I also got the impression that the Mill might target more
> embedded-style markets (maybe because of your mention of DSPs in some
> of your talks), but I can't remember mentions of Mills targeting HPC.
>

I suspect it is mostly a question of how much is required to run the
specializer...

Something like LLVM is fairly heavyweight.

A stack IR could be a little lighter, but still depends on how much it
assumes the backend to be able to pull off.

For example, a stack IR which identifies entities symbolically, assumes
value types to be passed implicitly via the stack, ... will require a
more advanced VM than one which uses direct indices to the entities it
references, allows layouts to be determined statically, and which
encodes the value-type of each stack operator in the operator itself.

Similarly, for varying approaches to running the code:
Direct Interpreter (A1)
Uses a while loop and a big switch block.
Fairly common approach for interpreters.
Indirect Interpreter (A2)
Decoding IR to a list of structs and function pointers;
Trampoline loop calls the function pointers.
Faster than A1, but also more complicated.
Translation to call-threaded code (B1)
Generated code is a list of function calls to operator functions.
Little or no logic emitted directly.
Common operators are implemented natively (B2)
One may start using optimizations like a register allocator, etc.
A fair number of operators are still handled as function calls.
These calls are typically treated as implicit.
Nearly full native operators (B3)
Register allocation is fundamental to code-generation;
Relatively few runtime calls are used
Runtime calls are typically explicit, rather than implicit.
...

So, my emulators are typically A2 or B1.
For example, my BJX2 emulator uses the A2 approach.

Effective use of A1 or B1 requires the IR to be fairly explicit (all
types/etc need to be readily visible at the level of every IR
instruction). For B1, explicitly encoding the division points between
basic-blocks may also be helpful.

The added "indirection" in A2 and B2 better allow for things like
abstract structure layout and indirect handling of operator types. They
add costs in terms of added complexity and memory overhead (such as
needing to convert the IR into an intermediate internal format).

This would reflect a difference between, say, JVM Bytecode and .NET CIL,
where the former would be more friendly to an A1 strategy, whereas CIL
would likely require an A2, B2, or B3 approach.

Many of my past code generators were B1 or B2.

B1 code-generators are fairly simple.
For a loop over the list of IR instructions:
Figure out which runtime function to call;
Emit a call to said function.
One may also have "blobs" for the start and end of basic-blocks.
Set up initial state when called;
Restore state and return to caller (typically a trampoline loop).

A B1 strategy can be good for, say:
Low memory overhead in compiler;
Relatively fast.
But, drawback:
Generated code is rather slow (on par with an A2 interpreter).

B1 vs B2:
Code generation for B1 may assume an external trampoline loop.
High-level control flow is typically handled externally.
Code generation for B2 may assume function structuring.
May require a full function prolog and epilog.
Control flow is handled within the compiled function.
B2 may be organized around functions rather than individual blocks.

Though, a few of my code-generators have used parts of the B2 strategy
while still otherwise built around the assumption of control flow via an
external trampoline loop.

Differences between the B2 and B3 strategy relate mostly to the handling
of scratch registers:
In B2, they would be assumed to be purely volatile.
They can only be used within a given operator.
May be stomped at any time via an emitted runtime call.
In B3, one can assume that scratch registers are stable.
They may also be handled via the register allocator.
They may potentially be used for holding variables.
...

BGBCC started out closer to B2, but is now mostly using the B3 strategy.

For some types of operations (such as working with 'variant',
'float128', ...), it falls back to the B2 strategy. This is flagged on a
per-basic-block basis; So that the register allocator knows not to use
scratch registers. Scratch registers may not hold statically assigned
variables in normal functions (and are auto-evicted between basic blocks).

Something like GCC or LLVM seem to represent a somewhat more advanced
category.

Click here to read the complete article

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suechc$d2p$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23450&group=comp.arch#23450

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Mon, 14 Feb 2022 14:02:48 -0600
Organization: A noiseless patient Spider
Lines: 61
Message-ID: <suechc$d2p$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 14 Feb 2022 20:02:53 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="46cc40bd7ad50ed82aef25b1680528d0";
logging-data="13401"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX180y7JR+Ef1iSiVCO74Y4VF"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.0
Cancel-Lock: sha1:yWqAfpj1vThHvhQNNmZ2Xzwm8cc=
In-Reply-To: <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
Content-Language: en-US

by: BGB - Mon, 14 Feb 2022 20:02 UTC

On 2/14/2022 9:57 AM, Quadibloc wrote:
> On Monday, February 14, 2022 at 2:15:21 AM UTC-7, Anton Ertl wrote:
>
>> My provisional verdict: Mill has been designed with the supercomputer
>> mindset, and suffers from the usual problems of such designs.
>
> Hmm. While I suspect _my_ designs have that flaw, my perception was
> that the focus in Mill was low power consumption rather than maximum
> performance.
>
> Of course, a focus on throughput, rather than single-thread performance,
> may be exactly what you are referring to.
>

FWIW: In my case I am mostly trying to get decent performance while
being economical with resource costs.

Much as clock speeds couldn't get faster indefinitely, transistor
budgets are likely already rapidly approaching the limits of what is
practical.

Hitting this limit is not likely to bode well for the long term
"superiority" of OoO, and may well lead to push-back towards
simpler/cheaper cores.

Though, this may be limited to the extent that L1 cache can't really be
made much cheaper (lots of cores with tiny L1 caches wouldn't exactly be
great either).

IME, the L1 caches seem to reach a stable state at ~ 8K..32K, where hit
rates are upwards of 95%, and increasing cache size only slightly
increases hit rate. Much below this point, hit rate starts to drop off
pretty rapidly.

A core with a 2K L1 cache suffers from a fairly high miss rate.

I have yet to find an "optimal lower limit" for the TLB, as it seems to
depend a lot on the program. Have noted: 256..1K TLBEs works fairly
well, but 16/32/64 TLBE's (4x/8x/16x 4-way) kinda sucks.

However, for these smaller TLB sizes, have noted that address-modulo
indexing seems to have a better hit rate than hashed indexing.

Having a 16x 1-way TLB or similar within the L1 caches can also be help
(can do address translation locally within the L1 cache if it hits).

....

But, less clear are the optimal tradeoffs for a hardware implementation
(for example, 16K L1's with 16 TLBE's in each L1 might potentially be
"too expensive" for a cost-conscious core).

But, then one could also argue about the relative cost of spending the
cost of all of the bits in the register file (say, 32 or 64 GPRs, each
64 bits, isn't exactly free either, ...).

....

Re: Encoding 20 and 40 bit instructions in 128 bits

<jwvo839xktj.fsf-monnier+comp.arch@gnu.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23451&group=comp.arch#23451

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Encoding 20 and 40 bit instructions in 128 bits
Date: Mon, 14 Feb 2022 15:23:59 -0500
Organization: A noiseless patient Spider
Lines: 17
Message-ID: <jwvo839xktj.fsf-monnier+comp.arch@gnu.org>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<sto41r$4tj$1@newsreader4.netcologne.de>
<9de2cef4-0cfc-4a6b-a96a-fc7cbc836966n@googlegroups.com>
<stuoqv$97e$1@newsreader4.netcologne.de> <stv51f$da9$1@dont-email.me>
<stvf02$v7b$1@dont-email.me> <su10pc$1e5$1@dont-email.me>
<su19ub$9hr$1@dont-email.me>
<731cd4dd-35b7-4509-96d5-bf33c1e8c60fn@googlegroups.com>
<e4234dae-c8de-4ba6-949b-696821e1a15dn@googlegroups.com>
<su59a7$avf$1@newsreader4.netcologne.de> <su65pg$lc$1@dont-email.me>
<8511a1a5-48be-4186-9d6e-8831616ba9a5n@googlegroups.com>
<53335b0a-0f90-4836-8cf9-5668308e38a9n@googlegroups.com>
<su6qeh$c9q$1@newsreader4.netcologne.de>
<3541fae1-e7e2-4ede-b08e-46f53b4bb2a4n@googlegroups.com>
<0I9OJ.22703$4JN7.21667@fx05.iad>
<8d160485-6b5e-4a5c-89b2-1d5172bc0544n@googlegroups.com>
<sucjd9$pvc$1@dont-email.me>
<jwvczjph0x7.fsf-monnier+comp.arch@gnu.org>
<sue2gm$24m$2@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="a78f999bfc53ca4e66a588df19722e2a";
logging-data="31666"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19Pa0Xxjg/N508FelSLTMKf"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:8rHomORX9yXAcnfLNP71S+2XZtM=
sha1:Nj0zmNBR7sbpqjA+54xGHkg8NkU=

by: Stefan Monnier - Mon, 14 Feb 2022 20:23 UTC

>> BTW, for those who like "interesting", you can go a step further and use
>> different groups of 8 registers depending on the "base" register so you
>> can avoid the "silo"ing problem.
> Not sure I understand that.
> Could you elaborate?

For example:

base register possible other registers
============= ========================
0 00,01,02,03,04,05,06,07
1 01,03,05,07,09,11,13,15
2 02,05,08,11,14,17,20,23
.... ......

-- Stefan

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suej3q$acm$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23452&group=comp.arch#23452

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Mon, 14 Feb 2022 13:55:06 -0800
Organization: A noiseless patient Spider
Lines: 52
Message-ID: <suej3q$acm$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de> <sucguj$dop$1@dont-email.me>
<sucvsh$9to$2@newsreader4.netcologne.de> <sud4p9$ji7$1@dont-email.me>
<sue2fq$24m$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 14 Feb 2022 21:55:06 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="1a1e1f2bbea776379b084b4158b8e918";
logging-data="10646"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/dxV2IVLNeuKVKk9BQ0ULE"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.1
Cancel-Lock: sha1:K6rFCBq1oLpIDNbCcq5Nzclf2fo=
In-Reply-To: <sue2fq$24m$1@newsreader4.netcologne.de>
Content-Language: en-US

by: Ivan Godard - Mon, 14 Feb 2022 21:55 UTC

On 2/14/2022 9:11 AM, Thomas Koenig wrote:
> Ivan Godard <ivan@millcomputing.com> schrieb:
>> On 2/13/2022 11:20 PM, Thomas Koenig wrote:
>>> Ivan Godard <ivan@millcomputing.com> schrieb:
>>>> On 2/13/2022 10:30 AM, Thomas Koenig wrote:
>>>
>>>>> Which begs the question - what is Mill going to do for operating
>>>>> system support, are there any plans? Plan 9 with its multiple
>>>>> binaries might actually be an option.
>>>>
>>>> In progress, albeit slowly. We will provide a microkernel, essentially
>>>> at or slightly above the HAL level - this is furthest along. On that
>>>> will be a basic Linux port sufficient to provide a standard syscall
>>>> interface.
>>>
>>> I was just about to write that Linux is pretty much tied to gcc, but
>>> it seems that clang is also supported now:
>>> https://www.kernel.org/doc/html/latest/kbuild/llvm.html
>>>
>>> so your clang-only compiler infrastructure should not be a problem.
>>
>> We are compiler-independent, although we have only done clang/LLVM so
>> far. Everything past genAsm is our own code, and any compiler can emit
>> genAsm.
>
> A bit like nvptx, I guess, which is translated in a "device driver"
> into native graphics gard code.
>
>>
>>>> The real issue comes from existing software that embeds
>>>> hardware or structural assumptions - /proc, drivers, etc. The CHERI
>>>> project suffers from the same problem.
>>>
>>> There is also the problem of which surrounding hardware to use.
>>> Hm... are pin layouts for CPUs protected, or could you (for example)
>>> lay out the connections on your CPU to be compatible with one of
>>> the existing AMD or ARM chips?
>>
>> I understand that HW expects to adopt existing industry standards for
>> such things, but IANAHWG and don't know the details, which won't be
>> settled anyway until we reach the point of advancing from FPGA to a real
>> chip.
>
> This is the first time I read about you talking about an FPGA for Mill.
> This could be quite interesting for soft core.
>
> Do you see advantages in running a Mill of whatever metal persuasion as
> a soft core vs the conventional soft cores like MicroBlaze?

The FPGA is a low-cost testbed and proof-of-principle; there's no
present plan to productize it. But that's a business issue, not a
technical one.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suek05$r6v$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23453&group=comp.arch#23453

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Mon, 14 Feb 2022 14:10:13 -0800
Organization: A noiseless patient Spider
Lines: 55
Message-ID: <suek05$r6v$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at> <sudb0g$rq3$1@dont-email.me>
<jwv7d9xh0dt.fsf-monnier+comp.arch@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 14 Feb 2022 22:10:13 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="1a1e1f2bbea776379b084b4158b8e918";
logging-data="27871"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18l6OcW4rf06f794+XxGmZs"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.1
Cancel-Lock: sha1:9ZcMI3Kl4LivKOUuSBn4bS0ufN4=
In-Reply-To: <jwv7d9xh0dt.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US

by: Ivan Godard - Mon, 14 Feb 2022 22:10 UTC

On 2/14/2022 8:44 AM, Stefan Monnier wrote:
>> The specializer puts a *lot* of work into stuffing as many instructions as
>> possible into the bundle format - Mill is a (very) wide architecture. But
>> for quick and dirty, i.e. JIT, code you can just put one instruction in each
>> bundle and be done with it, and ignore the fact that you are wasting all
>> that parallelism. The result will be no worse than code for any other
>> one-instruction-at-a-time code.
>
> There's a lot of money poured into making JIT'ed code (Java, Javascript,
> ...) run as fast as possible, so JIT should not be confused with "quick
> and dirty".
>
> Whether it will be harder to generate good JIT'ed code for Mill than for
> other machines, I don't know. But the "one-instruction-at-a-time"
> option will likely result in a very poor showing: the Mill will look bad
> when compared to other architectures, since those will use superscalar
> execution to get tolerable speed from the quick&dirty JIT code.

Whether it is worthwhile to do target-independent optimization (inline,
outline, CFG folds, etc.) is the same tradeoff in a Mill as it is for
any other ISA target. Mill differs only because of static scheduling,
which the super-scalar does dynamically in hardware. How much you want
to pull the stops out in the scheduler is a market trade-off, not a
technical one, and we don't know where the balance point is. Current
guess is that scheduling cost will be in the noise.

It's a general purpose architecture, suitable for all except the very
low end.

>
>
> Stefan

Re: Encoding 20 and 40 bit instructions in 128 bits

<663ead4b-2d8f-484d-8329-ae5692b88bean@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23454&group=comp.arch#23454

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:11ca:: with SMTP id n10mr815879qtk.567.1644877158815;
Mon, 14 Feb 2022 14:19:18 -0800 (PST)
X-Received: by 2002:a05:6830:2b20:: with SMTP id l32mr400762otv.333.1644877158300;
Mon, 14 Feb 2022 14:19:18 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 14 Feb 2022 14:19:18 -0800 (PST)
In-Reply-To: <jwvczjph0x7.fsf-monnier+comp.arch@gnu.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:b4f7:aaae:e817:c81f;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:b4f7:aaae:e817:c81f
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <_IhLJ.4280$0vE9.17@fx17.iad>
<3%tLJ.35102$t2Bb.34664@fx98.iad> <sto41r$4tj$1@newsreader4.netcologne.de>
<9de2cef4-0cfc-4a6b-a96a-fc7cbc836966n@googlegroups.com> <stuoqv$97e$1@newsreader4.netcologne.de>
<stv51f$da9$1@dont-email.me> <stvf02$v7b$1@dont-email.me> <su10pc$1e5$1@dont-email.me>
<su19ub$9hr$1@dont-email.me> <731cd4dd-35b7-4509-96d5-bf33c1e8c60fn@googlegroups.com>
<e4234dae-c8de-4ba6-949b-696821e1a15dn@googlegroups.com> <su59a7$avf$1@newsreader4.netcologne.de>
<su65pg$lc$1@dont-email.me> <8511a1a5-48be-4186-9d6e-8831616ba9a5n@googlegroups.com>
<53335b0a-0f90-4836-8cf9-5668308e38a9n@googlegroups.com> <su6qeh$c9q$1@newsreader4.netcologne.de>
<3541fae1-e7e2-4ede-b08e-46f53b4bb2a4n@googlegroups.com> <0I9OJ.22703$4JN7.21667@fx05.iad>
<8d160485-6b5e-4a5c-89b2-1d5172bc0544n@googlegroups.com> <sucjd9$pvc$1@dont-email.me>
<jwvczjph0x7.fsf-monnier+comp.arch@gnu.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <663ead4b-2d8f-484d-8329-ae5692b88bean@googlegroups.com>
Subject: Re: Encoding 20 and 40 bit instructions in 128 bits
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Mon, 14 Feb 2022 22:19:18 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 14

by: Quadibloc - Mon, 14 Feb 2022 22:19 UTC

On Monday, February 14, 2022 at 9:29:35 AM UTC-7, Stefan Monnier wrote:
> > Clever. I don't see what's wrong about that either, although register
> > assignment in the compiler might be "interesting"

> BTW, for those who like "interesting", you can go a step further and use
> different groups of 8 registers depending on the "base" register so you
> can avoid the "silo"ing problem.

Unfortunately, my designs can't take advantage of that; they already only
use one set of eight base registers, as they don't have room for a larger
field. And I expect that most code would use all 32 registers in conjunction
with the area of memory to which a single base register points in the
most common case.

John Savard

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<2022Feb14.231756@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23456&group=comp.arch#23456

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits
Date: Mon, 14 Feb 2022 22:17:56 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 27
Message-ID: <2022Feb14.231756@mips.complang.tuwien.ac.at>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at> <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="b636dcf9ba8eda7980c48ca0520dd5aa";
logging-data="29349"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19Q2Vt38lO/D69B13t+ek24"
Cancel-Lock: sha1:ppzlvk/tqGPd5wfjxkSH6+DeZ8E=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Mon, 14 Feb 2022 22:17 UTC

BGB <cr88192@gmail.com> writes:
>Hitting this limit is not likely to bode well for the long term
>"superiority" of OoO,

Why? If OoO is superior now at the current transistor budget, why
would it stop being superior if the transistor budget does not
increase?

>and may well lead to push-back towards
>simpler/cheaper cores.

If/When the increase in transistors/$ finally stops, it will be
interesting to see what happens then. Will we see more specialized
hardware, because now you can amortize the development (and mask
costs) over a longer time, even if you only serve a niche? Will
software developers finally get around to designing software that
makes the maximum of the hardware, because inefficiency can no longer
be papered over with faster hardware?

I have my doubts. Network effects will neuter the advantages of
specialized hardware and of carefully designed software that is late
to the market.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23457&group=comp.arch#23457

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:1924:: with SMTP id bj36mr802345qkb.526.1644883505441;
Mon, 14 Feb 2022 16:05:05 -0800 (PST)
X-Received: by 2002:a05:6808:f8a:b0:2d0:70a3:2138 with SMTP id
o10-20020a0568080f8a00b002d070a32138mr635050oiw.64.1644883505155; Mon, 14 Feb
2022 16:05:05 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!2.us.feeder.erje.net!feeder.erje.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 14 Feb 2022 16:05:04 -0800 (PST)
In-Reply-To: <2022Feb14.231756@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=162.229.185.59; posting-account=Gm3E_woAAACkDRJFCvfChVjhgA24PTsb
NNTP-Posting-Host: 162.229.185.59
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me>
<suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: yogaman...@yahoo.com (Scott Smader)
Injection-Date: Tue, 15 Feb 2022 00:05:05 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 43

by: Scott Smader - Tue, 15 Feb 2022 00:05 UTC

On Monday, February 14, 2022 at 2:26:46 PM UTC-8, Anton Ertl wrote:
> BGB <cr8...@gmail.com> writes:
> >Hitting this limit is not likely to bode well for the long term
> >"superiority" of OoO,
> Why? If OoO is superior now at the current transistor budget, why
> would it stop being superior if the transistor budget does not
> increase?

Not sure if the assumption of OoO superiority is justified at current transistor budget, but for the same transistor budget at the same clock rate, statically scheduled should win because it more efficiently uses the available transistors to do useful work.
Every time a program runs on an OoO machine, it wastes power to speculate about un-taken paths that can be avoided when source code compiles to a statically scheduled target instruction set.
Once the speculation transistors are removed, a statically scheduled chip can replace them with additional FUs to get more done each clock cycle.

> >and may well lead to push-back towards
> >simpler/cheaper cores.
> If/When the increase in transistors/$ finally stops, it will be
> interesting to see what happens then. Will we see more specialized
> hardware, because now you can amortize the development (and mask
> costs) over a longer time, even if you only serve a niche? Will
> software developers finally get around to designing software that
> makes the maximum of the hardware, because inefficiency can no longer
> be papered over with faster hardware?
>
> I have my doubts. Network effects will neuter the advantages of
> specialized hardware and of carefully designed software that is late
> to the market.
> - anton

Statically scheduled isn't specialized. Optimizing transistor budgets in a general purpose chip is, to a great degree, separable from optimizing the applications that run on it, and it should be.

Imo, that's what makes Mill so keen, but as you say, it will be interesting to see.

> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suerog$cd0$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23458&group=comp.arch#23458

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Mon, 14 Feb 2022 18:22:35 -0600
Organization: A noiseless patient Spider
Lines: 91
Message-ID: <suerog$cd0$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 15 Feb 2022 00:22:41 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="b15bfce4c7a8c2fd9d7d9df155af1db1";
logging-data="12704"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19l6wKzyB5ke+aDARC7QyOl"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.0
Cancel-Lock: sha1:vNq8Cdv/sydXhoyqlMvrFcQYuYM=
In-Reply-To: <2022Feb14.231756@mips.complang.tuwien.ac.at>
Content-Language: en-US

by: BGB - Tue, 15 Feb 2022 00:22 UTC

On 2/14/2022 4:17 PM, Anton Ertl wrote:
> BGB <cr88192@gmail.com> writes:
>> Hitting this limit is not likely to bode well for the long term
>> "superiority" of OoO,
>
> Why? If OoO is superior now at the current transistor budget, why
> would it stop being superior if the transistor budget does not
> increase?
>

Advantages of OoO were mostly:
"Ye Olde" scalar code goes keeps getting faster;
Not requiring as much from the compiler;
Effective at getting maximizing single thread performance;
...

It achieved this at the expense of spending considerable transistor
budget and energy on the problem.

The demand for higher performance will continue past the point where
continuously increasing transistor budget is no longer viable.

However, the push-back may favor designs which can give more performance
relative to transistor budget and energy use. Namely, such as VLIW cores
with dynamic translation from some higher-level IR (could be a
stack-based IR, but also likely would be an existing ISA being
repurposed as an IR).

Ideally, one needs an ISA that "doesn't suck" as an IR, where I suspect
both x86-64 and ARM64 making use of condition-codes does not exactly
work in their favor in this use-case.

While RISC-V doesn't use condition codes (good in this sense), it also
kinda sucks in many other areas.

I have yet to figure out what the ideal IR would look like exactly (or
even if it would necessarily be singular).

Though, there were a few considered (but mostly stalled) sub-efforts
towards trying to run x86-64 code on top of BJX2.

>> and may well lead to push-back towards
>> simpler/cheaper cores.
>
> If/When the increase in transistors/$ finally stops, it will be
> interesting to see what happens then. Will we see more specialized
> hardware, because now you can amortize the development (and mask
> costs) over a longer time, even if you only serve a niche? Will
> software developers finally get around to designing software that
> makes the maximum of the hardware, because inefficiency can no longer
> be papered over with faster hardware?
>
> I have my doubts. Network effects will neuter the advantages of
> specialized hardware and of carefully designed software that is late
> to the market.
>

My prediction is more in the form of VLIW based manycore systems likely
running software on top of a dynamic translation layer.

Unlike traditional VLIW compilers, the dynamic translation later could
have access to live / real-time profiler data, so could make better
guesses about things like when and how to go about modulo scheduling
loops and similar.

Sadly, what I am imagining here, isn't all that much like my BJX2
project, but BJX2 is limited more to what I can fit on an FPGA, and
ended up where it is because I kept finding ways I could push it
"forwards" with mostly only small/incremental increases in resource cost.

Though, yes, a 1-wide pipelined RISC core is a little cheaper than a
3-wide VLIW, but if limited to the same clock-speed, a 3-wide core can
outperform a 1-wide core.

The situation might be different if the 1-wide RISC were running at
100MHz, but I have found that "reliably"/"easily" passing timing at
100MHz (on the Spartan-7 and Artix-7) seems to require a using a 32-bit
ISA design (such as RV32I).

But, personally, I don't find RV32I all that inspiring.
Like, RV32I is like some sort of weird phantom that does pretty good on
Dhrystone but seemingly kinda sucks at nearly everything else.

....

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<ef566a39-916a-4b96-8d16-2ef419e181ecn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23459&group=comp.arch#23459

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:11ca:: with SMTP id n10mr1343726qtk.42.1644890626692;
Mon, 14 Feb 2022 18:03:46 -0800 (PST)
X-Received: by 2002:a54:4e85:0:b0:2ce:4cc1:9d82 with SMTP id
c5-20020a544e85000000b002ce4cc19d82mr731234oiy.50.1644890626418; Mon, 14 Feb
2022 18:03:46 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!feeder.erje.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 14 Feb 2022 18:03:46 -0800 (PST)
In-Reply-To: <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:b441:5ab7:ce59:7338;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:b441:5ab7:ce59:7338
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me>
<suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ef566a39-916a-4b96-8d16-2ef419e181ecn@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 15 Feb 2022 02:03:46 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 30

by: MitchAlsup - Tue, 15 Feb 2022 02:03 UTC

On Monday, February 14, 2022 at 9:57:52 AM UTC-6, Quadibloc wrote:
> On Monday, February 14, 2022 at 2:15:21 AM UTC-7, Anton Ertl wrote:
>
> > My provisional verdict: Mill has been designed with the supercomputer
> > mindset, and suffers from the usual problems of such designs.
<
> Hmm. While I suspect _my_ designs have that flaw, my perception was
> that the focus in Mill was low power consumption rather than maximum
> performance.
<
Other than not having a 32-bit subset: My 66000 ISA can be implemented
on a 1-wide in order implementation up through at least the 6-wide GBOoO
implementation I am looking at.
<
At the system level: we are at the density where we expect multiple cores
on a die, along with plenty of cache, multiple DRAM channels, several PCIe
links, and a Chip repeater.
<
This is no where near a Super Computer today. It takes a basket ball stadium
filled with racks, each rack filled with systems, all connected as a great big
network to qualify as a Super Computer today.
<
Now, with that in mind:: It seems to me that we are approaching the density
where one would box up 16-cores, a memory controller, and PCIe links as
<pick some name> and this becomes the unit of replication in the Chip--not
the cores themselves!
>
> Of course, a focus on throughput, rather than single-thread performance,
> may be exactly what you are referring to.
>
> John Savard

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<acdd5308-1666-4888-aa86-ab2d82380f1en@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23460&group=comp.arch#23460

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:258e:: with SMTP id fq14mr1361952qvb.69.1644891191941;
Mon, 14 Feb 2022 18:13:11 -0800 (PST)
X-Received: by 2002:a05:6870:884:b0:d3:120d:fb4a with SMTP id
fx4-20020a056870088400b000d3120dfb4amr608400oab.327.1644891191700; Mon, 14
Feb 2022 18:13:11 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 14 Feb 2022 18:13:11 -0800 (PST)
In-Reply-To: <suechc$d2p$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:b441:5ab7:ce59:7338;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:b441:5ab7:ce59:7338
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me>
<suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <acdd5308-1666-4888-aa86-ab2d82380f1en@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 15 Feb 2022 02:13:11 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Tue, 15 Feb 2022 02:13 UTC

On Monday, February 14, 2022 at 2:02:56 PM UTC-6, BGB wrote:
> On 2/14/2022 9:57 AM, Quadibloc wrote:
> > On Monday, February 14, 2022 at 2:15:21 AM UTC-7, Anton Ertl wrote:
> >
> >> My provisional verdict: Mill has been designed with the supercomputer
> >> mindset, and suffers from the usual problems of such designs.
> >
> > Hmm. While I suspect _my_ designs have that flaw, my perception was
> > that the focus in Mill was low power consumption rather than maximum
> > performance.
> >
> > Of course, a focus on throughput, rather than single-thread performance,
> > may be exactly what you are referring to.
> >
> FWIW: In my case I am mostly trying to get decent performance while
> being economical with resource costs.
>
>
> Much as clock speeds couldn't get faster indefinitely, transistor
> budgets are likely already rapidly approaching the limits of what is
> practical.
<
Check out PCIe 6.0 with links operating at 64Gts (16 GHz, PAM4 modulation,
Double Data Rate) per wire (unidirectional twisted pair.)
>
> Hitting this limit is not likely to bode well for the long term
> "superiority" of OoO, and may well lead to push-back towards
> simpler/cheaper cores.
>
>
> Though, this may be limited to the extent that L1 cache can't really be
> made much cheaper (lots of cores with tiny L1 caches wouldn't exactly be
> great either).
>
> IME, the L1 caches seem to reach a stable state at ~ 8K..32K, where hit
> rates are upwards of 95%, and increasing cache size only slightly
> increases hit rate. Much below this point, hit rate starts to drop off
> pretty rapidly.
<
Ahem:: It is not hit rate that matters--it is the miss rate and the miss
multiplier which matter.
<
L1s are stable because if the get bigger than 64 KB they take more than
one cycle to "access", where SRAM and wire routing are both considered.
<
Speed Deamons use smaller caches and target numerical applications.
Servers use larger applications, lower clock rates, and target server
"stuff". A speed deamon running data base is at a performance disadvantage
to a server oriented design running the same.
>
> A core with a 2K L1 cache suffers from a fairly high miss rate.
<
<ahem> Long memory latency.
>
>
> I have yet to find an "optimal lower limit" for the TLB, as it seems to
> depend a lot on the program. Have noted: 256..1K TLBEs works fairly
> well, but 16/32/64 TLBE's (4x/8x/16x 4-way) kinda sucks.
>
> However, for these smaller TLB sizes, have noted that address-modulo
> indexing seems to have a better hit rate than hashed indexing.
>
> Having a 16x 1-way TLB or similar within the L1 caches can also be help
> (can do address translation locally within the L1 cache if it hits).
>
> ...
>
>
> But, less clear are the optimal tradeoffs for a hardware implementation
> (for example, 16K L1's with 16 TLBE's in each L1 might potentially be
> "too expensive" for a cost-conscious core).
>
> But, then one could also argue about the relative cost of spending the
> cost of all of the bits in the register file (say, 32 or 64 GPRs, each
> 64 bits, isn't exactly free either, ...).
>
> ...

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<44002d87-49ab-436c-b7f8-1d5830345d29n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23461&group=comp.arch#23461

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:9e4f:: with SMTP id h76mr893139qke.645.1644891404088;
Mon, 14 Feb 2022 18:16:44 -0800 (PST)
X-Received: by 2002:a05:6808:bd3:b0:2d2:f7ae:1e8 with SMTP id
o19-20020a0568080bd300b002d2f7ae01e8mr737862oik.179.1644891403856; Mon, 14
Feb 2022 18:16:43 -0800 (PST)
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 14 Feb 2022 18:16:43 -0800 (PST)
In-Reply-To: <2022Feb14.231756@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:b441:5ab7:ce59:7338;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:b441:5ab7:ce59:7338
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me>
<suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <44002d87-49ab-436c-b7f8-1d5830345d29n@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 15 Feb 2022 02:16:44 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 3122

by: MitchAlsup - Tue, 15 Feb 2022 02:16 UTC

On Monday, February 14, 2022 at 4:26:46 PM UTC-6, Anton Ertl wrote:
> BGB <cr8...@gmail.com> writes:
> >Hitting this limit is not likely to bode well for the long term
> >"superiority" of OoO,
> Why? If OoO is superior now at the current transistor budget, why
> would it stop being superior if the transistor budget does not
> increase?
<
Since 2003 the hope was that SW would figure out how to use
as many processors as chips could hold. So far this has been
mostly a failure.
<
> >and may well lead to push-back towards
> >simpler/cheaper cores.
<
> If/When the increase in transistors/$ finally stops, it will be
> interesting to see what happens then. Will we see more specialized
> hardware, because now you can amortize the development (and mask
> costs) over a longer time, even if you only serve a niche? Will
> software developers finally get around to designing software that
> makes the maximum of the hardware, because inefficiency can no longer
> be papered over with faster hardware?
<
Until the transistor wall gets hit, designers will simply consume transistors.
<
Afterwards, one must measure and use them wisely.
>
> I have my doubts. Network effects will neuter the advantages of
> specialized hardware and of carefully designed software that is late
> to the market.
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<496d73fc-58d7-4dd8-8b98-ddcd920bd13en@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23462&group=comp.arch#23462

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:192:: with SMTP id s18mr1337730qtw.43.1644891804645;
Mon, 14 Feb 2022 18:23:24 -0800 (PST)
X-Received: by 2002:a05:6808:1806:: with SMTP id bh6mr750392oib.309.1644891804440;
Mon, 14 Feb 2022 18:23:24 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!2.us.feeder.erje.net!feeder.erje.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 14 Feb 2022 18:23:24 -0800 (PST)
In-Reply-To: <suerog$cd0$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:b441:5ab7:ce59:7338;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:b441:5ab7:ce59:7338
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me>
<suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at> <suerog$cd0$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <496d73fc-58d7-4dd8-8b98-ddcd920bd13en@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 15 Feb 2022 02:23:24 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 106

by: MitchAlsup - Tue, 15 Feb 2022 02:23 UTC

On Monday, February 14, 2022 at 6:22:45 PM UTC-6, BGB wrote:
> On 2/14/2022 4:17 PM, Anton Ertl wrote:
> > BGB <cr8...@gmail.com> writes:
> >> Hitting this limit is not likely to bode well for the long term
> >> "superiority" of OoO,
> >
> > Why? If OoO is superior now at the current transistor budget, why
> > would it stop being superior if the transistor budget does not
> > increase?
> >
> Advantages of OoO were mostly:
> "Ye Olde" scalar code goes keeps getting faster;
> Not requiring as much from the compiler;
<
A good OoO implementation wants the compiler to do LESS--encode
the subroutine with as few instructions as possible, don't bother to
schedule them, ... in the early 1990s Mc88120 was underperforming
on code optimized for Mc88100, where the code was heavily scheduled
loops software pipelined, instructions put in "inconsiderate" places
for the GBOoO machine we had. We actually god faster code most
of the time with -O1 and sometimes -O2 than -O3.
<
> Effective at getting maximizing single thread performance;
> ...
>
> It achieved this at the expense of spending considerable transistor
> budget and energy on the problem.
>
> The demand for higher performance will continue past the point where
> continuously increasing transistor budget is no longer viable.
>
>
> However, the push-back may favor designs which can give more performance
> relative to transistor budget and energy use. Namely, such as VLIW cores
> with dynamic translation from some higher-level IR (could be a
> stack-based IR, but also likely would be an existing ISA being
> repurposed as an IR).
>
>
>
> Ideally, one needs an ISA that "doesn't suck" as an IR, where I suspect
> both x86-64 and ARM64 making use of condition-codes does not exactly
> work in their favor in this use-case.
<
x86 basically has 3 condition codes C, O, and ZAPS and HW trance each
independently.
>
> While RISC-V doesn't use condition codes (good in this sense), it also
> kinda sucks in many other areas.
<
Academic quality.
>
> I have yet to figure out what the ideal IR would look like exactly (or
> even if it would necessarily be singular).
<
LLVM intermediate is actually quite good.
>
> Though, there were a few considered (but mostly stalled) sub-efforts
> towards trying to run x86-64 code on top of BJX2.
> >> and may well lead to push-back towards
> >> simpler/cheaper cores.
> >
> > If/When the increase in transistors/$ finally stops, it will be
> > interesting to see what happens then. Will we see more specialized
> > hardware, because now you can amortize the development (and mask
> > costs) over a longer time, even if you only serve a niche? Will
> > software developers finally get around to designing software that
> > makes the maximum of the hardware, because inefficiency can no longer
> > be papered over with faster hardware?
> >
> > I have my doubts. Network effects will neuter the advantages of
> > specialized hardware and of carefully designed software that is late
> > to the market.
> >
> My prediction is more in the form of VLIW based manycore systems likely
> running software on top of a dynamic translation layer.
>
> Unlike traditional VLIW compilers, the dynamic translation later could
> have access to live / real-time profiler data, so could make better
> guesses about things like when and how to go about modulo scheduling
> loops and similar.
>
>
> Sadly, what I am imagining here, isn't all that much like my BJX2
> project, but BJX2 is limited more to what I can fit on an FPGA, and
> ended up where it is because I kept finding ways I could push it
> "forwards" with mostly only small/incremental increases in resource cost.
>
>
> Though, yes, a 1-wide pipelined RISC core is a little cheaper than a
> 3-wide VLIW, but if limited to the same clock-speed, a 3-wide core can
> outperform a 1-wide core.
>
>
> The situation might be different if the 1-wide RISC were running at
> 100MHz, but I have found that "reliably"/"easily" passing timing at
> 100MHz (on the Spartan-7 and Artix-7) seems to require a using a 32-bit
> ISA design (such as RV32I).
>
> But, personally, I don't find RV32I all that inspiring.
> Like, RV32I is like some sort of weird phantom that does pretty good on
> Dhrystone but seemingly kinda sucks at nearly everything else.
>
> ...
<
Also note: I now have over 100 wed sites where JavaSxcript is turned off
just to get rid of annoying advertising.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<e360bac3-6a11-49e0-b089-f2158b0c762fn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23463&group=comp.arch#23463

copy link Newsgroups: comp.arch

X-Received: by 2002:ad4:4e2f:: with SMTP id dm15mr1378344qvb.57.1644893383617;
Mon, 14 Feb 2022 18:49:43 -0800 (PST)
X-Received: by 2002:aca:61c3:0:b0:2ce:6ee7:2cc7 with SMTP id
v186-20020aca61c3000000b002ce6ee72cc7mr723479oib.245.1644893383304; Mon, 14
Feb 2022 18:49:43 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 14 Feb 2022 18:49:43 -0800 (PST)
In-Reply-To: <suerog$cd0$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:b4f7:aaae:e817:c81f;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:b4f7:aaae:e817:c81f
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me>
<suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at> <suerog$cd0$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e360bac3-6a11-49e0-b089-f2158b0c762fn@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Tue, 15 Feb 2022 02:49:43 +0000
Content-Type: text/plain; charset="UTF-8"

by: Quadibloc - Tue, 15 Feb 2022 02:49 UTC

On Monday, February 14, 2022 at 5:22:45 PM UTC-7, BGB wrote:
> On 2/14/2022 4:17 PM, Anton Ertl wrote:
> > BGB <cr8...@gmail.com> writes:
> >> Hitting this limit is not likely to bode well for the long term
> >> "superiority" of OoO,

> > Why? If OoO is superior now at the current transistor budget, why
> > would it stop being superior if the transistor budget does not
> > increase?

> Advantages of OoO were mostly:
> "Ye Olde" scalar code goes keeps getting faster;
> Not requiring as much from the compiler;
> Effective at getting maximizing single thread performance;

> It achieved this at the expense of spending considerable transistor
> budget and energy on the problem.

> The demand for higher performance will continue past the point where
> continuously increasing transistor budget is no longer viable.

You - and Mitch Alsup - are such *optimists*!

Given that out-of-order execution requires a _lot_ of transistors
for the extra performance it provides, then it logically follows
that one could get more throughput by putting a large number
of in-order processors on the same die, does it not?

If so, why did Intel change the Atom over from in-order to
out-of-order, and why did the Xeon Phi fail?

I have two answers to offer:

1) We don't know how to write parallel code well, and we
never will. At least for the values of "well" which would
allow speed to scale as the number of processors applied
over a much larger range than is the case at present for
a given application.

2) The potential memory bandwidth feeding a single die
is a serious constraint. Thus, it will not be possible to
make use of the maximum throughput that could be achieved
on a die through a very large number of in-order processors.
The best possible would be a lesser value, which is what
is required by the case of a smaller number of OoO processors,
which would offer the benefit of dealing better with constraint
(1) above.

John Savard

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<8b9e0e5a-ec7b-4511-b401-78dd1f211499n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23464&group=comp.arch#23464

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:1c4c:: with SMTP id if12mr37232qvb.4.1644894340821;
Mon, 14 Feb 2022 19:05:40 -0800 (PST)
X-Received: by 2002:a05:6830:233d:: with SMTP id q29mr691790otg.331.1644894340185;
Mon, 14 Feb 2022 19:05:40 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 14 Feb 2022 19:05:40 -0800 (PST)
In-Reply-To: <acdd5308-1666-4888-aa86-ab2d82380f1en@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:b4f7:aaae:e817:c81f;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:b4f7:aaae:e817:c81f
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me>
<suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<acdd5308-1666-4888-aa86-ab2d82380f1en@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <8b9e0e5a-ec7b-4511-b401-78dd1f211499n@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Tue, 15 Feb 2022 03:05:40 +0000
Content-Type: text/plain; charset="UTF-8"

by: Quadibloc - Tue, 15 Feb 2022 03:05 UTC

On Monday, February 14, 2022 at 7:13:13 PM UTC-7, MitchAlsup wrote:

> Speed Deamons use smaller caches and target numerical applications.
> Servers use larger applications, lower clock rates, and target server
> "stuff". A speed deamon running data base is at a performance disadvantage
> to a server oriented design running the same.

Currently, big.LITTLE designs are oriented around the big processors
being the ones designed to do the actual work, and the little processors
being enough to handle updating the GUI, providing the processor with
enough oomph to let the operating system handle its routine busywork,
and to run highly interactive applications which, not being computationally
intensive, could have run just as well on a 386SX running Windows 3.1.

However, the Efficiency cores in Intel's designs with P and E cores are
apparently somewhat more powerful than LITTLE cores in the ARM chips
in people's smartphones, so this generalization may be less true in that
case.

Still, though, given that a PC might be used for number-crunching at
one moment, and for database queries at another, to me this sounds
like an argument for putting both speed demon performance cores
and server performance cores in a design, as well as cores oriented
towards power efficiency.

That is, though, only if data base still needs OoO cores, just less
enormous than those suited to number-crunching. If data base
would prefer a larger number of really small cores, then we don't
need a third kind of core, we just need chips with, say, four performance
cores and sixteen efficiency cores instead of four performance cores
and two efficiency cores.

A high-performance chip would have four of those dies in it;
sixteen performance cores to keep up with an AMD Ryzen 9 5950
or its current successor, and sixty four efficiency cores to supply
a more appropriate hardware resource when database processing
is what is being done.

I presume that by the time one gets to 4nm, it will be possible to
have such a chip running at around 3 GHz without requiring any
exotic cooling solution? And while sixty-four efficiency cores would
require a lot of memory bandwith, there would be room for enough
cache so that the requirement could actually be met, even if a
modest exercise of ingenuity will be required?

And, of course, since AMD already makes Threadrippers, it
could be that I'm being entirely too cautious in predicting what
could be achieved in just a couple of years if they really tried.

John Savard

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<jwvwnhwvnlj.fsf-monnier+comp.arch@gnu.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23465&group=comp.arch#23465

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits
Date: Mon, 14 Feb 2022 22:10:17 -0500
Organization: A noiseless patient Spider
Lines: 13
Message-ID: <jwvwnhwvnlj.fsf-monnier+comp.arch@gnu.org>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="7bf29bab7c6827fe551f20eac271d165";
logging-data="2704"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+xh9VgbdAjDE4jYtYQ5Q+Q"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:54td68/C76a4V/by7LEMhfg4tkw=
sha1:E+LXYm52yeQ3jv1fmY9EqeuQZ7k=

by: Stefan Monnier - Tue, 15 Feb 2022 03:10 UTC

> Not sure if the assumption of OoO superiority is justified at current
> transistor budget, but for the same transistor budget at the same clock
> rate, statically scheduled should win because it more efficiently uses the
> available transistors to do useful work.

What makes you think so? AFAIK OoO cores are better at keeping their
FUs busy, and they're probably also better at keeping the memory
hierarchy busy. Maybe they do that at the cost of more transistors
"wasted" doing "administrative overhead", but it's far from obvious that
this overhead is the only thing that matters.

Stefan

Re: Encoding 20 and 40 bit instructions in 128 bits

<jwvr184vn71.fsf-monnier+comp.arch@gnu.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23466&group=comp.arch#23466

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Encoding 20 and 40 bit instructions in 128 bits
Date: Mon, 14 Feb 2022 22:13:35 -0500
Organization: A noiseless patient Spider
Lines: 10
Message-ID: <jwvr184vn71.fsf-monnier+comp.arch@gnu.org>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<sto41r$4tj$1@newsreader4.netcologne.de>
<9de2cef4-0cfc-4a6b-a96a-fc7cbc836966n@googlegroups.com>
<stuoqv$97e$1@newsreader4.netcologne.de> <stv51f$da9$1@dont-email.me>
<stvf02$v7b$1@dont-email.me> <su10pc$1e5$1@dont-email.me>
<su19ub$9hr$1@dont-email.me>
<731cd4dd-35b7-4509-96d5-bf33c1e8c60fn@googlegroups.com>
<e4234dae-c8de-4ba6-949b-696821e1a15dn@googlegroups.com>
<su59a7$avf$1@newsreader4.netcologne.de> <su65pg$lc$1@dont-email.me>
<8511a1a5-48be-4186-9d6e-8831616ba9a5n@googlegroups.com>
<53335b0a-0f90-4836-8cf9-5668308e38a9n@googlegroups.com>
<su6qeh$c9q$1@newsreader4.netcologne.de>
<3541fae1-e7e2-4ede-b08e-46f53b4bb2a4n@googlegroups.com>
<0I9OJ.22703$4JN7.21667@fx05.iad>
<8d160485-6b5e-4a5c-89b2-1d5172bc0544n@googlegroups.com>
<sucjd9$pvc$1@dont-email.me>
<jwvczjph0x7.fsf-monnier+comp.arch@gnu.org>
<663ead4b-2d8f-484d-8329-ae5692b88bean@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="7bf29bab7c6827fe551f20eac271d165";
logging-data="2704"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+e7Wn7XWQ096jZjjgghc6J"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:VjIC+IpNAXavGDmDmBVQAhwQi+s=
sha1:KnK41fxtzYSV8G7UzCkvkKbhX0I=

by: Stefan Monnier - Tue, 15 Feb 2022 03:13 UTC

> Unfortunately, my designs can't take advantage of that; they already
> only use one set of eight base registers, as they don't have room for
> a larger field.

My suggestion doesn't require more bits in the instruction.
[ I don't claim it's a good idea, tho. It might be, but if so, it's
probably at the cost of a nasty register allocation problem. ]

Stefan

Re: Encoding 20 and 40 bit instructions in 128 bits

<12a4ee1c-6ec4-41ca-91b2-a8a7f8884035n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23467&group=comp.arch#23467

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:40e:: with SMTP id n14mr1461817qtx.380.1644896284144;
Mon, 14 Feb 2022 19:38:04 -0800 (PST)
X-Received: by 2002:a05:6808:f8a:b0:2d0:70a3:2138 with SMTP id
o10-20020a0568080f8a00b002d070a32138mr908445oiw.64.1644896283946; Mon, 14 Feb
2022 19:38:03 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 14 Feb 2022 19:38:03 -0800 (PST)
In-Reply-To: <jwvr184vn71.fsf-monnier+comp.arch@gnu.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:b4f7:aaae:e817:c81f;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:b4f7:aaae:e817:c81f
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <sto41r$4tj$1@newsreader4.netcologne.de>
<9de2cef4-0cfc-4a6b-a96a-fc7cbc836966n@googlegroups.com> <stuoqv$97e$1@newsreader4.netcologne.de>
<stv51f$da9$1@dont-email.me> <stvf02$v7b$1@dont-email.me> <su10pc$1e5$1@dont-email.me>
<su19ub$9hr$1@dont-email.me> <731cd4dd-35b7-4509-96d5-bf33c1e8c60fn@googlegroups.com>
<e4234dae-c8de-4ba6-949b-696821e1a15dn@googlegroups.com> <su59a7$avf$1@newsreader4.netcologne.de>
<su65pg$lc$1@dont-email.me> <8511a1a5-48be-4186-9d6e-8831616ba9a5n@googlegroups.com>
<53335b0a-0f90-4836-8cf9-5668308e38a9n@googlegroups.com> <su6qeh$c9q$1@newsreader4.netcologne.de>
<3541fae1-e7e2-4ede-b08e-46f53b4bb2a4n@googlegroups.com> <0I9OJ.22703$4JN7.21667@fx05.iad>
<8d160485-6b5e-4a5c-89b2-1d5172bc0544n@googlegroups.com> <sucjd9$pvc$1@dont-email.me>
<jwvczjph0x7.fsf-monnier+comp.arch@gnu.org> <663ead4b-2d8f-484d-8329-ae5692b88bean@googlegroups.com>
<jwvr184vn71.fsf-monnier+comp.arch@gnu.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <12a4ee1c-6ec4-41ca-91b2-a8a7f8884035n@googlegroups.com>
Subject: Re: Encoding 20 and 40 bit instructions in 128 bits
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Tue, 15 Feb 2022 03:38:04 +0000
Content-Type: text/plain; charset="UTF-8"

by: Quadibloc - Tue, 15 Feb 2022 03:38 UTC

On Monday, February 14, 2022 at 8:13:38 PM UTC-7, Stefan Monnier wrote:
> > Unfortunately, my designs can't take advantage of that; they already
> > only use one set of eight base registers, as they don't have room for
> > a larger field.

> My suggestion doesn't require more bits in the instruction.
> [ I don't claim it's a good idea, tho. It might be, but if so, it's
> probably at the cost of a nasty register allocation problem. ]

I wouldn't be qualified to say if the register allocation problem
would be so nasty as to negate its benefits.

My problem, instead, is that I can't see its potential benefits;
while I want multiple strands of execution going on, using
different registers for their work, I didn't imagine that they
would be using separate regions of memory.

If I _did_ have multiple strands of execution _like that_
to deal with, instead of playing with the base registers,
I would put them on different physical CPUs with different
buses to different arrays of DRAM, connected to each other
with Ethernet cables. That way, I could scale the memory
bandwidth with no headaches.

Now, of course, if they shared _some_ memory, but some
of the memory they used was different - and they really
were part of the same program - possibly enlarging the
pool of base registers available for a given number of
bits in the instruction by using the high bits of the
destination register... *would* make sense.

I'm just not familiar with a class of applications with such
a large thirst for memory.

Actually, though, I *did* come up with an idea to address
a case sort of like that. Array Mode.

Given that some programs require really big arrays, arrays
so big as to be equal to the memory expanse covered through
the use of a single base register...

instead of trying to figure out a way to use 32 base registers
instead of 8, I refused to settle for half measures, and instead
went with a form of indirect addressing - the displacement
field of the instruction is replaced by an array selector, which chooses
the memory location containing the base of the array, to which the
index register contents are added.

Registers are a scarce resource. Using *more* of them as static
base registers just didn't seem like a good idea to me. The pain of
a second memory access for indirect addressing can be taken care
of by a good cache.

So, basically, I didn't consider anything like your idea because I
went straight for the nuclear option.

John Savard

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suf799$b04$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23468&group=comp.arch#23468

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Mon, 14 Feb 2022 21:38:57 -0600
Organization: A noiseless patient Spider
Lines: 73
Message-ID: <suf799$b04$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<suerog$cd0$1@dont-email.me>
<e360bac3-6a11-49e0-b089-f2158b0c762fn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 15 Feb 2022 03:39:21 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="b15bfce4c7a8c2fd9d7d9df155af1db1";
logging-data="11268"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/iC//66McihTNlTWp/+r6D"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.0
Cancel-Lock: sha1:tFymqIPxAZZa4UQoEIm+NeDMPHo=
In-Reply-To: <e360bac3-6a11-49e0-b089-f2158b0c762fn@googlegroups.com>
Content-Language: en-US

by: BGB - Tue, 15 Feb 2022 03:38 UTC

On 2/14/2022 8:49 PM, Quadibloc wrote:
> On Monday, February 14, 2022 at 5:22:45 PM UTC-7, BGB wrote:
>> On 2/14/2022 4:17 PM, Anton Ertl wrote:
>>> BGB <cr8...@gmail.com> writes:
>>>> Hitting this limit is not likely to bode well for the long term
>>>> "superiority" of OoO,
>
>>> Why? If OoO is superior now at the current transistor budget, why
>>> would it stop being superior if the transistor budget does not
>>> increase?
>
>> Advantages of OoO were mostly:
>> "Ye Olde" scalar code goes keeps getting faster;
>> Not requiring as much from the compiler;
>> Effective at getting maximizing single thread performance;
>
>> It achieved this at the expense of spending considerable transistor
>> budget and energy on the problem.
>
>> The demand for higher performance will continue past the point where
>> continuously increasing transistor budget is no longer viable.
>
> You - and Mitch Alsup - are such *optimists*!
>
> Given that out-of-order execution requires a _lot_ of transistors
> for the extra performance it provides, then it logically follows
> that one could get more throughput by putting a large number
> of in-order processors on the same die, does it not?
>
> If so, why did Intel change the Atom over from in-order to
> out-of-order, and why did the Xeon Phi fail?
>

One issue with x86 (and x86-64) is that the design of the ISA kinda
sucks for in-order cores as well.

What I am imagining here is not x86.

I am not aware of any (particularly mainstream) attempts at VLIW
processors, at least much beyond the (ill fated) Itanium.

Though, something more like a manycore Itanium is closer to what I am
imagining than something like an Atom or Xeon Phi.

> I have two answers to offer:
>
> 1) We don't know how to write parallel code well, and we
> never will. At least for the values of "well" which would
> allow speed to scale as the number of processors applied
> over a much larger range than is the case at present for
> a given application.
>
> 2) The potential memory bandwidth feeding a single die
> is a serious constraint. Thus, it will not be possible to
> make use of the maximum throughput that could be achieved
> on a die through a very large number of in-order processors.
> The best possible would be a lesser value, which is what
> is required by the case of a smaller number of OoO processors,
> which would offer the benefit of dealing better with constraint
> (1) above.
>

Possible, though there is also the possibility that memory bandwidth
could still be improved further, such as by making memory modules that
have large numbers of lanes.

Though, this partly becomes a question of how much IO is it viable to
route through a PCB, or if we start seeing RAM modules mounted on top of
the CPU.

Re: Encoding 20 and 40 bit instructions in 128 bits

<jwvleycvleh.fsf-monnier+comp.arch@gnu.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23469&group=comp.arch#23469

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Encoding 20 and 40 bit instructions in 128 bits
Date: Mon, 14 Feb 2022 23:12:00 -0500
Organization: A noiseless patient Spider
Lines: 70
Message-ID: <jwvleycvleh.fsf-monnier+comp.arch@gnu.org>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<stuoqv$97e$1@newsreader4.netcologne.de> <stv51f$da9$1@dont-email.me>
<stvf02$v7b$1@dont-email.me> <su10pc$1e5$1@dont-email.me>
<su19ub$9hr$1@dont-email.me>
<731cd4dd-35b7-4509-96d5-bf33c1e8c60fn@googlegroups.com>
<e4234dae-c8de-4ba6-949b-696821e1a15dn@googlegroups.com>
<su59a7$avf$1@newsreader4.netcologne.de> <su65pg$lc$1@dont-email.me>
<8511a1a5-48be-4186-9d6e-8831616ba9a5n@googlegroups.com>
<53335b0a-0f90-4836-8cf9-5668308e38a9n@googlegroups.com>
<su6qeh$c9q$1@newsreader4.netcologne.de>
<3541fae1-e7e2-4ede-b08e-46f53b4bb2a4n@googlegroups.com>
<0I9OJ.22703$4JN7.21667@fx05.iad>
<8d160485-6b5e-4a5c-89b2-1d5172bc0544n@googlegroups.com>
<sucjd9$pvc$1@dont-email.me>
<jwvczjph0x7.fsf-monnier+comp.arch@gnu.org>
<663ead4b-2d8f-484d-8329-ae5692b88bean@googlegroups.com>
<jwvr184vn71.fsf-monnier+comp.arch@gnu.org>
<12a4ee1c-6ec4-41ca-91b2-a8a7f8884035n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="7bf29bab7c6827fe551f20eac271d165";
logging-data="12818"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+kRTdGxTiQ4FBlcm93u8my"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:NF+0IsybH8KAE2A/IT4G+bTMy1o=
sha1:glIpcMPN8Dz82LhE6kWvUhRJ3Cs=

by: Stefan Monnier - Tue, 15 Feb 2022 04:12 UTC

> My problem, instead, is that I can't see its potential benefits;
> while I want multiple strands of execution going on, using
> different registers for their work, I didn't imagine that they
> would be using separate regions of memory.
[...]
> Registers are a scarce resource. Using *more* of them as static
> base registers just didn't seem like a good idea to me. The pain of
> a second memory access for indirect addressing can be taken care
> of by a good cache.

I don't understand what this has to do with my suggestion.

AFAIU the design mentioned earlier was two have 2-reg instructions
for things like add/sub/... but instead of allowing both of those
registers to be any of the 32 regs in the register file (eating up
2x5bit = 10bit of the instruction), one of the operands could be any of
the 32 registers, while the other was limited to only 8 registers, and
those 8 registers depend on which register is used for the
first operand.
IOW

base register allowed second register
============= =======================
00-07 00-07
08-15 08-15
16-23 16-23
23-31 23-31

So you need 5bits to encode the base register but only 3bits to encode
the second register, thus saving 2 previous bits, which can be quite
valuable in the tight space of 16bit instructions. The downside is that
combining data from registers 00-07 which data from registers 08-16
(for example) can't be done with those instructions, so you need to
resort to "full size" instructions for that.

My suggestion is to make the mapping more complex so that instead of
having 3 big walls splitting your register file into completely disjoint
sets, you have many small walls all over the place: they still hinder
data movement, but the register file is not partitioned any more, so you
might be able to avoid using a full size instruction by (instead)
carefully choosing which registers you use.

In a way this is a similar idea to Seznec's skewed associative caches,
applied in a different context.

Finding which registers to use for which operation so as to avoid extra
register moves then becomes a kind of search for the exit of
a labyrinth.

The example of mapping I gave is probably a very poor choice
(e.g. it doesn't seem to lend itself to a simple&efficient
implementation), but something like:

base register allowed second register
============= =======================
00-01 00-07
02-03 08-15
04-05 16-23
06-07 23-31
08-09 00-07
10-11 08-15
12-13 16-23
14-15 23-31
...

seems cheap to implement in hardware (it might still be challenging for
the register allocator, OTOH).

Stefan

Re: Encoding 20 and 40 bit instructions in 128 bits

<571c3186-4875-49b3-bc0f-9236edfbd9d6n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23470&group=comp.arch#23470

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:dc2:: with SMTP id 2mr1374359qvt.93.1644901351490;
Mon, 14 Feb 2022 21:02:31 -0800 (PST)
X-Received: by 2002:a05:6830:2b20:: with SMTP id l32mr825708otv.333.1644901351233;
Mon, 14 Feb 2022 21:02:31 -0800 (PST)
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 14 Feb 2022 21:02:31 -0800 (PST)
In-Reply-To: <jwvleycvleh.fsf-monnier+comp.arch@gnu.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:b4f7:aaae:e817:c81f;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:b4f7:aaae:e817:c81f
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <stuoqv$97e$1@newsreader4.netcologne.de>
<stv51f$da9$1@dont-email.me> <stvf02$v7b$1@dont-email.me> <su10pc$1e5$1@dont-email.me>
<su19ub$9hr$1@dont-email.me> <731cd4dd-35b7-4509-96d5-bf33c1e8c60fn@googlegroups.com>
<e4234dae-c8de-4ba6-949b-696821e1a15dn@googlegroups.com> <su59a7$avf$1@newsreader4.netcologne.de>
<su65pg$lc$1@dont-email.me> <8511a1a5-48be-4186-9d6e-8831616ba9a5n@googlegroups.com>
<53335b0a-0f90-4836-8cf9-5668308e38a9n@googlegroups.com> <su6qeh$c9q$1@newsreader4.netcologne.de>
<3541fae1-e7e2-4ede-b08e-46f53b4bb2a4n@googlegroups.com> <0I9OJ.22703$4JN7.21667@fx05.iad>
<8d160485-6b5e-4a5c-89b2-1d5172bc0544n@googlegroups.com> <sucjd9$pvc$1@dont-email.me>
<jwvczjph0x7.fsf-monnier+comp.arch@gnu.org> <663ead4b-2d8f-484d-8329-ae5692b88bean@googlegroups.com>
<jwvr184vn71.fsf-monnier+comp.arch@gnu.org> <12a4ee1c-6ec4-41ca-91b2-a8a7f8884035n@googlegroups.com>
<jwvleycvleh.fsf-monnier+comp.arch@gnu.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <571c3186-4875-49b3-bc0f-9236edfbd9d6n@googlegroups.com>
Subject: Re: Encoding 20 and 40 bit instructions in 128 bits
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Tue, 15 Feb 2022 05:02:31 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 18

by: Quadibloc - Tue, 15 Feb 2022 05:02 UTC

On Monday, February 14, 2022 at 9:12:06 PM UTC-7, Stefan Monnier wrote:

> I don't understand what this has to do with my suggestion.

> My suggestion is to make the mapping more complex so that instead of
> having 3 big walls splitting your register file into completely disjoint
> sets, you have many small walls all over the place:

I am sorry. I completely misunderstood your suggestion. When I
saw the phrase "base register", I automatically took it to mean
a register used to indicate a region of memory, so that displacements
in instructions didn't need to be the full width of an address: like
segment registers in the x86, or like what are referred to as base
registers on the System/360.

You were using that phrase in a different sense, as the register
indicating which group of registers was to be used.

John Savard

Pages:1 2 3 4 567 8 9 10 11 12 13 14

server_pubkey.txt

rocksolid light 0.9.81
clearnet tor