Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

win-nt from the people who invented edlin. -- MaDsen Wikholm, mwikholm@at8.abo.fi


devel / comp.arch / Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

SubjectAuthor
* Encoding 20 and 40 bit instructions in 128 bitsThomas Koenig
+* Re: Encoding 20 and 40 bit instructions in 128 bitsStephen Fuld
|`* Re: Encoding 20 and 40 bit instructions in 128 bitsThomas Koenig
| +* Re: Encoding 20 and 40 bit instructions in 128 bitsStephen Fuld
| |+- Re: Encoding 20 and 40 bit instructions in 128 bitsThomas Koenig
| |+* Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
| ||`- Re: Encoding 20 and 40 bit instructions in 128 bitsStephen Fuld
| |`* Re: Encoding 20 and 40 bit instructions in 128 bitsQuadibloc
| | `* Re: Encoding 20 and 40 bit instructions in 128 bitsStephen Fuld
| |  +* Re: Encoding 20 and 40 bit instructions in 128 bitsQuadibloc
| |  |`* Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
| |  | +* Re: Encoding 20 and 40 bit instructions in 128 bitsJimBrakefield
| |  | |+* Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
| |  | ||`- Re: Encoding 20 and 40 bit instructions in 128 bitsJimBrakefield
| |  | |`* Re: Encoding 20 and 40 bit instructions in 128 bitsJimBrakefield
| |  | | +* Re: Encoding 20 and 40 bit instructions in 128 bitsEricP
| |  | | |+* Re: Encoding 20 and 40 bit instructions in 128 bitsJimBrakefield
| |  | | ||+* Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
| |  | | |||`- Re: Encoding 20 and 40 bit instructions in 128 bitsEricP
| |  | | ||`- Re: Encoding 20 and 40 bit instructions in 128 bitsEricP
| |  | | |`* Re: Encoding 20 and 40 bit instructions in 128 bitsEricP
| |  | | | `* Re: Encoding 20 and 40 bit instructions in 128 bitsThomas Koenig
| |  | | |  `* Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
| |  | | |   `* Re: Encoding 20 and 40 bit instructions in 128 bitsThomas Koenig
| |  | | |    +* Re: Encoding 20 and 40 bit instructions in 128 bitsBGB
| |  | | |    |`* Re: Encoding 20 and 40 bit instructions in 128 bitsBrett
| |  | | |    | `* Re: Encoding 20 and 40 bit instructions in 128 bitsBGB
| |  | | |    |  `* Re: Encoding 20 and 40 bit instructions in 128 bitsBrett
| |  | | |    |   +* Re: Encoding 20 and 40 bit instructions in 128 bitsQuadibloc
| |  | | |    |   |`* Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
| |  | | |    |   | `* Re: Encoding 20 and 40 bit instructions in 128 bitsThomas Koenig
| |  | | |    |   |  `* Re: Encoding 20 and 40 bit instructions in 128 bitsStephen Fuld
| |  | | |    |   |   +* Re: Encoding 20 and 40 bit instructions in 128 bitsStefan Monnier
| |  | | |    |   |   |`- Re: Encoding 20 and 40 bit instructions in 128 bitsStephen Fuld
| |  | | |    |   |   +* Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
| |  | | |    |   |   |`* Re: Encoding 20 and 40 bit instructions in 128 bitsQuadibloc
| |  | | |    |   |   | `* Re: Encoding 20 and 40 bit instructions in 128 bitsThomas Koenig
| |  | | |    |   |   |  +* Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
| |  | | |    |   |   |  |+* Re: Encoding 20 and 40 bit instructions in 128 bitsStefan Monnier
| |  | | |    |   |   |  ||+- Re: Encoding 20 and 40 bit instructions in 128 bitsBernd Linsel
| |  | | |    |   |   |  ||+- Re: Encoding 20 and 40 bit instructions in 128 bitsAnton Ertl
| |  | | |    |   |   |  ||`- Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
| |  | | |    |   |   |  |+* Re: Encoding 20 and 40 bit instructions in 128 bitsThomas Koenig
| |  | | |    |   |   |  ||`- Re: Encoding 20 and 40 bit instructions in 128 bitsBrian G. Lucas
| |  | | |    |   |   |  |`- Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
| |  | | |    |   |   |  +* Re: Encoding 20 and 40 bit instructions in 128 bitsAnton Ertl
| |  | | |    |   |   |  |`* Re: Encoding 20 and 40 bit instructions in 128 bitsThomas Koenig
| |  | | |    |   |   |  | `- Re: Encoding 20 and 40 bit instructions in 128 bitsBGB
| |  | | |    |   |   |  +* Re: Encoding 20 and 40 bit instructions in 128 bitsEricP
| |  | | |    |   |   |  |`* Re: Encoding 20 and 40 bit instructions in 128 bitsBGB
| |  | | |    |   |   |  | `* Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
| |  | | |    |   |   |  |  `* Re: Encoding 20 and 40 bit instructions in 128 bitsIvan Godard
| |  | | |    |   |   |  |   `* Re: Encoding 20 and 40 bit instructions in 128 bitsThomas Koenig
| |  | | |    |   |   |  |    `* Re: Encoding 20 and 40 bit instructions in 128 bitsIvan Godard
| |  | | |    |   |   |  |     +* Re: Encoding 20 and 40 bit instructions in 128 bitsThomas Koenig
| |  | | |    |   |   |  |     |`* Re: Encoding 20 and 40 bit instructions in 128 bitsQuadibloc
| |  | | |    |   |   |  |     | +- Re: Encoding 20 and 40 bit instructions in 128 bitsStephen Fuld
| |  | | |    |   |   |  |     | `- Re: Encoding 20 and 40 bit instructions in 128 bitsIvan Godard
| |  | | |    |   |   |  |     +* Re: Encoding 20 and 40 bit instructions in 128 bitsStefan Monnier
| |  | | |    |   |   |  |     |`- Re: Encoding 20 and 40 bit instructions in 128 bitsIvan Godard
| |  | | |    |   |   |  |     +* Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128John Levine
| |  | | |    |   |   |  |     |+* Re: instruction set binding time, was Encoding 20 and 40 bitThomas Koenig
| |  | | |    |   |   |  |     ||+* Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128Stefan Monnier
| |  | | |    |   |   |  |     |||+- Re: instruction set binding time, was Encoding 20 and 40 bitIvan Godard
| |  | | |    |   |   |  |     |||`* Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128Anton Ertl
| |  | | |    |   |   |  |     ||| +* Re: instruction set binding time, was Encoding 20 and 40 bitIvan Godard
| |  | | |    |   |   |  |     ||| |+* Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128Stefan Monnier
| |  | | |    |   |   |  |     ||| ||+- Re: instruction set binding time, was Encoding 20 and 40 bitBGB
| |  | | |    |   |   |  |     ||| ||+- Re: instruction set binding time, was Encoding 20 and 40 bitIvan Godard
| |  | | |    |   |   |  |     ||| ||`* Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128Anton Ertl
| |  | | |    |   |   |  |     ||| || `* Re: instruction set binding time, was Encoding 20 and 40 bitThomas Koenig
| |  | | |    |   |   |  |     ||| ||  +- Re: instruction set binding time, was Encoding 20 and 40 bitJohn Levine
| |  | | |    |   |   |  |     ||| ||  `* Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128Anton Ertl
| |  | | |    |   |   |  |     ||| ||   `* Re: instruction set binding time, was Encoding 20 and 40 bitTerje Mathisen
| |  | | |    |   |   |  |     ||| ||    `* Re: instruction set binding time, was Encoding 20 and 40 bitMitchAlsup
| |  | | |    |   |   |  |     ||| ||     +* Re: instruction set binding time, was Encoding 20 and 40 bitBGB
| |  | | |    |   |   |  |     ||| ||     |`- Re: instruction set binding time, was Encoding 20 and 40 bitMitchAlsup
| |  | | |    |   |   |  |     ||| ||     `- Re: instruction set binding time, was Encoding 20 and 40 bitTerje Mathisen
| |  | | |    |   |   |  |     ||| |`* Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128Anton Ertl
| |  | | |    |   |   |  |     ||| | +* Re: instruction set binding time, was Encoding 20 and 40 bitIvan Godard
| |  | | |    |   |   |  |     ||| | |+* Re: instruction set binding time, was Encoding 20 and 40 bitThomas Koenig
| |  | | |    |   |   |  |     ||| | ||`* Re: instruction set binding time, was Encoding 20 and 40 bitIvan Godard
| |  | | |    |   |   |  |     ||| | || `* Re: instruction set binding time, was Encoding 20 and 40 bitThomas Koenig
| |  | | |    |   |   |  |     ||| | ||  `* Re: instruction set binding time, was Encoding 20 and 40 bitIvan Godard
| |  | | |    |   |   |  |     ||| | ||   `* Re: instruction set binding time, was Encoding 20 and 40 bitThomas Koenig
| |  | | |    |   |   |  |     ||| | ||    +- Re: instruction set binding time, was Encoding 20 and 40 bitIvan Godard
| |  | | |    |   |   |  |     ||| | ||    `- Re: instruction set binding time, was Encoding 20 and 40 bitMitchAlsup
| |  | | |    |   |   |  |     ||| | |+- Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128Anton Ertl
| |  | | |    |   |   |  |     ||| | |`- Re: instruction set binding time, was Encoding 20 and 40 bitJohn Levine
| |  | | |    |   |   |  |     ||| | `* Re: instruction set binding time, was Encoding 20 and 40 bitThomas Koenig
| |  | | |    |   |   |  |     ||| |  `- Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128Anton Ertl
| |  | | |    |   |   |  |     ||| `* Re: instruction set binding time, was Encoding 20 and 40 bitQuadibloc
| |  | | |    |   |   |  |     |||  +* Re: instruction set binding time, was Encoding 20 and 40 bitBGB
| |  | | |    |   |   |  |     |||  |+* Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128Anton Ertl
| |  | | |    |   |   |  |     |||  ||+* Re: instruction set binding time, was Encoding 20 and 40 bitScott Smader
| |  | | |    |   |   |  |     |||  |||+* Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128Stefan Monnier
| |  | | |    |   |   |  |     |||  ||||`* Re: instruction set binding time, was Encoding 20 and 40 bitScott Smader
| |  | | |    |   |   |  |     |||  |||| +* Re: instruction set binding time, was Encoding 20 and 40 bitIvan Godard
| |  | | |    |   |   |  |     |||  |||| |+- Re: instruction set binding time, was Encoding 20 and 40 bitAnton Ertl
| |  | | |    |   |   |  |     |||  |||| |`* Re: instruction set binding time, was Encoding 20 and 40 bitIvan Godard
| |  | | |    |   |   |  |     |||  |||| | +- Re: instruction set binding time, was Encoding 20 and 40 bitMitchAlsup
| |  | | |    |   |   |  |     |||  |||| | +* Re: instruction set binding time, was Encoding 20 and 40 bitIvan Godard
| |  | | |    |   |   |  |     |||  |||| | `* Re: instruction set binding time, was Encoding 20 and 40 bitAnton Ertl
| |  | | |    |   |   |  |     |||  |||| +- Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128James Van Buskirk
| |  | | |    |   |   |  |     |||  |||| `* Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128Anton Ertl
| |  | | |    |   |   |  |     |||  |||+* Statically scheduled plus run ahead.Brett
| |  | | |    |   |   |  |     |||  |||`* Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128Anton Ertl
| |  | | |    |   |   |  |     |||  ||+* Re: instruction set binding time, was Encoding 20 and 40 bitBGB
| |  | | |    |   |   |  |     |||  ||+- Re: instruction set binding time, was Encoding 20 and 40 bitMitchAlsup
| |  | | |    |   |   |  |     |||  ||`* Re: instruction set binding time, was Encoding 20 and 40 bitThomas Koenig
| |  | | |    |   |   |  |     |||  |`* Re: instruction set binding time, was Encoding 20 and 40 bitMitchAlsup
| |  | | |    |   |   |  |     |||  +- Re: instruction set binding time, was Encoding 20 and 40 bitMitchAlsup
| |  | | |    |   |   |  |     |||  `- Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128Anton Ertl
| |  | | |    |   |   |  |     ||`* Re: instruction set binding time, was Encoding 20 and 40 bitIvan Godard
| |  | | |    |   |   |  |     |+- Re: instruction set binding time, was Encoding 20 and 40 bitMitchAlsup
| |  | | |    |   |   |  |     |`* Re: instruction set binding time, was Encoding 20 and 40 bitStephen Fuld
| |  | | |    |   |   |  |     `* Re: Encoding 20 and 40 bit instructions in 128 bitsAnton Ertl
| |  | | |    |   |   |  +* Re: Encoding 20 and 40 bit instructions in 128 bitsQuadibloc
| |  | | |    |   |   |  +- Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
| |  | | |    |   |   |  +- Re: Encoding 20 and 40 bit instructions in 128 bitsQuadibloc
| |  | | |    |   |   |  `- Re: Encoding 20 and 40 bit instructions in 128 bitsQuadibloc
| |  | | |    |   |   +* Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
| |  | | |    |   |   `- Re: Encoding 20 and 40 bit instructions in 128 bitsBGB
| |  | | |    |   `- Re: Encoding 20 and 40 bit instructions in 128 bitsBGB
| |  | | |    `* Re: Encoding 20 and 40 bit instructions in 128 bitsStephen Fuld
| |  | | `- Re: Encoding 20 and 40 bit instructions in 128 bitsThomas Koenig
| |  | `* Re: Encoding 20 and 40 bit instructions in 128 bitsQuadibloc
| |  `- Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
| `- Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
+- Re: Encoding 20 and 40 bit instructions in 128 bitsIvan Godard
+* Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
+* Re: Encoding 20 and 40 bit instructions in 128 bitsMitchAlsup
`- Re: Encoding 20 and 40 bit instructions in 128 bitsPaul A. Clayton

Pages:1234567891011121314
Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sulnop$2j0$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23627&group=comp.arch#23627

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Thu, 17 Feb 2022 06:57:27 -0800
Organization: A noiseless patient Spider
Lines: 110
Message-ID: <sulnop$2j0$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<suj84u$cjv$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 17 Feb 2022 14:57:29 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4cdbdab641c35097c9e27d444e25d5dd";
logging-data="2656"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19I5GaFD4aQwDv/dTubSwTt"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:KL6b5X/eEXwjze6uokpn5YGQyms=
In-Reply-To: <suj84u$cjv$1@dont-email.me>
Content-Language: en-US
 by: Ivan Godard - Thu, 17 Feb 2022 14:57 UTC

On 2/16/2022 8:18 AM, Stephen Fuld wrote:
> On 2/13/2022 9:52 AM, John Levine wrote:
>> According to Ivan Godard  <ivan@millcomputing.com>:
>>> In Mill the translation happens once, at install time, via a transparent
>>> invocation of the specializer (not a recompile, and the source code is
>>> not needed). In a microcoded machine (or a packing/cracking scheme) the
>>> translation takes place at execution time, every time. Either way, it
>>> gets done, and either way permits ISA extension freely. They differ in
>>> power, area, and time; the choice is an engineering design dimension.
>>>
>>> I assert without proof that Mill-style software translation in fact
>>> permits ISA extension that is *more* flexible and powerful than what can
>>> be achieved by hardware approaches. YMMV.
>>
>> If one believes in proof by example, this approach has been wildly
>> successful
>> in the IBM S/38, AS/400. and whatever it is called now, something
>> something i.
>>
>> You can take 30 year TIMI object code and run it on current hardware
>> at full speed,
>> supporting their exotic object architecture with 128 bit pointers.  I
>> believe the
>> architecture had one significant change in the 1990s, with addresses
>> expanding
>> from 48 to 64 bits but the old object code still works.
>>
>> I'm kind of surprised nobody else does this.  I suppose only IBM had
>> sufficient
>> control over both the hardware and software to make it work.
>
> You have raised an interesting question.  I have been thinking about it
> for a few days, and I don't think it is as simple as HW/SW control.
>
> First, you have to recognize that there is a perhaps subtle, but very
> important difference between what the Mill is going to do, and what the
> S/38 etc. did.  Specifically, the Mill "respecializes" the OS.  I
> believe that IBM rewrote the lowermost kernel stuff from scratch when
> they went from the proprietary CISC to the Power based systems.
>
> This difference means that the Mill is "limited" to a range of changes,
> i.e. number of FUs, belt size, presence or absence of certain features,
> etc.  It could not be used for a more fundamental change at the OS
> level, such as if some hypothetical future Mill abandoned the SAS model
> or went to a supervisor state based security (I realize unlikely to
> happen for other reasons).  So while IBM's approach allows more basic
> changes in the HW design, it comes at the cost of non-automatic OS
> migration.
>
> So, with that in mind, lets look at some hypothetical vendor considering
> a new system.  What alternatives does he have?
>
> 1.    He could go Mill like, with automatic migration of everything at
> the cost of limiting his flexibility
>
> 2.    He could go sort of S/38 like but instead of keeping some
> intermediate level, keep the (required to be portable) source code with
> the executable, and recompile if needed for a future machine.  After
> all, if we are talking of an infrequent migration, the extra cost of a
> full compile versus the "partial compile" of respecializing would be
> lost in the noise.
>
> 3.    He could expect to limit future HW designs to provide compatibility.
>
> 4.    He could expect much/most application software to be written to a
> defined intermediate language such as the JVM.  This is much like the
> S/38 approach, but substitutes JIT compiles for install time recompiles.
>
> 5.    Probably others. . .
>
> Now let's look at what actually happened in some real situations.
>
> While WinTel was two companies, they worked closely together and had a
> near monopoly for some years.  Intel mostly went with #3 (with the
> exception of Itanium), but Microsoft sort of went with the S/38 model
> (different user compatible interfaces for different HW), but didn't do
> any automatic migration, e.g. from X86 to Alpha or MIPS)
>
> Apple had complete control over the Mac environment and went with
> essentially #2, but without the automatic recompiles.  This allowed
> major changes to the underlying hardware, mostly without too many problems.
>
> Apple again, but this time with the iPhone.  Seems like mostly #4.
>
> While not a "company", Linux and the GCC programs essentially went with
> a different variant.  They used their version of C as essentially a
> public (as opposed to S/38's proprietary) intermediate language, and
> expect small changes and recompiles for different underlying architectures.
>
> While most hardware CPU vendors have little interest in making it easy
> to migrate to non compatible future hardware, they have great interest
> in allowing easy migration to their future CPUs.  They accomplish this
> by doing transparently some of what the Mill requires a respecializaton
> for (i.e. increasing the number of FUs), and providing "compatibility"
> modes such as what allows you to run a 30 year old S/360 program on a
> current system.
>
> So, overall, I think the answer to your question is "It's complicated!"
>  I think other vendors have evolved different solutions that provide
> similar capabilities (not as good in some areas, better in others).
> While what S/38 did was revolutionary at the time, and quite "elegant",
> the requirements that drove it led to multiple solutions.
>
> However, I do think it might be a good idea for a system to
> automatically keep the source code with the object code to allow for
> automatic recompiles.

What's the distribution medium? Which app vendors would be willing to
ship their source to their customers?

Re: Advantages of in-order execution (was: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits)

<96fd5b80-3549-4f5d-8736-deb7a3c2bf68n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23628&group=comp.arch#23628

  copy link   Newsgroups: comp.arch
X-Received: by 2002:adf:f689:0:b0:1e6:88bf:8ac8 with SMTP id v9-20020adff689000000b001e688bf8ac8mr2639281wrp.189.1645110499555;
Thu, 17 Feb 2022 07:08:19 -0800 (PST)
X-Received: by 2002:aca:61c3:0:b0:2ce:6ee7:2cc7 with SMTP id
v186-20020aca61c3000000b002ce6ee72cc7mr2638125oib.245.1645110498447; Thu, 17
Feb 2022 07:08:18 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.88.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 17 Feb 2022 07:08:18 -0800 (PST)
In-Reply-To: <5d5ee8d1-bfdb-4ec8-837f-b4bb0ed85bf0n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:ddec:e03:7010:d614;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:ddec:e03:7010:d614
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at> <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com> <jwvwnhwvnlj.fsf-monnier+comp.arch@gnu.org>
<70f6512f-243d-4492-80e6-299b40361ea5n@googlegroups.com> <2022Feb16.133409@mips.complang.tuwien.ac.at>
<465d3ca4-af43-4200-9789-8007628e7565n@googlegroups.com> <jwv7d9u657q.fsf-monnier+comp.arch@gnu.org>
<5d5ee8d1-bfdb-4ec8-837f-b4bb0ed85bf0n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <96fd5b80-3549-4f5d-8736-deb7a3c2bf68n@googlegroups.com>
Subject: Re: Advantages of in-order execution (was: instruction set binding
time, was Encoding 20 and 40 bit instructions in 128 bits)
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Thu, 17 Feb 2022 15:08:19 +0000
Content-Type: text/plain; charset="UTF-8"
 by: Quadibloc - Thu, 17 Feb 2022 15:08 UTC

On Wednesday, February 16, 2022 at 11:46:31 AM UTC-7, MitchAlsup wrote:

> What is missing is a vonNeumann model of parallelism where HW provides a
> few instructions to perform this new model and because of its simplicity
> elegance and functionality, SW can easily find ways to utilize that new model.

> Right now HW does not know what to build and SW does not know exactly
> what to ask for. HW stumbles around adding TS, CAS, DCAW, LL-SC, ad
> infinitum. SW trys to use new feature and finds it very difficult to use in
> practice. This CLEARLY shows that what HW is trying to supply is not what
> SW wants to consume; yet SW does not know on the intimate level what to
> ask HW to build, so the circle continues.

If this is the problem, does that mean that the problem has a solution?

It's certainly true that experts in writing programs aren't experts in building
CPUs and vice versa. But there are people who study the science of
designing algorithms, and whether or not certain problems can be solved
in O(n), O(n log n), or O(n^2) and stuff like that.

Thus, if there were problems that could theoretically be solved in parallel,
but aren't getting solved in parallel because our CPUs don't present their
parallel capabilities in the right way, that would get sorted out.

In any case, doing "embarassingly parallel" problems in parallel is _not_
currently a problem. If a lot of processors, working in parallel, can solve
a problem efficiently with little or no intercommunication, that's pretty
much a solved problem.

If the parts of a problem are dependent on one another, so that you
have to solve the first part before you can start working on the second,
and so on, your problem is serial. Maybe you can come up with an
entirely new algorithm.

So the only space for improvement is in improving intercommunication
between multiple computing elements. And there's lots of work being
done in this field.

Software, after all, does know *what arithmetic it wants done*, and
that pretty well is what the hardware needs to know. So I think the
problem is not a trivial one of miscommunication, but a fundamental
one of a lack of suitable algorithms.

John Savard

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suloft$i1r$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23629&group=comp.arch#23629

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Thu, 17 Feb 2022 07:09:47 -0800
Organization: A noiseless patient Spider
Lines: 89
Message-ID: <suloft$i1r$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<suerog$cd0$1@dont-email.me> <2022Feb15.124937@mips.complang.tuwien.ac.at>
<sugjhv$v6u$1@dont-email.me> <2022Feb17.110809@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 17 Feb 2022 15:09:49 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4cdbdab641c35097c9e27d444e25d5dd";
logging-data="18491"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/MM5qUOAoHsvXs+DFo/htN"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:2qeLd64BmaLdXBKLaIhw4rO6bmQ=
In-Reply-To: <2022Feb17.110809@mips.complang.tuwien.ac.at>
Content-Language: en-US
 by: Ivan Godard - Thu, 17 Feb 2022 15:09 UTC

On 2/17/2022 2:08 AM, Anton Ertl wrote:
> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>> On 2/15/2022 3:49 AM, Anton Ertl wrote:
>>> Also: OoO allows scheduling across many hundreds of instructions (512
>>> in Alder Lake) rather than typically a few or a few dozen for static
>>> scheduling,
>>
>> I am not suggesting that isn't true, but I do question why it is true.
>> That is, if it is beneficial, I presume a compiler could do its
>> scheduling across a window at least as big as HW. The compiler can use
>> more memory and time than is available to the HW. As a minimum, it
>> could emulate what the HW does so should be equal (excepting for
>> variable length delays).
>>
>> So is such a big window not beneficial to the compiler?
>
> The problem is that a 512-instruction window contains ~70 conditional
> branches (taking our LaTeX benchmark as example):
>
> /home/anton/tmp/pmu-tools/ocperf.py stat -e br_inst_retired.conditional -e instructions latex bench >/dev/null
>
> Performance counter stats for 'latex bench':
>
> 400088535 br_inst_retired_conditional
> 3010329693 instructions
>
> [Unfortunately, Skylake does not have a separate event counter for
> indirect branches, but they would also have to be predicted by the
> compiler.]
>
> So the compiler would have to branch-predict across ~70 instructions to
> schedule across such a window.
>
> Why not just perform non-speculative scheduling, i.e., move
> instructions further down in the execution rather than up? Because
> that means that in many cases the dependencies of conditional branches
> limit the IPC very much (e.g., see Section 6 and Figure 9 of
> <http://www.complang.tuwien.ac.at/papers/ertl-krall94cc.ps.gz>).
>
> So you want to speculative execution in a statically scheduled system
> if you want scheduling windows competing with hardware scheduling windows.
>
> Now, architectures up to now have no way for letting the compiler make
> use of dynamic branch prediction. Instead, they are limited to static
> branch prediction, which has a much higher miss rate than dynamic
> branch prediction (some very rough numbers are ~10% for static with
> profile data (20% without), 1% for dynamic).
>
> joshua.landau.ws@gmail.com has proposed (in the thread including
> <389485e7-b6dc-4b10-ac49-7883ae9fff0e@googlegroups.com>) an
> architectural mechanism for letting compilers make use of dynamic
> branch prediction accuracy: He splits the branch into a branch predict
> (brp) part, where the branch is taken based on the dynamic prediction,
> and branch-verify (brv) where the condition is actually available and
> the direction is verified; if the prediction was wrong, execution
> takes a recovery path and eventually continues on the alternative
> branch of the corresponding brp instruction.
>
> So how would a compiler make use of this architectural idea? It would
> place a brp before the first speculative instruction the compiler may
> want to schedule from a block controlled by a branch (brv), maybe 70
> brvs earlier. I expect a lot of code duplication; the worst case
> would be a factor of 2^70 code duplication, but realisitically the
> compiler would limit the code duplication to the maybe 100 most
> probable paths (as statically predicted), with very limited
> speculation on the other paths. The questions in this respect are:
>
> 1) How much do these code duplication limitations reduce the average
> IPC?
>
> 2) Even with such limitations, what is the effect of the code
> duplications on I-cache misses and L2-cache misses?
>
> If the predicted (and verified) path is always the same (and the
> static prediction actually included this path), the limitations will
> not reduce the IPC, and the I-cache misses will not become a problem,
> but if there are thousands of paths (that dynamic branch prediction
> nevertheless can predict well), things look less rosy.
>
> Despite these issues, if static scheduling wants to have any chance
> against OoO at all, IMO it needs this architectural mechanism.

Funny, we have no problem doing static scheduling together with runtime
branch prediction. Runahead prediction, even.

Given how emphatic you are that this is impossible, I conclude that you
must mean something different for "static scheduling" than I do. But
just what escapes me - can you clarify?

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<jwvy22934m1.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23632&group=comp.arch#23632

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits
Date: Thu, 17 Feb 2022 10:20:03 -0500
Organization: A noiseless patient Spider
Lines: 9
Message-ID: <jwvy22934m1.fsf-monnier+comp.arch@gnu.org>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at>
<suerog$cd0$1@dont-email.me>
<2022Feb15.124937@mips.complang.tuwien.ac.at>
<sugjhv$v6u$1@dont-email.me>
<2022Feb17.110809@mips.complang.tuwien.ac.at>
<suloft$i1r$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="bea98adecb175c81622af547f2dcb8a9";
logging-data="24881"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/mSbOo/8ZtsKRMFU2oMWfb"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:EYbiFGx5d2+SHMGFYB0BEfxIhq8=
sha1:cv5T//bwFVj1OYwtQH1dJSevYig=
 by: Stefan Monnier - Thu, 17 Feb 2022 15:20 UTC

Ivan Godard [2022-02-17 07:09:47] wrote:
> Funny, we have no problem doing static scheduling together with runtime
> branch prediction. Runahead prediction, even.

AFAICT he was talking about having the ability for the compiler to
change the static schedule according to runtime branch prediction.

Stefan

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<3ba67d1f-01f1-47ff-aa3f-d755f8096624n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23634&group=comp.arch#23634

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a5d:47ce:0:b0:1e8:88b7:446a with SMTP id o14-20020a5d47ce000000b001e888b7446amr2754862wrc.459.1645112124439;
Thu, 17 Feb 2022 07:35:24 -0800 (PST)
X-Received: by 2002:a05:6808:1208:b0:2d4:419d:8463 with SMTP id
a8-20020a056808120800b002d4419d8463mr1265717oil.227.1645112123837; Thu, 17
Feb 2022 07:35:23 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 17 Feb 2022 07:35:23 -0800 (PST)
In-Reply-To: <sukuuk$1qnd$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:dc5c:7df4:98bd:eb15;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:dc5c:7df4:98bd:eb15
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at> <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com> <2022Feb15.120729@mips.complang.tuwien.ac.at>
<jwv35kkfe8h.fsf-monnier+comp.arch@gnu.org> <2022Feb15.194310@mips.complang.tuwien.ac.at>
<suh34c$3pd$1@newsreader4.netcologne.de> <suidol$1b8d$1@gioia.aioe.org>
<suigkc$nkv$1@dont-email.me> <vG8PJ.15176$GjY3.1981@fx01.iad> <sukuuk$1qnd$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3ba67d1f-01f1-47ff-aa3f-d755f8096624n@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 17 Feb 2022 15:35:24 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Thu, 17 Feb 2022 15:35 UTC

On Thursday, February 17, 2022 at 1:54:03 AM UTC-6, Terje Mathisen wrote:
> EricP wrote:
> > BGB wrote:

> >
> > Mitch's My66K predicate value state flags are implied by the predicate
> > instruction shadow and tracked internally, not architectural ISA
> > registers like Itanium. To have multiple predicates in flight at once
> > just use multiple PRED instructions.
> Mitch's baby is the poser child for such branchless code, it fits very
> well indeed with this model to avoid poorly-predicted branches:
>
> while (l < r) {
> pivot = (l+r)>>1;
> if (arr[pivot] < target) l = pivot;
> else r = pivot;
> }
<
while_loop:
CMP Rt,Rl,Rr
BGE while_exit
while_loop_1:
ADD Rt,Rl+Rr
SR Rpivot,Rt,#1
LDD Rarray,[Rarrayp+Rpivot<<3]
CMP Rt,Rarray,Rtarget
PLT Rt,{10}
MOV Rl,Rpivot
MOV Rr,Rpivot
CMP Rt,Rl,Rr // deroll the top of loop
BLT Rt,while_loop_1
while_exit:
>
> Notice that there is no attempt to detect an early exact match, so the
> number of iterations will always be log2(arr.size), and therefore (close
> to) perfectly predictable, while the loop body fits the predicate shadow
> very nicely.
>
> Written in this form, even an x86 can use the same approach, just taking
> a cycle more per iteration due to the slow CMOV operations:
>
> next:
> lea ebx,[esi+edi]
> shr ebx,1
> mov eax,arr[ebx*4]
> cmp eax,edx ;; EDX == target value
> CMOVB esi,ebx ;; Opposite predicates, so
> CMOVAE edi,ebx ;; only one will execute!
>
> cmp esi,edi
> jb next
>
> This will take 6-8 clock cycles/iteration (assuming the entire arr[] in
> $L1 cache, add miss cycles otherwise), so ~90 cycles to find an entry in
> an 8K array.
>
> Using branching code instead would add half a branch mispredict per
> iteration for random searches, but much less if you would search
> repeatedly for close target values.
> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Re: Advantages of in-order execution (was: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits)

<68ac4530-123a-4bfd-bbe1-5e75beb3db1cn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23635&group=comp.arch#23635

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:600c:2552:b0:37b:c7dd:d376 with SMTP id e18-20020a05600c255200b0037bc7ddd376mr6616860wma.113.1645113079769;
Thu, 17 Feb 2022 07:51:19 -0800 (PST)
X-Received: by 2002:aca:a9c5:0:b0:2d4:373d:98c8 with SMTP id
s188-20020acaa9c5000000b002d4373d98c8mr1391050oie.272.1645113079268; Thu, 17
Feb 2022 07:51:19 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.87.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 17 Feb 2022 07:51:19 -0800 (PST)
In-Reply-To: <96fd5b80-3549-4f5d-8736-deb7a3c2bf68n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:dc5c:7df4:98bd:eb15;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:dc5c:7df4:98bd:eb15
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at> <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com> <jwvwnhwvnlj.fsf-monnier+comp.arch@gnu.org>
<70f6512f-243d-4492-80e6-299b40361ea5n@googlegroups.com> <2022Feb16.133409@mips.complang.tuwien.ac.at>
<465d3ca4-af43-4200-9789-8007628e7565n@googlegroups.com> <jwv7d9u657q.fsf-monnier+comp.arch@gnu.org>
<5d5ee8d1-bfdb-4ec8-837f-b4bb0ed85bf0n@googlegroups.com> <96fd5b80-3549-4f5d-8736-deb7a3c2bf68n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <68ac4530-123a-4bfd-bbe1-5e75beb3db1cn@googlegroups.com>
Subject: Re: Advantages of in-order execution (was: instruction set binding
time, was Encoding 20 and 40 bit instructions in 128 bits)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 17 Feb 2022 15:51:19 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Thu, 17 Feb 2022 15:51 UTC

On Thursday, February 17, 2022 at 9:08:22 AM UTC-6, Quadibloc wrote:
> On Wednesday, February 16, 2022 at 11:46:31 AM UTC-7, MitchAlsup wrote:
>
> > What is missing is a vonNeumann model of parallelism where HW provides a
> > few instructions to perform this new model and because of its simplicity
> > elegance and functionality, SW can easily find ways to utilize that new model.
>
> > Right now HW does not know what to build and SW does not know exactly
> > what to ask for. HW stumbles around adding TS, CAS, DCAW, LL-SC, ad
> > infinitum. SW trys to use new feature and finds it very difficult to use in
> > practice. This CLEARLY shows that what HW is trying to supply is not what
> > SW wants to consume; yet SW does not know on the intimate level what to
> > ask HW to build, so the circle continues.
<
> If this is the problem, does that mean that the problem has a solution?
<
I believe there is a solution.
Arguably, the best solution at the moment is arithmetic in memory.
<
Order of arrival dictates who sees what, and there is no (minimal) cache
line thrashing.
<
But look at how paper flows through an office: mail clerk walks around
and drops batches of paper on various workers. Workers start taking
paper off the top and work on whatever showed up. Lock free.
<
Janet over in the corner asks Jay 2 desks down if he ahs seen <blah>
and Jason drops 2 papers on her desk, but Jane overheard and drops
3 more papers on Janet's desk. It is the Janes overhearing par that
we do not understand at HW level. Mechanics we understand, but
how to filter the wheat from the chaff gets us.
>
> It's certainly true that experts in writing programs aren't experts in building
> CPUs and vice versa. But there are people who study the science of
> designing algorithms, and whether or not certain problems can be solved
> in O(n), O(n log n), or O(n^2) and stuff like that.
<
When it takes BigO(n^2) cache line movements to allow 1 unit of work
under contention, it ends up talking BigO( n^3 ) cache line moves to
get n producers matched with n consumers.
<
Under contention even CAS is BigO( N^2 ).
<
There are certain circumstances where ESM can reduce BigO( n^3 ) to BigO( 3 )
{yes that is right no n ),
<
However, arithmetic in memory is also BigO( 3 ) and required in several SW
architectures (OpenGL for instance).
>
> Thus, if there were problems that could theoretically be solved in parallel,
> but aren't getting solved in parallel because our CPUs don't present their
> parallel capabilities in the right way, that would get sorted out.
<
I have been watching this since 1975 and am failing to see much progress.
>
> In any case, doing "embarassingly parallel" problems in parallel is _not_
> currently a problem. If a lot of processors, working in parallel, can solve
> a problem efficiently with little or no intercommunication, that's pretty
> much a solved problem.
>
> If the parts of a problem are dependent on one another, so that you
> have to solve the first part before you can start working on the second,
> and so on, your problem is serial. Maybe you can come up with an
> entirely new algorithm.
<
Amdahl's Law is ferocious.
>
> So the only space for improvement is in improving intercommunication
> between multiple computing elements. And there's lots of work being
> done in this field.
>
> Software, after all, does know *what arithmetic it wants done*, and
> that pretty well is what the hardware needs to know. So I think the
> problem is not a trivial one of miscommunication, but a fundamental
> one of a lack of suitable algorithms.
<
Also note: the calculational cell-structure of eXcel allows for software
not to care about the order of things, only that the system "relaxes"
to a stable state. FEMs do this, too; so does Spice, and galactic simulations.
>
> John Savard

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sulsti$9e0$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23636&group=comp.arch#23636

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!news.freedyn.de!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Thu, 17 Feb 2022 16:25:22 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sulsti$9e0$1@newsreader4.netcologne.de>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<jwv35kkfe8h.fsf-monnier+comp.arch@gnu.org>
<2022Feb15.194310@mips.complang.tuwien.ac.at>
<suh34c$3pd$1@newsreader4.netcologne.de> <suidol$1b8d$1@gioia.aioe.org>
<suigkc$nkv$1@dont-email.me> <vG8PJ.15176$GjY3.1981@fx01.iad>
<sukuuk$1qnd$1@gioia.aioe.org>
Injection-Date: Thu, 17 Feb 2022 16:25:22 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-19cd-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:19cd:0:7285:c2ff:fe6c:992d";
logging-data="9664"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Thu, 17 Feb 2022 16:25 UTC

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

> while (l < r) {
> pivot = (l+r)>>1;
> if (arr[pivot] < target) l = pivot;
> else r = pivot;
> }

How will this loop terminate if, for example, l=0 and r=1 ?

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sult8j$ou6$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23637&group=comp.arch#23637

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Thu, 17 Feb 2022 08:31:13 -0800
Organization: A noiseless patient Spider
Lines: 143
Message-ID: <sult8j$ou6$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<suj84u$cjv$1@dont-email.me> <sulnop$2j0$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 17 Feb 2022 16:31:15 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="8866e3f7de7abd673e743c03a9c5105b";
logging-data="25542"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/yEfWQrCusgXHsVH0kQU/AlWuN0SnJ56g="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.0
Cancel-Lock: sha1:SdtWyza2SRpfh2eqhJHQOgBpI1k=
In-Reply-To: <sulnop$2j0$1@dont-email.me>
Content-Language: en-US
 by: Stephen Fuld - Thu, 17 Feb 2022 16:31 UTC

On 2/17/2022 6:57 AM, Ivan Godard wrote:
> On 2/16/2022 8:18 AM, Stephen Fuld wrote:
>> On 2/13/2022 9:52 AM, John Levine wrote:
>>> According to Ivan Godard  <ivan@millcomputing.com>:
>>>> In Mill the translation happens once, at install time, via a
>>>> transparent
>>>> invocation of the specializer (not a recompile, and the source code is
>>>> not needed). In a microcoded machine (or a packing/cracking scheme) the
>>>> translation takes place at execution time, every time. Either way, it
>>>> gets done, and either way permits ISA extension freely. They differ in
>>>> power, area, and time; the choice is an engineering design dimension.
>>>>
>>>> I assert without proof that Mill-style software translation in fact
>>>> permits ISA extension that is *more* flexible and powerful than what
>>>> can
>>>> be achieved by hardware approaches. YMMV.
>>>
>>> If one believes in proof by example, this approach has been wildly
>>> successful
>>> in the IBM S/38, AS/400. and whatever it is called now, something
>>> something i.
>>>
>>> You can take 30 year TIMI object code and run it on current hardware
>>> at full speed,
>>> supporting their exotic object architecture with 128 bit pointers.  I
>>> believe the
>>> architecture had one significant change in the 1990s, with addresses
>>> expanding
>>> from 48 to 64 bits but the old object code still works.
>>>
>>> I'm kind of surprised nobody else does this.  I suppose only IBM had
>>> sufficient
>>> control over both the hardware and software to make it work.
>>
>> You have raised an interesting question.  I have been thinking about
>> it for a few days, and I don't think it is as simple as HW/SW control.
>>
>> First, you have to recognize that there is a perhaps subtle, but very
>> important difference between what the Mill is going to do, and what
>> the S/38 etc. did.  Specifically, the Mill "respecializes" the OS.  I
>> believe that IBM rewrote the lowermost kernel stuff from scratch when
>> they went from the proprietary CISC to the Power based systems.
>>
>> This difference means that the Mill is "limited" to a range of
>> changes, i.e. number of FUs, belt size, presence or absence of certain
>> features, etc.  It could not be used for a more fundamental change at
>> the OS level, such as if some hypothetical future Mill abandoned the
>> SAS model or went to a supervisor state based security (I realize
>> unlikely to happen for other reasons).  So while IBM's approach allows
>> more basic changes in the HW design, it comes at the cost of
>> non-automatic OS migration.
>>
>> So, with that in mind, lets look at some hypothetical vendor
>> considering a new system.  What alternatives does he have?
>>
>> 1.    He could go Mill like, with automatic migration of everything at
>> the cost of limiting his flexibility
>>
>> 2.    He could go sort of S/38 like but instead of keeping some
>> intermediate level, keep the (required to be portable) source code
>> with the executable, and recompile if needed for a future machine.
>> After all, if we are talking of an infrequent migration, the extra
>> cost of a full compile versus the "partial compile" of respecializing
>> would be lost in the noise.
>>
>> 3.    He could expect to limit future HW designs to provide
>> compatibility.
>>
>> 4.    He could expect much/most application software to be written to
>> a defined intermediate language such as the JVM.  This is much like
>> the S/38 approach, but substitutes JIT compiles for install time
>> recompiles.
>>
>> 5.    Probably others. . .
>>
>> Now let's look at what actually happened in some real situations.
>>
>> While WinTel was two companies, they worked closely together and had a
>> near monopoly for some years.  Intel mostly went with #3 (with the
>> exception of Itanium), but Microsoft sort of went with the S/38 model
>> (different user compatible interfaces for different HW), but didn't do
>> any automatic migration, e.g. from X86 to Alpha or MIPS)
>>
>> Apple had complete control over the Mac environment and went with
>> essentially #2, but without the automatic recompiles.  This allowed
>> major changes to the underlying hardware, mostly without too many
>> problems.
>>
>> Apple again, but this time with the iPhone.  Seems like mostly #4.
>>
>> While not a "company", Linux and the GCC programs essentially went
>> with a different variant.  They used their version of C as essentially
>> a public (as opposed to S/38's proprietary) intermediate language, and
>> expect small changes and recompiles for different underlying
>> architectures.
>>
>> While most hardware CPU vendors have little interest in making it easy
>> to migrate to non compatible future hardware, they have great interest
>> in allowing easy migration to their future CPUs.  They accomplish this
>> by doing transparently some of what the Mill requires a
>> respecializaton for (i.e. increasing the number of FUs), and providing
>> "compatibility" modes such as what allows you to run a 30 year old
>> S/360 program on a current system.
>>
>> So, overall, I think the answer to your question is "It's
>> complicated!"   I think other vendors have evolved different solutions
>> that provide similar capabilities (not as good in some areas, better
>> in others). While what S/38 did was revolutionary at the time, and
>> quite "elegant", the requirements that drove it led to multiple
>> solutions.
>>
>> However, I do think it might be a good idea for a system to
>> automatically keep the source code with the object code to allow for
>> automatic recompiles.
>
> What's the distribution medium? Which app vendors would be willing to
> ship their source to their customers?

Yes, similar to John's point. I admit I was thinking of customer written
code. So let me respond two ways.

One, there could be a distinction between customer *written* software
and customer *purchased* (actually, I suppose licensed) software. The
vendor licensed software would be allowed to not provide the source.
This seems reasonable, as the vendor would amortize the cost of
recompile, etc. over many customers, so the support cost per customer is
less. Also, presumably, a vendor is less likely to "lose" the source
code, than a user shop.

My second response is a question that you, Ivan are uniquely positioned
to answer. If a vendor delivers code to users in Genasm form, how much
more difficult would it be for a user to do whatever he wanted to do
than if he had the source? I presume it is somewhere between source and
actually executable code, but where on the spectrum is it? Do you
expect any problems with third party vendors? BTW, I don't know the
answer to this for S/38. :-(

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Advantages of in-order execution

<sumegr$6t2$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23646&group=comp.arch#23646

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!rd9pRsUZyxkRLAEK7e/Uzw.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Advantages of in-order execution
Date: Thu, 17 Feb 2022 22:25:47 +0100
Organization: Aioe.org NNTP Server
Message-ID: <sumegr$6t2$1@gioia.aioe.org>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<jwvwnhwvnlj.fsf-monnier+comp.arch@gnu.org>
<70f6512f-243d-4492-80e6-299b40361ea5n@googlegroups.com>
<2022Feb16.133409@mips.complang.tuwien.ac.at>
<465d3ca4-af43-4200-9789-8007628e7565n@googlegroups.com>
<jwv7d9u657q.fsf-monnier+comp.arch@gnu.org>
<5d5ee8d1-bfdb-4ec8-837f-b4bb0ed85bf0n@googlegroups.com>
<96fd5b80-3549-4f5d-8736-deb7a3c2bf68n@googlegroups.com>
<68ac4530-123a-4bfd-bbe1-5e75beb3db1cn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="7074"; posting-host="rd9pRsUZyxkRLAEK7e/Uzw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.10.2
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Thu, 17 Feb 2022 21:25 UTC

MitchAlsup wrote:
> On Thursday, February 17, 2022 at 9:08:22 AM UTC-6, Quadibloc wrote:
>> On Wednesday, February 16, 2022 at 11:46:31 AM UTC-7, MitchAlsup wrote:
>>
>>> What is missing is a vonNeumann model of parallelism where HW provides a
>>> few instructions to perform this new model and because of its simplicity
>>> elegance and functionality, SW can easily find ways to utilize that new model.
>>
>>> Right now HW does not know what to build and SW does not know exactly
>>> what to ask for. HW stumbles around adding TS, CAS, DCAW, LL-SC, ad
>>> infinitum. SW trys to use new feature and finds it very difficult to use in
>>> practice. This CLEARLY shows that what HW is trying to supply is not what
>>> SW wants to consume; yet SW does not know on the intimate level what to
>>> ask HW to build, so the circle continues.
> <
>> If this is the problem, does that mean that the problem has a solution?
> <
> I believe there is a solution.
> Arguably, the best solution at the moment is arithmetic in memory.
> <
> Order of arrival dictates who sees what, and there is no (minimal) cache
> line thrashing.
> <
> But look at how paper flows through an office: mail clerk walks around
> and drops batches of paper on various workers. Workers start taking
> paper off the top and work on whatever showed up. Lock free.
> <
> Janet over in the corner asks Jay 2 desks down if he ahs seen <blah>
> and Jason drops 2 papers on her desk, but Jane overheard and drops
> 3 more papers on Janet's desk. It is the Janes overhearing par that
> we do not understand at HW level. Mechanics we understand, but
> how to filter the wheat from the chaff gets us.
>>
>> It's certainly true that experts in writing programs aren't experts in building
>> CPUs and vice versa. But there are people who study the science of
>> designing algorithms, and whether or not certain problems can be solved
>> in O(n), O(n log n), or O(n^2) and stuff like that.
> <
> When it takes BigO(n^2) cache line movements to allow 1 unit of work
> under contention, it ends up talking BigO( n^3 ) cache line moves to
> get n producers matched with n consumers.
> <
> Under contention even CAS is BigO( N^2 ).
> <
> There are certain circumstances where ESM can reduce BigO( n^3 ) to BigO( 3 )
> {yes that is right no n ),
> <
> However, arithmetic in memory is also BigO( 3 ) and required in several SW
> architectures (OpenGL for instance).

The closest x86 gets to arithmetic in memory might be when you use XADD
to pick work items off a queue: You have the exact same cache line
trashing issue, but you are at least making forward progress, so O(n) or
there abouts cache lines to let one of the N threads make progress.

For heavy contention situations it does work to create a dual-stage lock
array as I've oulined here before, but this is still just a workaround,
the solution is to find an approach which removes the need for N threads
to access the same cache line.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sumeru$btn$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23647&group=comp.arch#23647

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!rd9pRsUZyxkRLAEK7e/Uzw.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Thu, 17 Feb 2022 22:31:42 +0100
Organization: Aioe.org NNTP Server
Message-ID: <sumeru$btn$1@gioia.aioe.org>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<jwv35kkfe8h.fsf-monnier+comp.arch@gnu.org>
<2022Feb15.194310@mips.complang.tuwien.ac.at>
<suh34c$3pd$1@newsreader4.netcologne.de> <suidol$1b8d$1@gioia.aioe.org>
<suigkc$nkv$1@dont-email.me> <vG8PJ.15176$GjY3.1981@fx01.iad>
<sukuuk$1qnd$1@gioia.aioe.org> <sulsti$9e0$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="12215"; posting-host="rd9pRsUZyxkRLAEK7e/Uzw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.10.2
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Thu, 17 Feb 2022 21:31 UTC

Thomas Koenig wrote:
> Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
>
>> while (l < r) {
>> pivot = (l+r)>>1;
>> if (arr[pivot] < target) l = pivot;
>> else r = pivot;
>> }
>
> How will this loop terminate if, for example, l=0 and r=1 ?
>
Oops! :-)

while (l+1 < r)

would probably work better, you still need to check for nearest vs exact
match at the end.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sumj0d$ok$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23651&group=comp.arch#23651

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Thu, 17 Feb 2022 14:42:20 -0800
Organization: A noiseless patient Spider
Lines: 21
Message-ID: <sumj0d$ok$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<suerog$cd0$1@dont-email.me> <2022Feb15.124937@mips.complang.tuwien.ac.at>
<sugjhv$v6u$1@dont-email.me> <2022Feb17.110809@mips.complang.tuwien.ac.at>
<suloft$i1r$1@dont-email.me> <jwvy22934m1.fsf-monnier+comp.arch@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 17 Feb 2022 22:42:22 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4cdbdab641c35097c9e27d444e25d5dd";
logging-data="788"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+eTG34KBYscXn6pVGmRaJi"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:CmzbToQqsbujVhfyhVrr52xp6Js=
In-Reply-To: <jwvy22934m1.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US
 by: Ivan Godard - Thu, 17 Feb 2022 22:42 UTC

On 2/17/2022 7:20 AM, Stefan Monnier wrote:
> Ivan Godard [2022-02-17 07:09:47] wrote:
>> Funny, we have no problem doing static scheduling together with runtime
>> branch prediction. Runahead prediction, even.
>
> AFAICT he was talking about having the ability for the compiler to
> change the static schedule according to runtime branch prediction.
>
>
> Stefan

Why would you want to do that? The static schedule is perfect within a
block, and width and branch prediction takes care of between blocks.

I suppose on a narrow legacy machine you might want to redo the schedule
to merge two (predicted consecutive) blocks so as to hoist the loads
across the branch. But, as explained at length here and in our doc, Mill
hoists loads across branches anyway.

Anton is very skilled and knowledgeable; I just wish he'd complain about
the architecture we have instead of the architecture we don't.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<sumj5h$ok$2@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23652&group=comp.arch#23652

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Thu, 17 Feb 2022 14:45:05 -0800
Organization: A noiseless patient Spider
Lines: 142
Message-ID: <sumj5h$ok$2@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de>
<suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com>
<suj84u$cjv$1@dont-email.me> <sulnop$2j0$1@dont-email.me>
<sult8j$ou6$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 17 Feb 2022 22:45:05 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4cdbdab641c35097c9e27d444e25d5dd";
logging-data="788"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19Tglki9lo0FJkKCKU2tCHt"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:BZzknR62QYdTBiCpyO+gbLLwamw=
In-Reply-To: <sult8j$ou6$1@dont-email.me>
Content-Language: en-US
 by: Ivan Godard - Thu, 17 Feb 2022 22:45 UTC

On 2/17/2022 8:31 AM, Stephen Fuld wrote:
> On 2/17/2022 6:57 AM, Ivan Godard wrote:
>> On 2/16/2022 8:18 AM, Stephen Fuld wrote:
>>> On 2/13/2022 9:52 AM, John Levine wrote:
>>>> According to Ivan Godard  <ivan@millcomputing.com>:
>>>>> In Mill the translation happens once, at install time, via a
>>>>> transparent
>>>>> invocation of the specializer (not a recompile, and the source code is
>>>>> not needed). In a microcoded machine (or a packing/cracking scheme)
>>>>> the
>>>>> translation takes place at execution time, every time. Either way, it
>>>>> gets done, and either way permits ISA extension freely. They differ in
>>>>> power, area, and time; the choice is an engineering design dimension.
>>>>>
>>>>> I assert without proof that Mill-style software translation in fact
>>>>> permits ISA extension that is *more* flexible and powerful than
>>>>> what can
>>>>> be achieved by hardware approaches. YMMV.
>>>>
>>>> If one believes in proof by example, this approach has been wildly
>>>> successful
>>>> in the IBM S/38, AS/400. and whatever it is called now, something
>>>> something i.
>>>>
>>>> You can take 30 year TIMI object code and run it on current hardware
>>>> at full speed,
>>>> supporting their exotic object architecture with 128 bit pointers.
>>>> I believe the
>>>> architecture had one significant change in the 1990s, with addresses
>>>> expanding
>>>> from 48 to 64 bits but the old object code still works.
>>>>
>>>> I'm kind of surprised nobody else does this.  I suppose only IBM had
>>>> sufficient
>>>> control over both the hardware and software to make it work.
>>>
>>> You have raised an interesting question.  I have been thinking about
>>> it for a few days, and I don't think it is as simple as HW/SW control.
>>>
>>> First, you have to recognize that there is a perhaps subtle, but very
>>> important difference between what the Mill is going to do, and what
>>> the S/38 etc. did.  Specifically, the Mill "respecializes" the OS.  I
>>> believe that IBM rewrote the lowermost kernel stuff from scratch when
>>> they went from the proprietary CISC to the Power based systems.
>>>
>>> This difference means that the Mill is "limited" to a range of
>>> changes, i.e. number of FUs, belt size, presence or absence of
>>> certain features, etc.  It could not be used for a more fundamental
>>> change at the OS level, such as if some hypothetical future Mill
>>> abandoned the SAS model or went to a supervisor state based security
>>> (I realize unlikely to happen for other reasons).  So while IBM's
>>> approach allows more basic changes in the HW design, it comes at the
>>> cost of non-automatic OS migration.
>>>
>>> So, with that in mind, lets look at some hypothetical vendor
>>> considering a new system.  What alternatives does he have?
>>>
>>> 1.    He could go Mill like, with automatic migration of everything
>>> at the cost of limiting his flexibility
>>>
>>> 2.    He could go sort of S/38 like but instead of keeping some
>>> intermediate level, keep the (required to be portable) source code
>>> with the executable, and recompile if needed for a future machine.
>>> After all, if we are talking of an infrequent migration, the extra
>>> cost of a full compile versus the "partial compile" of respecializing
>>> would be lost in the noise.
>>>
>>> 3.    He could expect to limit future HW designs to provide
>>> compatibility.
>>>
>>> 4.    He could expect much/most application software to be written to
>>> a defined intermediate language such as the JVM.  This is much like
>>> the S/38 approach, but substitutes JIT compiles for install time
>>> recompiles.
>>>
>>> 5.    Probably others. . .
>>>
>>> Now let's look at what actually happened in some real situations.
>>>
>>> While WinTel was two companies, they worked closely together and had
>>> a near monopoly for some years.  Intel mostly went with #3 (with the
>>> exception of Itanium), but Microsoft sort of went with the S/38 model
>>> (different user compatible interfaces for different HW), but didn't
>>> do any automatic migration, e.g. from X86 to Alpha or MIPS)
>>>
>>> Apple had complete control over the Mac environment and went with
>>> essentially #2, but without the automatic recompiles.  This allowed
>>> major changes to the underlying hardware, mostly without too many
>>> problems.
>>>
>>> Apple again, but this time with the iPhone.  Seems like mostly #4.
>>>
>>> While not a "company", Linux and the GCC programs essentially went
>>> with a different variant.  They used their version of C as
>>> essentially a public (as opposed to S/38's proprietary) intermediate
>>> language, and expect small changes and recompiles for different
>>> underlying architectures.
>>>
>>> While most hardware CPU vendors have little interest in making it
>>> easy to migrate to non compatible future hardware, they have great
>>> interest in allowing easy migration to their future CPUs.  They
>>> accomplish this by doing transparently some of what the Mill requires
>>> a respecializaton for (i.e. increasing the number of FUs), and
>>> providing "compatibility" modes such as what allows you to run a 30
>>> year old S/360 program on a current system.
>>>
>>> So, overall, I think the answer to your question is "It's
>>> complicated!"   I think other vendors have evolved different
>>> solutions that provide similar capabilities (not as good in some
>>> areas, better in others). While what S/38 did was revolutionary at
>>> the time, and quite "elegant", the requirements that drove it led to
>>> multiple solutions.
>>>
>>> However, I do think it might be a good idea for a system to
>>> automatically keep the source code with the object code to allow for
>>> automatic recompiles.
>>
>> What's the distribution medium? Which app vendors would be willing to
>> ship their source to their customers?
>
> Yes, similar to John's point. I admit I was thinking of customer written
> code.  So let me respond two ways.
>
> One, there could be a distinction between customer *written* software
> and customer *purchased* (actually, I suppose licensed) software.  The
> vendor licensed software would be allowed to not provide the source.
> This seems reasonable, as the vendor would amortize the cost of
> recompile, etc. over many customers, so the support cost per customer is
> less.  Also, presumably, a vendor is less likely to "lose" the source
> code, than a user shop.
>
> My second response is a question that you, Ivan are uniquely positioned
> to answer.  If a vendor delivers code to users in Genasm form, how much
> more difficult would it be for a user to do whatever he wanted to do
> than if he had the source?  I presume it is somewhere between source and
> actually executable code, but where on the spectrum is it?  Do you
> expect any problems with third party vendors?  BTW, I don't know the
> answer to this for S/38.  :-(

GenAsm can be decompiled, like any assembly language or binary. I
suppose it might be a little easier as it is already in SSA form.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<2022Feb17.231845@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23654&group=comp.arch#23654

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits
Date: Thu, 17 Feb 2022 22:18:45 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 70
Message-ID: <2022Feb17.231845@mips.complang.tuwien.ac.at>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at> <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at> <212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com> <2022Feb15.120729@mips.complang.tuwien.ac.at> <3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
Injection-Info: reader02.eternal-september.org; posting-host="4d6cc5dec2d8add5b56da1d9cd0a2277";
logging-data="27876"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18GOYS2Yg/bD3hsH5D1frIP"
Cancel-Lock: sha1:AcLG9bKBflgqkgJjQYteMg7sSnE=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Thu, 17 Feb 2022 22:18 UTC

Scott Smader <yogaman101@yahoo.com> writes:
>Fine results. Thank you. But at least the A53/A73 numbers don't prove your =
>claim. The A73 cores are twice as big as the A53 cores, according to https:=
>//www.anandtech.com/show/10347/arm-cortex-a73-artemis-unveiled/3. So a 2x p=
>erformance improvement indicates equal efficiency per unit area, not superi=
>or.

Which would be relevant if the transistor budget had stopped at the
A53. Currently we are at transistor bugets where Ampere puts 80 or
128 Neoverse N1 cores (bigger than the A73) on a die. Why do they put
that many N1 cores on a die instead of N times the number of A53 cores
or M>N times the number of single-issue cores? I guess is that 1)
even on these CPUs intended for the cloud, sometimes single-thread
performance matters; and 2) the overhead of connecting N or M times as
many cores costs enough that having fewer, but bigger cores is
preferable. Note how Tilera did not succeed with its many simple
cores against the competition with fewer bigger cores. Sun went from
the narrow in-order UltraUSparc T1 through the OoO SPARC T4 to
eventually the wide OoO SPARC M7.

>> That's the fallacy that caused Intel and HP to waste billions on=20
>> IA-64, and Transmeta investors to invest $969M, most of which was=20
>> lost.=20
>
>Agreed that Transmeta and IA-64 failed. Disagree that two failures prove st=
>atic scheduling's efficiency to be a fallacy.

>> "Speculation transistors"? Ok, you can remove the branch predictor=20
>> and spend the transistors for an additional FU. The result for much=20
>> of the software will be that there will be done much less in each=20
>> cycle, because there are far fewer instructions ready for execution on=20
>> average without speculative execution, and therefore less work will be=20
>> done each cycle. If you also remove OoO execution, even fewer=20
>> instructions will be executed each cycle. So you have more FUs=20
>> available, but they will be idle.
>
>I've said statically scheduled needs VLIW which you've ignored by comparing=
> equal issue width cores.=20

There are not that many wide in-order cores, because in-order has a
low utilization of additional FUs, but as the example of the
three-wide 180nm Celeron that beats the 6-wide 180nm Celeron on
performance, IPC, power consumption, and area shows, wider cores don't
help if the code has not that much instruction-level parallelism.

>Whether fewer instructions are executed per cycle is less important than th=
>e number of results produced by the FUs per cycle. VLIW lets more FUs be ac=
>tive per cycle.

If there is not enough parallelism in the program, having more FUs at
your disposal does not help.

>And VLIW doesn't necessarily see execution delays for NOP p=
>laceholders.

So what? Ordinary architectures don't need such placeholders in the
first place.

> OoO complexity goes up approximately O(n^2) with the number of=
> FUs, right?=20

That has been the argument for EPIC. I don't know if it's right, and
if it plays a role compared to the quadratic number of bypasses that a
VLIW implementation probably has just like an implementation of an
ordinary architecture.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<2022Feb18.072804@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23659&group=comp.arch#23659

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits
Date: Fri, 18 Feb 2022 06:28:04 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 14
Message-ID: <2022Feb18.072804@mips.complang.tuwien.ac.at>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at> <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at> <suerog$cd0$1@dont-email.me> <2022Feb15.124937@mips.complang.tuwien.ac.at> <sugjhv$v6u$1@dont-email.me> <jwvfsokdrwq.fsf-monnier+comp.arch@gnu.org> <sugkji$6vi$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="4d6cc5dec2d8add5b56da1d9cd0a2277";
logging-data="23428"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+SWGVHi7jJI0SOSa3ZmwBK"
Cancel-Lock: sha1:Fw22gLirXRo6prgctrkrOOOUA1Y=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Fri, 18 Feb 2022 06:28 UTC

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>BTW, does that present an
>opportunity for some sort of profile driven optimization where the run
>time branch history is fed back to a future compilation for better
>optimization?

The work on static branch prediction used profile results as the gold
standard, but the result was still ~10% mispredictions (and ~20%
without profile data).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<2022Feb18.073552@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23660&group=comp.arch#23660

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits
Date: Fri, 18 Feb 2022 06:35:52 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 129
Message-ID: <2022Feb18.073552@mips.complang.tuwien.ac.at>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at> <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at> <212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com> <2022Feb15.120729@mips.complang.tuwien.ac.at> <3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com> <sugu4m$9au$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="4d6cc5dec2d8add5b56da1d9cd0a2277";
logging-data="22096"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18gRRSPpgK7HL8UtbNjJgbj"
Cancel-Lock: sha1:g5M+x3lu9lZKZziWDgKGUQQ0KP4=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Fri, 18 Feb 2022 06:35 UTC

BGB <cr88192@gmail.com> writes:
>Also the thing I was getting at:
> Assume a future which is transistor-budget limited.

That's not just the future, that is the present and the past.

>Unless the OoO cores deliver superior performance relative to their
>transistor budget, they have a problem in this case.

Just like they have now.

>OoO has tended to be faster, but at the cost of a higher transistor budget.

Yes, and OoO has been worth it for the past 25 years, with only ARM
disagreeing for their little cores (A53, A55, A710).

>Though, one can't ignore one big drawback of in-order designs:
> They do not deal well with cache misses.

If that was the problem, we would have seen Itaniums and other
in-order cores (e.g. 21164) with large L1 caches. Instead, they have
small L1 caches relative to competing OoO designs. That's because the
bigger problem for in-order cores is cache hit latency.

>One can use prefetching to good effect, but this may require involvement
>from the programmer, and C doesn't really have a "good" way to express a
>prefetch operation.

Actually software prefetching has been reported as often being
counterproductive. My guess is that (automatic) hardware prefetching
works pretty well, and that software prefetching too often consumes
more resources than justified by the benefit it gives. And yes, as
you point out, prefetching too early is also a problem.

>If one could potentially put another (watered down) sat of fetch/decode
>stages just ahead of the current position, it could be possible to use
>them as a prefetcher. Less clear is how to do this "effectively".
>
>One option would be to partly duplicate and delay the execute stages, say:
> IF ID1 (V_ID2 V_EX1 V_EX2 V_EX3) ID2 EX1 EX2 EX3 WB
>
>With the V_* stages mostly serving as "what if" stages, their results
>not being saved to the register file, but any loads/stores resulting in
>an L1 prefetch (stores are not written, and the L1 does not stall on a
>miss).

I really doubt that would help. If the load does not depend on some
address/index computation that arrives just before the load, IA-64 has
no problem moving the load to an earlier time, absorbing the cache
miss latency. If the load does depend on an earlier address/index
computation, your what-if stages will fetch from the wrong address.

But we do have hardware predictors that work by watching for linear
patterns in actual accessed addresses. Does not help for non-linear
patterns, though, but sometimes you get linear patterns even in
pointer-chasing code, because the nodes of the data structure happened
to be allocated linearly in memory.

>One could also possibly try to work around some things if the compiler
>were able to figure out when and where cache misses will occur (vs
>current compilers which have no real way of knowing this).

Interestingly, work on computing the worst-case execution time (WCET)
has figured out in some cases where cache hits will occur; it does
require true LRU, though; the usual pseudo-LRU means that a 4-way
(8-way) associative cache has to be treated as 2 (4) times smaller
2-way associative cache, because only for the two most recent accesses
to a set pseudo-LRU has the same results as true LRU.

>But, rather, "Could the core one could fit in the same transistor budget
>as an OoO core deliver better performance?".

It also depends on workload. For stuff where SIMD shines, a scalar
in-order can be competetive with a scalar OoO; but for this stuff it's
better to add SIMD.

>Thus far, it hasn't really come to this mostly because the transistor
>budget has been steadily increasing, but will not necessarily hold true
>once it stops.

Why would that make a difference?

>It is like the issue of all the ridiculously inefficient coding
>practices during the "MHz keeps getting bigger" era, and then single
>threaded performance hits a wall, and suddenly people are finding that
>writing efficient code matters again.

Are they?

>We have had around a decade without much improvement in single-thread
>performance.

It's actually close to 20 years.

Looking at factors of 2 (or more) in performance on our LaTeX
benchmark:

sec
1992 93.4 486DX2/66
1994 33.9 Pentium 100
1997 15.4 Pentium MMX 233
1998 7.6 Celeron 333
1999 3.04 Pentium III 750
2001 1.44 Athlon XP1800+ (1533MHz)
2006 0.652 Core 2 Duo E6600 (2400MHz)
2015 0.294 Core i5-6600K 4000MHz
2021 0.175 Xeon W-1370P 5200MHz

The last result is less than a factor of 2, but it's the fastest
result up to date.

>People will also have a lot more time to try to work out the "magic
>compiler" issues, ...

We have had from the appearance of microcode in the 1960s through
Metaflow and Cydrome in the 1980s until people lost faith in IA-64
around 2010 to work them out. It's unlikely that a breakthrough will
be found later.

>I am also not thinking of "OoO core vs in-order core running same ISA",
>but rather, OoO core vs a VLIW core at a similar transistor budget.

Itanium II 900MHz loses against Celeron 800MHz despite having a higher
transistor and power budget.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<2022Feb18.093152@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23661&group=comp.arch#23661

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits
Date: Fri, 18 Feb 2022 08:31:52 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 20
Distribution: world
Message-ID: <2022Feb18.093152@mips.complang.tuwien.ac.at>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <su9j56$r9h$1@dont-email.me> <suajgb$mk6$1@newsreader4.netcologne.de> <suaos8$nhu$1@dont-email.me> <subggb$2vj5$1@gal.iecc.com> <subiog$cp8$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at> <sudb0g$rq3$1@dont-email.me> <2022Feb15.104639@mips.complang.tuwien.ac.at> <sugurh$h1$1@newsreader4.netcologne.de>
Injection-Info: reader02.eternal-september.org; posting-host="4d6cc5dec2d8add5b56da1d9cd0a2277";
logging-data="28920"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18aieIMdoNLY95TflvdRp33"
Cancel-Lock: sha1:lihtxI63lTKrxbPm1utGJyn03E0=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Fri, 18 Feb 2022 08:31 UTC

Thomas Koenig <tkoenig@netcologne.de> writes:
>> In any case, I expect no problems when moving virtual
>> machines from one microarchitecture to another, as long as the second
>> microarchitecture supports the same instruction set extensions as the
>> first one.
>
>I remember reading that ARM has a problem with SVE when moving
>between microarchitectures with different lengths of execution
>units, in effect restricting the vector length to the minimum,
>128 bits.

Yes, that's a design flaw in SVE (where the major goal was to make it
SIMD-width-agnostic). I have no idea how to fix it while still having
architectural SIMD registers, though. It should be solvable for
memory-to-memory vector machines, and certainly for VVM.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<2022Feb18.093920@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23662&group=comp.arch#23662

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits
Date: Fri, 18 Feb 2022 08:39:20 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 49
Distribution: world
Message-ID: <2022Feb18.093920@mips.complang.tuwien.ac.at>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at> <7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at> <212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com> <2022Feb15.120729@mips.complang.tuwien.ac.at> <jwv35kkfe8h.fsf-monnier+comp.arch@gnu.org> <2022Feb15.194310@mips.complang.tuwien.ac.at> <suh34c$3pd$1@newsreader4.netcologne.de>
Injection-Info: reader02.eternal-september.org; posting-host="4d6cc5dec2d8add5b56da1d9cd0a2277";
logging-data="27737"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX183kyb9BaQspqB5CWoClTc3"
Cancel-Lock: sha1:tDrK8q8bEwI/uYZHNl4jVR9UiIM=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Fri, 18 Feb 2022 08:39 UTC

Thomas Koenig <tkoenig@netcologne.de> writes:
>Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>> And because compiler
>> branch prediction (~10% miss rate)
>
>That seems optimistic.

That's the results I remember from papers for profile-based static
branch prediction. Without profile data it was more like 20%.

>>is much worse than dynamic branch
>> prediction (~1% miss rate, both numbers vary strongly with the
>> application, so take them with a grain of salt),
>
>What is the branch miss rate on a binary search, or a sort?
>Should be close to 50%, correct?

Typically, in a binary search, you have a loop branch that's
predictable (even the loop exit is very predictable), and a
data-dependent branch which may be less predictable. Whether it
really is less predictable depends on the actual data. Yes, if you
have a uniform random distribution of values to search for, you will
see 50% mispredictions on that branch (i.e., ~25% overall).

But a uniform random distribution of data is not that common, and a
power law (e.g., Zipf's law) is more common; plus, there is typically
some local clustering and patterns that help dynamic branch predictors
and that is not reflected in global observations like Zipf's law.
Still, if-converting that branch as Terje Mathisen has suggested is a
good idea.

The major question in this context is how frequent binary search and
sorting is. I use binary search very rarely (the last time was
probably in the 2000s, for finding a value in a set of intervals);
hashing is more efficient for exact searches. I use sorting only for
preparing data for human consumption, or for join (the Unix tool,
which unfortunately requires sorted input).

In any case, the branch prediction results are what they are. LaTeX
(in Debian 8) has 1.05% branch mispredictions on our benchmark on
Skylake, so maybe the proportion of data-dependent branches is not
that high, or the branch predictor is quite good at using the patterns
in the branches (that result from patterns in the data) to predict
pretty accurately. Probably a mixture of both.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<d588d582-68f7-41a1-ab1e-1e873fb826b9n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23663&group=comp.arch#23663

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:600c:35d1:b0:37c:d45c:af57 with SMTP id r17-20020a05600c35d100b0037cd45caf57mr9702967wmq.149.1645175416494;
Fri, 18 Feb 2022 01:10:16 -0800 (PST)
X-Received: by 2002:a05:6870:505:b0:c4:7dc0:d726 with SMTP id
j5-20020a056870050500b000c47dc0d726mr2369424oao.249.1645175415930; Fri, 18
Feb 2022 01:10:15 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.128.88.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 18 Feb 2022 01:10:15 -0800 (PST)
In-Reply-To: <2022Feb18.073552@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:ddec:e03:7010:d614;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:ddec:e03:7010:d614
References: <ssu0r5$p2m$1@newsreader4.netcologne.de> <subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org> <2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com> <suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at> <212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at> <3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
<sugu4m$9au$1@dont-email.me> <2022Feb18.073552@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d588d582-68f7-41a1-ab1e-1e873fb826b9n@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Fri, 18 Feb 2022 09:10:16 +0000
Content-Type: text/plain; charset="UTF-8"
 by: Quadibloc - Fri, 18 Feb 2022 09:10 UTC

On Friday, February 18, 2022 at 12:32:05 AM UTC-7, Anton Ertl wrote:
> BGB <cr8...@gmail.com> writes:

> >People will also have a lot more time to try to work out the "magic
> >compiler" issues, ...

> We have had from the appearance of microcode in the 1960s through
> Metaflow and Cydrome in the 1980s until people lost faith in IA-64
> around 2010 to work them out. It's unlikely that a breakthrough will
> be found later.

This is an issue that continues to confuse me.

There is no way that a compiler can take the place of out-of-order
circuitry for allowing a computer to deal with cache misses. These aren't
predictable.

But when it comes to register hazards, _that_ part of what OoO does,
I would have thought, can of course be fully matched by changing the
rename registers to explicit registers, and having the compiler allocate
them and schedule the instructions accordingly. So I would have thought
this was already a solved problem.

If the burden of circuitry of OoO is very heavy, therefore, I would have
thought that the burden of making instructions wider, of having more
registers to save and restore, and so on, would be well worth it. But
maybe it isn't, since memory bandwidth is a big issue in its own right
as well.

What I _suspect_, though, is that the main reason the choice is usually
made to go with OoO rather than to go with big register files... is because
of what happened with RISC. Going from 8 or 16 registers to 32 registers
was supposed to be an example of this approach. And at one level of
hardware complexity and performance, it was.

But quickly, as more transistors became available, and more performance
was desired, we got OoO implementations of RISC architectures. Along
came Itanium - and even I quickly saw that this design was built around
one particular generation of microelectronics, and it would soon have to
be replaced with something even bigger and hence incompatible.

Whereas, if you just take a popular CISC architecture, and implement it
with the aid of OoO execution, nobody has to change all their programs, and
everyone is happy.

So the issue isn't that doing away with OoO by explicitly allocating registers
is technically infeasible, or technically inferior. You might be able to get
better performance at lower cost that way. But as long as transistor counts
keep going up, doing it that way means you keep having to change your ISA,
whereas using OoO means you don't.

In that case, the end of Moore's Law doesn't give us "time to solve the
compiler problem" - that problem is already solved. Instead, it means we
can finally stop at 256 registers or however many, as the next generation of
silicon isn't going to come along shortly and make the ISA obsolete.

So that's how I see the end of Moore's Law making VLIW designs possible
once again.

John Savard

Re: instruction set binding time, was Encoding 20 and 40 bit

<2022Feb18.100825@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23664&group=comp.arch#23664

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
Date: Fri, 18 Feb 2022 09:08:25 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 96
Message-ID: <2022Feb18.100825@mips.complang.tuwien.ac.at>
References: <sufpjo$oij$1@dont-email.me> <memo.20220215170019.7708M@jgd.cix.co.uk> <suh8bp$ce8$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="4d6cc5dec2d8add5b56da1d9cd0a2277";
logging-data="22087"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+d1HckyGqghSoHod/q7gzC"
Cancel-Lock: sha1:1V59JMob3BOOKvHsvsmexrVEWJU=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Fri, 18 Feb 2022 09:08 UTC

Ivan Godard <ivan@millcomputing.com> writes:
>EPIC hardware assumed that there are no intra-bundle hazards,

IA-64 has 128-bit bundles, which are an encoding (and branch target)
unit. On IA-64, a group has no internal register dependencies. A
bundle can contain instructions from two groups, and instructions from
a single group can be in an unlimited number of bundles. This may be
confusing to people coming from classical VLIW, where the VLIW
"instruction" maps to both bundle and group concepts of IA-64 (and
VLIW "operation" maps to IA-64 "instruction").

IA-64 can have memory dependencies within a group. At least that's
what has been reported here quite a while ago. Thinking some more
about it, I guess such cases had to be dealt with through the ALAT
mechanism (i.e., if there was a memory dependency within a group,
software would later execute recovery code).

In the following, you obviously mean "group" when you write "bundle".

>so the
>issue queue and its associated hazard check can be omitted. However,
>there were no such guarantees for inter-bundle hazards, so either the
>entire bundle had to run to completion, or, in later Itaniums, the
>hardware did OOO-like retire hazard checking to deal with varying
>instruction latency.

OoO-like or like an in-order superscalar? This is the first time I
have read about OoO features in IA-64. Where can I read more about
it? And I think the problem with multi-cycle instructions already
existed in the earliest IA-64 implementations (multiply, FP
instructions), so yes, the hardware had to check whether a result is
not yet available, and can only execute the part of the group up to
that instruction if it is; and yes, the hardware designer might have
decided that due to compiler scheduling it does not pay off to
complicate the hardware with such group splitting, and just stalled
the whole core in case of one register not being ready.

>The Itanium used fixed size
>bundles;

Here you use the IA-64 meaning of "bundle"; groups are arbitrarily
long.

>classic VLIW uses variable-size bundles;

According to
<https://en.wikipedia.org/wiki/Multiflow#Innovative_architecture>, the
Trace 7/200 used 256-bit long instructions, the 14/ models 512-bit
instructions, and the 28/ models 1024-bit instruction words, but my
understanding is that each model had a fixed instruction size. Do you
consider Metaflow to not be a classic VLIW?

According to <https://en.wikipedia.org/wiki/Cydrome#Product>, the
numeric processor used a 256-bit wide instruction word, and my
understanding is that this is a fixed size. Do you consider the
Cydrome to not be a classic VLIW?

I consider these machines to be the classic VLIWs, and later VLIWs to
be modern VLIWs.

>This makes the Mill suitable for general-purpose work that would cause a
>classic VLIW to spend too much time in stall, yet still take advantage
>of any schedule variability to do other work while (possibly) waiting
>for external data, and without needing OOO retire hazard hardware.

I assume you mean in-order readyness checking (absent in the MIPS
R2000 load, which gave the architecture its name), sometimes called a
scoreboard (although Mitch Alsup tells us that that name is
inappropriate because the original CDC-6600 scoreboard was a more
sophisticated feature with some OoO potential). The question is how
expensive this is. They could afford this feature in the earliest
RISCs, such as ARM and SPARC, so the cost cannot be that high.

>The gain from Mill's split load is limited by the maximal gap (in time)
>between the load-issue and load-retire instructions, which is in turn
>determined by how much work is available to do that neither depends on
>the load result nor is depended on by the load issue. That's easily
>determined by conventional dataflow analysis in the compiler, and is
>exactly the same as the amount of work that an OOO can do while waiting
>for a load to retire.

Do you mean an in-order superscalar? An OoO superscalar often has
fetched and decoded hundreds of instructions ahead, many of which have
already been processed and advanced to the retire queue, many of which
are blocked by dependencies on the load or other non-ready
instructions, but there are also often a number of instructions that
have just become ready (e.g., because they depended on the same
instruction as the load, or because they depended on an instruction
that produced its result in the same cycle as the last parent of the
load), and which may be from places that a compiler cannot easily
schedule (e.g., from behind a polymorphic method call).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suo3pt$ai1$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23665&group=comp.arch#23665

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!rd9pRsUZyxkRLAEK7e/Uzw.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Fri, 18 Feb 2022 13:35:09 +0100
Organization: Aioe.org NNTP Server
Message-ID: <suo3pt$ai1$1@gioia.aioe.org>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
<sugu4m$9au$1@dont-email.me> <2022Feb18.073552@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="10817"; posting-host="rd9pRsUZyxkRLAEK7e/Uzw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.10.2
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Fri, 18 Feb 2022 12:35 UTC

Anton Ertl wrote:
> BGB <cr88192@gmail.com> writes:
>> Also the thing I was getting at:
>> Assume a future which is transistor-budget limited.
>
> That's not just the future, that is the present and the past.
>
>> Unless the OoO cores deliver superior performance relative to their
>> transistor budget, they have a problem in this case.
>
> Just like they have now.
>
>> OoO has tended to be faster, but at the cost of a higher transistor budget.
>
> Yes, and OoO has been worth it for the past 25 years, with only ARM
> disagreeing for their little cores (A53, A55, A710).
>
>> Though, one can't ignore one big drawback of in-order designs:
>> They do not deal well with cache misses.
>
> If that was the problem, we would have seen Itaniums and other
> in-order cores (e.g. 21164) with large L1 caches. Instead, they have
> small L1 caches relative to competing OoO designs. That's because the
> bigger problem for in-order cores is cache hit latency.
>
>> One can use prefetching to good effect, but this may require involvement
>>from the programmer, and C doesn't really have a "good" way to express a
>> prefetch operation.
>
> Actually software prefetching has been reported as often being
> counterproductive. My guess is that (automatic) hardware prefetching
> works pretty well, and that software prefetching too often consumes
> more resources than justified by the benefit it gives. And yes, as
> you point out, prefetching too early is also a problem.
>

This is not conclusive, but in my own code I have never found an actual
case were adding prefetch instructions x cache lines in front of the
current position helped.

The best example I have seen didn't use prefetch instructions, but
instead actual byte loads, one per 64 byte cache line, to force load a
page (4K) worth of input data from each of the sources, then a tight
loop merging (FADD? I don't remember the actual op) the two input
buffers and writing the results back to one of them, followed by a final
tight loop that used non-temporal stores to push the 4K out to the real
destination without polluting any caches.

This was on a particular AMD model, it ran 2-3X faster than the
naive/natural code that only touched each data item once instead of
three times, but please note that it had to use actual loads instead of
prefetch ops, since the latter, being hints, would be skipped when all
load buffers were full.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: instruction set binding time, was Encoding 20 and 40 bit

<suoabu$rfk$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23666&group=comp.arch#23666

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
Date: Fri, 18 Feb 2022 06:27:09 -0800
Organization: A noiseless patient Spider
Lines: 106
Message-ID: <suoabu$rfk$1@dont-email.me>
References: <sufpjo$oij$1@dont-email.me>
<memo.20220215170019.7708M@jgd.cix.co.uk> <suh8bp$ce8$1@dont-email.me>
<2022Feb18.100825@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 18 Feb 2022 14:27:10 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="6eee5a70cc75c332c2d3f16500e8e251";
logging-data="28148"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18WaEVsO5mg9AfwYW480vG4"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:k3wBMsHSGOq5IbiuocaPRJWhALM=
In-Reply-To: <2022Feb18.100825@mips.complang.tuwien.ac.at>
Content-Language: en-US
 by: Ivan Godard - Fri, 18 Feb 2022 14:27 UTC

On 2/18/2022 1:08 AM, Anton Ertl wrote:
> Ivan Godard <ivan@millcomputing.com> writes:
>> EPIC hardware assumed that there are no intra-bundle hazards,
>
> IA-64 has 128-bit bundles, which are an encoding (and branch target)
> unit. On IA-64, a group has no internal register dependencies. A
> bundle can contain instructions from two groups, and instructions from
> a single group can be in an unlimited number of bundles. This may be
> confusing to people coming from classical VLIW, where the VLIW
> "instruction" maps to both bundle and group concepts of IA-64 (and
> VLIW "operation" maps to IA-64 "instruction").
>
> IA-64 can have memory dependencies within a group. At least that's
> what has been reported here quite a while ago. Thinking some more
> about it, I guess such cases had to be dealt with through the ALAT
> mechanism (i.e., if there was a memory dependency within a group,
> software would later execute recovery code).
>
> In the following, you obviously mean "group" when you write "bundle".

Yes, if "bundle" means "encoded together" and "group" means issued
together" in your (and Intel's?) lexicon. In our lexicon, "bundle" means
"issued together", "block" means encoded together, and there are six
blocks per bundle. Confusing, I agree.

Most of the rest of the discussion seems to be similar terminology
issues and correcting my ignorance of historic details.

<snip>

>
>> The gain from Mill's split load is limited by the maximal gap (in time)
>> between the load-issue and load-retire instructions, which is in turn
>> determined by how much work is available to do that neither depends on
>> the load result nor is depended on by the load issue. That's easily
>> determined by conventional dataflow analysis in the compiler, and is
>> exactly the same as the amount of work that an OOO can do while waiting
>> for a load to retire.
>
> Do you mean an in-order superscalar? An OoO superscalar often has
> fetched and decoded hundreds of instructions ahead, many of which have
> already been processed and advanced to the retire queue, many of which
> are blocked by dependencies on the load or other non-ready
> instructions, but there are also often a number of instructions that
> have just become ready (e.g., because they depended on the same
> instruction as the load, or because they depended on an instruction
> that produced its result in the same cycle as the last parent of the
> load), and which may be from places that a compiler cannot easily
> schedule (e.g., from behind a polymorphic method call).

For other than dependencies on load results (which are the only
variable-latency instructions), any instruction that becomes ready
solely because a depended-on just produced result reflects a dataflow
dependence that is visible in the source: in a+b*c the + depends on the
result of the *. All these have a perfect static schedule and no reorder
will benefit.

Of course, instructions may have other dependencies than pure static
dataflow: they may depend on load (variable) results, or on control
flow. Mill addresses most of these via wide speculation, much like an
OOO with its issue queue entries. The difference appears at a
mispredict. I'll assume that the Mill predictor is roughly as good as
anybody else's predictor, so the only speculation difference between
Mill and typical OOO is how much is speculated.

Mill does massive if-conversion. For the remaining branches, Mill does
just what OOO does: speculatively execute down the predicted path, and
if the predictions was wrong it backs out what it had done. The
difference is that Mill executes many fewer branches, and so has many
fewer mispredicts. The cost, of course, is that it spends resources to
execute instructions from untaken paths when the predictor would have
been right. This was a design choice: pay in the hardware for width to
get equal performance on predictable branches and much better on
unpredictable ones.

There are limits to if-conversion: you run out of width, or you hit the
boundary of compiler sight into the control flow graph. Our experience
is that width is the limit at the low end of the family, but rarely is
at the high end. It is common for the specializer to remove all branches
from a function except the backwards branches of loops.

The conversion boundary in the Mill specializer is the whole function;
it does not attempt to optimize across the call boundary, which a OOO
can do. This is a long known, acknowledged, and documented drawback to
the Mill's scheduling. The drawback is countered in part by the hardware
branch predictor, which does run ahead through call boundaries. Still,
in principle an OOO can issue a load from inside a called body before
the call itself has been issued, and a Mill can't do that.

Of course, the specializer can inline functions at non-polymorphic
calls, and the width makes it profitable to do that with more and bigger
functions. And the Mill call protocol eliminates all protocol overhead
(one thing I like about Mitch's ISA is that he too has effectively
eliminated call overhead). Mill phasing executes three cycles into a
called function before the call itself executes, which is sort of a poor
man's OOO with the same advantages, especially when the callee is (after
scheduling to width) very short.

Put all this together and the Mill ISA has eliminated all the OOO
advantages except: OOO will excel on 1) frequently called 2) polymorphic
functions that 3) immediately execute a load whose address arguments are
4) ready and that 5) misses when 6) the hardware is not bandwidth
limited. For that case a Mill will stall in the callee and an OOO likely
will not.

We decided that this case was not worth 12x.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<jwv5ypcxly1.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23668&group=comp.arch#23668

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits
Date: Fri, 18 Feb 2022 10:06:34 -0500
Organization: A noiseless patient Spider
Lines: 21
Message-ID: <jwv5ypcxly1.fsf-monnier+comp.arch@gnu.org>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me>
<2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
<sugu4m$9au$1@dont-email.me>
<2022Feb18.073552@mips.complang.tuwien.ac.at>
<d588d582-68f7-41a1-ab1e-1e873fb826b9n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="74cd02b4954a3d78e51042887bce9f0e";
logging-data="4952"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/js6lFr+rt1HzB+PF5emEb"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:oZEmUpyPNmBE3mrpDDpETg/+Qpg=
sha1:Ai1QTynEbUgbM4Vax94LjpSg0Go=
 by: Stefan Monnier - Fri, 18 Feb 2022 15:06 UTC

> But when it comes to register hazards, _that_ part of what OoO does,
> I would have thought, can of course be fully matched by changing the
> rename registers to explicit registers, and having the compiler allocate
> them and schedule the instructions accordingly. So I would have thought
> this was already a solved problem.

You forgot control flow. Even when control flow can be fully predicted
(fixed repetitions of loop and direct function calls), it can be really
difficult for the compiler to do as good as job as an OoO because an OoO
engine can schedule operations across those control flow edges, whereas
doing that for a compiler requires code duplication (loop unrolling and
function inlining) which come with their own tradeoffs.

IIUC the Mill's "phasing" feature aims to reduce the cost of this
problem by doing some of the scheduling at runtime, in a sense: during
any given cycle, 3 phases are executed, each one from a different
instruction, not necessarily all from the same basic block so the mix of
operations is actually somewhat dynamic.

Stefan

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suof81$vv$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23669&group=comp.arch#23669

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Fri, 18 Feb 2022 07:50:23 -0800
Organization: A noiseless patient Spider
Lines: 42
Message-ID: <suof81$vv$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<jwv35kkfe8h.fsf-monnier+comp.arch@gnu.org>
<2022Feb15.194310@mips.complang.tuwien.ac.at>
<suh34c$3pd$1@newsreader4.netcologne.de>
<2022Feb18.093920@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 18 Feb 2022 15:50:25 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="e5527eb568d843f52e7bc44bc1755319";
logging-data="1023"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX193nIrE4s0bNLR5iJOVGHPLTz/jDP66v7U="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.0
Cancel-Lock: sha1:Nfz+IfsksBulHrOW2sQpc1aHVr0=
In-Reply-To: <2022Feb18.093920@mips.complang.tuwien.ac.at>
Content-Language: en-US
 by: Stephen Fuld - Fri, 18 Feb 2022 15:50 UTC

On 2/18/2022 12:39 AM, Anton Ertl wrote:
> Thomas Koenig <tkoenig@netcologne.de> writes:
>> Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>>> And because compiler
>>> branch prediction (~10% miss rate)
>>
>> That seems optimistic.
>
> That's the results I remember from papers for profile-based static
> branch prediction. Without profile data it was more like 20%.
>
>>> is much worse than dynamic branch
>>> prediction (~1% miss rate, both numbers vary strongly with the
>>> application, so take them with a grain of salt),
>>
>> What is the branch miss rate on a binary search, or a sort?
>> Should be close to 50%, correct?
>
> Typically, in a binary search, you have a loop branch that's
> predictable (even the loop exit is very predictable), and a
> data-dependent branch which may be less predictable. Whether it
> really is less predictable depends on the actual data. Yes, if you
> have a uniform random distribution of values to search for, you will
> see 50% mispredictions on that branch (i.e., ~25% overall).
>
> But a uniform random distribution of data is not that common, and a
> power law (e.g., Zipf's law) is more common; plus, there is typically
> some local clustering and patterns that help dynamic branch predictors
> and that is not reflected in global observations like Zipf's law.
> Still, if-converting that branch as Terje Mathisen has suggested is a
> good idea.
>
> The major question in this context is how frequent binary search and
> sorting is.

Of course, it is application dependent. In the database world, they are
both quite frequent.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: instruction set binding time, was Encoding 20 and 40 bit

<5276f69f-07bc-4082-bb5b-c371d059d403n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23670&group=comp.arch#23670

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a5d:64ac:0:b0:1e7:1415:2548 with SMTP id m12-20020a5d64ac000000b001e714152548mr6821332wrp.267.1645202556092;
Fri, 18 Feb 2022 08:42:36 -0800 (PST)
X-Received: by 2002:a05:6870:9108:b0:ce:c0c9:5dc with SMTP id
o8-20020a056870910800b000cec0c905dcmr4628294oae.46.1645202555556; Fri, 18 Feb
2022 08:42:35 -0800 (PST)
Path: i2pn2.org!i2pn.org!news.swapon.de!2.eu.feeder.erje.net!feeder.erje.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 18 Feb 2022 08:42:35 -0800 (PST)
In-Reply-To: <suoabu$rfk$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=64.26.97.60; posting-account=6JNn0QoAAAD-Scrkl0ClrfutZTkrOS9S
NNTP-Posting-Host: 64.26.97.60
References: <sufpjo$oij$1@dont-email.me> <memo.20220215170019.7708M@jgd.cix.co.uk>
<suh8bp$ce8$1@dont-email.me> <2022Feb18.100825@mips.complang.tuwien.ac.at> <suoabu$rfk$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5276f69f-07bc-4082-bb5b-c371d059d403n@googlegroups.com>
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
From: paaroncl...@gmail.com (Paul A. Clayton)
Injection-Date: Fri, 18 Feb 2022 16:42:36 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 19
 by: Paul A. Clayton - Fri, 18 Feb 2022 16:42 UTC

On Friday, February 18, 2022 at 9:27:14 AM UTC-5, Ivan Godard wrote:
[snip]
> Still,
> in principle an OOO can issue a load from inside a called body before
> the call itself has been issued, and a Mill can't do that.

In theory a caller could start a pick up load that is picked up by the callee. This would be similar to function specific interfaces rather than using a generic ABI. It is not obvious how useful this would be.

> Put all this together and the Mill ISA has eliminated all the OOO
> advantages except: OOO will excel on 1) frequently called 2) polymorphic
> functions that 3) immediately execute a load whose address arguments are
> 4) ready and that 5) misses when 6) the hardware is not bandwidth
> limited. For that case a Mill will stall in the callee and an OOO likely
> will not.

I am not convinced this is the case, but writing on Google groups from a tablet is painful.

> We decided that this case was not worth 12x.

I doubt the Mill will get a 12x PPA advantage over an OoO implementation of a conventional ISA like AArch64.

Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

<suoior$8h4$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23671&group=comp.arch#23671

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: instruction set binding time, was Encoding 20 and 40 bit
instructions in 128 bits
Date: Fri, 18 Feb 2022 08:50:35 -0800
Organization: A noiseless patient Spider
Lines: 25
Message-ID: <suoior$8h4$1@dont-email.me>
References: <ssu0r5$p2m$1@newsreader4.netcologne.de>
<subiog$cp8$1@newsreader4.netcologne.de>
<jwva6euz9bv.fsf-monnier+comp.arch@gnu.org>
<2022Feb14.094955@mips.complang.tuwien.ac.at>
<7edb642d-b9c6-4f8d-b7e6-2cc77838d4c6n@googlegroups.com>
<suechc$d2p$1@dont-email.me> <2022Feb14.231756@mips.complang.tuwien.ac.at>
<212a9416-9770-41d0-949e-ddffb6fd8757n@googlegroups.com>
<2022Feb15.120729@mips.complang.tuwien.ac.at>
<3df393f9-c10f-4a12-9e5e-a2dc856ba5c0n@googlegroups.com>
<sugu4m$9au$1@dont-email.me> <2022Feb18.073552@mips.complang.tuwien.ac.at>
<d588d582-68f7-41a1-ab1e-1e873fb826b9n@googlegroups.com>
<jwv5ypcxly1.fsf-monnier+comp.arch@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 18 Feb 2022 16:50:35 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="6eee5a70cc75c332c2d3f16500e8e251";
logging-data="8740"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18BfVBpaLWD3HSXzy6FMGTG"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:jclMrzK5ckJrQlHW2RakjZZEFMQ=
In-Reply-To: <jwv5ypcxly1.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US
 by: Ivan Godard - Fri, 18 Feb 2022 16:50 UTC

On 2/18/2022 7:06 AM, Stefan Monnier wrote:
>> But when it comes to register hazards, _that_ part of what OoO does,
>> I would have thought, can of course be fully matched by changing the
>> rename registers to explicit registers, and having the compiler allocate
>> them and schedule the instructions accordingly. So I would have thought
>> this was already a solved problem.
>
> You forgot control flow. Even when control flow can be fully predicted
> (fixed repetitions of loop and direct function calls), it can be really
> difficult for the compiler to do as good as job as an OoO because an OoO
> engine can schedule operations across those control flow edges, whereas
> doing that for a compiler requires code duplication (loop unrolling and
> function inlining) which come with their own tradeoffs.

True for inlining. In general we pipe rather than unroll.

>
> IIUC the Mill's "phasing" feature aims to reduce the cost of this
> problem by doing some of the scheduling at runtime, in a sense: during
> any given cycle, 3 phases are executed, each one from a different
> instruction, not necessarily all from the same basic block so the mix of
> operations is actually somewhat dynamic.

Yes.


devel / comp.arch / Re: instruction set binding time, was Encoding 20 and 40 bit instructions in 128 bits

Pages:1234567891011121314
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor