Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Parallel lines never meet, unless you bend one or both of them.


devel / comp.arch / Re: IA-64 and other parallel failures

SubjectAuthor
* IA-64gareth evans
`* Re: IA-64David Brown
 `* Re: IA-64John Dallman
  `* Re: IA-64Marcus
   +* Re: IA-64John Levine
   |+* Re: IA-64Sarr Blumson
   ||+* Re: IA-64MitchAlsup
   |||+- Re: IA-64Stephen Fuld
   |||`- Re: IA-64EricP
   ||+* Re: IA-64David Brown
   |||`* Re: IA-64greenaum
   ||| `* Re: IA-64Marcus
   |||  +* Re: IA-64MitchAlsup
   |||  |+- Re: IA-64Ivan Godard
   |||  |+* Re: IA-64Terje Mathisen
   |||  ||+* Re: IA-64Thomas Koenig
   |||  |||+* Re: IA-64MitchAlsup
   |||  ||||`- Re: IA-64Thomas Koenig
   |||  |||+* Re: IA-64John Levine
   |||  ||||`* Re: IA-64MitchAlsup
   |||  |||| `- Re: IA-64John Levine
   |||  |||`- Re: IA-64Marcus
   |||  ||+* Re: IA-64Quadibloc
   |||  |||`- Re: IA-64MitchAlsup
   |||  ||`* Re: IA-64MitchAlsup
   |||  || `- Re: IA-64Terje Mathisen
   |||  |`* Re: IA-64Anton Ertl
   |||  | +- Re: IA-64Anton Ertl
   |||  | `* Re: IA-64MitchAlsup
   |||  |  `- Re: IA-64Anton Ertl
   |||  +- Re: IA-64John Dallman
   |||  +- Re: IA-64Ivan Godard
   |||  `* Re: IA-64Anton Ertl
   |||   `* Re: IA-64MitchAlsup
   |||    +* Re: IA-64Stephen Fuld
   |||    |+- Re: IA-64EricP
   |||    |`* Re: IA-64Marcus
   |||    | `- Re: IA-64Stephen Fuld
   |||    `* Re: IA-64Stefan Monnier
   |||     `- Re: IA-64MitchAlsup
   ||`- Re: IA-64Anton Ertl
   |+* Re: IA-64Stefan Monnier
   ||+* Re: IA-64John Dallman
   |||+- Re: IA-64Stefan Monnier
   |||+* Re: IA-64Thomas Koenig
   ||||`* Re: IA-64Anton Ertl
   |||| +* Re: IA-64Anton Ertl
   |||| |`* Re: IA-64BGB
   |||| | +* Re: IA-64EricP
   |||| | |+- Re: IA-64MitchAlsup
   |||| | |`- Re: IA-64EricP
   |||| | +- Re: IA-64Ivan Godard
   |||| | +- Re: IA-64MitchAlsup
   |||| | +- Re: IA-64Anton Ertl
   |||| | `* Re: Local stall pipeline stageEricP
   |||| |  +- Re: Local stall pipeline stageEricP
   |||| |  `* Re: Local stall pipeline stageMitchAlsup
   |||| |   `* Re: Local stall pipeline stageEricP
   |||| |    `- Re: Local stall pipeline stageMitchAlsup
   |||| `* Re: IA-64 and other parallel failuresJohn Levine
   ||||  +- Re: IA-64 and other parallel failuresMitchAlsup
   ||||  +* Re: IA-64 and other parallel failuresAnton Ertl
   ||||  |+* Re: IA-64 and other parallel failuresIvan Godard
   ||||  ||`* Re: IA-64 and other parallel failuresAnton Ertl
   ||||  || `* Re: IA-64 and other parallel failuresMichael S
   ||||  ||  `- Re: IA-64 and other parallel failuresJohn Levine
   ||||  |`- Re: IA-64 and other parallel failuresQuadibloc
   ||||  `- Re: IA-64 and other parallel failuresMichael S
   |||`* Re: IA-64Anton Ertl
   ||| `* Re: IA-64John Dallman
   |||  +* Re: IA-64Quadibloc
   |||  |`- Re: IA-64Marcus
   |||  `* Re: IA-64Anton Ertl
   |||   +* Re: IA-64EricP
   |||   |`* Re: IA-64Michael S
   |||   | `- Re: IA-64MitchAlsup
   |||   `* Re: IA-64John Dallman
   |||    +* Re: IA-64MitchAlsup
   |||    |`- Re: IA-64John Dallman
   |||    +* Re: IA-64Anton Ertl
   |||    |`* Re: IA-64John Dallman
   |||    | `* Re: IA-64Michael S
   |||    |  `* Re: IA-64John Dallman
   |||    |   +- Re: IA-64Michael S
   |||    |   `- Re: IA-64Thomas Koenig
   |||    `* Re: IA-64Quadibloc
   |||     `- Re: IA-64John Dallman
   ||+* Re: IA-64Quadibloc
   |||`- Re: IA-64Anton Ertl
   ||`* Re: IA-64Terje Mathisen
   || `* Re: IA-64Stefan Monnier
   ||  `- Re: IA-64Terje Mathisen
   |+* Re: IA-64MitchAlsup
   ||+* Re: IA-64Ivan Godard
   |||+* Re: IA-64BGB
   ||||`* Re: IA-64MitchAlsup
   |||| `* Re: IA-64Marcus
   ||||  `- Re: IA-64Quadibloc
   |||+- Re: IA-64MitchAlsup
   |||+- Re: VLIW, threat or menace, was IA-64John Levine
   |||`- Re: IA-64Stefan Monnier
   ||`- Re: IA-64Quadibloc
   |+- Re: IA-64BGB
   |`* Re: IA-64Quadibloc
   `- Re: IA-64John Dallman

Pages:12345
Re: IA-64

<s7ij30$q3s$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16693&group=comp.arch#16693

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-2ea4-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: IA-64
Date: Thu, 13 May 2021 07:05:04 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <s7ij30$q3s$1@newsreader4.netcologne.de>
References: <jwvsg2rfxz8.fsf-monnier+comp.arch@gnu.org>
<memo.20210512230643.13980R@jgd.cix.co.uk>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 13 May 2021 07:05:04 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-2ea4-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:2ea4:0:7285:c2ff:fe6c:992d";
logging-data="26748"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Thu, 13 May 2021 07:05 UTC

John Dallman <jgd@cix.co.uk> schrieb:
> In article <jwvsg2rfxz8.fsf-monnier+comp.arch@gnu.org>,
> monnier@iro.umontreal.ca (Stefan Monnier) wrote:
>
>> Or in more constructive terms, from which aspects of IA64 might
>> future computer architects find good inspirations?
>
> It provides examples of ideas that look clever, but should be avoided.
> They include, but are not limited to:
>
[...]
> * Relying on the compiler for problems you can't solve in hardware.

They may be forgiven (in part) for thinking that because compilers
certainly had solved a similar problem with the RISC introduction.

Without decent register allocation, that architecture approach
would have been dead in the water.

> * Assuming you can solve long-standing compsci problems with manpower.

Ah, the famous Thomas Watson / Seymour Cray quotes...

Thomas Watson, in 1963:

# Last week CDC had a press conference during which they officially
# announced their 6600 system. I understand that in the laboratory
# developing this system there are only 34 people, “including the
# janitor.” Of these, 14 are engineers and 4 are programmers,
# and only one has a Ph. D., a relatively junior programmer. To
# the outsider, the laboratory appeared to be cost conscious, hard
# working and highly motivated.
#
# Contrasting this modest effort with our own vast development
# activities, I fail to understand why we have lost our industry
# leadership position by letting someone else offer the world’s
# most powerful computer. At Jenny Lake, I think top priority should
# be given to a discussion as to what we are doing wrong and how we
# should go about changing it immediately.

Seymour is rumored to have answered...

# It seems like Mr. Watson has answered his own question.

Re: IA-64

<s7j691$siv$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16700&group=comp.arch#16700

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!+9JlleTFc3MOERf2LU/SVA.user.gioia.aioe.org.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: IA-64
Date: Thu, 13 May 2021 14:32:34 +0200
Organization: Aioe.org NNTP Server
Lines: 39
Message-ID: <s7j691$siv$1@gioia.aioe.org>
References: <s7gtk1$csi$1@dont-email.me>
<memo.20210512183631.13980N@jgd.cix.co.uk> <s7h71q$76h$1@dont-email.me>
<s7hcrm$sp1$1@gal.iecc.com> <jwvsg2rfxz8.fsf-monnier+comp.arch@gnu.org>
NNTP-Posting-Host: +9JlleTFc3MOERf2LU/SVA.user.gioia.aioe.org
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Complaints-To: abuse@aioe.org
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.7
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Thu, 13 May 2021 12:32 UTC

Stefan Monnier wrote:
> John Levine [2021-05-12 20:12:38] wrote:
>> According to Marcus <m.delete@this.bitsnbites.eu>:
>>> I wonder how many actually used and enjoyed using the IA-64. I have a
>>> friend who worked on it (not sure which parts), but if I've read him
>>> correctly he wasn't very fond of the design.
>> VLIW was a good idea in the 1980s but as densities have increased and we can now do in
>> hardware what VLIW did in software, it has become more trouble than it's worth except
>> perhaps in some signal processing niches.
>
> I find this answer entertaining because AFAIK the IA64 architecture's
> wasn't VLIW at all (tho it was marketed as some kind of successor to the
> principles of VLIW).
>
> In retrospect, I wonder what the IA64 architecture had going for it
> (other than manpower/money). Or in more constructive terms, from which
> aspects of IA64 might future computer architects find good inspirations?
> The one feature I remember from it is the idea of "bundling"
> instructions such that you can define instructions of funny sizes rather
> than being stuck with 16bit, 32bit, ... (it probably wasn't the first
> ISA that did that, but it was the first where I saw it).

When I studied the first architecture manuals, the CPU was supposed to
drop in just a few years (1987?), while in reality they were 5 years
late, with a partial implementation.

Anyway, at that first inspection it seemed really nice for us asm
programmers, I.e. I could easily see hw I could abuse the huge register
set to implement RSA style 512-1024 bit operations at high speed.

There also seemed to be some nice synergies between some of the parts,
but also some obvious warts added to simplify porting/emulating
low-level PA-RISC code.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: IA-64

<31b0ae46-d232-4863-9a2f-e5203cd03f99n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16701&group=comp.arch#16701

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:4484:: with SMTP id r126mr1853023qka.18.1620911837983;
Thu, 13 May 2021 06:17:17 -0700 (PDT)
X-Received: by 2002:aca:f40a:: with SMTP id s10mr2903298oih.122.1620911837737;
Thu, 13 May 2021 06:17:17 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 13 May 2021 06:17:17 -0700 (PDT)
In-Reply-To: <3063c78f-c413-43b4-9ad0-f1842885ad85n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=199.203.251.52; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 199.203.251.52
References: <s7gtk1$csi$1@dont-email.me> <memo.20210512183631.13980N@jgd.cix.co.uk>
<s7h71q$76h$1@dont-email.me> <s7hcrm$sp1$1@gal.iecc.com> <3063c78f-c413-43b4-9ad0-f1842885ad85n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <31b0ae46-d232-4863-9a2f-e5203cd03f99n@googlegroups.com>
Subject: Re: IA-64
From: already5...@yahoo.com (Michael S)
Injection-Date: Thu, 13 May 2021 13:17:17 +0000
Content-Type: text/plain; charset="UTF-8"
 by: Michael S - Thu, 13 May 2021 13:17 UTC

On Thursday, May 13, 2021 at 3:57:29 AM UTC+3, Quadibloc wrote:
> On Wednesday, May 12, 2021 at 2:12:40 PM UTC-6, John Levine wrote:
>
> > VLIW was a good idea in the 1980s but as densities have increased and we can now do in
> > hardware what VLIW did in software, it has become more trouble than it's worth except
> > perhaps in some signal processing niches.
> But if that's true... then how is it that these increased densities haven't prevented Apple's
> M1 chip from outperforming x86 chips, thus showing that RISC has genuine advantages?
>

In the benchmarks, I had seen, Apple M1 is consistently slower than top bins of AMD Zen3.
That's despite [M1] being built on more advanced TSMC process.

> Of course, that may be comparing apples and oranges. The complexity of the x86 instruction
> set versus RISC is not necessary, and brings no direct benefits to offset the burden it
> imposes... and, so, it can't be compared to spending transistors on out-of-order hardware,
> which does do something useful, versus trying to do it more simply with VLIW.
>
> John Savard

Re: IA-64

<2021May13.163749@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16703&group=comp.arch#16703

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: IA-64
Date: Thu, 13 May 2021 14:37:49 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 50
Message-ID: <2021May13.163749@mips.complang.tuwien.ac.at>
References: <s7gtk1$csi$1@dont-email.me> <memo.20210512183631.13980N@jgd.cix.co.uk> <s7h71q$76h$1@dont-email.me> <s7hcrm$sp1$1@gal.iecc.com> <s7hffs$19e$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="9b6fdafab5a0ed867dc02613f2ce9d52";
logging-data="13570"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19FLTHIE/Oem8eExafmK/aE"
Cancel-Lock: sha1:cHEeiiO6n+bm8ng5imK5WzSiECM=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Thu, 13 May 2021 14:37 UTC

Sarr Blumson <sarr.blumson@alum.dartmouth.org> writes:
>That's certainly been the decision of the market, but it's less
> clear it was wise.

It was the decision of the market based on IA-64 implementations' lack
of competetiveness.

>The stuff "we can now do in hardware" happened
> at compile time, so less critical in performance.

That is the theory that Intel and HP (and others, e.g., Transmeta)
invested a lot of money in. That's understandable, because it's such
a seductive idea that people still fall for it despite the example of
IA-64. But if you look at how IA-64 implementations performed
compared to AMD64 implementations, AMD64 generally had better
integer performance; from <2019Jan7.095118@mips.complang.tuwien.ac.at>:

|System Cint CFP
| res base res base CPU Tested Published
|Intel D850GB 656 640 714 704 Pentium 4 2000MHz Aug-2001 Sep-2001
|hp rx4610 --- 379 715 715 Intel Itanium 800MHz Aug-2001 Sep-2001
|IBM pSeries 690 Turbo 839 804 1266 1202 POWER4 1300MHz Apr-2002 May-2002
|Dell Precision WS 340 922 893 901 878 Pentium 4 2533MHz May-2002 Jun-2002
|hp workstation zx6000 --- 807 1356 1356 Itanium 2 1000Mhz Jul-2002 Jul-2002

> And putting
> complexity in hardware makes it harder to correct
> errors.

Hmm, it's not as if the IA-64 hardware was simple (that was not it's
point).

And it seems to me that errors in performance features can be
corrected relatively straightforwardly these days: The engineers
expect that the hardware may have errors, and put in "chicken bits"
that can disable the feature. So when it turns out that there is an
error, the processor manufacturer releases firmware/microcode that (in
the worst case) turns the feature off. There are also ways to
redirect instruction execution to programmable microcode.

>Does anyone think x86 won because it's the ideal ISA?

It won because it has a huge ecosystem behind it. But if IA-64
really had delivered what it promised, it would have won, given the
weight of Intel behind it. But IA-64 did not deliver.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: IA-64

<2021May13.171529@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16704&group=comp.arch#16704

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: IA-64
Date: Thu, 13 May 2021 15:15:29 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 68
Message-ID: <2021May13.171529@mips.complang.tuwien.ac.at>
References: <jwvsg2rfxz8.fsf-monnier+comp.arch@gnu.org> <memo.20210512230643.13980R@jgd.cix.co.uk>
Injection-Info: reader02.eternal-september.org; posting-host="9b6fdafab5a0ed867dc02613f2ce9d52";
logging-data="22531"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+US3j5hVIf5Er5o8kMfr2k"
Cancel-Lock: sha1:9tb2sjvx0/BxnBqolgj5BgFJtCY=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Thu, 13 May 2021 15:15 UTC

jgd@cix.co.uk (John Dallman) writes:
>In article <jwvsg2rfxz8.fsf-monnier+comp.arch@gnu.org>,
>monnier@iro.umontreal.ca (Stefan Monnier) wrote:
>
>> Or in more constructive terms, from which aspects of IA64 might
>> future computer architects find good inspirations?
>
>It provides examples of ideas that look clever, but should be avoided.
>They include, but are not limited to:
>
>* Being far too complicated overall.

Yes.

>* Excessive use of predicate registers.

Not sure about excessive. Given the premise of compiler-controlled
instruction-level parallelism, using predication is a good idea.

>* Excessive use of register windowing.

Is there any use that you would not call excessive? It seems to me
that you either use it or not. Again, given the premise of
compiler-controlled ILP, it seems like the way to go for high performance.

>* Modulo-scheduled loops.

Modulo scheduling is pretty close to optimal for large trip counts.
The problem is that it only works for simple loops, and not all loops
have large trip counts. You can if-convert some loops with ifs into
simple loops, but you get away from being close to optimal with that.

McKinley (Itanium 2) did pretty well on SPECfp, and my guess is that's
because SPECfp spends much of its time in simple loops.

The ISA additions in IA-64 for modulo scheduling (rotating register
files, some of the predication stuff) did not look well-balanced to
me.

>* Relying on the compiler for problems you can't solve in hardware.

If you cannot do it in hardware, by all means do it in the compiler.
The downfall of IA-64 was the reverse: It turned out that
instruction-level parallelism can be solved better in hardware than in
the compiler (certainly outside high-trip-count simple loops), and
that the resulting CPUs can be clocked higher.

>* Assuming you can solve long-standing compsci problems with manpower.

They assumed that they would be able to do more in the compiler than
was eventually achieved; after all, modulo scheduling works nicely, so
why not extend it to more complex control flow.

They also assumed that OoO execution could not be implemented
competetively, and the failure to foresee that it could be done was
the real killer.

The 200MHz Pentium Pro in 1995 might have been the warning sign, but
at the time the in-order 21164 was available in IIRC 366MHz, so they
probably thought that with a similar clock advantage over the OoO
competition, their ISA features would win the day, and they would
have. But when IA-64 implementations finally appeared, they had a
big clock disadvantage (factor 2.5 compared to the Pentium 4).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: IA-64

<2021May13.174624@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16705&group=comp.arch#16705

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: IA-64
Date: Thu, 13 May 2021 15:46:24 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 29
Message-ID: <2021May13.174624@mips.complang.tuwien.ac.at>
References: <s7gtk1$csi$1@dont-email.me> <memo.20210512183631.13980N@jgd.cix.co.uk> <s7h71q$76h$1@dont-email.me> <s7hcrm$sp1$1@gal.iecc.com> <jwvsg2rfxz8.fsf-monnier+comp.arch@gnu.org> <efb88c0e-d657-4be7-a42b-fea513003513n@googlegroups.com>
Injection-Info: reader02.eternal-september.org; posting-host="9b6fdafab5a0ed867dc02613f2ce9d52";
logging-data="22531"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+3xhwFf5en3yO1v2cLQDim"
Cancel-Lock: sha1:281vM1upMTHR+mRhJoBr2qbzAOQ=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Thu, 13 May 2021 15:46 UTC

Quadibloc <jsavard@ecn.ab.ca> writes:
>But like those which are acknowledged as VLIW, it explicitly indicated
>which of the three component instructions could be executed in
>parallel.

Note that there are two different concepts:

1) bundle: 128 bits encode three instructions and some meta-data.

2) group: instructions that can be executed in parallel on a
sufficiently parallel implementation, except if there are memory
dependencies (the compiler or programmer guarantees that there are no
register dependencies).

Note that a group can start somewhere in the middle of a bundle, then
encompass an arbitrary number of full bundles, and end somewhere in
the middle of a bundle.

IA-64 has enough deviations from classical VLIW (IMO it's closer to a
RISC with a funny encoding) that their new term "EPIC" is
well-justified.

For an earlier discussion by me on the topic, read
<2003Feb8.091458@a0.complang.tuwien.ac.at>.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: IA-64

<2021May13.175652@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16706&group=comp.arch#16706

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: IA-64
Date: Thu, 13 May 2021 15:56:52 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 18
Distribution: world
Message-ID: <2021May13.175652@mips.complang.tuwien.ac.at>
References: <jwvsg2rfxz8.fsf-monnier+comp.arch@gnu.org> <memo.20210512230643.13980R@jgd.cix.co.uk> <s7ij30$q3s$1@newsreader4.netcologne.de>
Injection-Info: reader02.eternal-september.org; posting-host="9b6fdafab5a0ed867dc02613f2ce9d52";
logging-data="22531"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Ckb3cHKbI/gxNq+n7QXjr"
Cancel-Lock: sha1:IDJ9v4wWh0vSfEk8+F2dDnOg4KA=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Thu, 13 May 2021 15:56 UTC

Thomas Koenig <tkoenig@netcologne.de> writes:
>John Dallman <jgd@cix.co.uk> schrieb:
>> * Relying on the compiler for problems you can't solve in hardware.
>
>They may be forgiven (in part) for thinking that because compilers
>certainly had solved a similar problem with the RISC introduction.

Indeed. RISC had shown that by relying on compilers the instruction
set could be simplified to allow a fast pipelined implementation.

The idea behind IA-64 was to apply the same approach to
instruction-level parallelism. But it did not work out, because
hardware designers could solve that problem better.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: IA-64

<s7jinq$aj6$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16707&group=comp.arch#16707

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: IA-64
Date: Thu, 13 May 2021 18:05:14 +0200
Organization: A noiseless patient Spider
Lines: 108
Message-ID: <s7jinq$aj6$1@dont-email.me>
References: <s7gtk1$csi$1@dont-email.me>
<memo.20210512183631.13980N@jgd.cix.co.uk> <s7h71q$76h$1@dont-email.me>
<s7hcrm$sp1$1@gal.iecc.com>
<52760db1-07bc-4c94-82b7-bbb81000ddd7n@googlegroups.com>
<s7hoa0$kad$1@dont-email.me> <s7hqs9$2al$1@dont-email.me>
<98b79f68-edc2-48eb-a627-9c9d7548699dn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 13 May 2021 16:05:14 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="46af5385d054e079ff2316dbeaf4723f";
logging-data="10854"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+vnVd+h/QP8wgP5bK0bX5AGuFdT7ier0Q="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.8.1
Cancel-Lock: sha1:fEsFJTzTAs5zcMmeiUl2FiDnjEU=
In-Reply-To: <98b79f68-edc2-48eb-a627-9c9d7548699dn@googlegroups.com>
Content-Language: en-US
 by: Marcus - Thu, 13 May 2021 16:05 UTC

On 2021-05-13, MitchAlsup wrote:
> On Wednesday, May 12, 2021 at 7:11:55 PM UTC-5, BGB wrote:
>> On 5/12/2021 6:28 PM, Ivan Godard wrote:
>>> On 5/12/2021 2:37 PM, MitchAlsup wrote:
>>>> On Wednesday, May 12, 2021 at 3:12:40 PM UTC-5, John Levine wrote:
>>>>> According to Marcus <m.de...@this.bitsnbites.eu>:
>>>>>> I wonder how many actually used and enjoyed using the IA-64. I have a
>>>>>> friend who worked on it (not sure which parts), but if I've read him
>>>>>> correctly he wasn't very fond of the design.
>>>> <
>>>>> VLIW was a good idea in the 1980s
>>>> <
>>>> I question this premise. The best I can say about VLIW is that in the
>>>> 1980s we did not know
>>>> enough about VLIW to state that it was not a good idea. I propose that
>>>> now we lean heavily
>>>> in that direction (not good) hoping that Mill will save VLIW.
>>>> <
>>>> In before Ivan claims Mill is not VLIW.
>>>> <
>>>
>>> <rising to bait>
>>>
>>> What's a VLIW? - rigorous definition please :-)
>>>
>>> Some people make it synonymous with "wide issue", i.e. when instructions
>>> are issued in groups that are asserted to be free of inter-instruction
>>> dependencies. By that definition Mitch's is a pre-parse feeding a VLIW,
>>> which is an idea I think he would choke on. That would also make IA64
>>> and Mill also be VLIWs, which I feel is an unhelpful blending of
>>> distinct architectural choices.
>>>
>>> I feel that a finer division based on the retire semantics is useful.
>>> There seem to be three categories: 1) everything retires a statically
>>> known number of cycles after issue, and other bundles may issue in the
>>> time between issue and retire; 2) everything retires as soon as
>>> possible, but nothing else issues or enters execution until all the
>>> instructions of the bundle have retired; 3) everything retires as soon
>>> as possible, and instructions from subsequent bundles issue together but
>>> do not enter execution until their dependencies retire. Thus with
>>> respect to retire semantics Mill=classic VLIW=1, IA64=2, and Mitch=3.
>>>
>> If I understand this, BJX2 would be 2 here.
>>
>> Generally, writeback happens after a fixed dependency, but the next
>> bundle may begin executing as soon as there are no longer any interlock
>> dependencies, which depends on the latency of each instruction. Prior to
>> write-back, these is forwarding.
> <
> With interlocks, one can make "early out" function units, so IDIV can
> special case 1<<k divisors and deliver a result in 3-ish cycles rather than
> 64 DIV {2,3,4} cycles. Or FDIV with a fraction of 0 but a normalized number.
> <
> Notice that a cache is simply an early out memory access !
>>
>> Or, the latency of the longest instruction for which an interlock occurs.
>>
>>
>> AFAIK, some classic VLIW machines don't have interlocks or forwarding,
>> so if one tries to use a result before writeback happens, they will just
>> get a stale value.
> <
> I actually used this semantic in some PDP-40 microcode where I was
> microcoding the PDP-11/45 FP operations in the writable control store
> of the PDP-11/40 we had at CMU.
>>
>> For adverse cases, one might need to emit bundles full of NOPs.
>>
>>
>> I didn't really want to go this route in my ISA though, so I felt that
>> interlocks were kind of a "necessary evil" for general usability.
> <
> Interlocks preserve you ability to make other implementations without
> needing specializers (ala Mill).

For me this is one of the key takeaways when I've studied ISA:s from
the past: If you design an ISA for a specific microarchitecture design
(which is *very* common) the ISA will get in the way when you want to
evolve the microarchitecture (e.g. delay slots vs longer and wider
pipelines, VLIW vs wider issue, etc etc).

For niche markets (like power optimized DSP) it can make perfect sense,
but to reach a wider market (e.g. to support both narrow in-order and
wide OoO with the same ISA)...

> <
>> Though, if one can organize instructions to minimize triggering
>> interlocks, this is better.
> <
> Yes, have but do not use.
>>
>> ...
>>> However, while Mill retires like a VLIW, it does not issue like one.
>>> Instead of bundles issuing one by one in order, Mill issue is temporally
>>> interleaved (in a statically schedulable way) just as Mill retire is
>>> interleaved. This is, Mill bundles are an encoding notion, not a
>>> temporal notion; that's true for Mitch too, except Mitch's bundle is in
>>> the trace cache and Mill's are in DRAM, and Mitch's use dynamic
>>> scheduling of execution and Mill uses static.
>>>
>>> If Mill didn't have phasing then I too would call it a VLIW. But it does
>>> have phasing, and calling it a VLIW confuses it with true VLIW
>>> architectures like Trimedia, Texas Instruments C64, and Hexagon, which
>>> are very different from Mill in both design and implementation.
>>>
>>>
>> Hmm...

Re: IA-64

<2021May13.183707@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16713&group=comp.arch#16713

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: IA-64
Date: Thu, 13 May 2021 16:37:07 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 54
Distribution: world
Message-ID: <2021May13.183707@mips.complang.tuwien.ac.at>
References: <jwvsg2rfxz8.fsf-monnier+comp.arch@gnu.org> <memo.20210512230643.13980R@jgd.cix.co.uk> <s7ij30$q3s$1@newsreader4.netcologne.de> <2021May13.175652@mips.complang.tuwien.ac.at>
Injection-Info: reader02.eternal-september.org; posting-host="9b6fdafab5a0ed867dc02613f2ce9d52";
logging-data="3597"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18n696b+pOBZMD3XDMep9+J"
Cancel-Lock: sha1:Lhk0DaEJeg252uETj3nbsHXS5KU=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Thu, 13 May 2021 16:37 UTC

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>The idea behind IA-64 was to apply the same approach [shift work to
>the compiler] to instruction-level parallelism. But it did not work
>out, because hardware designers could solve that problem better.

Here's my take why the IA-64 approach did not work out:

* Hardware scheduling works better because:

* Hardware branch prediction is a lot better than compiler branch
prediction (and that's the foundation of getting a lot of ILP).

* Hardware instruction windows are larger than compiler instruction
windows: the hardware limits are branch mispredicts and the size
of the reorder buffer (224 instructions on Skylake ff., 256 on
Zen3, 352 on Ice Lake ff., 630 on M1). The compiler limits are
compilation units (even with link-time optimization there are
dynamically linked calls), conditional branch prediction accuracy,
indirect branches. And you want to avoid the code explosion that
would result from scheduling across 224-630 instructions and the
~40-100 branches in these instructions; admittedly you can be a
little more selective in a compiler about which instructions to
reorder, but such a big scheduling window is still impractical.

* The clock rate of IA-64 implementations was low compared to the OoO
competition. And that's the case for all in-order implementations
in this century: The step from in-order Bonnell (first generation
Atom) to OoO Silvermont saw a big clock rate increase, likewise
SPARC in-order implementations had low clock rate, but once they
switched to OoO, the clock rates were high, and ARM in-order
implementations also have lower clock rates than their OoO siblings
(they are admittedly intended for low-power usage). From
discussions here it seems to me that the reason is that in in-order
cores there are feedback loops that affect the whole pipeline (which
therefore impose a relatively low limit on the clock rate), while in
OoO cores feedback loops tend to be more local and therefore allow
higher clock rates.

* Where the IA-64 approach worked well is for simple loops with high
trip counts, most of which are vectorizable (in principle, not
necessarily auto-vectorizable). But the SIMD instructions that
architectures grew since the mid-1990s provided a way to deal with
these loops with less hardware (wider functional units rather than
wider machines), so it stole the little thunder that IA-64 had.
Admittedly auto-vectorization is more hit-and-miss than modulo
scheduling, but manual vectorization is also an option. The fact
that SIMD instructions won shows that unsolved (and probably
unsolvable) problems like reliable auto-vectorization are no barrier
to success.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: IA-64

<jwvo8dea6sj.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16714&group=comp.arch#16714

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: IA-64
Date: Thu, 13 May 2021 13:25:11 -0400
Organization: A noiseless patient Spider
Lines: 16
Message-ID: <jwvo8dea6sj.fsf-monnier+comp.arch@gnu.org>
References: <s7gtk1$csi$1@dont-email.me>
<memo.20210512183631.13980N@jgd.cix.co.uk>
<s7h71q$76h$1@dont-email.me> <s7hcrm$sp1$1@gal.iecc.com>
<jwvsg2rfxz8.fsf-monnier+comp.arch@gnu.org>
<s7j691$siv$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="9f7934e4fb7540004edc4826411742e5";
logging-data="15375"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18e5nH3LyLqQS+PjM0+hyzT"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)
Cancel-Lock: sha1:8G5Gq46h+nxuWZ5dqiiV833J/F0=
sha1:JuXIfrd7KKArX7/ZtwPu+MuYzog=
 by: Stefan Monnier - Thu, 13 May 2021 17:25 UTC

> Anyway, at that first inspection it seemed really nice for us asm
> programmers, I.e. I could easily see hw I could abuse the huge register set
> to implement RSA style 512-1024 bit operations at high speed.

Interesting. As a developer of software development tools (i.e. where
`gcc` is arguably the part of SPEC that best matches my world),
I instead was wondering how we would ever be able to make efficient use
of such a huge register file for code traversing pointer-heavy
tree datastructures.

> There also seemed to be some nice synergies between some of the parts,

Do you remember specific cases?

Stefan

Re: IA-64

<memo.20210513193435.13980X@jgd.cix.co.uk>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16718&group=comp.arch#16718

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: jgd...@cix.co.uk (John Dallman)
Newsgroups: comp.arch
Subject: Re: IA-64
Date: Thu, 13 May 2021 19:34 +0100 (BST)
Organization: A noiseless patient Spider
Lines: 59
Message-ID: <memo.20210513193435.13980X@jgd.cix.co.uk>
References: <2021May13.171529@mips.complang.tuwien.ac.at>
Reply-To: jgd@cix.co.uk
Injection-Info: reader02.eternal-september.org; posting-host="f074d70d39e3275258b61e5059dc910f";
logging-data="27363"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1970o8NYfKUDHM+ljWn0wGtyw0DcLm5WRA="
Cancel-Lock: sha1:1CoXn5mbxSAaWW55vp/nLI6psf8=
 by: John Dallman - Thu, 13 May 2021 18:34 UTC

In article <2021May13.171529@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> >* Excessive use of predicate registers.
> Not sure about excessive. Given the premise of compiler-controlled
> instruction-level parallelism, using predication is a good idea.

Indeed, but having 64 of them (IIRC) seems like far too many.

> >* Excessive use of register windowing.
> Is there any use that you would not call excessive? It seems to me
> that you either use it or not. Again, given the premise of
> compiler-controlled ILP, it seems like the way to go for high
> performance.

Badly expressed: complicated register windowing, requiring a complex and
hidden Register Stack Engine, undermining the goal of having execution
under the compiler's control. This also led to the designers forgetting
that the floating-point registers did not have register windowing,
creating the advance-load floating-point fiasco, which wrecked the
performance of non-leaf functions that used floating point.

> >* Modulo-scheduled loops.
> Modulo scheduling is pretty close to optimal for large trip counts.
> The problem is that it only works for simple loops, and not all
> loops have large trip counts. You can if-convert some loops with
> ifs into simple loops, but you get away from being close to optimal
> with that.
>
> McKinley (Itanium 2) did pretty well on SPECfp, and my guess is
> that's because SPECfp spends much of its time in simple loops.
>
> The ISA additions in IA-64 for modulo scheduling (rotating register
> files, some of the predication stuff) did not look well-balanced to
> me.

Yup. Itanium displayed a major fallacy that Intel suffered from in the
1990s. I call it the supercomputing fallacy. The fastest computers in the
world were supercomputers with ISAs designed for specialised types of
computation. To make general-purpose computers faster, it was seen as
necessary to imitate those ISAs and rely on software developers to
transform their software to use them. This neglected the vast differences
between classical HPC code and more general-purpose software, but it did
provide someone to blame.

> They also assumed that OoO execution could not be implemented
> competetively, and the failure to foresee that it could be done was
> the real killer.

Indeed. The scuttlebutt from inside Intel hinted at lots of people who'd
been away from design for a few years wanting to go back to it so that
their prints would be on Itanium. They certainly started our absolutely
certain it would be a huge success, with a very large team.

In contrast, AMD knew that if they didn't get AMD64 right they were
history. That concentrates the mind, and reduces the clamour to get
onboard.

John

Re: IA-64

<s7jsp2$stk$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16719&group=comp.arch#16719

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!+9JlleTFc3MOERf2LU/SVA.user.gioia.aioe.org.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: IA-64
Date: Thu, 13 May 2021 20:56:36 +0200
Organization: Aioe.org NNTP Server
Lines: 28
Message-ID: <s7jsp2$stk$1@gioia.aioe.org>
References: <s7gtk1$csi$1@dont-email.me>
<memo.20210512183631.13980N@jgd.cix.co.uk> <s7h71q$76h$1@dont-email.me>
<s7hcrm$sp1$1@gal.iecc.com> <jwvsg2rfxz8.fsf-monnier+comp.arch@gnu.org>
<s7j691$siv$1@gioia.aioe.org> <jwvo8dea6sj.fsf-monnier+comp.arch@gnu.org>
NNTP-Posting-Host: +9JlleTFc3MOERf2LU/SVA.user.gioia.aioe.org
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Complaints-To: abuse@aioe.org
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.7
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Thu, 13 May 2021 18:56 UTC

Stefan Monnier wrote:
>> Anyway, at that first inspection it seemed really nice for us asm
>> programmers, I.e. I could easily see hw I could abuse the huge register set
>> to implement RSA style 512-1024 bit operations at high speed.
>
> Interesting. As a developer of software development tools (i.e. where
> `gcc` is arguably the part of SPEC that best matches my world),
> I instead was wondering how we would ever be able to make efficient use
> of such a huge register file for code traversing pointer-heavy
> tree datastructures.
>
>> There also seemed to be some nice synergies between some of the parts,
>
> Do you remember specific cases?

Not the details, just that I thought I could implement a
super-accumulator using a register array, dynamically addressing the
right part. This seems like it would have required sw control of the
rotating register windows, so I'm probably missing something. :-(

I did write some pseudo-asm for RSA-type 0.5 to 2 Kbit unsigned integer
arithmetic.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: IA-64

<s7ju3h$e3o$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16722&group=comp.arch#16722

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: IA-64
Date: Thu, 13 May 2021 14:18:01 -0500
Organization: A noiseless patient Spider
Lines: 81
Message-ID: <s7ju3h$e3o$1@dont-email.me>
References: <jwvsg2rfxz8.fsf-monnier+comp.arch@gnu.org>
<memo.20210512230643.13980R@jgd.cix.co.uk>
<s7ij30$q3s$1@newsreader4.netcologne.de>
<2021May13.175652@mips.complang.tuwien.ac.at>
<2021May13.183707@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 13 May 2021 19:19:13 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="a19b083030be3b0ecf0547aacfd41d40";
logging-data="14456"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX193VIxVx+Hq+OcndFXE1l2r"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.1
Cancel-Lock: sha1:rCwXMc2w6Et+NCv44bxUx5gKmqY=
In-Reply-To: <2021May13.183707@mips.complang.tuwien.ac.at>
Content-Language: en-US
 by: BGB - Thu, 13 May 2021 19:18 UTC

On 5/13/2021 11:37 AM, Anton Ertl wrote:
> anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>> The idea behind IA-64 was to apply the same approach [shift work to
>> the compiler] to instruction-level parallelism. But it did not work
>> out, because hardware designers could solve that problem better.
>
> Here's my take why the IA-64 approach did not work out:
>
> * Hardware scheduling works better because:
>
> * Hardware branch prediction is a lot better than compiler branch
> prediction (and that's the foundation of getting a lot of ILP).
>
> * Hardware instruction windows are larger than compiler instruction
> windows: the hardware limits are branch mispredicts and the size
> of the reorder buffer (224 instructions on Skylake ff., 256 on
> Zen3, 352 on Ice Lake ff., 630 on M1). The compiler limits are
> compilation units (even with link-time optimization there are
> dynamically linked calls), conditional branch prediction accuracy,
> indirect branches. And you want to avoid the code explosion that
> would result from scheduling across 224-630 instructions and the
> ~40-100 branches in these instructions; admittedly you can be a
> little more selective in a compiler about which instructions to
> reorder, but such a big scheduling window is still impractical.
>
> * The clock rate of IA-64 implementations was low compared to the OoO
> competition. And that's the case for all in-order implementations
> in this century: The step from in-order Bonnell (first generation
> Atom) to OoO Silvermont saw a big clock rate increase, likewise
> SPARC in-order implementations had low clock rate, but once they
> switched to OoO, the clock rates were high, and ARM in-order
> implementations also have lower clock rates than their OoO siblings
> (they are admittedly intended for low-power usage). From
> discussions here it seems to me that the reason is that in in-order
> cores there are feedback loops that affect the whole pipeline (which
> therefore impose a relatively low limit on the clock rate), while in
> OoO cores feedback loops tend to be more local and therefore allow
> higher clock rates.
>

Kinda wondering if this is related to an observation in my own project:
To keep the pipeline consistent, there is a big "Hold" signal that
basically stops everything connected to the pipeline from moving.

Things related to this hold signal are a big source of timing issues,
and pretty much anything dependent on this signal has a harder time with
timing (particularly if it is used on inputs to a stage).

The presence of some big global hold/stall signal seems to be implicit
in the design of an in-order pipeline.

Meanwhile, parts of the core which operate independently of this signal
can seemingly do more work without failing timing as easily.

If an OoO has all of the parts of the core operating independently,
without a global stall, it is possible this makes timing easier, and
allows for higher clock speeds?...

Eg, in an effort to reduce timing issues, I have ended up needing to
make FADD/FSUB/FMUL 1 cycle longer (now 7 cycles), but with the gain
that these units no longer need to respect the Hold signal. This does
not effect SIMD ops, which had already ignored the global Hold signal.

> * Where the IA-64 approach worked well is for simple loops with high
> trip counts, most of which are vectorizable (in principle, not
> necessarily auto-vectorizable). But the SIMD instructions that
> architectures grew since the mid-1990s provided a way to deal with
> these loops with less hardware (wider functional units rather than
> wider machines), so it stole the little thunder that IA-64 had.
> Admittedly auto-vectorization is more hit-and-miss than modulo
> scheduling, but manual vectorization is also an option. The fact
> that SIMD instructions won shows that unsolved (and probably
> unsolvable) problems like reliable auto-vectorization are no barrier
> to success.
>
> - anton
>

Re: IA-64

<HDfnI.351001$2A5.310022@fx45.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16724&group=comp.arch#16724

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!npeer.as286.net!npeer-ng0.as286.net!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx45.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: IA-64
References: <jwvsg2rfxz8.fsf-monnier+comp.arch@gnu.org> <memo.20210512230643.13980R@jgd.cix.co.uk> <s7ij30$q3s$1@newsreader4.netcologne.de> <2021May13.175652@mips.complang.tuwien.ac.at> <2021May13.183707@mips.complang.tuwien.ac.at> <s7ju3h$e3o$1@dont-email.me>
In-Reply-To: <s7ju3h$e3o$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 85
Message-ID: <HDfnI.351001$2A5.310022@fx45.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Thu, 13 May 2021 20:04:55 UTC
Date: Thu, 13 May 2021 16:04:20 -0400
X-Received-Bytes: 5096
 by: EricP - Thu, 13 May 2021 20:04 UTC

BGB wrote:
> On 5/13/2021 11:37 AM, Anton Ertl wrote:
>> anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>>> The idea behind IA-64 was to apply the same approach [shift work to
>>> the compiler] to instruction-level parallelism. But it did not work
>>> out, because hardware designers could solve that problem better.
>>
>> Here's my take why the IA-64 approach did not work out:
>>
>> * Hardware scheduling works better because:
>>
>> * Hardware branch prediction is a lot better than compiler branch
>> prediction (and that's the foundation of getting a lot of ILP).
>>
>> * Hardware instruction windows are larger than compiler instruction
>> windows: the hardware limits are branch mispredicts and the size
>> of the reorder buffer (224 instructions on Skylake ff., 256 on
>> Zen3, 352 on Ice Lake ff., 630 on M1). The compiler limits are
>> compilation units (even with link-time optimization there are
>> dynamically linked calls), conditional branch prediction accuracy,
>> indirect branches. And you want to avoid the code explosion that
>> would result from scheduling across 224-630 instructions and the
>> ~40-100 branches in these instructions; admittedly you can be a
>> little more selective in a compiler about which instructions to
>> reorder, but such a big scheduling window is still impractical.
>>
>> * The clock rate of IA-64 implementations was low compared to the OoO
>> competition. And that's the case for all in-order implementations
>> in this century: The step from in-order Bonnell (first generation
>> Atom) to OoO Silvermont saw a big clock rate increase, likewise
>> SPARC in-order implementations had low clock rate, but once they
>> switched to OoO, the clock rates were high, and ARM in-order
>> implementations also have lower clock rates than their OoO siblings
>> (they are admittedly intended for low-power usage). From
>> discussions here it seems to me that the reason is that in in-order
>> cores there are feedback loops that affect the whole pipeline (which
>> therefore impose a relatively low limit on the clock rate), while in
>> OoO cores feedback loops tend to be more local and therefore allow
>> higher clock rates.
>>
>
> Kinda wondering if this is related to an observation in my own project:
> To keep the pipeline consistent, there is a big "Hold" signal that
> basically stops everything connected to the pipeline from moving.
>
> Things related to this hold signal are a big source of timing issues,
> and pretty much anything dependent on this signal has a harder time with
> timing (particularly if it is used on inputs to a stage).
>
> The presence of some big global hold/stall signal seems to be implicit
> in the design of an in-order pipeline.
>
>
> Meanwhile, parts of the core which operate independently of this signal
> can seemingly do more work without failing timing as easily.
>
> If an OoO has all of the parts of the core operating independently,
> without a global stall, it is possible this makes timing easier, and
> allows for higher clock speeds?...
>
>
> Eg, in an effort to reduce timing issues, I have ended up needing to
> make FADD/FSUB/FMUL 1 cycle longer (now 7 cycles), but with the gain
> that these units no longer need to respect the Hold signal. This does
> not effect SIMD ops, which had already ignored the global Hold signal.

There is an alternative to a "global pipeline stall" signal whereby the
stall signal propagates backwards stage by stage through the pipeline.
It has the advantages that each stage makes its own
stall decision locally so no global propagation delay,
and pipeline bubbles automatically fill themselves in
so stages only stall when they have to.

This design uses latches between stages:

Synchronous Interlocked Pipelines, 2002
https://faculty.cs.byu.edu/~egm/papers/async_2002.pdf

This one uses edge triggered FF:

A New Synchronous circuit for Elastic Pipeline Architecture, 2015
http://dept.ru.ac.bd/ic4me2/2015/proceedings/pdfs/107.pdf

Re: IA-64

<s7k49r$6vb$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16726&group=comp.arch#16726

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: IA-64
Date: Thu, 13 May 2021 14:04:59 -0700
Organization: A noiseless patient Spider
Lines: 72
Message-ID: <s7k49r$6vb$1@dont-email.me>
References: <jwvsg2rfxz8.fsf-monnier+comp.arch@gnu.org>
<memo.20210512230643.13980R@jgd.cix.co.uk>
<s7ij30$q3s$1@newsreader4.netcologne.de>
<2021May13.175652@mips.complang.tuwien.ac.at>
<2021May13.183707@mips.complang.tuwien.ac.at> <s7ju3h$e3o$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 13 May 2021 21:04:59 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="06fcc820a36a296d009e14bf251c0493";
logging-data="7147"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/lz++rDwfZzKlaGLwIZ3qr"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.0
Cancel-Lock: sha1:z0f/zu+Q9iqk7xFkuNuk/pvHC2c=
In-Reply-To: <s7ju3h$e3o$1@dont-email.me>
Content-Language: en-US
 by: Ivan Godard - Thu, 13 May 2021 21:04 UTC

On 5/13/2021 12:18 PM, BGB wrote:
> On 5/13/2021 11:37 AM, Anton Ertl wrote:
>> anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>>> The idea behind IA-64 was to apply the same approach [shift work to
>>> the compiler] to instruction-level parallelism.  But it did not work
>>> out, because hardware designers could solve that problem better.
>>
>> Here's my take why the IA-64 approach did not work out:
>>
>> * Hardware scheduling works better because:
>>
>>    * Hardware branch prediction is a lot better than compiler branch
>>      prediction (and that's the foundation of getting a lot of ILP).
>>
>>    * Hardware instruction windows are larger than compiler instruction
>>      windows: the hardware limits are branch mispredicts and the size
>>      of the reorder buffer (224 instructions on Skylake ff., 256 on
>>      Zen3, 352 on Ice Lake ff., 630 on M1).  The compiler limits are
>>      compilation units (even with link-time optimization there are
>>      dynamically linked calls), conditional branch prediction accuracy,
>>      indirect branches.  And you want to avoid the code explosion that
>>      would result from scheduling across 224-630 instructions and the
>>      ~40-100 branches in these instructions; admittedly you can be a
>>      little more selective in a compiler about which instructions to
>>      reorder, but such a big scheduling window is still impractical.
>>
>> * The clock rate of IA-64 implementations was low compared to the OoO
>>    competition.  And that's the case for all in-order implementations
>>    in this century: The step from in-order Bonnell (first generation
>>    Atom) to OoO Silvermont saw a big clock rate increase, likewise
>>    SPARC in-order implementations had low clock rate, but once they
>>    switched to OoO, the clock rates were high, and ARM in-order
>>    implementations also have lower clock rates than their OoO siblings
>>    (they are admittedly intended for low-power usage).  From
>>    discussions here it seems to me that the reason is that in in-order
>>    cores there are feedback loops that affect the whole pipeline (which
>>    therefore impose a relatively low limit on the clock rate), while in
>>    OoO cores feedback loops tend to be more local and therefore allow
>>    higher clock rates.
>>
>
> Kinda wondering if this is related to an observation in my own project:
> To keep the pipeline consistent, there is a big "Hold" signal that
> basically stops everything connected to the pipeline from moving.
>
> Things related to this hold signal are a big source of timing issues,
> and pretty much anything dependent on this signal has a harder time with
> timing (particularly if it is used on inputs to a stage).
>
> The presence of some big global hold/stall signal seems to be implicit
> in the design of an in-order pipeline.

No, it's not implicit; saying you must choose OOO or single-issue is a
false dichotomy.

The key is exception and miss handling - yes, if these must be stalled
then you are screwed. That's why Mill is designed to execute ahead into
a putative stall condition by recording these conditions in in-flight
metadata, the NaR flag. Execution can run two (all that's needed in in
present implementation) full cycles into skid buffers and then replay
those buffers with the same timing as if the stall hadn't happened.
That's result replay, in contrast to the usual wisdom that discards the
pipe and does issue replay.

Result replay does require that skids do not make any permanent change
to persistent state. Of course, OOOs better not make any such change
either; it's called Spectre, and Mill with result replay is immune. The
skid buffers are not a separate piece of hardware; they are the same FU
and spiller bypass latches that hold data normally, and run-out is just
a matter of keeping two cycle worth of history of the logical->physical
belt numbering.

Re: IA-64

<cf01a60b-65eb-4bd2-a257-0a40e805ab91n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16727&group=comp.arch#16727

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:57c5:: with SMTP id w5mr34166402qta.166.1620943778328;
Thu, 13 May 2021 15:09:38 -0700 (PDT)
X-Received: by 2002:a05:6808:3a3:: with SMTP id n3mr28403388oie.157.1620943778129;
Thu, 13 May 2021 15:09:38 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 13 May 2021 15:09:37 -0700 (PDT)
In-Reply-To: <s7ju3h$e3o$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <jwvsg2rfxz8.fsf-monnier+comp.arch@gnu.org> <memo.20210512230643.13980R@jgd.cix.co.uk>
<s7ij30$q3s$1@newsreader4.netcologne.de> <2021May13.175652@mips.complang.tuwien.ac.at>
<2021May13.183707@mips.complang.tuwien.ac.at> <s7ju3h$e3o$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <cf01a60b-65eb-4bd2-a257-0a40e805ab91n@googlegroups.com>
Subject: Re: IA-64
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 13 May 2021 22:09:38 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Thu, 13 May 2021 22:09 UTC

On Thursday, May 13, 2021 at 2:19:15 PM UTC-5, BGB wrote:
> On 5/13/2021 11:37 AM, Anton Ertl wrote:
> > an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
> >> The idea behind IA-64 was to apply the same approach [shift work to
> >> the compiler] to instruction-level parallelism. But it did not work
> >> out, because hardware designers could solve that problem better.
> >
> > Here's my take why the IA-64 approach did not work out:
> >
> > * Hardware scheduling works better because:
> >
> > * Hardware branch prediction is a lot better than compiler branch
> > prediction (and that's the foundation of getting a lot of ILP).
> >
> > * Hardware instruction windows are larger than compiler instruction
> > windows: the hardware limits are branch mispredicts and the size
> > of the reorder buffer (224 instructions on Skylake ff., 256 on
> > Zen3, 352 on Ice Lake ff., 630 on M1). The compiler limits are
> > compilation units (even with link-time optimization there are
> > dynamically linked calls), conditional branch prediction accuracy,
> > indirect branches. And you want to avoid the code explosion that
> > would result from scheduling across 224-630 instructions and the
> > ~40-100 branches in these instructions; admittedly you can be a
> > little more selective in a compiler about which instructions to
> > reorder, but such a big scheduling window is still impractical.
> >
> > * The clock rate of IA-64 implementations was low compared to the OoO
> > competition. And that's the case for all in-order implementations
> > in this century: The step from in-order Bonnell (first generation
> > Atom) to OoO Silvermont saw a big clock rate increase, likewise
> > SPARC in-order implementations had low clock rate, but once they
> > switched to OoO, the clock rates were high, and ARM in-order
> > implementations also have lower clock rates than their OoO siblings
> > (they are admittedly intended for low-power usage). From
> > discussions here it seems to me that the reason is that in in-order
> > cores there are feedback loops that affect the whole pipeline (which
> > therefore impose a relatively low limit on the clock rate), while in
> > OoO cores feedback loops tend to be more local and therefore allow
> > higher clock rates.
> >
> Kinda wondering if this is related to an observation in my own project:
> To keep the pipeline consistent, there is a big "Hold" signal that
> basically stops everything connected to the pipeline from moving.
>
> Things related to this hold signal are a big source of timing issues,
> and pretty much anything dependent on this signal has a harder time with
> timing (particularly if it is used on inputs to a stage).
<
In one of my designs, we did not prevent the pipeline flip-flops from flopping,
instead, we added a 3rd latch 1/2 cycle later so that if we advanced and we
should not have, we had the value to cycle back and make it appear that the
pipeline did not advance. This allows that big global "hold" signal another
1/2 cycle to be processed and buffered and wired to where it needed to be.
>
> The presence of some big global hold/stall signal seems to be implicit
> in the design of an in-order pipeline.
>
>
> Meanwhile, parts of the core which operate independently of this signal
> can seemingly do more work without failing timing as easily.
>
> If an OoO has all of the parts of the core operating independently,
> without a global stall, it is possible this makes timing easier, and
> allows for higher clock speeds?...
>
>
> Eg, in an effort to reduce timing issues, I have ended up needing to
> make FADD/FSUB/FMUL 1 cycle longer (now 7 cycles), but with the gain
> that these units no longer need to respect the Hold signal. This does
> not effect SIMD ops, which had already ignored the global Hold signal.
> > * Where the IA-64 approach worked well is for simple loops with high
> > trip counts, most of which are vectorizable (in principle, not
> > necessarily auto-vectorizable). But the SIMD instructions that
> > architectures grew since the mid-1990s provided a way to deal with
> > these loops with less hardware (wider functional units rather than
> > wider machines), so it stole the little thunder that IA-64 had.
> > Admittedly auto-vectorization is more hit-and-miss than modulo
> > scheduling, but manual vectorization is also an option. The fact
> > that SIMD instructions won shows that unsolved (and probably
> > unsolvable) problems like reliable auto-vectorization are no barrier
> > to success.
> >
> > - anton
> >

Re: IA-64

<17590a99-9103-4564-968f-4715c5cdcd05n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16728&group=comp.arch#16728

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:20e7:: with SMTP id 7mr42656755qvk.36.1620943869273;
Thu, 13 May 2021 15:11:09 -0700 (PDT)
X-Received: by 2002:a4a:e512:: with SMTP id r18mr33650749oot.40.1620943869061;
Thu, 13 May 2021 15:11:09 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!usenet.pasdenom.info!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 13 May 2021 15:11:08 -0700 (PDT)
In-Reply-To: <HDfnI.351001$2A5.310022@fx45.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <jwvsg2rfxz8.fsf-monnier+comp.arch@gnu.org> <memo.20210512230643.13980R@jgd.cix.co.uk>
<s7ij30$q3s$1@newsreader4.netcologne.de> <2021May13.175652@mips.complang.tuwien.ac.at>
<2021May13.183707@mips.complang.tuwien.ac.at> <s7ju3h$e3o$1@dont-email.me> <HDfnI.351001$2A5.310022@fx45.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <17590a99-9103-4564-968f-4715c5cdcd05n@googlegroups.com>
Subject: Re: IA-64
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 13 May 2021 22:11:09 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Thu, 13 May 2021 22:11 UTC

On Thursday, May 13, 2021 at 3:04:58 PM UTC-5, EricP wrote:

>
> This design uses latches between stages:
>
> Synchronous Interlocked Pipelines, 2002
> https://faculty.cs.byu.edu/~egm/papers/async_2002.pdf
>
> This one uses edge triggered FF:
>
> A New Synchronous circuit for Elastic Pipeline Architecture, 2015
> http://dept.ru.ac.bd/ic4me2/2015/proceedings/pdfs/107.pdf
<
Thanks EricP

Re: IA-64

<0a51413d-6995-435c-82c1-f8045a4fcbb4n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16729&group=comp.arch#16729

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a0c:dc08:: with SMTP id s8mr43439968qvk.12.1620945615208; Thu, 13 May 2021 15:40:15 -0700 (PDT)
X-Received: by 2002:a9d:5c11:: with SMTP id o17mr36819859otk.178.1620945614964; Thu, 13 May 2021 15:40:14 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsfeed.xs4all.nl!newsfeed9.news.xs4all.nl!tr2.eu1.usenetexpress.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 13 May 2021 15:40:14 -0700 (PDT)
In-Reply-To: <s7jinq$aj6$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:f8e3:d700:89d2:dbb3:1aa2:bbad; posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:f8e3:d700:89d2:dbb3:1aa2:bbad
References: <s7gtk1$csi$1@dont-email.me> <memo.20210512183631.13980N@jgd.cix.co.uk> <s7h71q$76h$1@dont-email.me> <s7hcrm$sp1$1@gal.iecc.com> <52760db1-07bc-4c94-82b7-bbb81000ddd7n@googlegroups.com> <s7hoa0$kad$1@dont-email.me> <s7hqs9$2al$1@dont-email.me> <98b79f68-edc2-48eb-a627-9c9d7548699dn@googlegroups.com> <s7jinq$aj6$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <0a51413d-6995-435c-82c1-f8045a4fcbb4n@googlegroups.com>
Subject: Re: IA-64
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Thu, 13 May 2021 22:40:15 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 12
 by: Quadibloc - Thu, 13 May 2021 22:40 UTC

On Thursday, May 13, 2021 at 10:05:17 AM UTC-6, Marcus wrote:

> For me this is one of the key takeaways when I've studied ISA:s from
> the past: If you design an ISA for a specific microarchitecture design
> (which is *very* common) the ISA will get in the way when you want to
> evolve the microarchitecture (e.g. delay slots vs longer and wider
> pipelines, VLIW vs wider issue, etc etc).

Yes, and I know that when the Itanium was first introduced and I
had the chance to read up on its ISA, that was one thing that I felt
was wrong with it.

John Savard

Re: IA-64

<f73cf9c4-dde8-4b2d-a239-a76efc297265n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16730&group=comp.arch#16730

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:b1b:: with SMTP id t27mr33991842qkg.42.1620945992634;
Thu, 13 May 2021 15:46:32 -0700 (PDT)
X-Received: by 2002:aca:30cc:: with SMTP id w195mr32107107oiw.78.1620945992448;
Thu, 13 May 2021 15:46:32 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 13 May 2021 15:46:32 -0700 (PDT)
In-Reply-To: <memo.20210513193435.13980X@jgd.cix.co.uk>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:f8e3:d700:89d2:dbb3:1aa2:bbad;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:f8e3:d700:89d2:dbb3:1aa2:bbad
References: <2021May13.171529@mips.complang.tuwien.ac.at> <memo.20210513193435.13980X@jgd.cix.co.uk>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f73cf9c4-dde8-4b2d-a239-a76efc297265n@googlegroups.com>
Subject: Re: IA-64
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Thu, 13 May 2021 22:46:32 +0000
Content-Type: text/plain; charset="UTF-8"
 by: Quadibloc - Thu, 13 May 2021 22:46 UTC

On Thursday, May 13, 2021 at 12:34:39 PM UTC-6, John Dallman wrote:

> Yup. Itanium displayed a major fallacy that Intel suffered from in the
> 1990s. I call it the supercomputing fallacy. The fastest computers in the
> world were supercomputers with ISAs designed for specialised types of
> computation. To make general-purpose computers faster, it was seen as
> necessary to imitate those ISAs and rely on software developers to
> transform their software to use them. This neglected the vast differences
> between classical HPC code and more general-purpose software, but it did
> provide someone to blame.

And here I was thinking that since the Pentium Pro and Pentium II were similar
in microarchitecture to the IBM 360/195 - cache plus an OoO floating-point unit
plus a fast division algorithm - the only thing left (besides making the integer
unit OoO too) would be going to the only faster architecture ever tried...

the vector architecture of the Cray I and its successors.

As currently exemplified by the NEC SX-Aurora TSUBASA and no other current
computer, although _perhaps_ the scalable vector stuff in the latest Arm spec
comes close.

But that is a chimera and a fallacy!

John Savard

Re: IA-64 and other parallel failures

<s7kk42$2sku$1@gal.iecc.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16732&group=comp.arch#16732

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!xmission!usenet.csail.mit.edu!news.iecc.com!.POSTED.news.iecc.com!not-for-mail
From: joh...@taugh.com (John Levine)
Newsgroups: comp.arch
Subject: Re: IA-64 and other parallel failures
Date: Fri, 14 May 2021 01:34:58 -0000 (UTC)
Organization: Taughannock Networks
Message-ID: <s7kk42$2sku$1@gal.iecc.com>
References: <jwvsg2rfxz8.fsf-monnier+comp.arch@gnu.org> <memo.20210512230643.13980R@jgd.cix.co.uk> <s7ij30$q3s$1@newsreader4.netcologne.de> <2021May13.175652@mips.complang.tuwien.ac.at>
Injection-Date: Fri, 14 May 2021 01:34:58 -0000 (UTC)
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="94878"; mail-complaints-to="abuse@iecc.com"
In-Reply-To: <jwvsg2rfxz8.fsf-monnier+comp.arch@gnu.org> <memo.20210512230643.13980R@jgd.cix.co.uk> <s7ij30$q3s$1@newsreader4.netcologne.de> <2021May13.175652@mips.complang.tuwien.ac.at>
Cleverness: some
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: johnl@iecc.com (John Levine)
 by: John Levine - Fri, 14 May 2021 01:34 UTC

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
>The idea behind IA-64 was to apply the same approach to
>instruction-level parallelism. But it did not work out, because
>hardware designers could solve that problem better.

Right. It turned out that even with a very clever compiler there is a lot of
stuff you cannot schedule until runtime because it depends on the data.

I can think of a bunch of attempts to make a fast processor that exposed internal
parallel features so software could take advantage of them, and they always ended up
losing to designs that hid the parallelism in a nominally sequential instruction set
and scheduled in hardware on the fly.

The Intel i860 was a fast RISC-y chip in the late 1980s, same
generation as the 486. They had the bright idea to expose the floating
point pipeline so you could do a pipelined floating instruction that
ran in one cycle and advanced the three stage pipeline by one stage
and returned the result from three instructions back. No compiler ever
could deal with that so I think there were assembler routines for
things like dot product, but nothing else used it.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Re: IA-64 and other parallel failures

<2ba85ecb-4259-4fb2-843e-3870495079b1n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16733&group=comp.arch#16733

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:9e12:: with SMTP id h18mr41892301qke.483.1620956771312; Thu, 13 May 2021 18:46:11 -0700 (PDT)
X-Received: by 2002:a4a:d442:: with SMTP id p2mr16558818oos.89.1620956771063; Thu, 13 May 2021 18:46:11 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsfeed.xs4all.nl!newsfeed8.news.xs4all.nl!tr3.eu1.usenetexpress.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 13 May 2021 18:46:10 -0700 (PDT)
In-Reply-To: <s7kk42$2sku$1@gal.iecc.com>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <jwvsg2rfxz8.fsf-monnier+comp.arch@gnu.org> <memo.20210512230643.13980R@jgd.cix.co.uk> <s7ij30$q3s$1@newsreader4.netcologne.de> <2021May13.175652@mips.complang.tuwien.ac.at> <s7kk42$2sku$1@gal.iecc.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2ba85ecb-4259-4fb2-843e-3870495079b1n@googlegroups.com>
Subject: Re: IA-64 and other parallel failures
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 14 May 2021 01:46:11 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 29
 by: MitchAlsup - Fri, 14 May 2021 01:46 UTC

On Thursday, May 13, 2021 at 8:35:01 PM UTC-5, John Levine wrote:
> According to Anton Ertl <an...@mips.complang.tuwien.ac.at>:
> >The idea behind IA-64 was to apply the same approach to
> >instruction-level parallelism. But it did not work out, because
> >hardware designers could solve that problem better.
<
> Right. It turned out that even with a very clever compiler there is a lot of
> stuff you cannot schedule until runtime because it depends on the data.
<
One of the obvious things that is hard to schedule is LDs from a cache
line, 7 hits 1 miss 7 more hits 1 miss <repeat ad nauseam>
>
> I can think of a bunch of attempts to make a fast processor that exposed internal
> parallel features so software could take advantage of them, and they always ended up
> losing to designs that hid the parallelism in a nominally sequential instruction set
> and scheduled in hardware on the fly.
>
> The Intel i860 was a fast RISC-y chip in the late 1980s, same
> generation as the 486. They had the bright idea to expose the floating
> point pipeline so you could do a pipelined floating instruction that
> ran in one cycle and advanced the three stage pipeline by one stage
> and returned the result from three instructions back. No compiler ever
> could deal with that so I think there were assembler routines for
> things like dot product, but nothing else used it.
<
Yech !
> --
> Regards,
> John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
> Please consider the environment before reading this e-mail. https://jl.ly

Re: IA-64 and other parallel failures

<2021May14.101841@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16737&group=comp.arch#16737

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: IA-64 and other parallel failures
Date: Fri, 14 May 2021 08:18:41 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 82
Message-ID: <2021May14.101841@mips.complang.tuwien.ac.at>
References: <jwvsg2rfxz8.fsf-monnier+comp.arch@gnu.org> <memo.20210512230643.13980R@jgd.cix.co.uk> <s7ij30$q3s$1@newsreader4.netcologne.de> <2021May13.175652@mips.complang.tuwien.ac.at> <s7kk42$2sku$1@gal.iecc.com>
Injection-Info: reader02.eternal-september.org; posting-host="5d5f7c2fd86717a462d647ead1d1489d";
logging-data="29438"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18G5EHFp9VcS2Qla2rdw9B/"
Cancel-Lock: sha1:/yizisfluUzIrqE7hUvka0xvstA=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Fri, 14 May 2021 08:18 UTC

John Levine <johnl@taugh.com> writes:
>The Intel i860 was a fast RISC-y chip in the late 1980s, same
>generation as the 486. They had the bright idea to expose the floating
>point pipeline so you could do a pipelined floating instruction that
>ran in one cycle and advanced the three stage pipeline by one stage
>and returned the result from three instructions back. No compiler ever
>could deal with that so I think there were assembler routines for
>things like dot product, but nothing else used it.

Bradlee et al. certainly targeted the i860 in his work, and I dimly
remember that they did something about the explicitly advanced
pipeline.

But this idea works well only where modulo scheduling works
well, and all the other code has to suffer the costs (even if the
compiler can deal with it).

And it pretty much fixes the microarchitecture, which is not a great
long-term idea: If you want to make a superscalar CPU with a second
FPU, it's interesting how you reconcile the appearance of a single
explicitly advanced FPU pipeline with the reality of two FPUs.

And the benefit of this idea? You have lessened the FP register
pressure by two registers in some places, and have avoided FP register
interlocking (but FP delay slots could have achieved that, too).

As for compiling for it, apart from using it in modulo-scheduled loops
(and I leave it to someone else to explore how much that complicates
the modulo scheduler), it seems to be a relatively straightforward
modification of typical list scheduling algorithm (used for basic
blocks and superblocks); I'll assume forward scheduling in the
following: Schedule ready FP instructions as usual, but make the
results only available after three instructions have been started; if
the scheduler runs out of ready FP instructions and there are still
results in the pipeline, schedule a dummy FP instruction to get the
next result out. The heuristic priority of FP instructions could
benefit from an adjustment that takes the priority of the results that
it advances into account. Seems doable to me, but it apparently was
not worth doing in a production compiler.

@InProceedings{bradlee+91pldi,
author = "David G. Bradlee and Robert R. Henry and Susan J. Eggers",
title = "The {Marion} System for Retargetable Instruction Scheduling",
booktitle = "SIGPLAN '91 Conference on
Programming Language Design and Implementation",
year = "1991",
pages = "229--240",
address = "Toronto",
OPTjournal = sigplan,
OPTvolume = "26",
OPTnumber = "6",
OPTmonth = jun,
annote = "A back end generator for RISCs, consisting of simple
instruction selection and several strategies for the
combination of register allocation and instruction
scheduling (Postpass, IPS, RASE). The maschine
description describes the resources needed by an
instruction and contains means to describe
explicitely advanced pipelines (i860), too. Machine
descriptions were developed for the 88100, the R2000
and the i860. The quality of the generated code is
between the MIPS compiler's -O1 and -O2 levels."
}

@Proceedings{sigplan91,
key = "SIGPLAN~'91",
booktitle = "SIGPLAN~'91 Conference on
Programming Language Design and Implementation",
title = "SIGPLAN~'91 Conference on
Programming Language Design and Implementation",
year = "1991",
OPTaddress = "Toronto",
OPTjournal = sigplan,
OPTvolume = "26",
OPTnumber = "6",
OPTmonth = jun,
}

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: IA-64

<2021May14.105737@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16738&group=comp.arch#16738

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: IA-64
Date: Fri, 14 May 2021 08:57:37 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 46
Message-ID: <2021May14.105737@mips.complang.tuwien.ac.at>
References: <jwvsg2rfxz8.fsf-monnier+comp.arch@gnu.org> <memo.20210512230643.13980R@jgd.cix.co.uk> <s7ij30$q3s$1@newsreader4.netcologne.de> <2021May13.175652@mips.complang.tuwien.ac.at> <2021May13.183707@mips.complang.tuwien.ac.at> <s7ju3h$e3o$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="5d5f7c2fd86717a462d647ead1d1489d";
logging-data="29438"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/S1f+EyvVM7xS4SkdlkbCY"
Cancel-Lock: sha1:SBbvxz6Ty9/vp2EyGieXX8+Xe1Q=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Fri, 14 May 2021 08:57 UTC

BGB <cr88192@gmail.com> writes:
>On 5/13/2021 11:37 AM, Anton Ertl wrote:
>> * The clock rate of IA-64 implementations was low compared to the OoO
>> competition. And that's the case for all in-order implementations
>> in this century: The step from in-order Bonnell (first generation
>> Atom) to OoO Silvermont saw a big clock rate increase, likewise
>> SPARC in-order implementations had low clock rate, but once they
>> switched to OoO, the clock rates were high, and ARM in-order
>> implementations also have lower clock rates than their OoO siblings
>> (they are admittedly intended for low-power usage). From
>> discussions here it seems to me that the reason is that in in-order
>> cores there are feedback loops that affect the whole pipeline (which
>> therefore impose a relatively low limit on the clock rate), while in
>> OoO cores feedback loops tend to be more local and therefore allow
>> higher clock rates.
>>
>
>Kinda wondering if this is related to an observation in my own project:
>To keep the pipeline consistent, there is a big "Hold" signal that
>basically stops everything connected to the pipeline from moving.
>
>Things related to this hold signal are a big source of timing issues,
>and pretty much anything dependent on this signal has a harder time with
>timing (particularly if it is used on inputs to a stage).
>
>The presence of some big global hold/stall signal seems to be implicit
>in the design of an in-order pipeline.
>
>
>Meanwhile, parts of the core which operate independently of this signal
>can seemingly do more work without failing timing as easily.
>
>If an OoO has all of the parts of the core operating independently,
>without a global stall, it is possible this makes timing easier, and
>allows for higher clock speeds?...

Yes, it's these kinds of considerations that I meant. As others have
pointed out, there are ways around that, but it seems that most of
those microarchitects who were willing to go there also took another
step and went for OoO. IIRC there was a fast-clocked in-order power
implementation that was the exception that proved the rule.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: IA-64

<2021May14.110233@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16740&group=comp.arch#16740

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: IA-64
Date: Fri, 14 May 2021 09:02:33 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 80
Message-ID: <2021May14.110233@mips.complang.tuwien.ac.at>
References: <2021May13.171529@mips.complang.tuwien.ac.at> <memo.20210513193435.13980X@jgd.cix.co.uk>
Injection-Info: reader02.eternal-september.org; posting-host="5d5f7c2fd86717a462d647ead1d1489d";
logging-data="17175"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+gKs/Sgmvol0n0bfmJZ/xl"
Cancel-Lock: sha1:qvg7Yu5XM29DLLsVSCBdiPzxnmk=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Fri, 14 May 2021 09:02 UTC

jgd@cix.co.uk (John Dallman) writes:
>In article <2021May13.171529@mips.complang.tuwien.ac.at>,
>anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>
>> >* Excessive use of predicate registers.
>> Not sure about excessive. Given the premise of compiler-controlled
>> instruction-level parallelism, using predication is a good idea.
>
>Indeed, but having 64 of them (IIRC) seems like far too many.

Fits nicely into a 64-bit word, so the cost of having that many seems
to be small. Given their use in software-pipelined loops, there may
also be good uses for having that many. What I find a bit strange is
that comparison instructions always set two predicate bits: the result
and its complement; so maybe 32 real predicates were enough.

>> >* Excessive use of register windowing.
>> Is there any use that you would not call excessive? It seems to me
>> that you either use it or not. Again, given the premise of
>> compiler-controlled ILP, it seems like the way to go for high
>> performance.
>
>Badly expressed: complicated register windowing, requiring a complex and
>hidden Register Stack Engine, undermining the goal of having execution
>under the compiler's control.

It seems to be appropriate for the architecture to me: They thought
that they would get IPC gains from their compiler ILP stuff. One of
the remaining things is the register save and restore overhead at call
boundaries; there they did not rely on the compiler magic wand to wave
it away (virtual function calls, dynamically linked libraries, and
programmers that don't use link-time optimization (for reasons
discussed here earlier) are barriers to that), and instead decided on
an architectural approach to do it.

>This also led to the designers forgetting
>that the floating-point registers did not have register windowing,
>creating the advance-load floating-point fiasco, which wrecked the
>performance of non-leaf functions that used floating point.

The architects decided to avoid the register stack for FP; my guess is
that they did not see many function calls in FP code in their
benchmarks, so the benefit would have been small. The cost would have
been another register stack pointer, another register stack memory
area and a register stack engine capable of dealing with two register
stacks, and apparently they considered the benefit not worth that
cost. What is the advance-load floating-point fiasco?

>Yup. Itanium displayed a major fallacy that Intel suffered from in the
>1990s. I call it the supercomputing fallacy. The fastest computers in the
>world were supercomputers with ISAs designed for specialised types of
>computation. To make general-purpose computers faster, it was seen as
>necessary to imitate those ISAs and rely on software developers to
>transform their software to use them. This neglected the vast differences
>between classical HPC code and more general-purpose software, but it did
>provide someone to blame.

The Watson vs. Cray quotes posted here recently showed a similar
neglect. While the IBM S/360 family may not have had the fastest
supercomputer for much of the time, it certainly produced a lot of
revenue for IBM and its descendents are still sold, while
supercomputer companies seem to have a hard life.

And the reason for that is exactly because the software crisis is not
relevant in supercomputing: Supercomputer hardware tends to be more
expensive than software, so if the next company comes along with the
next supercomputer with some new idea how to make it faster and harder
to program, customers jump to it, and accept the cost of rewriting the
software for it.

By contrast in much of the computing world software cost is higher
than hardware cost, so instead of rewriting the software to accomodate
faster/cheaper hardware, people rather buy more expensive hardware if
they need more performance.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: IA-64 and other parallel failures

<s7lhmc$2gh$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16742&group=comp.arch#16742

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: IA-64 and other parallel failures
Date: Fri, 14 May 2021 02:59:40 -0700
Organization: A noiseless patient Spider
Lines: 89
Message-ID: <s7lhmc$2gh$1@dont-email.me>
References: <jwvsg2rfxz8.fsf-monnier+comp.arch@gnu.org>
<memo.20210512230643.13980R@jgd.cix.co.uk>
<s7ij30$q3s$1@newsreader4.netcologne.de>
<2021May13.175652@mips.complang.tuwien.ac.at> <s7kk42$2sku$1@gal.iecc.com>
<2021May14.101841@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 14 May 2021 09:59:41 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="965d25763e67ded1db4e08a96dc48409";
logging-data="2577"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19E6ia064Hskw58/L7eU8Ed"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.0
Cancel-Lock: sha1:hNGceKYFC17QX8DzW/g6k0JA+As=
In-Reply-To: <2021May14.101841@mips.complang.tuwien.ac.at>
Content-Language: en-US
 by: Ivan Godard - Fri, 14 May 2021 09:59 UTC

On 5/14/2021 1:18 AM, Anton Ertl wrote:
> John Levine <johnl@taugh.com> writes:
>> The Intel i860 was a fast RISC-y chip in the late 1980s, same
>> generation as the 486. They had the bright idea to expose the floating
>> point pipeline so you could do a pipelined floating instruction that
>> ran in one cycle and advanced the three stage pipeline by one stage
>> and returned the result from three instructions back. No compiler ever
>> could deal with that so I think there were assembler routines for
>> things like dot product, but nothing else used it.
>
> Bradlee et al. certainly targeted the i860 in his work, and I dimly
> remember that they did something about the explicitly advanced
> pipeline.
>
> But this idea works well only where modulo scheduling works
> well, and all the other code has to suffer the costs (even if the
> compiler can deal with it).
>
> And it pretty much fixes the microarchitecture, which is not a great
> long-term idea: If you want to make a superscalar CPU with a second
> FPU, it's interesting how you reconcile the appearance of a single
> explicitly advanced FPU pipeline with the reality of two FPUs.
>
> And the benefit of this idea? You have lessened the FP register
> pressure by two registers in some places, and have avoided FP register
> interlocking (but FP delay slots could have achieved that, too).
>
> As for compiling for it, apart from using it in modulo-scheduled loops
> (and I leave it to someone else to explore how much that complicates
> the modulo scheduler), it seems to be a relatively straightforward
> modification of typical list scheduling algorithm (used for basic
> blocks and superblocks); I'll assume forward scheduling in the
> following: Schedule ready FP instructions as usual, but make the
> results only available after three instructions have been started; if
> the scheduler runs out of ready FP instructions and there are still
> results in the pipeline, schedule a dummy FP instruction to get the
> next result out. The heuristic priority of FP instructions could
> benefit from an adjustment that takes the priority of the results that
> it advances into account. Seems doable to me, but it apparently was
> not worth doing in a production compiler.

Or just let it come out in three cycles whether the FP pipe was used in
the meantime or not. And then schedule the consumers for when it will
come out.

Static scheduling, anyone?

> @InProceedings{bradlee+91pldi,
> author = "David G. Bradlee and Robert R. Henry and Susan J. Eggers",
> title = "The {Marion} System for Retargetable Instruction Scheduling",
> booktitle = "SIGPLAN '91 Conference on
> Programming Language Design and Implementation",
> year = "1991",
> pages = "229--240",
> address = "Toronto",
> OPTjournal = sigplan,
> OPTvolume = "26",
> OPTnumber = "6",
> OPTmonth = jun,
> annote = "A back end generator for RISCs, consisting of simple
> instruction selection and several strategies for the
> combination of register allocation and instruction
> scheduling (Postpass, IPS, RASE). The maschine
> description describes the resources needed by an
> instruction and contains means to describe
> explicitely advanced pipelines (i860), too. Machine
> descriptions were developed for the 88100, the R2000
> and the i860. The quality of the generated code is
> between the MIPS compiler's -O1 and -O2 levels."
> }
>
> @Proceedings{sigplan91,
> key = "SIGPLAN~'91",
> booktitle = "SIGPLAN~'91 Conference on
> Programming Language Design and Implementation",
> title = "SIGPLAN~'91 Conference on
> Programming Language Design and Implementation",
> year = "1991",
> OPTaddress = "Toronto",
> OPTjournal = sigplan,
> OPTvolume = "26",
> OPTnumber = "6",
> OPTmonth = jun,
> }
>
> - anton
>


devel / comp.arch / Re: IA-64 and other parallel failures

Pages:12345
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor