Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Those who can, do; those who can't, write. Those who can't write work for the Bell Labs Record.


devel / comp.arch / Re: Pentium 4, Bulldozer - and IBM

SubjectAuthor
* Pentium 4, Bulldozer - and IBMQuadibloc
+* Re: Pentium 4, Bulldozer - and IBMIvan Godard
|+* Re: Pentium 4, Bulldozer - and IBMQuadibloc
||`* Re: Pentium 4, Bulldozer - and IBMIvan Godard
|| `* Re: Pentium 4, Bulldozer - and IBMQuadibloc
||  `* Re: Pentium 4, Bulldozer - and IBMMitchAlsup
||   `- Re: Pentium 4, Bulldozer - and IBMQuadibloc
|`- Re: Pentium 4, Bulldozer - and IBMMitchAlsup
+- Re: Pentium 4, Bulldozer - and IBMAnton Ertl
+* Re: Pentium 4, Bulldozer - and IBMMitchAlsup
|`- Re: Pentium 4, Bulldozer - and IBMQuadibloc
`* Re: Pentium 4, Bulldozer - and IBMStefan Monnier
 +- Re: Pentium 4, Bulldozer - and IBMMitchAlsup
 +- Re: Pentium 4, Bulldozer - and IBMQuadibloc
 `* Re: Pentium 4, Bulldozer - and IBMAnton Ertl
  +* Re: Pentium 4, Bulldozer - and IBMEricP
  |`* Re: Pentium 4, Bulldozer - and IBMMitchAlsup
  | `* Re: Pentium 4, Bulldozer - and IBMStephen Fuld
  |  `* Re: Pentium 4, Bulldozer - and IBMJohn Levine
  |   +* Re: Pentium 4, Bulldozer - and IBMEricP
  |   |`* Re: Pentium 4, Bulldozer - and IBMEricP
  |   | `* Re: Pentium 4, Bulldozer - and IBMAnton Ertl
  |   |  `* Re: Pentium 4, Bulldozer - and IBMMitchAlsup
  |   |   +- Re: Pentium 4, Bulldozer - and IBMThomas Koenig
  |   |   +* Re: Pentium 4, Bulldozer - and IBMStefan Monnier
  |   |   |+- Re: Pentium 4, Bulldozer - and IBMJohn Dallman
  |   |   |+- Re: Pentium 4, Bulldozer - and IBMJohn Levine
  |   |   |`- Re: Pentium 4, Bulldozer - and IBMAnton Ertl
  |   |   +- Re: Pentium 4, Bulldozer - and IBMTerje Mathisen
  |   |   `- Re: Pentium 4, Bulldozer - and IBMEricP
  |   `- Re: Pentium 4, Bulldozer - and IBMMitchAlsup
  `* Re: Pentium 4, Bulldozer - and IBMQuadibloc
   +- Re: Pentium 4, Bulldozer - and IBMMitchAlsup
   `- Re: Pentium 4, Bulldozer - and IBMAnton Ertl

Pages:12
Re: Pentium 4, Bulldozer - and IBM

<2021May28.102932@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17237&group=comp.arch#17237

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Pentium 4, Bulldozer - and IBM
Date: Fri, 28 May 2021 08:29:32 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 29
Message-ID: <2021May28.102932@mips.complang.tuwien.ac.at>
References: <b9dff470-b8b5-4514-824e-0436bb44ffd5n@googlegroups.com> <2QLrI.635965$nn2.430250@fx48.iad> <58bceb57-3323-44f5-8471-3977313954c4n@googlegroups.com> <s8ok50$tn8$1@dont-email.me> <s8on45$rte$1@gal.iecc.com> <CiSrI.244$341.147@fx42.iad> <Te0sI.642819$nn2.296253@fx48.iad>
Injection-Info: reader02.eternal-september.org; posting-host="24fbea3b8cd41d70d07da302a4c7b344";
logging-data="25278"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18oGhlyB+/0Sy1qfraHb1wj"
Cancel-Lock: sha1:pZi1OEpFNpZ6frvbm4ms6Tl+q+8=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Fri, 28 May 2021 08:29 UTC

EricP <ThatWouldBeTelling@thevillage.com> writes:
[z15]
>This 6 wide zillion stage pipeline (which hence forth shall be called
>a super-duper pipeline) makes sense if they HW multi-thread's the uOps
>so that there are 6 uOps from 1 thread or 1 uOp from 6 threads,
>or whatever in between.
>
>So yes, a particular HW thread takes a branch miss and stalls,
>but 5 others are stuffing uOps into the pipeline stages
>so it really never drains.

I expect that even with a single thread, the branch predictor does not
let the pipeline drain. As soon as a misprediction is found the
correct target address is fed into the first stage of the pipeline for
fetching from there. Of course there will be a lot of stuff in flight
that is then on the mispredicted path, but I expect that these
instructions will run through the pipeline until retirement and be
thrown away there, instead of canceling them wherever they are.

As for SMT, each thread will have a shorter predicted path, and
therefore less to throw away when a misprediction appears; also,
independent threads provide more independent instructions, increasing
utilization. Well, at least that's the theory. What I have seen on
Intel Skylake and AMD Zen has been disappointing.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Pentium 4, Bulldozer - and IBM

<f4425362-d350-4db4-a724-5ea422ba2334n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17245&group=comp.arch#17245

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:4a88:: with SMTP id l8mr7928771qtq.133.1622300955191;
Sat, 29 May 2021 08:09:15 -0700 (PDT)
X-Received: by 2002:a4a:8111:: with SMTP id b17mr5704376oog.5.1622300954987;
Sat, 29 May 2021 08:09:14 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 29 May 2021 08:09:14 -0700 (PDT)
In-Reply-To: <2021May28.102932@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:2541:cd25:58ed:accd;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:2541:cd25:58ed:accd
References: <b9dff470-b8b5-4514-824e-0436bb44ffd5n@googlegroups.com>
<2QLrI.635965$nn2.430250@fx48.iad> <58bceb57-3323-44f5-8471-3977313954c4n@googlegroups.com>
<s8ok50$tn8$1@dont-email.me> <s8on45$rte$1@gal.iecc.com> <CiSrI.244$341.147@fx42.iad>
<Te0sI.642819$nn2.296253@fx48.iad> <2021May28.102932@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f4425362-d350-4db4-a724-5ea422ba2334n@googlegroups.com>
Subject: Re: Pentium 4, Bulldozer - and IBM
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 29 May 2021 15:09:15 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Sat, 29 May 2021 15:09 UTC

On Friday, May 28, 2021 at 3:45:20 AM UTC-5, Anton Ertl wrote:
> EricP <ThatWould...@thevillage.com> writes:
> [z15]
> >This 6 wide zillion stage pipeline (which hence forth shall be called
> >a super-duper pipeline) makes sense if they HW multi-thread's the uOps
> >so that there are 6 uOps from 1 thread or 1 uOp from 6 threads,
> >or whatever in between.
> >
> >So yes, a particular HW thread takes a branch miss and stalls,
> >but 5 others are stuffing uOps into the pipeline stages
> >so it really never drains.
<
> I expect that even with a single thread, the branch predictor does not
> let the pipeline drain. As soon as a misprediction is found the
> correct target address is fed into the first stage of the pipeline for
> fetching from there. Of course there will be a lot of stuff in flight
> that is then on the mispredicted path, but I expect that these
> instructions will run through the pipeline until retirement and be
> thrown away there, instead of canceling them wherever they are.
<
There is a 6 stage branch predictor and there is a 2 stage branch predictor
(the 2 stage predictor is new compared to Z14).
<
But also note it takes another 15 stage pipeline to perform retirement !!
{15 to get instructions to decode, 15 to retire instructions, and at least
a 15 stage pipeline to perform calculations and Dcache access} They
fetch an entire cache line in a single access and the branch predictor
has to predict at least 5 branches per fetch.
<
On the other hand there are dedicated units to perform IBM floating point,
IEEE floating point, Decimal Floating Point, Decimal Arithmetic,.....
<
On final note, the whole thing is water cooled and has 17 layers of metal
and is 24mm^2. The water cooling is probably allowing for 25% frequency
gain (4GHz->5GHz).
<
So from all of the above, and not mentioned anywhere in the IBM literature,
one can conclude that the thing is indeed SMT, nothing else makes sense.
Maybe not Niagra-like SMT, but a significant number of threads are in
progress at any clock tick of the machine.
>
> As for SMT, each thread will have a shorter predicted path, and
> therefore less to throw away when a misprediction appears; also,
> independent threads provide more independent instructions, increasing
> utilization. Well, at least that's the theory. What I have seen on
> Intel Skylake and AMD Zen has been disappointing.
<
Yes, this is the throughput side of their (IBM) argument.
<
I find it interesting that one can buy a machine with 240 cores, yet applications
only have access to 190 of these, leaving 30 cores to perform I/O !!! and
other system "things".
<
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: Pentium 4, Bulldozer - and IBM

<s8tm36$7s6$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17246&group=comp.arch#17246

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-7052-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Pentium 4, Bulldozer - and IBM
Date: Sat, 29 May 2021 15:20:06 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <s8tm36$7s6$1@newsreader4.netcologne.de>
References: <b9dff470-b8b5-4514-824e-0436bb44ffd5n@googlegroups.com>
<2QLrI.635965$nn2.430250@fx48.iad>
<58bceb57-3323-44f5-8471-3977313954c4n@googlegroups.com>
<s8ok50$tn8$1@dont-email.me> <s8on45$rte$1@gal.iecc.com>
<CiSrI.244$341.147@fx42.iad> <Te0sI.642819$nn2.296253@fx48.iad>
<2021May28.102932@mips.complang.tuwien.ac.at>
<f4425362-d350-4db4-a724-5ea422ba2334n@googlegroups.com>
Injection-Date: Sat, 29 May 2021 15:20:06 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-7052-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:7052:0:7285:c2ff:fe6c:992d";
logging-data="8070"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Sat, 29 May 2021 15:20 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:

> So from all of the above, and not mentioned anywhere in the IBM literature,
> one can conclude that the thing is indeed SMT, nothing else makes sense.
> Maybe not Niagra-like SMT, but a significant number of threads are in
> progress at any clock tick of the machine.

Their Red Book, IBM z15 (8561) Technical Guide, says

# The Integrated Facility for Linux (IFL) and IBM Z Integrated
# Information Processor (zIIP) processor units on the z15 server can
# be configured to run two simultaneous threads per clock cycle in
# a single processor (SMT). This feature increases the capacity of
# these processors with 25% in average over processors that are
# running single thread.

Two processes per clock cycle?

Re: Pentium 4, Bulldozer - and IBM

<jwveedpczhj.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17248&group=comp.arch#17248

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Pentium 4, Bulldozer - and IBM
Date: Sat, 29 May 2021 11:52:52 -0400
Organization: A noiseless patient Spider
Lines: 16
Message-ID: <jwveedpczhj.fsf-monnier+comp.arch@gnu.org>
References: <b9dff470-b8b5-4514-824e-0436bb44ffd5n@googlegroups.com>
<2QLrI.635965$nn2.430250@fx48.iad>
<58bceb57-3323-44f5-8471-3977313954c4n@googlegroups.com>
<s8ok50$tn8$1@dont-email.me> <s8on45$rte$1@gal.iecc.com>
<CiSrI.244$341.147@fx42.iad> <Te0sI.642819$nn2.296253@fx48.iad>
<2021May28.102932@mips.complang.tuwien.ac.at>
<f4425362-d350-4db4-a724-5ea422ba2334n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="2284c647c82ad9863cc162f64b0e37ff";
logging-data="4989"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/+D0t7Q73AdYL021fQZP34"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)
Cancel-Lock: sha1:RVJTDBx3+wd3pfO2GZ6ZvULcQQ4=
sha1:FJnV0qm1RGZM5hs8rhaHha9BL7k=
 by: Stefan Monnier - Sat, 29 May 2021 15:52 UTC

> But also note it takes another 15 stage pipeline to perform retirement !!
> {15 to get instructions to decode, 15 to retire instructions, and at least
> a 15 stage pipeline to perform calculations and Dcache access} They
> fetch an entire cache line in a single access and the branch predictor
> has to predict at least 5 branches per fetch.
[...]
> I find it interesting that one can buy a machine with 240 cores, yet applications
> only have access to 190 of these, leaving 30 cores to perform I/O !!! and
> other system "things".

Hmm... any chance these "240 cores" are really 16 cores with 15 threads
each, or something like that? The available literature suggests it's
not the case, but the numbers seem to hint in that direction ;-)

Stefan

Re: Pentium 4, Bulldozer - and IBM

<memo.20210529172611.13680B@jgd.cix.co.uk>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17249&group=comp.arch#17249

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: jgd...@cix.co.uk (John Dallman)
Newsgroups: comp.arch
Subject: Re: Pentium 4, Bulldozer - and IBM
Date: Sat, 29 May 2021 17:26 +0100 (BST)
Organization: A noiseless patient Spider
Lines: 16
Message-ID: <memo.20210529172611.13680B@jgd.cix.co.uk>
References: <jwveedpczhj.fsf-monnier+comp.arch@gnu.org>
Reply-To: jgd@cix.co.uk
Injection-Info: reader02.eternal-september.org; posting-host="f10b2b34770732b7eddf1b1e324a3136";
logging-data="26740"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/EA4wry50NKwIGqz2gj8ucdrKwr/0R9hI="
Cancel-Lock: sha1:nXd3cOZbUEdggb4LnzxfOP7dCd4=
 by: John Dallman - Sat, 29 May 2021 16:26 UTC

In article <jwveedpczhj.fsf-monnier+comp.arch@gnu.org>,
monnier@iro.umontreal.ca (Stefan Monnier) wrote:

> Hmm... any chance these "240 cores" are really 16 cores with 15
> threads each, or something like that? The available literature
> suggests it's not the case, but the numbers seem to hint in that
> direction ;-)

Not plausibly. The machines are very modular, built up from "Single-Chip
Modules" ("SCM") which hold variable numbers of "Processor Units" ("PU").
It's possible to upgrade the machines very incrementally, and the
customers will be very aware of the cooling requirements.

https://www.redbooks.ibm.com/abstracts/sg248850.html

John

Re: Pentium 4, Bulldozer - and IBM

<s8tq1m$2ni6$1@gal.iecc.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17250&group=comp.arch#17250

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.cmpublishers.com!adore2!news.iecc.com!.POSTED.news.iecc.com!not-for-mail
From: joh...@taugh.com (John Levine)
Newsgroups: comp.arch
Subject: Re: Pentium 4, Bulldozer - and IBM
Date: Sat, 29 May 2021 16:27:34 -0000 (UTC)
Organization: Taughannock Networks
Message-ID: <s8tq1m$2ni6$1@gal.iecc.com>
References: <b9dff470-b8b5-4514-824e-0436bb44ffd5n@googlegroups.com> <2021May28.102932@mips.complang.tuwien.ac.at> <f4425362-d350-4db4-a724-5ea422ba2334n@googlegroups.com> <jwveedpczhj.fsf-monnier+comp.arch@gnu.org>
Injection-Date: Sat, 29 May 2021 16:27:34 -0000 (UTC)
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="89670"; mail-complaints-to="abuse@iecc.com"
In-Reply-To: <b9dff470-b8b5-4514-824e-0436bb44ffd5n@googlegroups.com> <2021May28.102932@mips.complang.tuwien.ac.at> <f4425362-d350-4db4-a724-5ea422ba2334n@googlegroups.com> <jwveedpczhj.fsf-monnier+comp.arch@gnu.org>
Cleverness: some
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: johnl@iecc.com (John Levine)
 by: John Levine - Sat, 29 May 2021 16:27 UTC

According to Stefan Monnier <monnier@iro.umontreal.ca>:
>> But also note it takes another 15 stage pipeline to perform retirement !!
>> {15 to get instructions to decode, 15 to retire instructions, and at least
>> a 15 stage pipeline to perform calculations and Dcache access} They
>> fetch an entire cache line in a single access and the branch predictor
>> has to predict at least 5 branches per fetch.
>[...]
>> I find it interesting that one can buy a machine with 240 cores, yet applications
>> only have access to 190 of these, leaving 30 cores to perform I/O !!! and
>> other system "things".

Some of them are hot spares. Z series can switch cores on the fly if one of them fails.
They really really do not want their systems to crash.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Re: Pentium 4, Bulldozer - and IBM

<2021May29.183612@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17251&group=comp.arch#17251

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Pentium 4, Bulldozer - and IBM
Date: Sat, 29 May 2021 16:36:12 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 24
Message-ID: <2021May29.183612@mips.complang.tuwien.ac.at>
References: <b9dff470-b8b5-4514-824e-0436bb44ffd5n@googlegroups.com> <2QLrI.635965$nn2.430250@fx48.iad> <58bceb57-3323-44f5-8471-3977313954c4n@googlegroups.com> <s8ok50$tn8$1@dont-email.me> <s8on45$rte$1@gal.iecc.com> <CiSrI.244$341.147@fx42.iad> <Te0sI.642819$nn2.296253@fx48.iad> <2021May28.102932@mips.complang.tuwien.ac.at> <f4425362-d350-4db4-a724-5ea422ba2334n@googlegroups.com> <jwveedpczhj.fsf-monnier+comp.arch@gnu.org>
Injection-Info: reader02.eternal-september.org; posting-host="ca7182227178d9a3349e5923eb9d036e";
logging-data="21833"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/APiKBky39OQ06z6tZ8rh5"
Cancel-Lock: sha1:dFMq/mFF5Tl4NrWSy3St0P0rPjU=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Sat, 29 May 2021 16:36 UTC

Stefan Monnier <monnier@iro.umontreal.ca> writes:
>> I find it interesting that one can buy a machine with 240 cores, yet applications
>> only have access to 190 of these, leaving 30 cores to perform I/O !!! and
>> other system "things".

240-190=50 but see below.

>Hmm... any chance these "240 cores" are really 16 cores with 15 threads
>each, or something like that?

No. Each CPU has 12 cores, and the max190 has 5 drawers with 43 PUs
(cores) per drawer (that would be 215 PUs, so 25 PUs are for internal
things); I guess they have 4 CPUs per drawer, and 1-2 cores per CPU
are disabled for yield reasons. Each core supports 2-way SMT.

Sources:
https://en.wikipedia.org/wiki/IBM_z15_%28microprocessor%29
https://www.redbooks.ibm.com/redbooks/pdfs/sg248851.pdf
https://www.ibm.com/downloads/cas/NN7GBPJ1

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Pentium 4, Bulldozer - and IBM

<s8tvn1$112k$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17256&group=comp.arch#17256

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!/FKOcGQMirZgkZJCo9x3IA.user.gioia.aioe.org.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Pentium 4, Bulldozer - and IBM
Date: Sat, 29 May 2021 20:04:18 +0200
Organization: Aioe.org NNTP Server
Lines: 62
Message-ID: <s8tvn1$112k$1@gioia.aioe.org>
References: <b9dff470-b8b5-4514-824e-0436bb44ffd5n@googlegroups.com>
<2QLrI.635965$nn2.430250@fx48.iad>
<58bceb57-3323-44f5-8471-3977313954c4n@googlegroups.com>
<s8ok50$tn8$1@dont-email.me> <s8on45$rte$1@gal.iecc.com>
<CiSrI.244$341.147@fx42.iad> <Te0sI.642819$nn2.296253@fx48.iad>
<2021May28.102932@mips.complang.tuwien.ac.at>
<f4425362-d350-4db4-a724-5ea422ba2334n@googlegroups.com>
NNTP-Posting-Host: /FKOcGQMirZgkZJCo9x3IA.user.gioia.aioe.org
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Complaints-To: abuse@aioe.org
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.7
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Sat, 29 May 2021 18:04 UTC

MitchAlsup wrote:
> On Friday, May 28, 2021 at 3:45:20 AM UTC-5, Anton Ertl wrote:
>> EricP <ThatWould...@thevillage.com> writes:
>> [z15]
>>> This 6 wide zillion stage pipeline (which hence forth shall be called
>>> a super-duper pipeline) makes sense if they HW multi-thread's the uOps
>>> so that there are 6 uOps from 1 thread or 1 uOp from 6 threads,
>>> or whatever in between.
>>>
>>> So yes, a particular HW thread takes a branch miss and stalls,
>>> but 5 others are stuffing uOps into the pipeline stages
>>> so it really never drains.
> <
>> I expect that even with a single thread, the branch predictor does not
>> let the pipeline drain. As soon as a misprediction is found the
>> correct target address is fed into the first stage of the pipeline for
>> fetching from there. Of course there will be a lot of stuff in flight
>> that is then on the mispredicted path, but I expect that these
>> instructions will run through the pipeline until retirement and be
>> thrown away there, instead of canceling them wherever they are.
> <
> There is a 6 stage branch predictor and there is a 2 stage branch predictor
> (the 2 stage predictor is new compared to Z14).
> <
> But also note it takes another 15 stage pipeline to perform retirement !!
> {15 to get instructions to decode, 15 to retire instructions, and at least
> a 15 stage pipeline to perform calculations and Dcache access} They
> fetch an entire cache line in a single access and the branch predictor
> has to predict at least 5 branches per fetch.
> <
> On the other hand there are dedicated units to perform IBM floating point,
> IEEE floating point, Decimal Floating Point, Decimal Arithmetic,.....
> <
> On final note, the whole thing is water cooled and has 17 layers of metal
> and is 24mm^2. The water cooling is probably allowing for 25% frequency
> gain (4GHz->5GHz).
> <
> So from all of the above, and not mentioned anywhere in the IBM literature,
> one can conclude that the thing is indeed SMT, nothing else makes sense.
> Maybe not Niagra-like SMT, but a significant number of threads are in
> progress at any clock tick of the machine.
>>
>> As for SMT, each thread will have a shorter predicted path, and
>> therefore less to throw away when a misprediction appears; also,
>> independent threads provide more independent instructions, increasing
>> utilization. Well, at least that's the theory. What I have seen on
>> Intel Skylake and AMD Zen has been disappointing.
> <
> Yes, this is the throughput side of their (IBM) argument.
> <
> I find it interesting that one can buy a machine with 240 cores, yet applications
> only have access to 190 of these, leaving 30 cores to perform I/O !!! and
> other system "things".

240 - 190 = 50 "system reserved" cores?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Pentium 4, Bulldozer - and IBM

<gxOsI.6078$jf1.2697@fx37.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17271&group=comp.arch#17271

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!news.uzoreto.com!newsfeed.xs4all.nl!newsfeed9.news.xs4all.nl!fdc3.netnews.com!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx37.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Pentium 4, Bulldozer - and IBM
References: <b9dff470-b8b5-4514-824e-0436bb44ffd5n@googlegroups.com> <2QLrI.635965$nn2.430250@fx48.iad> <58bceb57-3323-44f5-8471-3977313954c4n@googlegroups.com> <s8ok50$tn8$1@dont-email.me> <s8on45$rte$1@gal.iecc.com> <CiSrI.244$341.147@fx42.iad> <Te0sI.642819$nn2.296253@fx48.iad> <2021May28.102932@mips.complang.tuwien.ac.at> <f4425362-d350-4db4-a724-5ea422ba2334n@googlegroups.com>
In-Reply-To: <f4425362-d350-4db4-a724-5ea422ba2334n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 77
Message-ID: <gxOsI.6078$jf1.2697@fx37.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sun, 30 May 2021 15:52:44 UTC
Date: Sun, 30 May 2021 11:52:06 -0400
X-Received-Bytes: 4731
 by: EricP - Sun, 30 May 2021 15:52 UTC

MitchAlsup wrote:
> On Friday, May 28, 2021 at 3:45:20 AM UTC-5, Anton Ertl wrote:
>> EricP <ThatWould...@thevillage.com> writes:
>> [z15]
>>> This 6 wide zillion stage pipeline (which hence forth shall be called
>>> a super-duper pipeline) makes sense if they HW multi-thread's the uOps
>>> so that there are 6 uOps from 1 thread or 1 uOp from 6 threads,
>>> or whatever in between.
>>>
>>> So yes, a particular HW thread takes a branch miss and stalls,
>>> but 5 others are stuffing uOps into the pipeline stages
>>> so it really never drains.
> <
>> I expect that even with a single thread, the branch predictor does not
>> let the pipeline drain. As soon as a misprediction is found the
>> correct target address is fed into the first stage of the pipeline for
>> fetching from there. Of course there will be a lot of stuff in flight
>> that is then on the mispredicted path, but I expect that these
>> instructions will run through the pipeline until retirement and be
>> thrown away there, instead of canceling them wherever they are.
> <
> There is a 6 stage branch predictor and there is a 2 stage branch predictor
> (the 2 stage predictor is new compared to Z14).
> <
> But also note it takes another 15 stage pipeline to perform retirement !!
> {15 to get instructions to decode, 15 to retire instructions, and at least
> a 15 stage pipeline to perform calculations and Dcache access} They
> fetch an entire cache line in a single access and the branch predictor
> has to predict at least 5 branches per fetch.

That paper on the z15 branch predictor says that the minimum delay
for a branch mispredict is 26 cycles, and the statistical penalty
is 35 cycles because of other queue disruptions.
An I$L1 miss with L2 hit is a minimum of 8 extra clocks.

So a fair portion of that pipeline drains on a branch mispredict.

> <
> On the other hand there are dedicated units to perform IBM floating point,
> IEEE floating point, Decimal Floating Point, Decimal Arithmetic,.....
> <
> On final note, the whole thing is water cooled and has 17 layers of metal
> and is 24mm^2. The water cooling is probably allowing for 25% frequency
> gain (4GHz->5GHz).
> <
> So from all of the above, and not mentioned anywhere in the IBM literature,
> one can conclude that the thing is indeed SMT, nothing else makes sense.
> Maybe not Niagra-like SMT, but a significant number of threads are in
> progress at any clock tick of the machine.

The branch paper says it is SMT and refers to "SMT2" mode
and the text implies there are 2 HW threads.

If a single branch mispredict can drain 35 clocks*6 = 210 instructions,
and they say instructions average 5 bytes each so thats 1kB lost,
I would have expected more than 1 alternate thread fetch buffer would be
required to keep the pipeline busy providing an alternate 1kB instructions.

>> As for SMT, each thread will have a shorter predicted path, and
>> therefore less to throw away when a misprediction appears; also,
>> independent threads provide more independent instructions, increasing
>> utilization. Well, at least that's the theory. What I have seen on
>> Intel Skylake and AMD Zen has been disappointing.
> <
> Yes, this is the throughput side of their (IBM) argument.
> <
> I find it interesting that one can buy a machine with 240 cores, yet applications
> only have access to 190 of these, leaving 30 cores to perform I/O !!! and
> other system "things".
> <
>> - anton
>> --
>> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
>> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Pages:12
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor