Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

"In the long run, every program becomes rococo, and then rubble." -- Alan Perlis


devel / comp.arch / Mercurial cores

SubjectAuthor
* Mercurial coresStephen Fuld
+* Re: Mercurial coresMitchAlsup
|+* Re: Mercurial coresMitchAlsup
||`* Re: Mercurial coresEricP
|| `* Re: Mercurial coresMitchAlsup
||  +* Re: Mercurial coresIvan Godard
||  |`- Re: Mercurial coresMitchAlsup
||  +* Re: Mercurial coresAnton Ertl
||  |`* Re: Mercurial coresStephen Fuld
||  | +* Re: Mercurial coresAnton Ertl
||  | |+- Re: Mercurial coresAnton Ertl
||  | |`* Re: Mercurial coresStephen Fuld
||  | | `* Re: Mercurial coresThomas Koenig
||  | |  +* Re: Mercurial coresStephen Fuld
||  | |  |`- Re: Mercurial coresMitchAlsup
||  | |  `* Re: Mercurial coresMitchAlsup
||  | |   `* Re: Mercurial coresQuadibloc
||  | |    +- Re: Mercurial coresRichard Damon
||  | |    +- Re: Mercurial coresMitchAlsup
||  | |    +- Re: hoststuff, was Mercurial coresJohn Levine
||  | |    `- Re: Mercurial coresMitchAlsup
||  | `- Re: Mercurial coresQuadibloc
||  `* Re: Mercurial coresEricP
||   `* Re: Mercurial coresMitchAlsup
||    +- Re: Mercurial coresQuadibloc
||    `* Re: Mercurial coresPaul A. Clayton
||     +- Re: Mercurial coresMitchAlsup
||     +* Re: Mercurial coresQuadibloc
||     |`- Re: Mercurial coresMitchAlsup
||     `- Re: Mercurial coresQuadibloc
|`* Re: Mercurial coresStefan Monnier
| `- Re: Mercurial coresMitchAlsup
+- Re: Mercurial coresQuadibloc
+* Re: Mercurial coresAnton Ertl
|+* Re: Mercurial coresBGB
||`* Re: Mercurial coresTerje Mathisen
|| +- Re: Mercurial coresBGB
|| `* Re: Mercurial coresMitchAlsup
||  +* Re: Mercurial coresQuadibloc
||  |`- Re: Mercurial coresMitchAlsup
||  `- Re: Mercurial coresTerje Mathisen
|`* Re: Mercurial coresStephen Fuld
| `- Re: Mercurial coresAnton Ertl
`* Re: everything old is new again, or Mercurial coresJohn Levine
 +- Re: everything old is new again, or Mercurial coresIvan Godard
 +* Re: everything old is new again, or Mercurial coresTerje Mathisen
 |+* Re: everything old is new again, or Mercurial coresMichael S
 ||+- Re: everything old is new again, or Mercurial coresMitchAlsup
 ||`* Re: everything old is new again, or Mercurial coresJohn Levine
 || `- Re: everything old is new again, or Mercurial coresBrian G. Lucas
 |+- Re: everything old is new again, or Mercurial coresStefan Monnier
 |`- Re: everything old is new again, or Mercurial coresMitchAlsup
 `- Re: everything old is new again, or Mercurial coresMitchAlsup

Pages:123
Mercurial cores

<sf139h$p9h$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19732&group=comp.arch#19732

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Mercurial cores
Date: Wed, 11 Aug 2021 11:01:18 -0700
Organization: A noiseless patient Spider
Lines: 19
Message-ID: <sf139h$p9h$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 11 Aug 2021 18:01:21 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="7646e3e2bee7a6a9a03d108a9ff616f9";
logging-data="25905"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19cr1oFqRLGZq6KwN+6hq6oDy7lnB61H3g="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.12.0
Cancel-Lock: sha1:+38FSMi0CSlACvlzF59w5TyQZRo=
Content-Language: en-US
X-Mozilla-News-Host: snews://news.eternal-september.org:563
 by: Stephen Fuld - Wed, 11 Aug 2021 18:01 UTC

I am not sure what to make of this, but as a software guy, I expect the
hardware to always do the right thing or tell me it didn't.

https://www.theregister.com/2021/06/04/google_chip_flaws/

If you don't like The Register, the article contains a link to the PDF
of the ACM paper.

So, is the phenomenon real? (I assume Google is seeing what the are
seeing). Are they right that it will get worse as the technology
improves. As I said above, I don't like software solutions to this
problem, but what is the right hardware solution? Core duplication and
checking? Back to internal parity checks? Other solutions?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Mercurial cores

<aedb4b11-ea1d-4e9f-b9d6-75fc1d16bba7n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19734&group=comp.arch#19734

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:5182:: with SMTP id kl2mr294694qvb.19.1628711839461;
Wed, 11 Aug 2021 12:57:19 -0700 (PDT)
X-Received: by 2002:aca:59c6:: with SMTP id n189mr9100842oib.44.1628711839232;
Wed, 11 Aug 2021 12:57:19 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 11 Aug 2021 12:57:19 -0700 (PDT)
In-Reply-To: <sf139h$p9h$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:407c:5db4:b829:fe2;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:407c:5db4:b829:fe2
References: <sf139h$p9h$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <aedb4b11-ea1d-4e9f-b9d6-75fc1d16bba7n@googlegroups.com>
Subject: Re: Mercurial cores
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 11 Aug 2021 19:57:19 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Wed, 11 Aug 2021 19:57 UTC

On Wednesday, August 11, 2021 at 1:01:23 PM UTC-5, Stephen Fuld wrote:
> I am not sure what to make of this, but as a software guy, I expect the
> hardware to always do the right thing or tell me it didn't.
>
> https://www.theregister.com/2021/06/04/google_chip_flaws/
>
> If you don't like The Register, the article contains a link to the PDF
> of the ACM paper.
>
> So, is the phenomenon real? (I assume Google is seeing what the are
> seeing). Are they right that it will get worse as the technology
> improves. As I said above, I don't like software solutions to this
> problem, but what is the right hardware solution? Core duplication and
> checking? Back to internal parity checks? Other solutions?
<
Back in the 68020 days (say around 30 MHz) we found chips that would run
at 33ns (30 MHz) forever* but if pushed to 32ns fail through various speed
paths in the chip. These speed paths were different on different chips based
on which transistors were faster (than average) and which were slower (than
average). Moto, back then, would sell '020s in bins which were 2ns slower
than the chips would run on the testers.
>
My guess is that this phenomenon is real, and that if they declocked the
chip from 20× to 19× the vast majority of them would quit creating these
kinds of failures.
<
In a physical sense, each transistor is an experiment, and at todays
lithography, one side of the transistor may see 137 ions implanted,
while the other side (7nm away) sees 142 ions implanted. So the
transistor operates differently than theory, and sooner or later one
of the transistors in one of the millions of speed paths on a die is
slower than when the path was tested on the tester. There are all
sorts of aging phenomenon also going on.
<
In addition to the above, the transistors are now in the range where one
can count the number of atoms across the gate on 2 hands (soon 1 hand)!
At these kind of scales, quantum effects start to appear which make the
operation of the transistor more difficult to describe, model, and design
around. Instead of a MOSFET having a smooth V-I curve based on gate
voltage, the curve has wiggles in it based not only on what voltage is
being applied, but upon the change in voltage over the last picosecond.
<
At modern speeds (5GHz) even wires are displaying certain effects like
"skin effects" on voltage edges faster than 6ps and Ampere's effects on
edge speeds faster than 4ps. These effects add to the problems spoken
of above.
>
>
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

Re: Mercurial cores

<8f6cf457-9a02-422a-a32c-998eb291a211n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19735&group=comp.arch#19735

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:5ecd:: with SMTP id s13mr500129qtx.16.1628712921443;
Wed, 11 Aug 2021 13:15:21 -0700 (PDT)
X-Received: by 2002:a4a:918e:: with SMTP id d14mr392824ooh.90.1628712921203;
Wed, 11 Aug 2021 13:15:21 -0700 (PDT)
Path: i2pn2.org!i2pn.org!news.niel.me!usenet.pasdenom.info!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 11 Aug 2021 13:15:21 -0700 (PDT)
In-Reply-To: <aedb4b11-ea1d-4e9f-b9d6-75fc1d16bba7n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:407c:5db4:b829:fe2;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:407c:5db4:b829:fe2
References: <sf139h$p9h$1@dont-email.me> <aedb4b11-ea1d-4e9f-b9d6-75fc1d16bba7n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <8f6cf457-9a02-422a-a32c-998eb291a211n@googlegroups.com>
Subject: Re: Mercurial cores
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 11 Aug 2021 20:15:21 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Wed, 11 Aug 2021 20:15 UTC

On Wednesday, August 11, 2021 at 2:57:20 PM UTC-5, MitchAlsup wrote:
> On Wednesday, August 11, 2021 at 1:01:23 PM UTC-5, Stephen Fuld wrote:
> > I am not sure what to make of this, but as a software guy, I expect the
> > hardware to always do the right thing or tell me it didn't.
> >
> > https://www.theregister.com/2021/06/04/google_chip_flaws/
> >
> > If you don't like The Register, the article contains a link to the PDF
> > of the ACM paper.
> >
> > So, is the phenomenon real? (I assume Google is seeing what the are
> > seeing). Are they right that it will get worse as the technology
> > improves. As I said above, I don't like software solutions to this
> > problem, but what is the right hardware solution? Core duplication and
> > checking? Back to internal parity checks? Other solutions?
> <
> Back in the 68020 days (say around 30 MHz) we found chips that would run
> at 33ns (30 MHz) forever* but if pushed to 32ns fail through various speed
> paths in the chip. These speed paths were different on different chips based
> on which transistors were faster (than average) and which were slower (than
> average). Moto, back then, would sell '020s in bins which were 2ns slower
> than the chips would run on the testers.
> >
> My guess is that this phenomenon is real, and that if they declocked the
> chip from 20× to 19× the vast majority of them would quit creating these
> kinds of failures.
> <
> In a physical sense, each transistor is an experiment, and at todays
> lithography, one side of the transistor may see 137 ions implanted,
> while the other side (7nm away) sees 142 ions implanted. So the
> transistor operates differently than theory, and sooner or later one
> of the transistors in one of the millions of speed paths on a die is
> slower than when the path was tested on the tester. There are all
> sorts of aging phenomenon also going on.
> <
> In addition to the above, the transistors are now in the range where one
> can count the number of atoms across the gate on 2 hands (soon 1 hand)!
> At these kind of scales, quantum effects start to appear which make the
> operation of the transistor more difficult to describe, model, and design
> around. Instead of a MOSFET having a smooth V-I curve based on gate
> voltage, the curve has wiggles in it based not only on what voltage is
> being applied, but upon the change in voltage over the last picosecond.
> <
> At modern speeds (5GHz) even wires are displaying certain effects like
> "skin effects" on voltage edges faster than 6ps and Ampere's effects on
> edge speeds faster than 4ps. These effects add to the problems spoken
> of above.
<
After reading the PDF and thinking about it for an hour::
<
This reads a lot like SIMD considered bad. For example, they change a
library (probably from straight x86 to one using the vector instructions
:: SSE[k]) and all of a sudden, various corruptions start happening.
<
They describe a CPU like a set of common registers surrounded by an
amalgam of accelerators--these accelerators are not the stuff x86
had originally.
<
It sounds a bit like the Pentium FDIV bug and a bit like -- "if it hurts
QUIT doing it."
> >
> >
> > --
> > - Stephen Fuld
> > (e-mail address disguised to prevent spam)

Re: Mercurial cores

<jwvo8a3zp7b.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19736&group=comp.arch#19736

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Mercurial cores
Date: Wed, 11 Aug 2021 16:49:44 -0400
Organization: A noiseless patient Spider
Lines: 16
Message-ID: <jwvo8a3zp7b.fsf-monnier+comp.arch@gnu.org>
References: <sf139h$p9h$1@dont-email.me>
<aedb4b11-ea1d-4e9f-b9d6-75fc1d16bba7n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Injection-Info: reader02.eternal-september.org; posting-host="d022e50b853368ef5455047901183a36";
logging-data="21770"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/L2+dHGLU1dgW1Bej02oB/"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)
Cancel-Lock: sha1:OuLWutFNwKNSFxl7LBkdf81wjzY=
sha1:/2ufyWZBjb+9WDGw5+W2hMeK1D4=
 by: Stefan Monnier - Wed, 11 Aug 2021 20:49 UTC

> My guess is that this phenomenon is real, and that if they declocked the
> chip from 20× to 19× the vast majority of them would quit creating these
> kinds of failures.

I guess there's also the difficulty to account for all the possible
cases during testing, due to the complexity of the cores, multiplied by
the possible operating circumstances (temperature, voltage, frequency,
multiple frequency domains, ...). Also I'm not sure how voltage and
frequency transitions are implemented but I'd be surprised if they don't
make all of that yet a bit more interesting since they are applied quite
often nowadays so they need to be reasonably fast.

All in all, the reliability we experience is nothing more than astounding.

Stefan

Re: Mercurial cores

<a2b0c535-7093-44c9-8a3a-1338a8236643n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19737&group=comp.arch#19737

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:438e:: with SMTP id s14mr731438qvr.26.1628717858984;
Wed, 11 Aug 2021 14:37:38 -0700 (PDT)
X-Received: by 2002:a05:6808:aba:: with SMTP id r26mr8855334oij.30.1628717858746;
Wed, 11 Aug 2021 14:37:38 -0700 (PDT)
Path: i2pn2.org!i2pn.org!news.niel.me!usenet.pasdenom.info!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 11 Aug 2021 14:37:38 -0700 (PDT)
In-Reply-To: <jwvo8a3zp7b.fsf-monnier+comp.arch@gnu.org>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <sf139h$p9h$1@dont-email.me> <aedb4b11-ea1d-4e9f-b9d6-75fc1d16bba7n@googlegroups.com>
<jwvo8a3zp7b.fsf-monnier+comp.arch@gnu.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <a2b0c535-7093-44c9-8a3a-1338a8236643n@googlegroups.com>
Subject: Re: Mercurial cores
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 11 Aug 2021 21:37:38 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Wed, 11 Aug 2021 21:37 UTC

On Wednesday, August 11, 2021 at 3:49:49 PM UTC-5, Stefan Monnier wrote:
> > My guess is that this phenomenon is real, and that if they declocked the
> > chip from 20× to 19× the vast majority of them would quit creating these
> > kinds of failures.
<
> I guess there's also the difficulty to account for all the possible
> cases during testing, due to the complexity of the cores, multiplied by
> the possible operating circumstances (temperature, voltage, frequency,
> multiple frequency domains, ...).
<
Back in the ½µ days, with asynchronous interfaces, it could take 1T test
vectors to properly test a part. This corresponds to 10% of the chips service
life !! Not to mention the $5,000 per hour of tester "value"...........
<
> Also I'm not sure how voltage and
> frequency transitions are implemented but I'd be surprised if they don't
> make all of that yet a bit more interesting since they are applied quite
> often nowadays so they need to be reasonably fast.
<
You just multiplied the problem space by about 1000×
>
> All in all, the reliability we experience is nothing more than astounding..
<
If those data centers did not have 100,000 of cores, they might never notice
the problem, either.
>
>
> Stefan

Re: Mercurial cores

<ngYQI.12129$CgPc.4376@fx01.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19739&group=comp.arch#19739

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx01.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Mercurial cores
References: <sf139h$p9h$1@dont-email.me> <aedb4b11-ea1d-4e9f-b9d6-75fc1d16bba7n@googlegroups.com> <8f6cf457-9a02-422a-a32c-998eb291a211n@googlegroups.com>
In-Reply-To: <8f6cf457-9a02-422a-a32c-998eb291a211n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 40
Message-ID: <ngYQI.12129$CgPc.4376@fx01.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Wed, 11 Aug 2021 22:34:59 UTC
Date: Wed, 11 Aug 2021 18:34:52 -0400
X-Received-Bytes: 2493
 by: EricP - Wed, 11 Aug 2021 22:34 UTC

MitchAlsup wrote:
> On Wednesday, August 11, 2021 at 2:57:20 PM UTC-5, MitchAlsup wrote:
>> On Wednesday, August 11, 2021 at 1:01:23 PM UTC-5, Stephen Fuld wrote:
>>> I am not sure what to make of this, but as a software guy, I expect the
>>> hardware to always do the right thing or tell me it didn't.
>>>
>>> https://www.theregister.com/2021/06/04/google_chip_flaws/
>>>
>>> If you don't like The Register, the article contains a link to the PDF
>>> of the ACM paper.
>>>
>>> So, is the phenomenon real? (I assume Google is seeing what the are
>>> seeing). Are they right that it will get worse as the technology
>>> improves. As I said above, I don't like software solutions to this
>>> problem, but what is the right hardware solution? Core duplication and
>>> checking? Back to internal parity checks? Other solutions?
>> <
> <
> After reading the PDF and thinking about it for an hour::
> <
> This reads a lot like SIMD considered bad. For example, they change a
> library (probably from straight x86 to one using the vector instructions
> :: SSE[k]) and all of a sudden, various corruptions start happening.
> <
> They describe a CPU like a set of common registers surrounded by an
> amalgam of accelerators--these accelerators are not the stuff x86
> had originally.
> <
> It sounds a bit like the Pentium FDIV bug and a bit like -- "if it hurts
> QUIT doing it."

I read it as a plea for more automatic internal self checks
so they can detect the intermittent problems when they occur.
Equivalent to memory ECC but for logic.

If they can't determine which cpu's are hurting and when,
they can't quit doing it.

Re: Mercurial cores

<5dec961e-1be7-498a-9c75-0e49fa2103c7n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19741&group=comp.arch#19741

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:6611:: with SMTP id c17mr1073259qtp.392.1628723445348;
Wed, 11 Aug 2021 16:10:45 -0700 (PDT)
X-Received: by 2002:aca:59c6:: with SMTP id n189mr9683618oib.44.1628723445129;
Wed, 11 Aug 2021 16:10:45 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 11 Aug 2021 16:10:44 -0700 (PDT)
In-Reply-To: <ngYQI.12129$CgPc.4376@fx01.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <sf139h$p9h$1@dont-email.me> <aedb4b11-ea1d-4e9f-b9d6-75fc1d16bba7n@googlegroups.com>
<8f6cf457-9a02-422a-a32c-998eb291a211n@googlegroups.com> <ngYQI.12129$CgPc.4376@fx01.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5dec961e-1be7-498a-9c75-0e49fa2103c7n@googlegroups.com>
Subject: Re: Mercurial cores
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 11 Aug 2021 23:10:45 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Wed, 11 Aug 2021 23:10 UTC

On Wednesday, August 11, 2021 at 5:35:02 PM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Wednesday, August 11, 2021 at 2:57:20 PM UTC-5, MitchAlsup wrote:
> >> On Wednesday, August 11, 2021 at 1:01:23 PM UTC-5, Stephen Fuld wrote:
> >>> I am not sure what to make of this, but as a software guy, I expect the
> >>> hardware to always do the right thing or tell me it didn't.
> >>>
> >>> https://www.theregister.com/2021/06/04/google_chip_flaws/
> >>>
> >>> If you don't like The Register, the article contains a link to the PDF
> >>> of the ACM paper.
> >>>
> >>> So, is the phenomenon real? (I assume Google is seeing what the are
> >>> seeing). Are they right that it will get worse as the technology
> >>> improves. As I said above, I don't like software solutions to this
> >>> problem, but what is the right hardware solution? Core duplication and
> >>> checking? Back to internal parity checks? Other solutions?
> >> <
> > <
> > After reading the PDF and thinking about it for an hour::
> > <
> > This reads a lot like SIMD considered bad. For example, they change a
> > library (probably from straight x86 to one using the vector instructions
> > :: SSE[k]) and all of a sudden, various corruptions start happening.
> > <
> > They describe a CPU like a set of common registers surrounded by an
> > amalgam of accelerators--these accelerators are not the stuff x86
> > had originally.
> > <
> > It sounds a bit like the Pentium FDIV bug and a bit like -- "if it hurts
> > QUIT doing it."
<
> I read it as a plea for more automatic internal self checks
> so they can detect the intermittent problems when they occur.
> Equivalent to memory ECC but for logic.
<
Expensive. Consider, for the moment, how much logic would it take to
verify a multiplier gave you the correct bit patterns ?
<
Store {registers, caches, SRAM, DRAM} are all easily protected using various
kinds of error detecting and correcting codes. Buses and wires that simply
move bits around can utilize the same.
<
Verifying arithmetic is hard, verifying control units is even harder. I once
worked on the design of a machine which could deliver all of the correct
bit patterns to all the correct locations, taking the correct number of
cycles, and having done all this in the WRONG sequence order !!
>
> If they can't determine which cpu's are hurting and when,
> they can't quit doing it.
<
They can take all of the suspected units and turn the clock multiplier down
by 1 or 2 units of multiplication. That is what I meant by quit doing that.

Re: Mercurial cores

<sf1qch$67i$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19743&group=comp.arch#19743

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Mercurial cores
Date: Wed, 11 Aug 2021 17:35:30 -0700
Organization: A noiseless patient Spider
Lines: 57
Message-ID: <sf1qch$67i$1@dont-email.me>
References: <sf139h$p9h$1@dont-email.me>
<aedb4b11-ea1d-4e9f-b9d6-75fc1d16bba7n@googlegroups.com>
<8f6cf457-9a02-422a-a32c-998eb291a211n@googlegroups.com>
<ngYQI.12129$CgPc.4376@fx01.iad>
<5dec961e-1be7-498a-9c75-0e49fa2103c7n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 12 Aug 2021 00:35:30 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="77bde09a160b08db269157740a4b1a84";
logging-data="6386"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18tERxEtgGQXLNYJxE3xJFd"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.12.0
Cancel-Lock: sha1:DbaFASnxYjYR3i0UPxfUtQUFlag=
In-Reply-To: <5dec961e-1be7-498a-9c75-0e49fa2103c7n@googlegroups.com>
Content-Language: en-US
 by: Ivan Godard - Thu, 12 Aug 2021 00:35 UTC

On 8/11/2021 4:10 PM, MitchAlsup wrote:
> On Wednesday, August 11, 2021 at 5:35:02 PM UTC-5, EricP wrote:
>> MitchAlsup wrote:
>>> On Wednesday, August 11, 2021 at 2:57:20 PM UTC-5, MitchAlsup wrote:
>>>> On Wednesday, August 11, 2021 at 1:01:23 PM UTC-5, Stephen Fuld wrote:
>>>>> I am not sure what to make of this, but as a software guy, I expect the
>>>>> hardware to always do the right thing or tell me it didn't.
>>>>>
>>>>> https://www.theregister.com/2021/06/04/google_chip_flaws/
>>>>>
>>>>> If you don't like The Register, the article contains a link to the PDF
>>>>> of the ACM paper.
>>>>>
>>>>> So, is the phenomenon real? (I assume Google is seeing what the are
>>>>> seeing). Are they right that it will get worse as the technology
>>>>> improves. As I said above, I don't like software solutions to this
>>>>> problem, but what is the right hardware solution? Core duplication and
>>>>> checking? Back to internal parity checks? Other solutions?
>>>> <
>>> <
>>> After reading the PDF and thinking about it for an hour::
>>> <
>>> This reads a lot like SIMD considered bad. For example, they change a
>>> library (probably from straight x86 to one using the vector instructions
>>> :: SSE[k]) and all of a sudden, various corruptions start happening.
>>> <
>>> They describe a CPU like a set of common registers surrounded by an
>>> amalgam of accelerators--these accelerators are not the stuff x86
>>> had originally.
>>> <
>>> It sounds a bit like the Pentium FDIV bug and a bit like -- "if it hurts
>>> QUIT doing it."
> <
>> I read it as a plea for more automatic internal self checks
>> so they can detect the intermittent problems when they occur.
>> Equivalent to memory ECC but for logic.
> <
> Expensive. Consider, for the moment, how much logic would it take to
> verify a multiplier gave you the correct bit patterns ?
> <
> Store {registers, caches, SRAM, DRAM} are all easily protected using various
> kinds of error detecting and correcting codes. Buses and wires that simply
> move bits around can utilize the same.
> <
> Verifying arithmetic is hard, verifying control units is even harder. I once
> worked on the design of a machine which could deliver all of the correct
> bit patterns to all the correct locations, taking the correct number of
> cycles, and having done all this in the WRONG sequence order !!
>>
>> If they can't determine which cpu's are hurting and when,
>> they can't quit doing it.
> <
> They can take all of the suspected units and turn the clock multiplier down
> by 1 or 2 units of multiplication. That is what I meant by quit doing that.
>

At a certain point voting redundancy is cheaper.

Re: Mercurial cores

<0448cd06-23f1-4a5d-b3a8-52d50cd3b220n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19744&group=comp.arch#19744

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:57c4:: with SMTP id w4mr1404495qta.39.1628729284910;
Wed, 11 Aug 2021 17:48:04 -0700 (PDT)
X-Received: by 2002:a05:6830:31a4:: with SMTP id q4mr1305824ots.82.1628729284657;
Wed, 11 Aug 2021 17:48:04 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 11 Aug 2021 17:48:04 -0700 (PDT)
In-Reply-To: <sf1qch$67i$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <sf139h$p9h$1@dont-email.me> <aedb4b11-ea1d-4e9f-b9d6-75fc1d16bba7n@googlegroups.com>
<8f6cf457-9a02-422a-a32c-998eb291a211n@googlegroups.com> <ngYQI.12129$CgPc.4376@fx01.iad>
<5dec961e-1be7-498a-9c75-0e49fa2103c7n@googlegroups.com> <sf1qch$67i$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <0448cd06-23f1-4a5d-b3a8-52d50cd3b220n@googlegroups.com>
Subject: Re: Mercurial cores
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 12 Aug 2021 00:48:04 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Thu, 12 Aug 2021 00:48 UTC

On Wednesday, August 11, 2021 at 7:35:32 PM UTC-5, Ivan Godard wrote:
> On 8/11/2021 4:10 PM, MitchAlsup wrote:
> > On Wednesday, August 11, 2021 at 5:35:02 PM UTC-5, EricP wrote:
> >> MitchAlsup wrote:
> >>> On Wednesday, August 11, 2021 at 2:57:20 PM UTC-5, MitchAlsup wrote:
> >>>> On Wednesday, August 11, 2021 at 1:01:23 PM UTC-5, Stephen Fuld wrote:
> >>>>> I am not sure what to make of this, but as a software guy, I expect the
> >>>>> hardware to always do the right thing or tell me it didn't.
> >>>>>
> >>>>> https://www.theregister.com/2021/06/04/google_chip_flaws/
> >>>>>
> >>>>> If you don't like The Register, the article contains a link to the PDF
> >>>>> of the ACM paper.
> >>>>>
> >>>>> So, is the phenomenon real? (I assume Google is seeing what the are
> >>>>> seeing). Are they right that it will get worse as the technology
> >>>>> improves. As I said above, I don't like software solutions to this
> >>>>> problem, but what is the right hardware solution? Core duplication and
> >>>>> checking? Back to internal parity checks? Other solutions?
> >>>> <
> >>> <
> >>> After reading the PDF and thinking about it for an hour::
> >>> <
> >>> This reads a lot like SIMD considered bad. For example, they change a
> >>> library (probably from straight x86 to one using the vector instructions
> >>> :: SSE[k]) and all of a sudden, various corruptions start happening.
> >>> <
> >>> They describe a CPU like a set of common registers surrounded by an
> >>> amalgam of accelerators--these accelerators are not the stuff x86
> >>> had originally.
> >>> <
> >>> It sounds a bit like the Pentium FDIV bug and a bit like -- "if it hurts
> >>> QUIT doing it."
> > <
> >> I read it as a plea for more automatic internal self checks
> >> so they can detect the intermittent problems when they occur.
> >> Equivalent to memory ECC but for logic.
> > <
> > Expensive. Consider, for the moment, how much logic would it take to
> > verify a multiplier gave you the correct bit patterns ?
> > <
> > Store {registers, caches, SRAM, DRAM} are all easily protected using various
> > kinds of error detecting and correcting codes. Buses and wires that simply
> > move bits around can utilize the same.
> > <
> > Verifying arithmetic is hard, verifying control units is even harder. I once
> > worked on the design of a machine which could deliver all of the correct
> > bit patterns to all the correct locations, taking the correct number of
> > cycles, and having done all this in the WRONG sequence order !!
> >>
> >> If they can't determine which cpu's are hurting and when,
> >> they can't quit doing it.
> > <
> > They can take all of the suspected units and turn the clock multiplier down
> > by 1 or 2 units of multiplication. That is what I meant by quit doing that.
> >
> At a certain point voting redundancy is cheaper.
<
DIV 3 is cheaper than -5%

Re: Mercurial cores

<2c9e4d3f-fbc4-4bd5-aa5b-18e2f12588a6n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19746&group=comp.arch#19746

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:de13:: with SMTP id h19mr2706795qkj.441.1628746642863;
Wed, 11 Aug 2021 22:37:22 -0700 (PDT)
X-Received: by 2002:a05:6808:2208:: with SMTP id bd8mr10562730oib.110.1628746642663;
Wed, 11 Aug 2021 22:37:22 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 11 Aug 2021 22:37:22 -0700 (PDT)
In-Reply-To: <sf139h$p9h$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:f39d:2c00:59e4:ccb6:faf6:25bd;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:f39d:2c00:59e4:ccb6:faf6:25bd
References: <sf139h$p9h$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2c9e4d3f-fbc4-4bd5-aa5b-18e2f12588a6n@googlegroups.com>
Subject: Re: Mercurial cores
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Thu, 12 Aug 2021 05:37:22 +0000
Content-Type: text/plain; charset="UTF-8"
 by: Quadibloc - Thu, 12 Aug 2021 05:37 UTC

On Wednesday, August 11, 2021 at 12:01:23 PM UTC-6, Stephen Fuld wrote:

> So, is the phenomenon real?

Well, since chips today are being made on ever-finer process nodes, it
makes sense that in addition to yields being affected - which, of course,
the foundries work on until chips can be economically produced - the
chance of other, subtler problems that are harder to detect also
increases.

IBM designed their mainframes with a lot of internal arithmetic
checking to prevent part failures from leading to erroneous results,
as these would be unacceptable to commercial data processing
customers. So techniques are known for dealing with this - the
kind that IBM used, and more exotic techniques used in spacecraft
like triple redundancy.

When I saw the title "mercurial cores", though, right away it made
me think of "Jellicle songs for Jellicle cats"!

John Savard

Re: Mercurial cores

<2021Aug12.153153@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19747&group=comp.arch#19747

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Mercurial cores
Date: Thu, 12 Aug 2021 13:31:53 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 61
Message-ID: <2021Aug12.153153@mips.complang.tuwien.ac.at>
References: <sf139h$p9h$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="9332c76440f24a3af9cb176869b4b728";
logging-data="7115"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18HHz7Ba7fKTKAcz9npan3q"
Cancel-Lock: sha1:WKBJ5QKzIuLTmILVUIMX0nA7vm8=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Thu, 12 Aug 2021 13:31 UTC

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>I am not sure what to make of this, but as a software guy, I expect the
>hardware to always do the right thing or tell me it didn't.
>
>https://www.theregister.com/2021/06/04/google_chip_flaws/
>
>If you don't like The Register, the article contains a link to the PDF
>of the ACM paper.

Is it in any way forbidden to post that link here?

https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s01-hochschild.pdf

There, I did it.

>So, is the phenomenon real? (I assume Google is seeing what the are
>seeing).

I have had two personal machines (IIRC in 1993 and 2003) that
corrupted data rarely enough that the machines did not crash. On the
job a few years ago a Northwood-based machine started acting up after
~15 years in service, but did not crash outright, so our technician at
first did not suspect a hardware failure; I did, because we had not
changed anything in the software, and eventually the technician agreed
with me.

>Are they right that it will get worse as the technology
>improves.

What kind of improvement would that be?

>As I said above, I don't like software solutions to this
>problem, but what is the right hardware solution? Core duplication and
>checking?

We already have core duplication, doing checking of the stuff that,
say, goes outside the per-core state (i.e., outside L2 in current
CPUs) should be achievable (although probably not easy, or CPU
manufacturers would put it in their server CPUs already).

>Back to internal parity checks?

I guess that current Intel and AMD CPUs have a lot of that already.
Overclockers report that overclocking these CPUs with ordinary (not
LN2) cooling and within their usual power limits produces hardly any
improvement, so the CPU manufacturers manage to run these CPUs pretty
close to their limits. I doubt that they ran these CPUs for hours and
hours on the expensive testers to determine which core works at which
clock at a given frequency. Instead, I guess that the CPUs do much of
the testing themselves, and repeat a part of it on startup or maybe
now and then in SMM to account for degradation. And maybe they also
have something like parity checks during production work.

Maybe what Google is seeing are some cases where the startup testing
is not good enough. In that case, the problem may be solvable with
additional test hardware and test vectors.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Mercurial cores

<2021Aug12.161034@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19748&group=comp.arch#19748

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Mercurial cores
Date: Thu, 12 Aug 2021 14:10:34 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 36
Message-ID: <2021Aug12.161034@mips.complang.tuwien.ac.at>
References: <sf139h$p9h$1@dont-email.me> <aedb4b11-ea1d-4e9f-b9d6-75fc1d16bba7n@googlegroups.com> <8f6cf457-9a02-422a-a32c-998eb291a211n@googlegroups.com> <ngYQI.12129$CgPc.4376@fx01.iad> <5dec961e-1be7-498a-9c75-0e49fa2103c7n@googlegroups.com>
Injection-Info: reader02.eternal-september.org; posting-host="9332c76440f24a3af9cb176869b4b728";
logging-data="27703"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Nb3jadSvNfX3EY+/Ui+rJ"
Cancel-Lock: sha1:7tuJ10s7teergTvs3c/g4MjtKAg=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Thu, 12 Aug 2021 14:10 UTC

MitchAlsup <MitchAlsup@aol.com> writes:
>Expensive. Consider, for the moment, how much logic would it take to
>verify a multiplier gave you the correct bit patterns ?

Compute the hexadecimal digital root (repeated digital sum) dr of the
multipliers a and b, and the result p. Now,

dr(p)=dr(dr(a)*dr(b))

Should be pretty easy to check, and you can use other bases than hex
to vary cost and precision. You can also use the digital root for
checking addition and subtraction.

>Verifying arithmetic is hard,

For data in GPRs, if computation is wrong now and then, it will affect
address computation after a while, and that will show up as
exceptions. For data in FPRs and SIMD registers, errors tend to show
up only as wrong results (witness how long it took to notice the
Pentium FDIV bug). So for that we certainly want extra checking.

> verifying control units is even harder.

But then, control units have an effect through the functional units
they control. So, again, when it's a functional unit that deals with
addresses and control flow, errors tend to result in exceptions. When
it's a functional unit that deals with FP or SIMD data, errors tend to
result in silent wrong results; and when the wrong control input also
goes to the checking circuit, even that does not help; so the checking
circuit should have its own control signal circuitry as far as
possible.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Mercurial cores

<sf3eoj$9pu$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19749&group=comp.arch#19749

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Mercurial cores
Date: Thu, 12 Aug 2021 10:29:15 -0500
Organization: A noiseless patient Spider
Lines: 83
Message-ID: <sf3eoj$9pu$1@dont-email.me>
References: <sf139h$p9h$1@dont-email.me>
<2021Aug12.153153@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 12 Aug 2021 15:29:23 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="2846ba4a2fc8b3940ce8f6bd0028e07a";
logging-data="10046"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18v0463/XYBu/3PYJzto/WY"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.12.0
Cancel-Lock: sha1:6ZMbSt0iqGptBm0fkhsUqDztSko=
In-Reply-To: <2021Aug12.153153@mips.complang.tuwien.ac.at>
Content-Language: en-US
 by: BGB - Thu, 12 Aug 2021 15:29 UTC

On 8/12/2021 8:31 AM, Anton Ertl wrote:
> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>> I am not sure what to make of this, but as a software guy, I expect the
>> hardware to always do the right thing or tell me it didn't.
>>
>> https://www.theregister.com/2021/06/04/google_chip_flaws/
>>
>> If you don't like The Register, the article contains a link to the PDF
>> of the ACM paper.
>
> Is it in any way forbidden to post that link here?
>
> https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s01-hochschild.pdf
>
> There, I did it.
>
>> So, is the phenomenon real? (I assume Google is seeing what the are
>> seeing).
>
> I have had two personal machines (IIRC in 1993 and 2003) that
> corrupted data rarely enough that the machines did not crash. On the
> job a few years ago a Northwood-based machine started acting up after
> ~15 years in service, but did not crash outright, so our technician at
> first did not suspect a hardware failure; I did, because we had not
> changed anything in the software, and eventually the technician agreed
> with me.
>
>> Are they right that it will get worse as the technology
>> improves.
>
> What kind of improvement would that be?
>
>> As I said above, I don't like software solutions to this
>> problem, but what is the right hardware solution? Core duplication and
>> checking?
>
> We already have core duplication, doing checking of the stuff that,
> say, goes outside the per-core state (i.e., outside L2 in current
> CPUs) should be achievable (although probably not easy, or CPU
> manufacturers would put it in their server CPUs already).
>
>> Back to internal parity checks?
>
> I guess that current Intel and AMD CPUs have a lot of that already.
> Overclockers report that overclocking these CPUs with ordinary (not
> LN2) cooling and within their usual power limits produces hardly any
> improvement, so the CPU manufacturers manage to run these CPUs pretty
> close to their limits. I doubt that they ran these CPUs for hours and
> hours on the expensive testers to determine which core works at which
> clock at a given frequency. Instead, I guess that the CPUs do much of
> the testing themselves, and repeat a part of it on startup or maybe
> now and then in SMM to account for degradation. And maybe they also
> have something like parity checks during production work.
>
> Maybe what Google is seeing are some cases where the startup testing
> is not good enough. In that case, the problem may be solvable with
> additional test hardware and test vectors.
>

I had a Phenom II which worked fine for years at the stock speed (though
did run a little hot), but at one point started having problems so I
ended up needing to underclock it for stability.

Then, later, I ran an AMD FX for which I also ended up needing to
disable the turbo feature and also underclock it slightly (running it at
3.8 rather than 4.2). Partly this was because in my tests, the turbo
on-average actually made performance worse under load, and it still ran
pretty hot.

At least for the Ryzen I have, I am able to run it at stock speeds...

Though it does still have some bizarre scheduling behavior that I have
not been able to figure out the cause:
Namely, by default, any given process seems only able to use 2 "logical
processors" on a single core; though Cinebench somehow sidesteps this
and uses all the cores.

Though, I guess it does sort of work out OK as I am frequently running 4
or so Verilog simulations at the same time in the background.

....

Re: Mercurial cores

<sf3q6o$1lru$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19754&group=comp.arch#19754

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!pIhVuqI7njB9TMV+aIPpbg.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Mercurial cores
Date: Thu, 12 Aug 2021 20:44:39 +0200
Organization: Aioe.org NNTP Server
Message-ID: <sf3q6o$1lru$1@gioia.aioe.org>
References: <sf139h$p9h$1@dont-email.me>
<2021Aug12.153153@mips.complang.tuwien.ac.at> <sf3eoj$9pu$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="55166"; posting-host="pIhVuqI7njB9TMV+aIPpbg.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.8.1
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Thu, 12 Aug 2021 18:44 UTC

BGB wrote:
> On 8/12/2021 8:31 AM, Anton Ertl wrote:
>> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>>> I am not sure what to make of this, but as a software guy, I expect the
>>> hardware to always do the right thing or tell me it didn't.
>>>
>>> https://www.theregister.com/2021/06/04/google_chip_flaws/
>>>
>>> If you don't like The Register, the article contains a link to the PDF
>>> of the ACM paper.
>>
>> Is it in any way forbidden to post that link here?
>>
>> https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s01-hochschild.pdf
>>
>>
>> There, I did it.
>>
>>> So, is the phenomenon real?  (I assume Google is seeing what the are
>>> seeing).
>>
>> I have had two personal machines (IIRC in 1993 and 2003) that
>> corrupted data rarely enough that the machines did not crash.  On the
>> job a few years ago a Northwood-based machine started acting up after
>> ~15 years in service, but did not crash outright, so our technician at
>> first did not suspect a hardware failure; I did, because we had not
>> changed anything in the software, and eventually the technician agreed
>> with me.
>>
>>> Are they right that it will get worse as the technology
>>> improves.
>>
>> What kind of improvement would that be?
>>
>>> As I said above, I don't like software solutions to this
>>> problem, but what is the right hardware solution?  Core duplication and
>>> checking?
>>
>> We already have core duplication, doing checking of the stuff that,
>> say, goes outside the per-core state (i.e., outside L2 in current
>> CPUs) should be achievable (although probably not easy, or CPU
>> manufacturers would put it in their server CPUs already).
>>
>>> Back to internal parity checks?
>>
>> I guess that current Intel and AMD CPUs have a lot of that already.
>> Overclockers report that overclocking these CPUs with ordinary (not
>> LN2) cooling and within their usual power limits produces hardly any
>> improvement, so the CPU manufacturers manage to run these CPUs pretty
>> close to their limits.  I doubt that they ran these CPUs for hours and
>> hours on the expensive testers to determine which core works at which
>> clock at a given frequency.  Instead, I guess that the CPUs do much of
>> the testing themselves, and repeat a part of it on startup or maybe
>> now and then in SMM to account for degradation.  And maybe they also
>> have something like parity checks during production work.
>>
>> Maybe what Google is seeing are some cases where the startup testing
>> is not good enough.  In that case, the problem may be solvable with
>> additional test hardware and test vectors.
>>
>
> I had a Phenom II which worked fine for years at the stock speed (though
> did run a little hot), but at one point started having problems so I
> ended up needing to underclock it for stability.
>
>
> Then, later, I ran an AMD FX for which I also ended up needing to
> disable the turbo feature and also underclock it slightly (running it at
> 3.8 rather than 4.2). Partly this was because in my tests, the turbo
> on-average actually made performance worse under load, and it still ran
> pretty hot.
>
>
> At least for the Ryzen I have, I am able to run it at stock speeds...
>
> Though it does still have some bizarre scheduling behavior that I have
> not been able to figure out the cause:
> Namely, by default, any given process seems only able to use 2 "logical
> processors" on a single core; though Cinebench somehow sidesteps this
> and uses all the cores.
>
> Though, I guess it does sort of work out OK as I am frequently running 4
> or so Verilog simulations at the same time in the background.

I have seen the same with at least two laptops which have been used for
batch processing of ~25 TB of lidar data: I had to underclock them to
avoid crashes or sudden shutdowns (probably caused by vector ops on all
cores that increased the core temperature too quickly?) with no warnings
or crash dumps.

No problems after that 20% reduction in max speed, and only a much
smaller throughput degradation.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Mercurial cores

<sf3sho$gb5$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19755&group=comp.arch#19755

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Mercurial cores
Date: Thu, 12 Aug 2021 14:24:32 -0500
Organization: A noiseless patient Spider
Lines: 131
Message-ID: <sf3sho$gb5$1@dont-email.me>
References: <sf139h$p9h$1@dont-email.me>
<2021Aug12.153153@mips.complang.tuwien.ac.at> <sf3eoj$9pu$1@dont-email.me>
<sf3q6o$1lru$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 12 Aug 2021 19:24:40 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="2846ba4a2fc8b3940ce8f6bd0028e07a";
logging-data="16741"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+wJmDRHkNsj1WBtJIQf7kE"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.12.0
Cancel-Lock: sha1:aTbGvo4a58cKCgdI9DeYXHYgyi0=
In-Reply-To: <sf3q6o$1lru$1@gioia.aioe.org>
Content-Language: en-US
 by: BGB - Thu, 12 Aug 2021 19:24 UTC

On 8/12/2021 1:44 PM, Terje Mathisen wrote:
> BGB wrote:
>> On 8/12/2021 8:31 AM, Anton Ertl wrote:
>>> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>>>> I am not sure what to make of this, but as a software guy, I expect the
>>>> hardware to always do the right thing or tell me it didn't.
>>>>
>>>> https://www.theregister.com/2021/06/04/google_chip_flaws/
>>>>
>>>> If you don't like The Register, the article contains a link to the PDF
>>>> of the ACM paper.
>>>
>>> Is it in any way forbidden to post that link here?
>>>
>>> https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s01-hochschild.pdf
>>>
>>>
>>> There, I did it.
>>>
>>>> So, is the phenomenon real?  (I assume Google is seeing what the are
>>>> seeing).
>>>
>>> I have had two personal machines (IIRC in 1993 and 2003) that
>>> corrupted data rarely enough that the machines did not crash.  On the
>>> job a few years ago a Northwood-based machine started acting up after
>>> ~15 years in service, but did not crash outright, so our technician at
>>> first did not suspect a hardware failure; I did, because we had not
>>> changed anything in the software, and eventually the technician agreed
>>> with me.
>>>
>>>> Are they right that it will get worse as the technology
>>>> improves.
>>>
>>> What kind of improvement would that be?
>>>
>>>> As I said above, I don't like software solutions to this
>>>> problem, but what is the right hardware solution?  Core duplication and
>>>> checking?
>>>
>>> We already have core duplication, doing checking of the stuff that,
>>> say, goes outside the per-core state (i.e., outside L2 in current
>>> CPUs) should be achievable (although probably not easy, or CPU
>>> manufacturers would put it in their server CPUs already).
>>>
>>>> Back to internal parity checks?
>>>
>>> I guess that current Intel and AMD CPUs have a lot of that already.
>>> Overclockers report that overclocking these CPUs with ordinary (not
>>> LN2) cooling and within their usual power limits produces hardly any
>>> improvement, so the CPU manufacturers manage to run these CPUs pretty
>>> close to their limits.  I doubt that they ran these CPUs for hours and
>>> hours on the expensive testers to determine which core works at which
>>> clock at a given frequency.  Instead, I guess that the CPUs do much of
>>> the testing themselves, and repeat a part of it on startup or maybe
>>> now and then in SMM to account for degradation.  And maybe they also
>>> have something like parity checks during production work.
>>>
>>> Maybe what Google is seeing are some cases where the startup testing
>>> is not good enough.  In that case, the problem may be solvable with
>>> additional test hardware and test vectors.
>>>
>>
>> I had a Phenom II which worked fine for years at the stock speed
>> (though did run a little hot), but at one point started having
>> problems so I ended up needing to underclock it for stability.
>>
>>
>> Then, later, I ran an AMD FX for which I also ended up needing to
>> disable the turbo feature and also underclock it slightly (running it
>> at 3.8 rather than 4.2). Partly this was because in my tests, the
>> turbo on-average actually made performance worse under load, and it
>> still ran pretty hot.
>>
>>
>> At least for the Ryzen I have, I am able to run it at stock speeds...
>>
>> Though it does still have some bizarre scheduling behavior that I have
>> not been able to figure out the cause:
>> Namely, by default, any given process seems only able to use 2
>> "logical processors" on a single core; though Cinebench somehow
>> sidesteps this and uses all the cores.
>>
>> Though, I guess it does sort of work out OK as I am frequently running
>> 4 or so Verilog simulations at the same time in the background.
>
> I have seen the same with at least two laptops which have been used for
> batch processing of ~25 TB of lidar data: I had to underclock them to
> avoid crashes or sudden shutdowns (probably caused by vector ops on all
> cores that increased the core temperature too quickly?) with no warnings
> or crash dumps.
>
> No problems after that 20% reduction in max speed, and only a much
> smaller throughput degradation.
>

Yeah, pretty much.

The Phenom II became unstable at the base clock and needed to be
underclocked to be moderately stable.

Mostly it was thermal issues with the AMD FX (Piledriver). In its stock
settings, it would only run for short bursts at the turbo frequency, and
then throttle down really bad.

Say, it jumped 4.3 GHz briefly, overheat, drop to 1.3GHz, stay there for
a while, jump back to 4.3 GHz briefly, ...

Disabling turbo reduced this issue, but it still ran pretty hot and was
prone to throttling or crashing under load.

Dropping it down to 3.8 (with turbo disabled) caused a notable reduction
in temperature with minimal impact on performance (and I could run it at
close to full load without it going into thermal throttle or crashing).

Granted, I was using an air-cooled setup. Maybe people had better
results with water cooling, but alas.

When I later got a Ryzen, it has a lower Base and Turbo frequency than
the FX had, but still gives better performance, and runs a lot cooler
under load.

Not sure how much was due to efficiency, lower clock speeds, or that the
Ryzen came with a comparably larger stock heatsink (~ 5x5x4 inches).

But, yeah, possibly it is a somewhat different workload between a "Leet
Gamer" and "someone who leaves their PC at 40-70% CPU load for extended
periods while running Verilog simulations and genetic-algorithms, and
similar...".

Re: Mercurial cores

<080fcc84-7327-4ea0-ac08-8d4ea4339587n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19759&group=comp.arch#19759

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:96c2:: with SMTP id y185mr6706797qkd.6.1628802121713;
Thu, 12 Aug 2021 14:02:01 -0700 (PDT)
X-Received: by 2002:aca:2117:: with SMTP id 23mr5082817oiz.0.1628802121501;
Thu, 12 Aug 2021 14:02:01 -0700 (PDT)
Path: i2pn2.org!i2pn.org!news.niel.me!usenet.pasdenom.info!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 12 Aug 2021 14:02:01 -0700 (PDT)
In-Reply-To: <sf3q6o$1lru$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <sf139h$p9h$1@dont-email.me> <2021Aug12.153153@mips.complang.tuwien.ac.at>
<sf3eoj$9pu$1@dont-email.me> <sf3q6o$1lru$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <080fcc84-7327-4ea0-ac08-8d4ea4339587n@googlegroups.com>
Subject: Re: Mercurial cores
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 12 Aug 2021 21:02:01 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Thu, 12 Aug 2021 21:02 UTC

On Thursday, August 12, 2021 at 1:44:46 PM UTC-5, Terje Mathisen wrote:
> BGB wrote:

> I have seen the same with at least two laptops which have been used for
> batch processing of ~25 TB of lidar data: I had to underclock them to
> avoid crashes or sudden shutdowns (probably caused by vector ops on all
> cores that increased the core temperature too quickly?) with no warnings
> or crash dumps.
>
> No problems after that 20% reduction in max speed, and only a much
> smaller throughput degradation.
<
Wouldn't it be nice if more applications were like games ? In most games
a poor calculation only results in a single pixel have the wrong shade of
{RGB} on the screen and vanishes the next frame--the phenomenon is
called "shimmer"
<
>
> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Re: Mercurial cores

<a50db3d8-6ed0-4024-ae88-8980cd012986n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19762&group=comp.arch#19762

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:9e4f:: with SMTP id h76mr7188978qke.24.1628811581205;
Thu, 12 Aug 2021 16:39:41 -0700 (PDT)
X-Received: by 2002:a9d:5603:: with SMTP id e3mr5228280oti.178.1628811580992;
Thu, 12 Aug 2021 16:39:40 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 12 Aug 2021 16:39:40 -0700 (PDT)
In-Reply-To: <080fcc84-7327-4ea0-ac08-8d4ea4339587n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:f39d:2c00:2db1:2bfe:f48b:a7bf;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:f39d:2c00:2db1:2bfe:f48b:a7bf
References: <sf139h$p9h$1@dont-email.me> <2021Aug12.153153@mips.complang.tuwien.ac.at>
<sf3eoj$9pu$1@dont-email.me> <sf3q6o$1lru$1@gioia.aioe.org> <080fcc84-7327-4ea0-ac08-8d4ea4339587n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <a50db3d8-6ed0-4024-ae88-8980cd012986n@googlegroups.com>
Subject: Re: Mercurial cores
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Thu, 12 Aug 2021 23:39:41 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2626
 by: Quadibloc - Thu, 12 Aug 2021 23:39 UTC

On Thursday, August 12, 2021 at 3:02:02 PM UTC-6, MitchAlsup wrote:

> Wouldn't it be nice if more applications were like games ? In most games
> a poor calculation only results in a single pixel have the wrong shade of
> {RGB} on the screen and vanishes the next frame--the phenomenon is
> called "shimmer"

Well, given the increasing importance of AI based on matrices of low-precision
floats, you may get your wish.

But a lot of applications are not anything like games, and demand perfect
accuracy from trillions of calculations. As the technology to provide this is
known, the problem is the intensity of the competition to provide the highest
possible performance at the lowest possible cost...

which leads manufacturers to claim to provide more performance than is
possible at a lower price than is possible, at least with the constraint of
error-free computation.

Since many people only really drive the computational capabilities of their
computers heavily when playing games... perhaps there's room here for
some sort of dual-mode operation.

My initial reaction to that sentence, though, was of the order of "dream
on". Unless _all_ applications are "like games" in the sense you outline,
a computer has to be capable of error-free operation. I don't see that there's
really an escape from that.

John Savard

Re: Mercurial cores

<ec3a43e4-5c8e-40a2-bca0-ffa3adc07f41n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19764&group=comp.arch#19764

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:2912:: with SMTP id m18mr7134710qkp.331.1628813309290;
Thu, 12 Aug 2021 17:08:29 -0700 (PDT)
X-Received: by 2002:aca:2117:: with SMTP id 23mr5573460oiz.0.1628813309043;
Thu, 12 Aug 2021 17:08:29 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 12 Aug 2021 17:08:28 -0700 (PDT)
In-Reply-To: <a50db3d8-6ed0-4024-ae88-8980cd012986n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <sf139h$p9h$1@dont-email.me> <2021Aug12.153153@mips.complang.tuwien.ac.at>
<sf3eoj$9pu$1@dont-email.me> <sf3q6o$1lru$1@gioia.aioe.org>
<080fcc84-7327-4ea0-ac08-8d4ea4339587n@googlegroups.com> <a50db3d8-6ed0-4024-ae88-8980cd012986n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ec3a43e4-5c8e-40a2-bca0-ffa3adc07f41n@googlegroups.com>
Subject: Re: Mercurial cores
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 13 Aug 2021 00:08:29 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Fri, 13 Aug 2021 00:08 UTC

On Thursday, August 12, 2021 at 6:39:42 PM UTC-5, Quadibloc wrote:
> On Thursday, August 12, 2021 at 3:02:02 PM UTC-6, MitchAlsup wrote:
>
> > Wouldn't it be nice if more applications were like games ? In most games
> > a poor calculation only results in a single pixel have the wrong shade of
> > {RGB} on the screen and vanishes the next frame--the phenomenon is
> > called "shimmer"
> Well, given the increasing importance of AI based on matrices of low-precision
> floats, you may get your wish.
>
> But a lot of applications are not anything like games, and demand perfect
> accuracy from trillions of calculations. As the technology to provide this is
> known, the problem is the intensity of the competition to provide the highest
> possible performance at the lowest possible cost...
<
There is the old adage:: Good, fast, cheap -- choose any 2.
<
So, for the most part they trow out good in order to get fast and cheap
>
> which leads manufacturers to claim to provide more performance than is
> possible at a lower price than is possible, at least with the constraint of
> error-free computation.
<
Reminds me of 1950s automobiles and the public resistance to seat belts.
>
> Since many people only really drive the computational capabilities of their
> computers heavily when playing games... perhaps there's room here for
> some sort of dual-mode operation.
>
> My initial reaction to that sentence, though, was of the order of "dream
> on". Unless _all_ applications are "like games" in the sense you outline,
> a computer has to be capable of error-free operation. I don't see that there's
> really an escape from that.
>
> John Savard

Re: Mercurial cores

<sf4m9d$n4r$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19765&group=comp.arch#19765

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Mercurial cores
Date: Thu, 12 Aug 2021 19:43:56 -0700
Organization: A noiseless patient Spider
Lines: 70
Message-ID: <sf4m9d$n4r$1@dont-email.me>
References: <sf139h$p9h$1@dont-email.me>
<2021Aug12.153153@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 13 Aug 2021 02:43:57 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="23384bce6dc1dec31cc384e274f283ca";
logging-data="23707"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19ZT9G1txYlQhCbhLFGv3NfwadhFmhlErI="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:CSYsqCITDtdXPbST3kVzORwEQGQ=
In-Reply-To: <2021Aug12.153153@mips.complang.tuwien.ac.at>
Content-Language: en-US
 by: Stephen Fuld - Fri, 13 Aug 2021 02:43 UTC

On 8/12/2021 6:31 AM, Anton Ertl wrote:
> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>> I am not sure what to make of this, but as a software guy, I expect the
>> hardware to always do the right thing or tell me it didn't.
>>
>> https://www.theregister.com/2021/06/04/google_chip_flaws/
>>
>> If you don't like The Register, the article contains a link to the PDF
>> of the ACM paper.
>
> Is it in any way forbidden to post that link here?
>
> https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s01-hochschild.pdf
>
> There, I did it.

No reason not to post the link. I first came upon the story in The
Register through my Apple news feed, so that is the link I used. On
behalf of those who wanted the full story right away, thanks for posting
the link.

>
>> So, is the phenomenon real? (I assume Google is seeing what the are
>> seeing).
>
> I have had two personal machines (IIRC in 1993 and 2003) that
> corrupted data rarely enough that the machines did not crash. On the
> job a few years ago a Northwood-based machine started acting up after
> ~15 years in service, but did not crash outright, so our technician at
> first did not suspect a hardware failure; I did, because we had not
> changed anything in the software, and eventually the technician agreed
> with me.
>
>> Are they right that it will get worse as the technology
>> improves.
>
> What kind of improvement would that be?

The indicate things like smaller lithography.

>
>> As I said above, I don't like software solutions to this
>> problem, but what is the right hardware solution? Core duplication and
>> checking?
>
> We already have core duplication, doing checking of the stuff that,
> say, goes outside the per-core state (i.e., outside L2 in current
> CPUs) should be achievable (although probably not easy, or CPU
> manufacturers would put it in their server CPUs already).

Presumably it also increased cost, as it halves the number of cores per
chip that can do useful work.

>> Back to internal parity checks?
>
> I guess that current Intel and AMD CPUs have a lot of that already.

Certainly someone like Mitch would be better able to answer that, but my
impression was that they didn't.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: everything old is new again, or Mercurial cores

<sf4n7m$2ffm$1@gal.iecc.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19766&group=comp.arch#19766

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.cmpublishers.com!adore2!news.iecc.com!.POSTED.news.iecc.com!not-for-mail
From: joh...@taugh.com (John Levine)
Newsgroups: comp.arch
Subject: Re: everything old is new again, or Mercurial cores
Date: Fri, 13 Aug 2021 03:00:06 -0000 (UTC)
Organization: Taughannock Networks
Message-ID: <sf4n7m$2ffm$1@gal.iecc.com>
References: <sf139h$p9h$1@dont-email.me>
Injection-Date: Fri, 13 Aug 2021 03:00:06 -0000 (UTC)
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
logging-data="81398"; mail-complaints-to="abuse@iecc.com"
In-Reply-To: <sf139h$p9h$1@dont-email.me>
Cleverness: some
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: johnl@iecc.com (John Levine)
 by: John Levine - Fri, 13 Aug 2021 03:00 UTC

According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:
>So, is the phenomenon real? (I assume Google is seeing what they are seeing).

Back in the 1950s and 1950s computers were, by our standards extremely unreliable.

In 1948 the IBM SSEC did everything twice in parallel and compared.
Programmers had to hang around while their programs ran to recover
manually when the compares failed, or other things failed, and get it
restarted.

The IAS machine in Princeton sort of started working in 1951 but it
required huge effort and another year (and as it turned out huge
cooling) to get it to work for more than a few seconds at a time.

The IBM 704 was more reliable than its predecessors, partly because they had learned
more about making reliable vacuum tubes, and mostly because it used core memory rather
than flaky Williams tubes or Selectrons. Nonetheless, I hear that the maximum practical
size for a Fortran program was limited not by the size of memory but by how long the
compiler could run before the hardware flaked and it crashed.

The details of the mercurial failures aren't the same as the failures of vacuum tubes,
of course, but they share the fundamental problem of trying to build digital systems
out of fundamentally analog components.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Re: everything old is new again, or Mercurial cores

<sf4o05$usn$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19767&group=comp.arch#19767

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: everything old is new again, or Mercurial cores
Date: Thu, 12 Aug 2021 20:13:09 -0700
Organization: A noiseless patient Spider
Lines: 31
Message-ID: <sf4o05$usn$1@dont-email.me>
References: <sf139h$p9h$1@dont-email.me> <sf4n7m$2ffm$1@gal.iecc.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 13 Aug 2021 03:13:09 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="a8cd4a1f7d9781973b00cf666e1db0d9";
logging-data="31639"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18q99hgLHfP3Yp89g1J4ltR"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.12.0
Cancel-Lock: sha1:6lhSfG3npIGlU+U4i+ENjP64fTc=
In-Reply-To: <sf4n7m$2ffm$1@gal.iecc.com>
Content-Language: en-US
 by: Ivan Godard - Fri, 13 Aug 2021 03:13 UTC

On 8/12/2021 8:00 PM, John Levine wrote:
> According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:
>> So, is the phenomenon real? (I assume Google is seeing what they are seeing).
>
> Back in the 1950s and 1950s computers were, by our standards extremely unreliable.
>
> In 1948 the IBM SSEC did everything twice in parallel and compared.
> Programmers had to hang around while their programs ran to recover
> manually when the compares failed, or other things failed, and get it
> restarted.
>
> The IAS machine in Princeton sort of started working in 1951 but it
> required huge effort and another year (and as it turned out huge
> cooling) to get it to work for more than a few seconds at a time.
>
> The IBM 704 was more reliable than its predecessors, partly because they had learned
> more about making reliable vacuum tubes, and mostly because it used core memory rather
> than flaky Williams tubes or Selectrons. Nonetheless, I hear that the maximum practical
> size for a Fortran program was limited not by the size of memory but by how long the
> compiler could run before the hardware flaked and it crashed.
>
> The details of the mercurial failures aren't the same as the failures of vacuum tubes,
> of course, but they share the fundamental problem of trying to build digital systems
> out of fundamentally analog components.
>

When writing some of the system software for the B6500, using the System
101 hardware prototype, the others in the software team had a MTTF of
~five minutes. For me, however, 101 would stay up for ~half an hour. I
am convinced my success came because, when it finished a compile, I
would thank it and pat the console.

Re: Mercurial cores

<2021Aug13.094325@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19771&group=comp.arch#19771

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Mercurial cores
Date: Fri, 13 Aug 2021 07:43:25 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 22
Message-ID: <2021Aug13.094325@mips.complang.tuwien.ac.at>
References: <sf139h$p9h$1@dont-email.me> <2021Aug12.153153@mips.complang.tuwien.ac.at> <sf4m9d$n4r$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="0fd349d651368433610776fd56ad7a3b";
logging-data="6064"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/cAoobre++FGoO0AR5Ln5u"
Cancel-Lock: sha1:+3nvn/Qsh1N8JPOG+pv6p5DAr8o=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Fri, 13 Aug 2021 07:43 UTC

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
>On 8/12/2021 6:31 AM, Anton Ertl wrote:
>> We already have core duplication, doing checking of the stuff that,
>> say, goes outside the per-core state (i.e., outside L2 in current
>> CPUs) should be achievable (although probably not easy, or CPU
>> manufacturers would put it in their server CPUs already).
>
>Presumably it also increased cost, as it halves the number of cores per
>chip that can do useful work.

Only if you run the cores in the checking mode. For the manufacturers
that would be just an optional feature that they offer to their
customers. You might consider that the customers may not want to
double the number of cores they buy/rent in order to get this
checking, but I think that there are customers that would use that
option. After all, Tandem was a successful business for a while, and
I think their later fate was not due to the customer base vanishing.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Mercurial cores

<sf5bjl$2f4$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19774&group=comp.arch#19774

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!pIhVuqI7njB9TMV+aIPpbg.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Mercurial cores
Date: Fri, 13 Aug 2021 10:47:48 +0200
Organization: Aioe.org NNTP Server
Message-ID: <sf5bjl$2f4$1@gioia.aioe.org>
References: <sf139h$p9h$1@dont-email.me>
<2021Aug12.153153@mips.complang.tuwien.ac.at> <sf3eoj$9pu$1@dont-email.me>
<sf3q6o$1lru$1@gioia.aioe.org>
<080fcc84-7327-4ea0-ac08-8d4ea4339587n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="2532"; posting-host="pIhVuqI7njB9TMV+aIPpbg.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.8.1
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Fri, 13 Aug 2021 08:47 UTC

MitchAlsup wrote:
> On Thursday, August 12, 2021 at 1:44:46 PM UTC-5, Terje Mathisen wrote:
>> BGB wrote:
>
>> I have seen the same with at least two laptops which have been used for
>> batch processing of ~25 TB of lidar data: I had to underclock them to
>> avoid crashes or sudden shutdowns (probably caused by vector ops on all
>> cores that increased the core temperature too quickly?) with no warnings
>> or crash dumps.
>>
>> No problems after that 20% reduction in max speed, and only a much
>> smaller throughput degradation.
> <
> Wouldn't it be nice if more applications were like games ? In most games
> a poor calculation only results in a single pixel have the wrong shade of
> {RGB} on the screen and vanishes the next frame--the phenomenon is
> called "shimmer"

I've told the story about how iD Software (Mike Abrash/John Carmack/etc)
spent a couple of weeks chasing exactly such a shimmer where Mike's
daughter with fresh young eyes saw the glitch. They finally found out
that their white box PC vendor had uplocked a 90 MHz cpu to 100 MHz
("because testing has shown it to be good").

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: everything old is new again, or Mercurial cores

<sf5cbc$cre$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19776&group=comp.arch#19776

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!pIhVuqI7njB9TMV+aIPpbg.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: everything old is new again, or Mercurial cores
Date: Fri, 13 Aug 2021 11:00:26 +0200
Organization: Aioe.org NNTP Server
Message-ID: <sf5cbc$cre$1@gioia.aioe.org>
References: <sf139h$p9h$1@dont-email.me> <sf4n7m$2ffm$1@gal.iecc.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="13166"; posting-host="pIhVuqI7njB9TMV+aIPpbg.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.8.1
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Fri, 13 Aug 2021 09:00 UTC

John Levine wrote:
> According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:
>> So, is the phenomenon real? (I assume Google is seeing what they are seeing).
>
> Back in the 1950s and 1950s computers were, by our standards extremely unreliable.

Even in the seventies mainframes competed on how fast they could
snapshot, crash, reboot the os, reload the snapshot and recover in order
to handle faults.
:-)

I think they got that process down to a few seconds?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: everything old is new again, or Mercurial cores

<e936d2a4-c42b-4b8b-914a-a80afa0d7f23n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19779&group=comp.arch#19779

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:2101:: with SMTP id l1mr1461433qkl.104.1628852529246;
Fri, 13 Aug 2021 04:02:09 -0700 (PDT)
X-Received: by 2002:aca:2117:: with SMTP id 23mr1668965oiz.0.1628852529071;
Fri, 13 Aug 2021 04:02:09 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.mixmin.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 13 Aug 2021 04:02:08 -0700 (PDT)
In-Reply-To: <sf5cbc$cre$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=87.68.183.224; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 87.68.183.224
References: <sf139h$p9h$1@dont-email.me> <sf4n7m$2ffm$1@gal.iecc.com> <sf5cbc$cre$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e936d2a4-c42b-4b8b-914a-a80afa0d7f23n@googlegroups.com>
Subject: Re: everything old is new again, or Mercurial cores
From: already5...@yahoo.com (Michael S)
Injection-Date: Fri, 13 Aug 2021 11:02:09 +0000
Content-Type: text/plain; charset="UTF-8"
 by: Michael S - Fri, 13 Aug 2021 11:02 UTC

On Friday, August 13, 2021 at 12:00:30 PM UTC+3, Terje Mathisen wrote:
> John Levine wrote:
> > According to Stephen Fuld <sf...@alumni.cmu.edu.invalid>:
> >> So, is the phenomenon real? (I assume Google is seeing what they are seeing).
> >
> > Back in the 1950s and 1950s computers were, by our standards extremely unreliable.
> Even in the seventies mainframes competed on how fast they could
> snapshot, crash, reboot the os, reload the snapshot and recover in order
> to handle faults.
> :-)
>
> I think they got that process down to a few seconds?
> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

My impression was that in the second half of the 70s IBM started to take reliability
seriously and very quickly achieved huge progress on this front.
The same happened to [mainframe] security few years later.
But I was not around back then.

Pages:123
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor