Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

All the evidence concerning the universe has not yet been collected, so there's still hope.


devel / comp.arch / Re: Power cost of IEEE754

SubjectAuthor
* Misc: Idle thoughts for cheap and fast(ish) GPU.BGB
+* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.MitchAlsup
|`* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.BGB
| `* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.MitchAlsup
|  `* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.BGB
|   +* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.MitchAlsup
|   |`* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.BGB
|   | `- Re: Misc: Idle thoughts for cheap and fast(ish) GPU.luke.l...@gmail.com
|   `* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.luke.l...@gmail.com
|    `* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.BGB
|     `* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.luke.l...@gmail.com
|      `* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.BGB
|       +* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.luke.l...@gmail.com
|       |`- Re: Misc: Idle thoughts for cheap and fast(ish) GPU.BGB
|       `- Re: Misc: Idle thoughts for cheap and fast(ish) GPU.MitchAlsup
+* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.Terje Mathisen
|+- Re: Misc: Idle thoughts for cheap and fast(ish) GPU.BGB
|`* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.luke.l...@gmail.com
| +* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.BGB
| |`* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.luke.l...@gmail.com
| | +* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.BGB
| | |`* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.luke.l...@gmail.com
| | | `* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.BGB
| | |  +- Re: Misc: Idle thoughts for cheap and fast(ish) GPU.luke.l...@gmail.com
| | |  `- Re: Misc: Idle thoughts for cheap and fast(ish) GPU.MitchAlsup
| | `* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.Andy
| |  +* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.robf...@gmail.com
| |  |`* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.luke.l...@gmail.com
| |  | `- Re: Misc: Idle thoughts for cheap and fast(ish) GPU.BGB
| |  `* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.BGB
| |   `* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.Andy
| |    `* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.BGB
| |     `* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.robf...@gmail.com
| |      `* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.BGB
| |       `* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.robf...@gmail.com
| |        `* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.BGB
| |         `* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.MitchAlsup
| |          `- Re: Misc: Idle thoughts for cheap and fast(ish) GPU.BGB
| +* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.Terje Mathisen
| |`- Re: Misc: Idle thoughts for cheap and fast(ish) GPU.BGB
| +* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.Anton Ertl
| |`- Re: Misc: Idle thoughts for cheap and fast(ish) GPU.Terje Mathisen
| `* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.Quadibloc
|  +- Re: Misc: Idle thoughts for cheap and fast(ish) GPU.BGB
|  +* AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)Anton Ertl
|  |`* Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)John Dallman
|  | +* Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)Thomas Koenig
|  | |`- Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)John Dallman
|  | `* Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)Marcus
|  |  +- Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)Anton Ertl
|  |  `- Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)Terje Mathisen
|  `* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.MitchAlsup
|   `* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.Michael S
|    `* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.Anton Ertl
|     `* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.Michael S
|      +* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.Anton Ertl
|      |`* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.MitchAlsup
|      | `- Re: Misc: Idle thoughts for cheap and fast(ish) GPU.Anton Ertl
|      `- Re: Misc: Idle thoughts for cheap and fast(ish) GPU.Anton Ertl
`* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.luke.l...@gmail.com
 +* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.Quadibloc
 |+* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.BGB
 ||`* Re: Misc: Idle thoughts for cheap and fast(ish) GPU.MitchAlsup
 || `- Re: Misc: Idle thoughts for cheap and fast(ish) GPU.BGB
 |`- Re: Misc: Idle thoughts for cheap and fast(ish) GPU.MitchAlsup
 `* Power cost of IEEE754 (was: Misc: Idle thoughts for cheap and fast(ish) GPU)Stefan Monnier
  +* Re: Power cost of IEEE754 (was: Misc: Idle thoughts for cheap androbf...@gmail.com
  |`- Re: Power cost of IEEE754 (was: Misc: Idle thoughts for cheap andBGB
  +* Re: Power cost of IEEE754Terje Mathisen
  |+* Re: Power cost of IEEE754BGB
  ||+* Re: Power cost of IEEE754MitchAlsup
  |||`- Re: Power cost of IEEE754BGB
  ||`* Re: Power cost of IEEE754MitchAlsup
  || `* Re: Power cost of IEEE754Quadibloc
  ||  +- Re: Power cost of IEEE754MitchAlsup
  ||  +* Re: Power cost of IEEE754Anton Ertl
  ||  |`* Re: Power cost of IEEE754Terje Mathisen
  ||  | `- Re: Power cost of IEEE754MitchAlsup
  ||  `* Re: Power cost of IEEE754Thomas Koenig
  ||   `* Re: Power cost of IEEE754John Dallman
  ||    `* Re: Power cost of IEEE754Michael S
  ||     `* Re: Power cost of IEEE754EricP
  ||      `* Re: Power cost of IEEE754MitchAlsup
  ||       `* Re: Power cost of IEEE754EricP
  ||        `- Re: Power cost of IEEE754EricP
  |`* Re: Power cost of IEEE754Paul A. Clayton
  | +* Re: Power cost of IEEE754MitchAlsup
  | |`* Re: Power cost of IEEE754luke.l...@gmail.com
  | | +- Re: Power cost of IEEE754MitchAlsup
  | | +* Re: Power cost of IEEE754Josh Vanderhoof
  | | |`* Re: Power cost of IEEE754BGB
  | | | `* Re: Power cost of IEEE754Josh Vanderhoof
  | | |  `* Re: Power cost of IEEE754BGB
  | | |   `* Re: Power cost of IEEE754Josh Vanderhoof
  | | |    `* Re: Power cost of IEEE754BGB
  | | |     `* Re: Power cost of IEEE754Josh Vanderhoof
  | | |      `* Re: Power cost of IEEE754BGB
  | | |       `* Re: Power cost of IEEE754Josh Vanderhoof
  | | |        `- Re: Power cost of IEEE754BGB
  | | `- Re: Power cost of IEEE754Terje Mathisen
  | +* Re: Power cost of IEEE754Ivan Godard
  | `* Re: Power cost of IEEE754BGB
  `- Re: Power cost of IEEE754 (was: Misc: Idle thoughts for cheap andQuadibloc

Pages:123456
Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)

<memo.20220817141407.11400O@jgd.cix.co.uk>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27339&group=comp.arch#27339

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: jgd...@cix.co.uk (John Dallman)
Newsgroups: comp.arch
Subject: Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)
Date: Wed, 17 Aug 2022 14:14 +0100 (BST)
Organization: A noiseless patient Spider
Lines: 21
Message-ID: <memo.20220817141407.11400O@jgd.cix.co.uk>
References: <2022Aug17.101938@mips.complang.tuwien.ac.at>
Reply-To: jgd@cix.co.uk
Injection-Info: reader01.eternal-september.org; posting-host="49ec787dd6559b0dde75bb4065754a7a";
logging-data="490102"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/KcVf2HEMu54HF5XsnXmaUeB4jU/H+6CI="
Cancel-Lock: sha1:NoQF7NGlLbsljDjSxxHYnQKEGEw=
 by: John Dallman - Wed, 17 Aug 2022 13:14 UTC

In article <2022Aug17.101938@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> And then you have Intel marketing, which disables AVX512 in SKUs for
> chips that actually support AVX512 (and likewise disable AVX in some
> SKUs), ensuring that most programmers will steer clear of AVX for
> many years to come and even longer for AVX512.

There are parts of Intel that don't seem to understand the difference
between firmware and applications. They behave as if machines are sold
with all their software already on them, and run that load for their
lives. This is not how the application software market actually works.

If you're an application software producer, doing an extra version for an
instruction set extension needs to give the end-user something big for it
to be cost-effective. I've experimented with AVX, and it does very little
for the software I work on. I understand why, and AVX-512 isn't going to
do any better. Improving performance at the algorithm level works better
and applies to all platforms.

John

Re: Power cost of IEEE754

<tdiql4$96b$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27340&group=comp.arch#27340

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!y0sttPrO1OAcON/g+jAtOw.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Power cost of IEEE754
Date: Wed, 17 Aug 2022 15:32:27 +0200
Organization: Aioe.org NNTP Server
Message-ID: <tdiql4$96b$1@gioia.aioe.org>
References: <t6gush$p5u$1@dont-email.me>
<0a9765d5-f885-4109-9ba8-69430513d05fn@googlegroups.com>
<jwvlerncc0p.fsf-monnier+comp.arch@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="9419"; posting-host="y0sttPrO1OAcON/g+jAtOw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.13
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Wed, 17 Aug 2022 13:32 UTC

Stefan Monnier wrote:
>> 3) RVV or any other Cray-Style Vector ISA requiring strict compliance
>> with IEEE754 FP accuracy will punish you with a 400% power/area
>> penalty compared to modern 3D-optimised GPUs which, as explicitly
>
> What is the origin of this extra cost? IOW, where's the meat of the
> savings (I mean: which part of the computation of the last few bits
> costs so much)?
> Does the saving vary significantly between instructions?

I think that is pretty much bogus, Mitch have shown repeatedly that
having FMAC makes subnormal handling nearly free, as in zero cycles and
single-digit percentage gates/power. 400% is simply rubbish.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)

<tdiqmd$52v$2@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27341&group=comp.arch#27341

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2a0a-a540-1faa-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)
Date: Wed, 17 Aug 2022 13:33:01 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <tdiqmd$52v$2@newsreader4.netcologne.de>
References: <2022Aug17.101938@mips.complang.tuwien.ac.at>
<memo.20220817141407.11400O@jgd.cix.co.uk>
Injection-Date: Wed, 17 Aug 2022 13:33:01 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2a0a-a540-1faa-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2a0a:a540:1faa:0:7285:c2ff:fe6c:992d";
logging-data="5215"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Wed, 17 Aug 2022 13:33 UTC

John Dallman <jgd@cix.co.uk> schrieb:
> In article <2022Aug17.101938@mips.complang.tuwien.ac.at>,
> anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>
>> And then you have Intel marketing, which disables AVX512 in SKUs for
>> chips that actually support AVX512 (and likewise disable AVX in some
>> SKUs), ensuring that most programmers will steer clear of AVX for
>> many years to come and even longer for AVX512.
>
> There are parts of Intel that don't seem to understand the difference
> between firmware and applications. They behave as if machines are sold
> with all their software already on them, and run that load for their
> lives. This is not how the application software market actually works.

Either that, or self-compiled software.

The latter makes sense for a lot of people in scientific computing,
but those are probably not so frequent.

But even there, SIMD is only useful for some fields.

>
> If you're an application software producer, doing an extra version for an
> instruction set extension needs to give the end-user something big for it
> to be cost-effective. I've experimented with AVX, and it does very little
> for the software I work on. I understand why, and AVX-512 isn't going to
> do any better.

Which field is that, can you tell?

>Improving performance at the algorithm level works better
> and applies to all platforms.

That is always a given, but sometimes you would like to do both :-)

Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)

<tdiqth$f2i5$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27342&group=comp.arch#27342

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)
Date: Wed, 17 Aug 2022 15:36:47 +0200
Organization: A noiseless patient Spider
Lines: 36
Message-ID: <tdiqth$f2i5$1@dont-email.me>
References: <2022Aug17.101938@mips.complang.tuwien.ac.at>
<memo.20220817141407.11400O@jgd.cix.co.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 17 Aug 2022 13:36:49 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="9955c720f573f08c5c170ea882a73e45";
logging-data="494149"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX183jZ0525ur1c8TKW8jq9Q3xV38AVVFJYU="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.11.0
Cancel-Lock: sha1:ELv/D6qnWEay9NKkv0uvHzz2Nq8=
Content-Language: en-US
In-Reply-To: <memo.20220817141407.11400O@jgd.cix.co.uk>
 by: Marcus - Wed, 17 Aug 2022 13:36 UTC

On 2022-08-17, John Dallman wrote:
> In article <2022Aug17.101938@mips.complang.tuwien.ac.at>,
> anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>
>> And then you have Intel marketing, which disables AVX512 in SKUs for
>> chips that actually support AVX512 (and likewise disable AVX in some
>> SKUs), ensuring that most programmers will steer clear of AVX for
>> many years to come and even longer for AVX512.
>
> There are parts of Intel that don't seem to understand the difference
> between firmware and applications. They behave as if machines are sold
> with all their software already on them, and run that load for their
> lives. This is not how the application software market actually works.
>
> If you're an application software producer, doing an extra version for an
> instruction set extension needs to give the end-user something big for it
> to be cost-effective. I've experimented with AVX, and it does very little
> for the software I work on. I understand why, and AVX-512 isn't going to
> do any better. Improving performance at the algorithm level works better
> and applies to all platforms.
>
> John

I haven't played with AVX512 nor bothered to learn it (because, hey,
what's the point when there's no hardware around to run it on anyway),
but my impression is that they made some actual improvements w.r.t.
predication and such, so that it should be easier to vectorize code
than with SSE or regular AVX.

If that is true, I would really like to see narrower implementations of
AVX512, as it could potentially bring some of the benefits of vector
processing (compared to plain-ol-SIMD). E.g. 128-bit wide AVX512 (if
that's doable) would give you immunity against latency hazards up to
four clock cycles.

/Marcus

Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)

<memo.20220817153648.11400P@jgd.cix.co.uk>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27346&group=comp.arch#27346

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: jgd...@cix.co.uk (John Dallman)
Newsgroups: comp.arch
Subject: Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)
Date: Wed, 17 Aug 2022 15:36 +0100 (BST)
Organization: A noiseless patient Spider
Lines: 17
Message-ID: <memo.20220817153648.11400P@jgd.cix.co.uk>
References: <tdiqmd$52v$2@newsreader4.netcologne.de>
Reply-To: jgd@cix.co.uk
Injection-Info: reader01.eternal-september.org; posting-host="49ec787dd6559b0dde75bb4065754a7a";
logging-data="505855"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/ap2i2xsXgH4tpXB7K2soakJuwbZ/2pFQ="
Cancel-Lock: sha1:ws8pxC9nuQygsJqSOELXYKV46zM=
 by: John Dallman - Wed, 17 Aug 2022 14:36 UTC

In article <tdiqmd$52v$2@newsreader4.netcologne.de>,
tkoenig@netcologne.de (Thomas Koenig) wrote:

> Either that, or self-compiled software.
>
> The latter makes sense for a lot of people in scientific computing,
> but those are probably not so frequent.
>
> But even there, SIMD is only useful for some fields.

True.

> Which field is that, can you tell?

I can speak more freely if I don't.

John

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<5ce1c416-7bb7-4262-9008-8ce9eb9aecdcn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27354&group=comp.arch#27354

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ae9:e903:0:b0:6ba:e5aa:d59e with SMTP id x3-20020ae9e903000000b006bae5aad59emr15823061qkf.214.1660749150790;
Wed, 17 Aug 2022 08:12:30 -0700 (PDT)
X-Received: by 2002:a05:6808:212a:b0:344:3c48:222d with SMTP id
r42-20020a056808212a00b003443c48222dmr1798589oiw.186.1660749150499; Wed, 17
Aug 2022 08:12:30 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 17 Aug 2022 08:12:30 -0700 (PDT)
In-Reply-To: <3d8a28d0-afb7-4650-b59c-34fc8a92a637n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:5879:d7aa:3989:85e9;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:5879:d7aa:3989:85e9
References: <t6gush$p5u$1@dont-email.me> <0a9765d5-f885-4109-9ba8-69430513d05fn@googlegroups.com>
<3d8a28d0-afb7-4650-b59c-34fc8a92a637n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5ce1c416-7bb7-4262-9008-8ce9eb9aecdcn@googlegroups.com>
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 17 Aug 2022 15:12:30 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2883
 by: MitchAlsup - Wed, 17 Aug 2022 15:12 UTC

On Wednesday, August 17, 2022 at 1:35:54 AM UTC-5, Quadibloc wrote:
> On Sunday, July 24, 2022 at 4:28:47 AM UTC-6, luke.l...@gmail.com wrote:
>
> > 3) RVV or any other Cray-Style Vector ISA requiring strict compliance
> > with IEEE754 FP accuracy will punish you with a 400% power/area
> > penalty compared to modern 3D-optimised GPUs which, as explicitly
> > spelled out and allowed in the Vulkan(tm) Spec by the Khronos
> > Group, are permitted significant accuracy reductions. Mitch
> > will (or will have already) filled you in on that.
>
> Well, if your purpose is building a device that is helpful to scientists
> wishing to perform large-scale computations on FP64 numbers, you'll
> just have to pay that penalty. After paying that penalty, though, something
> based on a GPU architecture instead of that of a conventional non-vector
> CPU is *still* vastly more powerful.
<
Those scientists were perfectly happy with CDC6600 quality floating
point. They were the ones on the side of a) "IEEE 754 containers are
file, b) leave out all that denorm and NaN stuff.
>
> And we have the NEC Aurora TSUBASA as an example of a Cray-style
> vector ISA. At a significantly higher cost, it was capable of about half
> of the FP64 FLOPS that a video card with high FP64 capabilities made
> on the same technology could provide. But because this architecture is
> more flexible, one could use it for a higher proportion of the floating-point
> calculations in typical scientific programs.
>
> That is a tradeoff that might be well worth it.
>
> John Savard

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<9e7e834e-889e-4a61-a357-ae47061ef766n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27355&group=comp.arch#27355

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:f12:b0:6b6:64e9:260c with SMTP id v18-20020a05620a0f1200b006b664e9260cmr18291088qkl.538.1660749229528;
Wed, 17 Aug 2022 08:13:49 -0700 (PDT)
X-Received: by 2002:a05:620a:3705:b0:6b9:6ff:559d with SMTP id
de5-20020a05620a370500b006b906ff559dmr18281671qkb.365.1660749229326; Wed, 17
Aug 2022 08:13:49 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 17 Aug 2022 08:13:49 -0700 (PDT)
In-Reply-To: <88646c7a-3e1d-406e-b3ec-7171cfd4e235n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:5879:d7aa:3989:85e9;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:5879:d7aa:3989:85e9
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org>
<ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com> <88646c7a-3e1d-406e-b3ec-7171cfd4e235n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <9e7e834e-889e-4a61-a357-ae47061ef766n@googlegroups.com>
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 17 Aug 2022 15:13:49 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 1837
 by: MitchAlsup - Wed, 17 Aug 2022 15:13 UTC

On Wednesday, August 17, 2022 at 1:39:05 AM UTC-5, Quadibloc wrote:
> On Sunday, July 24, 2022 at 4:41:19 PM UTC-6, luke.l...@gmail.com wrote:
>
> > AVX-512 *shudder*
>
> And here I was wondering why Intel hadn't been rushing to
> put AVX-512 on its consumer CPUs, so that they would
> squash AMD with vastly superior games and graphic
> performance.
<
Running 512-bit AVX goes for a dozen milliseconds before someone
has to throttle back the CPU frequency, due to the generation of heat.
>
> John Savard

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<a4c963e1-2e02-4a14-aafd-2c4f3046b5a3n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27356&group=comp.arch#27356

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:2aa6:b0:474:844b:24ff with SMTP id js6-20020a0562142aa600b00474844b24ffmr23429153qvb.51.1660749307482;
Wed, 17 Aug 2022 08:15:07 -0700 (PDT)
X-Received: by 2002:a0c:8081:0:b0:496:7822:c55a with SMTP id
1-20020a0c8081000000b004967822c55amr3952922qvb.87.1660749307218; Wed, 17 Aug
2022 08:15:07 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 17 Aug 2022 08:15:07 -0700 (PDT)
In-Reply-To: <tdi9co$ddqi$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:5879:d7aa:3989:85e9;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:5879:d7aa:3989:85e9
References: <t6gush$p5u$1@dont-email.me> <0a9765d5-f885-4109-9ba8-69430513d05fn@googlegroups.com>
<3d8a28d0-afb7-4650-b59c-34fc8a92a637n@googlegroups.com> <tdi9co$ddqi$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <a4c963e1-2e02-4a14-aafd-2c4f3046b5a3n@googlegroups.com>
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 17 Aug 2022 15:15:07 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 4775
 by: MitchAlsup - Wed, 17 Aug 2022 15:15 UTC

On Wednesday, August 17, 2022 at 3:37:47 AM UTC-5, BGB wrote:
> On 8/17/2022 1:35 AM, Quadibloc wrote:
> > On Sunday, July 24, 2022 at 4:28:47 AM UTC-6, luke.l...@gmail.com wrote:
> >
> >> 3) RVV or any other Cray-Style Vector ISA requiring strict compliance
> >> with IEEE754 FP accuracy will punish you with a 400% power/area
> >> penalty compared to modern 3D-optimised GPUs which, as explicitly
> >> spelled out and allowed in the Vulkan(tm) Spec by the Khronos
> >> Group, are permitted significant accuracy reductions. Mitch
> >> will (or will have already) filled you in on that.
> >
> > Well, if your purpose is building a device that is helpful to scientists
> > wishing to perform large-scale computations on FP64 numbers, you'll
> > just have to pay that penalty. After paying that penalty, though, something
> > based on a GPU architecture instead of that of a conventional non-vector
> > CPU is *still* vastly more powerful.
> >
> > And we have the NEC Aurora TSUBASA as an example of a Cray-style
> > vector ISA. At a significantly higher cost, it was capable of about half
> > of the FP64 FLOPS that a video card with high FP64 capabilities made
> > on the same technology could provide. But because this architecture is
> > more flexible, one could use it for a higher proportion of the floating-point
> > calculations in typical scientific programs.
> >
> > That is a tradeoff that might be well worth it.
> >
> Pretty much, there are reasons I am mostly going for DAZ + FTZ and FPU
> designs which don't give exact 0.5 ULP rounding. These make the FPU so
> much cheaper.
>
>
> For GPU-like uses, it is possible there could be a use-case for a
> further reduced-precision Binary64 variant, say:
> S.E11.F32.Z20
<
Z20 ???
>
> This is partly because there are still a few cases where a 16-bit
> mantissa is not sufficient (and where needing to keep the main FPU
> around is also fairly expensive).
>
>
> Namely, there is the point in my rasterizer where it goes from
> single-precision to 32-bit fixed-point.
>
> On the floating-point side, it can use low precision calculations just
> fine. On the integer side, it is 16.16 fixed point. However, the
> conversion step from floating point to fixed point integer exceeds the
> accuracy requirements of both the truncated-single and full single
> precision (so, would effectively need something with a 32-bit mantissa
> to deal with this).
>
>
> This may be added as another possible sub-feature to the low-precision
> unit, namely scalar FADD/FMUL units with 32-bit mantissa, and also needs
> to support integer conversion. Mostly to serve as a cheaper stand-in for
> the main FPU.
>
>
> Though, it is possible the "GPU core" idea might be dead for now, given
> I still can't quite make stuff cheap enough.
>
>
> Well, either that, or try to cheap out on the FP->Int conversion process
> and add some additional "low cost" SIMD converters (2x FP32 <-> Packed
> Int32). Say, if one adds a bias of 1024, this is enough to get approx
> 10.6 output, which can then be shifted to give values in the 16.16
> format (but would also require doing this step via a specialized blob of
> ASM; at present this part of the rasterizer is still written in C).
>
> I guess this could potentially side-step the need for a higher precision
> unit (but would limit maximum viewport size).
>
>
> > John Savard

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<tdj3ut$g0cl$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27358&group=comp.arch#27358

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
Date: Wed, 17 Aug 2022 11:11:07 -0500
Organization: A noiseless patient Spider
Lines: 93
Message-ID: <tdj3ut$g0cl$1@dont-email.me>
References: <t6gush$p5u$1@dont-email.me>
<0a9765d5-f885-4109-9ba8-69430513d05fn@googlegroups.com>
<3d8a28d0-afb7-4650-b59c-34fc8a92a637n@googlegroups.com>
<tdi9co$ddqi$1@dont-email.me>
<a4c963e1-2e02-4a14-aafd-2c4f3046b5a3n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 17 Aug 2022 16:11:10 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="e0f4f55b45f43c5c8d3c11b71a881169";
logging-data="524693"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19pCYUdNes3/NcegUMh6wwF"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.12.0
Cancel-Lock: sha1:D1ANjBo6npdxuL7Z+cDWQr5hWTI=
Content-Language: en-US
In-Reply-To: <a4c963e1-2e02-4a14-aafd-2c4f3046b5a3n@googlegroups.com>
 by: BGB - Wed, 17 Aug 2022 16:11 UTC

On 8/17/2022 10:15 AM, MitchAlsup wrote:
> On Wednesday, August 17, 2022 at 3:37:47 AM UTC-5, BGB wrote:
>> On 8/17/2022 1:35 AM, Quadibloc wrote:
>>> On Sunday, July 24, 2022 at 4:28:47 AM UTC-6, luke.l...@gmail.com wrote:
>>>
>>>> 3) RVV or any other Cray-Style Vector ISA requiring strict compliance
>>>> with IEEE754 FP accuracy will punish you with a 400% power/area
>>>> penalty compared to modern 3D-optimised GPUs which, as explicitly
>>>> spelled out and allowed in the Vulkan(tm) Spec by the Khronos
>>>> Group, are permitted significant accuracy reductions. Mitch
>>>> will (or will have already) filled you in on that.
>>>
>>> Well, if your purpose is building a device that is helpful to scientists
>>> wishing to perform large-scale computations on FP64 numbers, you'll
>>> just have to pay that penalty. After paying that penalty, though, something
>>> based on a GPU architecture instead of that of a conventional non-vector
>>> CPU is *still* vastly more powerful.
>>>
>>> And we have the NEC Aurora TSUBASA as an example of a Cray-style
>>> vector ISA. At a significantly higher cost, it was capable of about half
>>> of the FP64 FLOPS that a video card with high FP64 capabilities made
>>> on the same technology could provide. But because this architecture is
>>> more flexible, one could use it for a higher proportion of the floating-point
>>> calculations in typical scientific programs.
>>>
>>> That is a tradeoff that might be well worth it.
>>>
>> Pretty much, there are reasons I am mostly going for DAZ + FTZ and FPU
>> designs which don't give exact 0.5 ULP rounding. These make the FPU so
>> much cheaper.
>>
>>
>> For GPU-like uses, it is possible there could be a use-case for a
>> further reduced-precision Binary64 variant, say:
>> S.E11.F32.Z20
> <
> Z20 ???

Z: Ignored on input, Set to Zero on output.

Basically, format compatible with Binary64, but with reduced precision,
and thus cheaper.

A 32-bit mantissa allows using 3 DSPs for FMUL.

FADD would use ~ 36 bit internally, which would be enough to handle
basic operations, and potentially deal with 32-bit integer conversion.
Here, would stick 001 on the top, and a single sub-ULP bit on the bottom
(during FADD, anything which falls below this point is simply discarded).

It is sorta like with the S.E8.F16.Z7 format.

>>
>> This is partly because there are still a few cases where a 16-bit
>> mantissa is not sufficient (and where needing to keep the main FPU
>> around is also fairly expensive).
>>
>>
>> Namely, there is the point in my rasterizer where it goes from
>> single-precision to 32-bit fixed-point.
>>
>> On the floating-point side, it can use low precision calculations just
>> fine. On the integer side, it is 16.16 fixed point. However, the
>> conversion step from floating point to fixed point integer exceeds the
>> accuracy requirements of both the truncated-single and full single
>> precision (so, would effectively need something with a 32-bit mantissa
>> to deal with this).
>>
>>
>> This may be added as another possible sub-feature to the low-precision
>> unit, namely scalar FADD/FMUL units with 32-bit mantissa, and also needs
>> to support integer conversion. Mostly to serve as a cheaper stand-in for
>> the main FPU.
>>
>>
>> Though, it is possible the "GPU core" idea might be dead for now, given
>> I still can't quite make stuff cheap enough.
>>
>>
>> Well, either that, or try to cheap out on the FP->Int conversion process
>> and add some additional "low cost" SIMD converters (2x FP32 <-> Packed
>> Int32). Say, if one adds a bias of 1024, this is enough to get approx
>> 10.6 output, which can then be shifted to give values in the 16.16
>> format (but would also require doing this step via a specialized blob of
>> ASM; at present this part of the rasterizer is still written in C).
>>
>> I guess this could potentially side-step the need for a higher precision
>> unit (but would limit maximum viewport size).
>>
>>
>>> John Savard

Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)

<2022Aug17.181136@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27360&group=comp.arch#27360

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)
Date: Wed, 17 Aug 2022 16:11:36 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 18
Message-ID: <2022Aug17.181136@mips.complang.tuwien.ac.at>
References: <2022Aug17.101938@mips.complang.tuwien.ac.at> <memo.20220817141407.11400O@jgd.cix.co.uk> <tdiqth$f2i5$1@dont-email.me>
Injection-Info: reader01.eternal-september.org; posting-host="cf676451a0ccde7d1507fc7ebb226724";
logging-data="522980"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+FH7h5PWrh2zkHXPmOXZI9"
Cancel-Lock: sha1:rsOUpGW1ZKmv/mx9/0SkC27KIEg=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Wed, 17 Aug 2022 16:11 UTC

Marcus <m.delete@this.bitsnbites.eu> writes:
>If that is true, I would really like to see narrower implementations of
>AVX512,

The EVEX prefix supports 128, 256, and 512-bit operations through the
L and L' bits.

>as it could potentially bring some of the benefits of vector
>processing (compared to plain-ol-SIMD). E.g. 128-bit wide AVX512 (if
>that's doable) would give you immunity against latency hazards up to
>four clock cycles.

How so?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<2869eb73-0329-49d5-8ea6-2021382bc82bn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27361&group=comp.arch#27361

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:13ca:b0:343:129:8894 with SMTP id p10-20020a05622a13ca00b0034301298894mr22789103qtk.253.1660754825055;
Wed, 17 Aug 2022 09:47:05 -0700 (PDT)
X-Received: by 2002:ac8:7d84:0:b0:344:662d:278c with SMTP id
c4-20020ac87d84000000b00344662d278cmr11210335qtd.513.1660754824929; Wed, 17
Aug 2022 09:47:04 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 17 Aug 2022 09:47:04 -0700 (PDT)
In-Reply-To: <9e7e834e-889e-4a61-a357-ae47061ef766n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=199.203.251.52; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 199.203.251.52
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org>
<ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com> <88646c7a-3e1d-406e-b3ec-7171cfd4e235n@googlegroups.com>
<9e7e834e-889e-4a61-a357-ae47061ef766n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2869eb73-0329-49d5-8ea6-2021382bc82bn@googlegroups.com>
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
From: already5...@yahoo.com (Michael S)
Injection-Date: Wed, 17 Aug 2022 16:47:05 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2400
 by: Michael S - Wed, 17 Aug 2022 16:47 UTC

On Wednesday, August 17, 2022 at 6:13:50 PM UTC+3, MitchAlsup wrote:
> On Wednesday, August 17, 2022 at 1:39:05 AM UTC-5, Quadibloc wrote:
> > On Sunday, July 24, 2022 at 4:41:19 PM UTC-6, luke.l...@gmail.com wrote:
> >
> > > AVX-512 *shudder*
> >
> > And here I was wondering why Intel hadn't been rushing to
> > put AVX-512 on its consumer CPUs, so that they would
> > squash AMD with vastly superior games and graphic
> > performance.
> <
> Running 512-bit AVX goes for a dozen milliseconds before someone
> has to throttle back the CPU frequency, due to the generation of heat.

According to what Anton told us few weeks ago, it does not apply to Rocket Lake
Probably, does not apply to Tiger Lake either, as long as only 1 or 2 cores are
crunching 512-bit stuff.
It makes sense if a big heat source is not any 512-bit operations, but
specifically FP arithmetic, i.e. FADD/FMUL/FMADD since for those operations the
max. throughput on above mentioned cores is the same for 512-bit opcodes (1 per
clock) and for 256-bit opcodes (2 per clock).

Re: Power cost of IEEE754

<tdj7qc$gdh3$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27364&group=comp.arch#27364

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Power cost of IEEE754
Date: Wed, 17 Aug 2022 12:16:58 -0500
Organization: A noiseless patient Spider
Lines: 78
Message-ID: <tdj7qc$gdh3$1@dont-email.me>
References: <t6gush$p5u$1@dont-email.me>
<0a9765d5-f885-4109-9ba8-69430513d05fn@googlegroups.com>
<jwvlerncc0p.fsf-monnier+comp.arch@gnu.org> <tdiql4$96b$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 17 Aug 2022 17:17:00 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="e0f4f55b45f43c5c8d3c11b71a881169";
logging-data="538147"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19QI+fsw6d8OI1NLKmC06Qe"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.12.0
Cancel-Lock: sha1:cwtb+SvdgfkSq2K31A1h4f7Cx+E=
Content-Language: en-US
In-Reply-To: <tdiql4$96b$1@gioia.aioe.org>
 by: BGB - Wed, 17 Aug 2022 17:16 UTC

On 8/17/2022 8:32 AM, Terje Mathisen wrote:
> Stefan Monnier wrote:
>>> 3) RVV or any other Cray-Style Vector ISA requiring strict compliance
>>>       with IEEE754 FP accuracy will punish you with a 400% power/area
>>>       penalty compared to modern 3D-optimised GPUs which, as explicitly
>>
>> What is the origin of this extra cost?  IOW, where's the meat of the
>> savings (I mean: which part of the computation of the last few bits
>> costs so much)?
>> Does the saving vary significantly between instructions?
>
> I think that is pretty much bogus, Mitch have shown repeatedly that
> having FMAC makes subnormal handling nearly free, as in zero cycles and
> single-digit percentage gates/power. 400% is simply rubbish.
>

Not sure whether it is anywhere near 400%, but not strictly non-zero either.

Nearly free, for the FMAC, but IME still adds a latency cost mostly in
that an FMAC would still require several additional clock cycles vs a
bare FMUL if using DAZ/FTZ, due mostly to the additional adder and
normalization steps.

For performance, FMAC would likely only win over FMUL if there were
significantly more 'A*B+C' cases than 'A*B' cases.

There is one way on using FMAC to fake higher precision math, which in
order to make use of would require a larger mantissa than a bare FADD.
Supporting this mode would add cost.

As for 0.5 ULP, this would likely require:
FMUL to produce a full-width result, rather than just the high-order bits;
FADD needs to have more sub ULP bits.

So, say, for a double-precision FMAC:
A*B, produces 108 bits of output (more expensive than ~ 56);
May need to expand further in the adder (TBD).

Mostly to deal with semantic edge cases involving catastrophic
cancellation and similar...

Without FMAC, one can mostly ignore catastrophic cancellation issue:
In cases where it would happen, there is very little below the ULP that
would need to be preserved.
But with FMAC, one may find that, suddenly, these low-order bits can
actually have a visible effect on the result.

One could reduce cost by always producing the "double rounded" output
(as-if it had been done as two separate operations), but this does
eliminate some use cases.

....

Vs, say, a cheaper version:
A*B produces ~ 56 bits (high bits only);
Can use 6 DSPs and a few extra LUT based multipliers for the rest.
A+B uses 68 bits;
Could have also been 56,
but need a few extra to deal with Int64 conversion.

Though, this is still kinda expensive, a version which uses 3 DSPs for
FMUL and a 36-bit adder would be cheaper still, but as can be noted,
would effectively cut the low 20 bits off the mantissa.

Could still be sufficient for 3D rendering tasks, but would be mostly
insufficient for "general purpose" use (code which assumes full
precision double could break in some fairly major ways).

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<2022Aug17.190034@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27365&group=comp.arch#27365

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
Date: Wed, 17 Aug 2022 17:00:34 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 23
Message-ID: <2022Aug17.190034@mips.complang.tuwien.ac.at>
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org> <ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com> <88646c7a-3e1d-406e-b3ec-7171cfd4e235n@googlegroups.com> <9e7e834e-889e-4a61-a357-ae47061ef766n@googlegroups.com> <2869eb73-0329-49d5-8ea6-2021382bc82bn@googlegroups.com>
Injection-Info: reader01.eternal-september.org; posting-host="cf676451a0ccde7d1507fc7ebb226724";
logging-data="541360"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+X0C1scUb+BZI9RjmhbhIw"
Cancel-Lock: sha1:jNy3LewaUp8m8Bm06bgOKOKuvEw=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Wed, 17 Aug 2022 17:00 UTC

Michael S <already5chosen@yahoo.com> writes:
>According to what Anton told us few weeks ago, it does not apply to Rocket Lake
>Probably, does not apply to Tiger Lake either, as long as only 1 or 2 cores are
>crunching 512-bit stuff.

Actually, for the Ice Lake i5-1035G4, we see an AVX512 downclock with
1 active core from 3.7GHz to 3.6GHz, but no downclock with more active
cores (2 cores are already at 3.6GHz without AVX512, 3 and 4 at
3.3GHz)
<https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html>.

I guess that the increased voltage for 3.7GHz in combination with
AVX512 can result in too much current draw for the core, so they
downclock (and lower voltage) to 3.6GHz. For the other cases the
voltage is already low enough, so no proactive downclocking is
necessary, and you can leave reactive downclocking to the power and
temperature limits. If my guess is correct, I expect that Tiger Lake
will be similar.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Power cost of IEEE754 (was: Misc: Idle thoughts for cheap and fast(ish) GPU)

<tdjb0v$gokt$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27366&group=comp.arch#27366

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Power cost of IEEE754 (was: Misc: Idle thoughts for cheap and
fast(ish) GPU)
Date: Wed, 17 Aug 2022 13:11:40 -0500
Organization: A noiseless patient Spider
Lines: 77
Message-ID: <tdjb0v$gokt$1@dont-email.me>
References: <t6gush$p5u$1@dont-email.me>
<0a9765d5-f885-4109-9ba8-69430513d05fn@googlegroups.com>
<jwvlerncc0p.fsf-monnier+comp.arch@gnu.org>
<4b919cba-b547-427a-9151-554139e3cca4n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 17 Aug 2022 18:11:44 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="e0f4f55b45f43c5c8d3c11b71a881169";
logging-data="549533"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18xZVvRWKe412eIpwFQUPMy"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.12.0
Cancel-Lock: sha1:D4n5cR6CwfhlEK7uCCBmhg3bxwE=
Content-Language: en-US
In-Reply-To: <4b919cba-b547-427a-9151-554139e3cca4n@googlegroups.com>
 by: BGB - Wed, 17 Aug 2022 18:11 UTC

On 8/17/2022 8:13 AM, robf...@gmail.com wrote:
>> For GPU-like uses, it is possible there could be a use-case for a
>> further reduced-precision Binary64 variant, say:
>> S.E11.F32.Z20
>
>> This is partly because there are still a few cases where a 16-bit
>> mantissa is not sufficient (and where needing to keep the main FPU
>> around is also fairly expensive).
>
> I have been thinking of going with 21-bit floats S1.E5.F15 because
> three can be fit into 64-bits and so extra vector lanes can be had
> cheaply.
>

Actually exists as a SIMD vector sub-type in BJX2.

First 3 elements are Binary16;
Last element is divided into 3x 5-bit fields, which extend the mantissa
to 15 bits.

Or, 64-bit field:
0xPPPP_ZZZZ_YYYY_XXXX;
Where PPPP is:
0zzz_zzyy_yyyx_xxxx

There is another vector type where the first 3 elements are Binary32,
and the last element extends each mantissa by 10 bits (so ~ 33 bits),
organized along a pretty much similar pattern.

An earlier form of the latter was originally developed and used in my
BGBTech2 3D engine, mostly because:
3x Binary32 was not sufficient for the world size (1024x1024 km);
The cost of passing these around, and doing Unpack/Compute/Repack on
them, was at the time faster than working directly with 3x Binary64 in
this case (a lot more time with these was spent on passing them around
rather than actually doing math with them).

That version differed though in that it was less symmetric:
Added 12 bits to X and Y, but only 8 to Z.

> I like the low precision idea. Keeping the significand under 18 bits
> means just a single DSP can be used for the multiply.
>

Yes, this was a major reason:
16-bit is padded to 18-bit by adding 01 to the top;
A single DSP is used to multiply the result.

Keeping the low order bits as zero (for both multiply and add), was
mostly because if one ignores these bits for multiply but leaves them as
non-zero in the output (because, after all, 18*16->36), but then keep
full Binary32 precision for add/subtract, tends to cause instability
issues with some calculations (such as when doing divide or square root
via Newton-Raphson).

So, it seemed like a better idea to ignore these bits entirely on input,
and force them to zero on output, across both multiply and add/subtract
in this case.

As it so happens, this format is also basically sufficient for the
projection process in OpenGL and similar (though with the main "choke
point" being when going from floating-point in the front-end to fixed
point in the back-end).

Though, I guess, "fake it using integer math" is possibly also still an
option worth looking into (and probably cheaper than having a dedicated
FMUL and FADD units effectively just to pull off Binary32 -> Fixed 16.16
conversion).

....

Re: Power cost of IEEE754

<70e450df-dfc9-41b5-9f69-418efd01aaden@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27368&group=comp.arch#27368

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:128a:b0:6bb:5a52:6dd8 with SMTP id w10-20020a05620a128a00b006bb5a526dd8mr8235775qki.350.1660763132041;
Wed, 17 Aug 2022 12:05:32 -0700 (PDT)
X-Received: by 2002:a05:622a:130c:b0:343:6753:127f with SMTP id
v12-20020a05622a130c00b003436753127fmr24001013qtk.88.1660763131861; Wed, 17
Aug 2022 12:05:31 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border-2.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 17 Aug 2022 12:05:31 -0700 (PDT)
In-Reply-To: <tdj7qc$gdh3$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:f1c3:96e3:f4f1:d944;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:f1c3:96e3:f4f1:d944
References: <t6gush$p5u$1@dont-email.me> <0a9765d5-f885-4109-9ba8-69430513d05fn@googlegroups.com>
<jwvlerncc0p.fsf-monnier+comp.arch@gnu.org> <tdiql4$96b$1@gioia.aioe.org> <tdj7qc$gdh3$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <70e450df-dfc9-41b5-9f69-418efd01aaden@googlegroups.com>
Subject: Re: Power cost of IEEE754
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 17 Aug 2022 19:05:32 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 73
 by: MitchAlsup - Wed, 17 Aug 2022 19:05 UTC

On Wednesday, August 17, 2022 at 12:17:04 PM UTC-5, BGB wrote:
> On 8/17/2022 8:32 AM, Terje Mathisen wrote:
> > Stefan Monnier wrote:
> >>> 3) RVV or any other Cray-Style Vector ISA requiring strict compliance
> >>> with IEEE754 FP accuracy will punish you with a 400% power/area
> >>> penalty compared to modern 3D-optimised GPUs which, as explicitly
> >>
> >> What is the origin of this extra cost? IOW, where's the meat of the
> >> savings (I mean: which part of the computation of the last few bits
> >> costs so much)?
> >> Does the saving vary significantly between instructions?
> >
> > I think that is pretty much bogus, Mitch have shown repeatedly that
> > having FMAC makes subnormal handling nearly free, as in zero cycles and
> > single-digit percentage gates/power. 400% is simply rubbish.
> >
> Not sure whether it is anywhere near 400%, but not strictly non-zero either.
>
The adder for adding denorms to a 64-bit 754 FMAC* is on the order of 2%
(*) FMAC that can also perform FMUL and FADD/FSUB.
<
Which while not actually zero is close enough to not mater in the long run.
>
> Nearly free, for the FMAC, but IME still adds a latency cost mostly in
> that an FMAC would still require several additional clock cycles vs a
> bare FMUL if using DAZ/FTZ, due mostly to the additional adder and
> normalization steps.
<
1-gate of delay added to the overall FMAC latency (over the typical 4 cycles)
>
> For performance, FMAC would likely only win over FMUL if there were
> significantly more 'A*B+C' cases than 'A*B' cases.
>
Actually, even systems with an FMAC will often have a spare FADD because
{FADD/FSUB (and for those poor ISAs with separate register files) FCMP}
represent 57%-63% of FP calculations.
>
> There is one way on using FMAC to fake higher precision math, which in
> order to make use of would require a larger mantissa than a bare FADD.
> Supporting this mode would add cost.
>
That is why you do these things in the FMAC unit where you have a 168-bit adder
(typically Koogie-Stone) and a 52-bit incrementer.
>
> As for 0.5 ULP, this would likely require:
> FMUL to produce a full-width result, rather than just the high-order bits;
> FADD needs to have more sub ULP bits.
>
These are IEEE REQUIREMENTS.
>
> So, say, for a double-precision FMAC:
> A*B, produces 108 bits of output (more expensive than ~ 56);
> May need to expand further in the adder (TBD).
>
> Mostly to deal with semantic edge cases involving catastrophic
> cancellation and similar...
>
It's like Air-Bags in a car. Would you buy a car without air-bags (for street use)
today ?
>
> Without FMAC, one can mostly ignore catastrophic cancellation issue:
> In cases where it would happen, there is very little below the ULP that
> would need to be preserved.
<
It is PRECISELY the catastrophic cancellation issue which drove IEEE
to specify FMAC in the way they did.
<
> But with FMAC, one may find that, suddenly, these low-order bits can
> actually have a visible effect on the result.
>
Because they should ! Like the ABS system in you car driving downhill
in freezing rain.
<
<Apparently you just don't get the motive for doing quality work.......>

Re: Power cost of IEEE754

<c1eac1c0-1196-4ff0-a9a5-a88ac66d961en@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27372&group=comp.arch#27372

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:11c8:b0:343:4d55:3307 with SMTP id n8-20020a05622a11c800b003434d553307mr24534423qtk.306.1660766048657;
Wed, 17 Aug 2022 12:54:08 -0700 (PDT)
X-Received: by 2002:a05:622a:1998:b0:343:6452:dbd9 with SMTP id
u24-20020a05622a199800b003436452dbd9mr24241236qtc.423.1660766048542; Wed, 17
Aug 2022 12:54:08 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 17 Aug 2022 12:54:08 -0700 (PDT)
In-Reply-To: <tdj7qc$gdh3$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:f1c3:96e3:f4f1:d944;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:f1c3:96e3:f4f1:d944
References: <t6gush$p5u$1@dont-email.me> <0a9765d5-f885-4109-9ba8-69430513d05fn@googlegroups.com>
<jwvlerncc0p.fsf-monnier+comp.arch@gnu.org> <tdiql4$96b$1@gioia.aioe.org> <tdj7qc$gdh3$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c1eac1c0-1196-4ff0-a9a5-a88ac66d961en@googlegroups.com>
Subject: Re: Power cost of IEEE754
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 17 Aug 2022 19:54:08 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2026
 by: MitchAlsup - Wed, 17 Aug 2022 19:54 UTC

On Wednesday, August 17, 2022 at 12:17:04 PM UTC-5, BGB wrote:
> On 8/17/2022 8:32 AM, Terje Mathisen wrote:
>
>
> As for 0.5 ULP, this would likely require:
> FMUL to produce a full-width result, rather than just the high-order bits;
> FADD needs to have more sub ULP bits.
>
You keep arguing that IEEE is hard. Yes, it is, and yes it was supposed to
be--just to eliminate all the inferior ways to perform FP arithmetic, so
programmers could trust FP arithmetics across implementations.
<
And I keep beating you up on this issue.
<
If you added the caveat that "in your expressive medium (FPGA)
getting the last IEEE binary digit correct is excessively expensive,"
I would be able to avoid these <seemingly constant> beatings.

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<7e3cd26f-aa39-48b7-866f-a0f04402c5a6n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27373&group=comp.arch#27373

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:d82:b0:477:3d7c:1081 with SMTP id e2-20020a0562140d8200b004773d7c1081mr23381956qve.28.1660766620238;
Wed, 17 Aug 2022 13:03:40 -0700 (PDT)
X-Received: by 2002:a37:58c6:0:b0:6b5:d169:7b99 with SMTP id
m189-20020a3758c6000000b006b5d1697b99mr19943960qkb.709.1660766620014; Wed, 17
Aug 2022 13:03:40 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!feeder.erje.net!border-1.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 17 Aug 2022 13:03:39 -0700 (PDT)
In-Reply-To: <2022Aug17.190034@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:594b:586f:ef6a:2264;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:594b:586f:ef6a:2264
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org>
<ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com> <88646c7a-3e1d-406e-b3ec-7171cfd4e235n@googlegroups.com>
<9e7e834e-889e-4a61-a357-ae47061ef766n@googlegroups.com> <2869eb73-0329-49d5-8ea6-2021382bc82bn@googlegroups.com>
<2022Aug17.190034@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <7e3cd26f-aa39-48b7-866f-a0f04402c5a6n@googlegroups.com>
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
From: already5...@yahoo.com (Michael S)
Injection-Date: Wed, 17 Aug 2022 20:03:40 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 29
 by: Michael S - Wed, 17 Aug 2022 20:03 UTC

On Wednesday, August 17, 2022 at 8:33:06 PM UTC+3, Anton Ertl wrote:
> Michael S <already...@yahoo.com> writes:
> >According to what Anton told us few weeks ago, it does not apply to Rocket Lake
> >Probably, does not apply to Tiger Lake either, as long as only 1 or 2 cores are
> >crunching 512-bit stuff.
> Actually, for the Ice Lake i5-1035G4, we see an AVX512 downclock with
> 1 active core from 3.7GHz to 3.6GHz, but no downclock with more active
> cores (2 cores are already at 3.6GHz without AVX512, 3 and 4 at
> 3.3GHz)
> <https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html>.
>
> I guess that the increased voltage for 3.7GHz in combination with
> AVX512 can result in too much current draw for the core, so they
> downclock (and lower voltage) to 3.6GHz. For the other cases the
> voltage is already low enough, so no proactive downclocking is
> necessary, and you can leave reactive downclocking to the power and
> temperature limits. If my guess is correct, I expect that Tiger Lake
> will be similar.

An equivalent of i5-1035G4 in Tiger Lake family is i5-1155G7
that has max. frequency = 4.5 GHz.
I think, it's a too optimistic to expect that with 3 or 4 cores
running AVX512 stuff it would be downclocked by mere 9% = 4.1GHz.

Would be glad to find out that I am wrong about it.

> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: Power cost of IEEE754

<tdk425$lits$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27380&group=comp.arch#27380

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Power cost of IEEE754
Date: Wed, 17 Aug 2022 20:18:58 -0500
Organization: A noiseless patient Spider
Lines: 236
Message-ID: <tdk425$lits$1@dont-email.me>
References: <t6gush$p5u$1@dont-email.me>
<0a9765d5-f885-4109-9ba8-69430513d05fn@googlegroups.com>
<jwvlerncc0p.fsf-monnier+comp.arch@gnu.org> <tdiql4$96b$1@gioia.aioe.org>
<tdj7qc$gdh3$1@dont-email.me>
<70e450df-dfc9-41b5-9f69-418efd01aaden@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 18 Aug 2022 01:19:01 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="3d2d06bdbaa22ce19e0616e787bb0ed6";
logging-data="707516"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/RKcrperu/zzNZlsIcwrEv"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.12.0
Cancel-Lock: sha1:DeihwvxUdHKu/o7jpQ3+QvfrXSM=
In-Reply-To: <70e450df-dfc9-41b5-9f69-418efd01aaden@googlegroups.com>
Content-Language: en-US
 by: BGB - Thu, 18 Aug 2022 01:18 UTC

On 8/17/2022 2:05 PM, MitchAlsup wrote:
> On Wednesday, August 17, 2022 at 12:17:04 PM UTC-5, BGB wrote:
>> On 8/17/2022 8:32 AM, Terje Mathisen wrote:
>>> Stefan Monnier wrote:
>>>>> 3) RVV or any other Cray-Style Vector ISA requiring strict compliance
>>>>> with IEEE754 FP accuracy will punish you with a 400% power/area
>>>>> penalty compared to modern 3D-optimised GPUs which, as explicitly
>>>>
>>>> What is the origin of this extra cost? IOW, where's the meat of the
>>>> savings (I mean: which part of the computation of the last few bits
>>>> costs so much)?
>>>> Does the saving vary significantly between instructions?
>>>
>>> I think that is pretty much bogus, Mitch have shown repeatedly that
>>> having FMAC makes subnormal handling nearly free, as in zero cycles and
>>> single-digit percentage gates/power. 400% is simply rubbish.
>>>
>> Not sure whether it is anywhere near 400%, but not strictly non-zero either.
>>
> The adder for adding denorms to a 64-bit 754 FMAC* is on the order of 2%
> (*) FMAC that can also perform FMUL and FADD/FSUB.
> <
> Which while not actually zero is close enough to not mater in the long run.

Bigger cost is needing to go from only producing the high-order bits of
the multiplier result (cheap case), to producing all of the bits.

It is sort of like the difference in area between a right triangle and a
square.

With just an FMUL, all these bits fall off into the void, so one doesn't
really need to see if they were never created in the first place.

Full FMAC (and also strict 0.5 ULP) can kind of throw a wrench in these
sorts of optimizations.

Free is not free if it already means one has to pay the cost for a bunch
of other features in order for a given feature to be "free".

>>
>> Nearly free, for the FMAC, but IME still adds a latency cost mostly in
>> that an FMAC would still require several additional clock cycles vs a
>> bare FMUL if using DAZ/FTZ, due mostly to the additional adder and
>> normalization steps.
> <
> 1-gate of delay added to the overall FMAC latency (over the typical 4 cycles)

The FADD part can handle subnormal numbers pretty much for free, but
that is not what I am arguing here.

It is FMUL where it is more of an issue, and bolting the units together
to allow handling this case comes at a latency cost (and needing to make
FMUL slower than it might be otherwise, is not free).

>>
>> For performance, FMAC would likely only win over FMUL if there were
>> significantly more 'A*B+C' cases than 'A*B' cases.
>>
> Actually, even systems with an FMAC will often have a spare FADD because
> {FADD/FSUB (and for those poor ISAs with separate register files) FCMP}
> represent 57%-63% of FP calculations.

OK.

Seems like one could handle FADD by skipping over the multiplier part.
But, yeah, always needing to spend the full latency is not desirable.

>>
>> There is one way on using FMAC to fake higher precision math, which in
>> order to make use of would require a larger mantissa than a bare FADD.
>> Supporting this mode would add cost.
>>
> That is why you do these things in the FMAC unit where you have a 168-bit adder
> (typically Koogie-Stone) and a 52-bit incrementer.

Granted.

Pros/Cons if one has a cheaper but faster FMAC that can't actually do
this operation (because it doesn't actually have the bits to do so).

>>
>> As for 0.5 ULP, this would likely require:
>> FMUL to produce a full-width result, rather than just the high-order bits;
>> FADD needs to have more sub ULP bits.
>>
> These are IEEE REQUIREMENTS.

Which partly I skip over, because my FPU designs aren't really IEEE
conformant anyways, nor is strict conformance necessarily a design goal...

This would turn into a game of whack-a-mole.

Nor would it actually gain enough here to make it worth the added cost.

>>
>> So, say, for a double-precision FMAC:
>> A*B, produces 108 bits of output (more expensive than ~ 56);
>> May need to expand further in the adder (TBD).
>>
>> Mostly to deal with semantic edge cases involving catastrophic
>> cancellation and similar...
>>
> It's like Air-Bags in a car. Would you buy a car without air-bags (for street use)
> today ?

In normal FADD/FSUB, the issue is effectively hidden:
The only way it could happen usually is if the mantissas have the same
exponent;
By extension, there are hardly any non-zero sub-ULP bits in this scenario.

In cases where the exponents differ, this scenario can't occur, so much
beyond a few bits, the stuff that falls off the bottom can be ignored.

For subtracting the values, one could detect whether or not the bits
that fell off the bottom were non-zero and use this to twiddle the
carry-in to the adder. But, this isn't free either (and if one has, say,
8 sub ULP bits, there is only around a 1 in 256 chance of this actually
having a visible effect on the result).

But, as for cars, I think a lot might depend on where one is living.

A relative of mine has a vehicle which is basically built out of welded
steel tubing and diamond plate, no doors or windows, no windshield, no
muffler ("straight pipe exhaust"), minimal "over the lap" style
seatbelts, etc.

Has turn signals and a license plate, so "good enough" it seems.

Not necessarily sure road safety was all that much of a design
consideration in this case.

Some of the other vehicles around are only partially painted, say with
much of the vehicle still being gray primer, ...

Well, or other scenarios one might see, say, where a car got T-boned, so
the owner removed the car-doors and similar on that side and mostly
replaced them with a piece of plywood or similar with some window holes
cut into it (say, with some clear plastic or similar stapled over the
holes), etc...

Functional airbags at that point, not necessarily a given.

>>
>> Without FMAC, one can mostly ignore catastrophic cancellation issue:
>> In cases where it would happen, there is very little below the ULP that
>> would need to be preserved.
> <
> It is PRECISELY the catastrophic cancellation issue which drove IEEE
> to specify FMAC in the way they did.
> <

But, it is neither necessarily the cheapest nor fastest way to implement
this operator.

Granted, does allow using SIMD to pull off extended precision
calculations (rather than, say, doing this using integer math or
something). This would not otherwise be possible with a cheaper version.

Making this work effectively though would require both making the FADD
part wider, and also calculating the low half of the multiplier result.

Meanwhile, integer math is easier to get right...
Is it because there is some rigidly enforced integer math standard?
Not really; The standard integer math rules happen to already mostly
align with the cheapest ways of implementing usable integer math operators.

If floating point math were defined in terms of the cheapest possible
ways of implementing it, then there wouldn't be anything to argue about.

But, I guess, maybe there would still possibility for debate, say,
between person A's system where (5.0-3.0)==2.0, and person B's whose
gives 1.999999, etc...

>> But with FMAC, one may find that, suddenly, these low-order bits can
>> actually have a visible effect on the result.
>>
> Because they should ! Like the ABS system in you car driving downhill
> in freezing rain.
> <
> <Apparently you just don't get the motive for doing quality work.......>

It is more trade-offs...

Would IEEE-754 conformance still be worthwhile, if it meant one could
*only* afford to have Single Precision?...

I don't consider this a win.

Though, this is basically what MicroBlaze does.

Sometimes, one can be like "close enough".

It is sorta similar for things like OpenGL:
* Does it implement the API?
** Well, more or less 1.3 levels (of the parts that aren't absent).
* Does it produce pixel-exact results with a reference implementation:
** No, not even close (looks almost more like the original PlayStation).

What about BC7 compressed texture support, does it look pixel-exact to
the reference version?
* Heh, no, also not even close.

What about BC1(DXT1) or BC3(DXT5)?
* Closer, but still not quite.
* It just sort of approximates these via its own format internally.

But, being like "Nope, can't afford to do this perfectly, so not going
to do it at all" is not so useful either.

And, so on, down the list...

Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)

<tdknpe$g76$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27381&group=comp.arch#27381

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!10O9MudpjwoXIahOJRbDvA.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)
Date: Thu, 18 Aug 2022 08:55:44 +0200
Organization: Aioe.org NNTP Server
Message-ID: <tdknpe$g76$1@gioia.aioe.org>
References: <2022Aug17.101938@mips.complang.tuwien.ac.at>
<memo.20220817141407.11400O@jgd.cix.co.uk> <tdiqth$f2i5$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="16614"; posting-host="10O9MudpjwoXIahOJRbDvA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.13
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Thu, 18 Aug 2022 06:55 UTC

Marcus wrote:
> On 2022-08-17, John Dallman wrote:
>> In article <2022Aug17.101938@mips.complang.tuwien.ac.at>,
>> anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>>
>>> And then you have Intel marketing, which disables AVX512 in SKUs for
>>> chips that actually support AVX512 (and likewise disable AVX in some
>>> SKUs), ensuring that most programmers will steer clear of AVX for
>>> many years to come and even longer for AVX512.
>>
>> There are parts of Intel that don't seem to understand the difference
>> between firmware and applications. They behave as if machines are sold
>> with all their software already on them, and run that load for their
>> lives. This is not how the application software market actually works.
>>
>> If you're an application software producer, doing an extra version for an
>> instruction set extension needs to give the end-user something big for it
>> to be cost-effective. I've experimented with AVX, and it does very little
>> for the software I work on. I understand why, and AVX-512 isn't going to
>> do any better. Improving performance at the algorithm level works better
>> and applies to all platforms.
>>
>> John
>
> I haven't played with AVX512 nor bothered to learn it (because, hey,
> what's the point when there's no hardware around to run it on anyway),
> but my impression is that they made some actual improvements w.r.t.
> predication and such, so that it should be easier to vectorize code
> than with SSE or regular AVX.
>
> If that is true, I would really like to see narrower implementations of
> AVX512, as it could potentially bring some of the benefits of vector
> processing (compared to plain-ol-SIMD). E.g. 128-bit wide AVX512 (if
> that's doable) would give you immunity against latency hazards up to
> four clock cycles.

It seems to be true, since AVX-512 at 64-byte cache line chunks have
finally reached parity with the original Larrabee design in this regard.

LRB was of course primarly designed to be able to handle GPU tasks, and
that included an easy way to mask leading/trailing parts that failed to
fill up a full cache line.

I do agree that it would have been a lot better for Intel to supply slow
512-bit support to all SKUs, then use the more power-hungry versions for
high-end cpus. I waited a year some years ago to get a (xeon) cpu which
included this, only to learn after the fact that AVX-512 actually wasn't
enabled on the mobile/portable workstation version.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Power cost of IEEE754

<tdun4o$2ifk6$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27465&group=comp.arch#27465

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: paaroncl...@gmail.com (Paul A. Clayton)
Newsgroups: comp.arch
Subject: Re: Power cost of IEEE754
Date: Sun, 21 Aug 2022 21:45:59 -0400
Organization: A noiseless patient Spider
Lines: 34
Message-ID: <tdun4o$2ifk6$1@dont-email.me>
References: <t6gush$p5u$1@dont-email.me>
<0a9765d5-f885-4109-9ba8-69430513d05fn@googlegroups.com>
<jwvlerncc0p.fsf-monnier+comp.arch@gnu.org> <tdiql4$96b$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 22 Aug 2022 01:46:00 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="14d2226ce1c7eb43e75cc17585335620";
logging-data="2702982"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19/vYE/Yl7NOG38ICdmHtR2mmLEPiwtyow="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.0
Cancel-Lock: sha1:r60skmzbMC/Wzkf90S1TyOVX+Cc=
In-Reply-To: <tdiql4$96b$1@gioia.aioe.org>
 by: Paul A. Clayton - Mon, 22 Aug 2022 01:45 UTC

Terje Mathisen wrote:
> Stefan Monnier wrote:
>>> 3) RVV or any other Cray-Style Vector ISA requiring strict
>>> compliance
>>>       with IEEE754 FP accuracy will punish you with a 400%
>>> power/area
>>>       penalty compared to modern 3D-optimised GPUs which, as
>>> explicitly
>>
>> What is the origin of this extra cost?  IOW, where's the meat of
>> the
>> savings (I mean: which part of the computation of the last few bits
>> costs so much)?
>> Does the saving vary significantly between instructions?
>
> I think that is pretty much bogus, Mitch have shown repeatedly
> that having FMAC makes subnormal handling nearly free, as in zero
> cycles and single-digit percentage gates/power. 400% is simply
> rubbish.

I am guessing that double-rounding FMADD would not provide as much
savings. Single-rounding is typically assumed, but I think the
original implementation on MIPS did double rounding (perhaps to
provide the same result as unfused FMUL and FADD).

(I kind of wonder what the costs would have been to implement
floating point multiplication as a high power-of-two bits
truncated result (i.e., ignoring carry-in from lower bits). I tend
to agree that the benefit of standardization limits different
interfaces to specialized uses, but there may have been a better
alternative. I also think that the probablistic rounding that Nick
Maclaren championed (a little) would have been interesting and
reduced or exposed some problems with floating point.)

Re: Power cost of IEEE754

<3888602b-9fcb-4c6b-9cb1-b3ae8d3c4fe7n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27467&group=comp.arch#27467

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:205:b0:343:282:3d0e with SMTP id b5-20020a05622a020500b0034302823d0emr13897401qtx.436.1661134698859;
Sun, 21 Aug 2022 19:18:18 -0700 (PDT)
X-Received: by 2002:a05:6214:2a8b:b0:496:9c18:e928 with SMTP id
jr11-20020a0562142a8b00b004969c18e928mr14379485qvb.94.1661134698708; Sun, 21
Aug 2022 19:18:18 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 21 Aug 2022 19:18:18 -0700 (PDT)
In-Reply-To: <tdun4o$2ifk6$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:ed38:7cf:b605:9e55;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:ed38:7cf:b605:9e55
References: <t6gush$p5u$1@dont-email.me> <0a9765d5-f885-4109-9ba8-69430513d05fn@googlegroups.com>
<jwvlerncc0p.fsf-monnier+comp.arch@gnu.org> <tdiql4$96b$1@gioia.aioe.org> <tdun4o$2ifk6$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3888602b-9fcb-4c6b-9cb1-b3ae8d3c4fe7n@googlegroups.com>
Subject: Re: Power cost of IEEE754
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 22 Aug 2022 02:18:18 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 4405
 by: MitchAlsup - Mon, 22 Aug 2022 02:18 UTC

On Sunday, August 21, 2022 at 8:46:03 PM UTC-5, Paul A. Clayton wrote:
> Terje Mathisen wrote:
> > Stefan Monnier wrote:
> >>> 3) RVV or any other Cray-Style Vector ISA requiring strict
> >>> compliance
> >>> with IEEE754 FP accuracy will punish you with a 400%
> >>> power/area
> >>> penalty compared to modern 3D-optimised GPUs which, as
> >>> explicitly
> >>
> >> What is the origin of this extra cost? IOW, where's the meat of
> >> the
> >> savings (I mean: which part of the computation of the last few bits
> >> costs so much)?
> >> Does the saving vary significantly between instructions?
> >
> > I think that is pretty much bogus, Mitch have shown repeatedly
> > that having FMAC makes subnormal handling nearly free, as in zero
> > cycles and single-digit percentage gates/power. 400% is simply
> > rubbish.
<
Who here as actually build a floating point unit at the gate level (FPGA
does not count) ??
<
> I am guessing that double-rounding FMADD would not provide as much
> savings. Single-rounding is typically assumed, but I think the
> original implementation on MIPS did double rounding (perhaps to
> provide the same result as unfused FMUL and FADD).
<
Various people have espoused several points of view recently,
including one who actually has build an FPU at the gate (and
transistor) level !
<
If you go to:: https://repositories.lib.utexas.edu/handle/2152/3082
you can get the whole story.
<
Basically, the Tree is about ½ of the FMAC unit. Since the tree is
53-bits tall, and CEIL( ln2( 53 ) ) = 8; In order to get the last bit
sort of correct, one can not chop the tree less than 61-bits.Thus
one ends up with a 53×53 triangle, and then 8 more bits with
dimensions and angles of (53 up, 8 horizontal, 45 down, and
8 at 45º) This ends up to be closer to 63% of the tree size compared
to a full 53×53 parallelogram.
<
Given that the tree was 50% and is now 50%×63%= 32%. The
unit is now 82% as big as it used to be. A savings of 18%.
<
This is a very far cry of 400%
>
> (I kind of wonder what the costs would have been to implement
> floating point multiplication as a high power-of-two bits
> truncated result (i.e., ignoring carry-in from lower bits). I tend
> to agree that the benefit of standardization limits different
> interfaces to specialized uses, but there may have been a better
> alternative. I also think that the probablistic rounding that Nick
> Maclaren championed (a little) would have been interesting and
> reduced or exposed some problems with floating point.)
<
At the time IEE754 was being done (the first version) there were
2 kinds of FP users, those who wanted fast FP (zillions) and those
who wanted correct answers (5--yes 5, maybe 6). It a quirky turn of
fate, 5 (who did not buy their own computers) won out over zillions
(who were willing to spend multi-millions per computer).
<
And now we are condemned to follow suit.

Re: Power cost of IEEE754

<tduucp$2j1ao$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27470&group=comp.arch#27470

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Power cost of IEEE754
Date: Sun, 21 Aug 2022 20:49:44 -0700
Organization: A noiseless patient Spider
Lines: 33
Message-ID: <tduucp$2j1ao$1@dont-email.me>
References: <t6gush$p5u$1@dont-email.me>
<0a9765d5-f885-4109-9ba8-69430513d05fn@googlegroups.com>
<jwvlerncc0p.fsf-monnier+comp.arch@gnu.org> <tdiql4$96b$1@gioia.aioe.org>
<tdun4o$2ifk6$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 22 Aug 2022 03:49:45 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="ce3fc0f90b680eacb2a9886b8dd42f99";
logging-data="2721112"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/Mpii8tfvwuxbTPtUBl0fE"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.12.0
Cancel-Lock: sha1:bR30ftpx5wikybHUOJF/Q60T1yw=
In-Reply-To: <tdun4o$2ifk6$1@dont-email.me>
Content-Language: en-US
 by: Ivan Godard - Mon, 22 Aug 2022 03:49 UTC

On 8/21/2022 6:45 PM, Paul A. Clayton wrote:
> Terje Mathisen wrote:
>> Stefan Monnier wrote:
>>>> 3) RVV or any other Cray-Style Vector ISA requiring strict compliance
>>>>       with IEEE754 FP accuracy will punish you with a 400% power/area
>>>>       penalty compared to modern 3D-optimised GPUs which, as explicitly
>>>
>>> What is the origin of this extra cost?  IOW, where's the meat of the
>>> savings (I mean: which part of the computation of the last few bits
>>> costs so much)?
>>> Does the saving vary significantly between instructions?
>>
>> I think that is pretty much bogus, Mitch have shown repeatedly that
>> having FMAC makes subnormal handling nearly free, as in zero cycles
>> and single-digit percentage gates/power. 400% is simply rubbish.
>
> I am guessing that double-rounding FMADD would not provide as much
> savings. Single-rounding is typically assumed, but I think the
> original implementation on MIPS did double rounding (perhaps to
> provide the same result as unfused FMUL and FADD).
>
> (I kind of wonder what the costs would have been to implement
> floating point multiplication as a high power-of-two bits
> truncated result (i.e., ignoring carry-in from lower bits). I tend
> to agree that the benefit of standardization limits different
> interfaces to specialized uses, but there may have been a better
> alternative. I also think that the probablistic rounding that Nick
> Maclaren championed (a little) would have been interesting and
> reduced or exposed some problems with floating point.)
>

I originally thought stochastic rounding was a great idea and put it in.
Kahan talked me out of it.

Re: Power cost of IEEE754

<tdvpn5$2lci2$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27475&group=comp.arch#27475

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: paaroncl...@gmail.com (Paul A. Clayton)
Newsgroups: comp.arch
Subject: Re: Power cost of IEEE754
Date: Mon, 22 Aug 2022 07:36:03 -0400
Organization: A noiseless patient Spider
Lines: 26
Message-ID: <tdvpn5$2lci2$1@dont-email.me>
References: <t6gush$p5u$1@dont-email.me>
<0a9765d5-f885-4109-9ba8-69430513d05fn@googlegroups.com>
<jwvlerncc0p.fsf-monnier+comp.arch@gnu.org> <tdiql4$96b$1@gioia.aioe.org>
<tdun4o$2ifk6$1@dont-email.me> <tduucp$2j1ao$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 22 Aug 2022 11:36:05 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="f76e7c4bd707e9c8261eeda6e96a2000";
logging-data="2798146"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18GsxELc2OOWwIdUfrlcFZc8VxCgi/ZkSw="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.0
Cancel-Lock: sha1:sgx4bOSJxgQIqDtXrqGM4Dw9q8k=
In-Reply-To: <tduucp$2j1ao$1@dont-email.me>
 by: Paul A. Clayton - Mon, 22 Aug 2022 11:36 UTC

Ivan Godard wrote:
> On 8/21/2022 6:45 PM, Paul A. Clayton wrote:
[snip]>> alternative. I also think that the probablistic rounding
that Nick
>> Maclaren championed (a little) would have been interesting and
>> reduced or exposed some problems with floating point.)
>>
>
> I originally thought stochastic rounding was a great idea and put
> it in. Kahan talked me out of it.

I suspect Kahan was concerned with accuracy and provability while
Nick Maclaren was concerned with ordinary use where providing
jitter could either (rarely?) avoid some issues with practical
function (getting a meaningfully correct result) or make more
visible the existence of numerical issues with the computation. I
*think* Nick Maclaren considered variable results a positive
feature (i.e., FP should not be treated as integer with bit-exact
results).

Without hardware that supports such as an option, examining the
effects becomes much more difficult. Getting others to run their
code in real life conditions seems important for getting enough
data with fewer systematic 'errors'; even with a roughly equally
fast and reasonably available option, motivating such testing
seems likely to be challenging.

Re: Power cost of IEEE754

<tdvs11$18fj$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27477&group=comp.arch#27477

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!10O9MudpjwoXIahOJRbDvA.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Power cost of IEEE754
Date: Mon, 22 Aug 2022 14:15:35 +0200
Organization: Aioe.org NNTP Server
Message-ID: <tdvs11$18fj$1@gioia.aioe.org>
References: <t6gush$p5u$1@dont-email.me>
<0a9765d5-f885-4109-9ba8-69430513d05fn@googlegroups.com>
<jwvlerncc0p.fsf-monnier+comp.arch@gnu.org> <tdiql4$96b$1@gioia.aioe.org>
<tdun4o$2ifk6$1@dont-email.me> <tduucp$2j1ao$1@dont-email.me>
<tdvpn5$2lci2$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="41459"; posting-host="10O9MudpjwoXIahOJRbDvA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.13
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Mon, 22 Aug 2022 12:15 UTC

Paul A. Clayton wrote:
> Ivan Godard wrote:
>> On 8/21/2022 6:45 PM, Paul A. Clayton wrote:
> [snip]>> alternative. I also think that the probablistic rounding that Nick
>>> Maclaren championed (a little) would have been interesting and
>>> reduced or exposed some problems with floating point.)
>>>
>>
>> I originally thought stochastic rounding was a great idea and put it
>> in. Kahan talked me out of it.
>
> I suspect Kahan was concerned with accuracy and provability while Nick
> Maclaren was concerned with ordinary use where providing jitter could
> either (rarely?) avoid some issues with practical function (getting a
> meaningfully correct result) or make more visible the existence of
> numerical issues with the computation. I *think* Nick Maclaren
> considered variable results a positive feature (i.e., FP should not be
> treated as integer with bit-exact results).
>
> Without hardware that supports such as an option, examining the effects
> becomes much more difficult. Getting others to run their code in real
> life conditions seems important for getting enough data with fewer
> systematic 'errors'; even with a roughly equally fast and reasonably
> available option, motivating such testing seems likely to be challenging.

The best you can do these days to determine if you are close to the
endge of stability is to either rerun the calculation with fixed
rounding mode (truncate/floor/ceil) which will use the exact same time,
or you can manually truncate/jitter one or more of the bottom mantissa
bits at regular points in the calculation, but this requires source code
changes.

I can see both sides of Nick's argument:

a) It is good to teach FP programmers to never expect bit-exact
reproducibility, and instead verify error bars.

b) Having bit-reproducibility makes verification far easier.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Power cost of IEEE754

<55c4f336-ba24-4812-a9cc-b313a62ee5d2n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27478&group=comp.arch#27478

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:2387:b0:496:c9db:82b0 with SMTP id fw7-20020a056214238700b00496c9db82b0mr10225862qvb.111.1661171992548;
Mon, 22 Aug 2022 05:39:52 -0700 (PDT)
X-Received: by 2002:a05:6214:2aa2:b0:477:1882:3dc with SMTP id
js2-20020a0562142aa200b00477188203dcmr15331314qvb.11.1661171992386; Mon, 22
Aug 2022 05:39:52 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 22 Aug 2022 05:39:52 -0700 (PDT)
In-Reply-To: <3888602b-9fcb-4c6b-9cb1-b3ae8d3c4fe7n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.181.56; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.181.56
References: <t6gush$p5u$1@dont-email.me> <0a9765d5-f885-4109-9ba8-69430513d05fn@googlegroups.com>
<jwvlerncc0p.fsf-monnier+comp.arch@gnu.org> <tdiql4$96b$1@gioia.aioe.org>
<tdun4o$2ifk6$1@dont-email.me> <3888602b-9fcb-4c6b-9cb1-b3ae8d3c4fe7n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <55c4f336-ba24-4812-a9cc-b313a62ee5d2n@googlegroups.com>
Subject: Re: Power cost of IEEE754
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Mon, 22 Aug 2022 12:39:52 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2530
 by: luke.l...@gmail.com - Mon, 22 Aug 2022 12:39 UTC

On Monday, August 22, 2022 at 3:18:19 AM UTC+1, MitchAlsup wrote:

> At the time IEE754 was being done (the first version) there were
> 2 kinds of FP users, those who wanted fast FP (zillions) and those
> who wanted correct answers (5--yes 5, maybe 6).

3D GPUs as you know are still focussed on speed (and power)
due to simple pragmatic issues that the screen resolutions simply
don't need high-accuracy to work out where a pixel goes. i.e.: if the
screen is 1280x800 you certainly don't need accuracy above
~14 bits, it is wasted silicon and power. this was why MIPS 3D ASE
ops were designed with low-accuracy.

the *influence* of 3D GPU manufacturers although they are
small in number is much higher (behind closed doors such
as the Khronos Group) and so we just don't hear about
it.

Tom Forsyth on the other hand in his talk on Larrabee mentions
that they concentrated on FP32, missed the goal of being a
commercially-viable 3D GPU Card (performance and power)
so attempted to target the scientific market instead, only to
be told that the FP64 performance sucked and consequently
they fell between two stools.

wark-wark.

damned if you do, damned if you don't.

l.

Pages:123456
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor