Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

As a computer, I find your faith in technology amusing.


devel / comp.arch / Re: Why separate 32-bit arithmetic on a 64-bit architecture?

SubjectAuthor
* Why separate 32-bit arithmetic on a 64-bit architecture?Thomas Koenig
+* Re: Why separate 32-bit arithmetic on a 64-bit architecture?BGB
|`* Re: Why separate 32-bit arithmetic on a 64-bit architecture?David Brown
| +- Re: Why separate 32-bit arithmetic on a 64-bit architecture?BGB
| `* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Anton Ertl
|  `- Re: Why separate 32-bit arithmetic on a 64-bit architecture?Anton Ertl
+* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Anton Ertl
|`- Re: Why separate 32-bit arithmetic on a 64-bit architecture?BGB
+* Re: Why separate 32-bit arithmetic on a 64-bit architecture?MitchAlsup
|`* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Marcus
| +- Re: Why separate 32-bit arithmetic on a 64-bit architecture?Marcus
| `- Re: Why separate 32-bit arithmetic on a 64-bit architecture?MitchAlsup
+* Re: Why separate 32-bit arithmetic on a 64-bit architecture?EricP
|+* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Thomas Koenig
||`* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Thomas Koenig
|| `* Re: Why separate 32-bit arithmetic on a 64-bit architecture?EricP
||  `* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Thomas Koenig
||   `* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Quadibloc
||    +* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Terje Mathisen
||    |`* The cost of gradual underflow (was: Why separate 32-bit arithmetic on a 64-bit aStefan Monnier
||    | `- Re: The cost of gradual underflowTerje Mathisen
||    +- Re: Why separate 32-bit arithmetic on a 64-bit architecture?MitchAlsup
||    `* Re: Why separate 32-bit arithmetic on a 64-bit architecture?antispam
||     +* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Terje Mathisen
||     |`* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Michael S
||     | +* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Terje Mathisen
||     | |`* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Michael S
||     | | +* Re: Why separate 32-bit arithmetic on a 64-bit architecture?MitchAlsup
||     | | |`* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Michael S
||     | | | `* Re: Why separate 32-bit arithmetic on a 64-bit architecture?MitchAlsup
||     | | |  `* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Michael S
||     | | |   +- Re: Why separate 32-bit arithmetic on a 64-bit architecture?MitchAlsup
||     | | |   `- Re: Why separate 32-bit arithmetic on a 64-bit architecture?Anton Ertl
||     | | `* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Terje Mathisen
||     | |  `* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Michael S
||     | |   `* Re: Why separate 32-bit arithmetic on a 64-bit architecture?MitchAlsup
||     | |    `* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Michael S
||     | |     `* Re: Why separate 32-bit arithmetic on a 64-bit architecture?MitchAlsup
||     | |      `* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Michael S
||     | |       `* Re: Why separate 32-bit arithmetic on a 64-bit architecture?MitchAlsup
||     | |        `* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Michael S
||     | |         +* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Thomas Koenig
||     | |         |+- Re: Why separate 32-bit arithmetic on a 64-bit architecture?Quadibloc
||     | |         |`- Re: Why separate 32-bit arithmetic on a 64-bit architecture?Quadibloc
||     | |         `- Re: Why separate 32-bit arithmetic on a 64-bit architecture?Terje Mathisen
||     | `* Re: Why separate 32-bit arithmetic on a 64-bit architecture?MitchAlsup
||     |  +* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Quadibloc
||     |  |+* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Quadibloc
||     |  ||`- Re: Why separate 32-bit arithmetic on a 64-bit architecture?MitchAlsup
||     |  |+* Re: Why separate 32-bit arithmetic on a 64-bit architecture?MitchAlsup
||     |  ||`* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Quadibloc
||     |  || `* Re: Why separate 32-bit arithmetic on a 64-bit architecture?MitchAlsup
||     |  ||  `* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Quadibloc
||     |  ||   +* Re: Why separate 32-bit arithmetic on a 64-bit architecture?MitchAlsup
||     |  ||   |`* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Thomas Koenig
||     |  ||   | `* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Michael S
||     |  ||   |  +* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Anton Ertl
||     |  ||   |  |+- Re: Why separate 32-bit arithmetic on a 64-bit architecture?EricP
||     |  ||   |  |`* Re: Why separate 32-bit arithmetic on a 64-bit architecture?MitchAlsup
||     |  ||   |  | `- Re: Why separate 32-bit arithmetic on a 64-bit architecture?Michael S
||     |  ||   |  `* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Quadibloc
||     |  ||   |   +- Re: Why separate 32-bit arithmetic on a 64-bit architecture?Michael S
||     |  ||   |   +* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Quadibloc
||     |  ||   |   |`* Re: Why separate 32-bit arithmetic on a 64-bit architecture?George Neuner
||     |  ||   |   | `- Re: Why separate 32-bit arithmetic on a 64-bit architecture?Quadibloc
||     |  ||   |   `* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Quadibloc
||     |  ||   |    +- Re: Why separate 32-bit arithmetic on a 64-bit architecture?Quadibloc
||     |  ||   |    +- Spectre ane EPIC (was: Why separate 32-bit arithmetic...)Anton Ertl
||     |  ||   |    +* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Michael S
||     |  ||   |    |`* Spectre (was: Why separate 32-bit arithmetic ...)Anton Ertl
||     |  ||   |    | +* Re: Spectre (was: Why separate 32-bit arithmetic ...)Michael S
||     |  ||   |    | |+* Re: SpectreEricP
||     |  ||   |    | ||+* Re: SpectreMitchAlsup
||     |  ||   |    | |||`* Re: SpectreEricP
||     |  ||   |    | ||| `- Re: SpectreMitchAlsup
||     |  ||   |    | ||`- Re: SpectreAnton Ertl
||     |  ||   |    | |`* Re: Spectre (was: Why separate 32-bit arithmetic ...)Anton Ertl
||     |  ||   |    | | +* Re: Spectre (was: Why separate 32-bit arithmetic ...)MitchAlsup
||     |  ||   |    | | |`- Re: Spectre (was: Why separate 32-bit arithmetic ...)Thomas Koenig
||     |  ||   |    | | `- Re: Spectre (was: Why separate 32-bit arithmetic ...)Anton Ertl
||     |  ||   |    | +* Re: SpectreEricP
||     |  ||   |    | |`* Re: SpectreAnton Ertl
||     |  ||   |    | | +* Memory encryption (was: Spectre)Thomas Koenig
||     |  ||   |    | | |`* Re: Memory encryption (was: Spectre)Anton Ertl
||     |  ||   |    | | | `* Re: Memory encryption (was: Spectre)Elijah Stone
||     |  ||   |    | | |  +- Re: Memory encryption (was: Spectre)Michael S
||     |  ||   |    | | |  `* Re: Memory encryption (was: Spectre)Anton Ertl
||     |  ||   |    | | |   +- Re: Memory encryption (was: Spectre)MitchAlsup
||     |  ||   |    | | |   `* Re: Memory encryption (was: Spectre)Thomas Koenig
||     |  ||   |    | | |    `- Re: Memory encryption (was: Spectre)Anton Ertl
||     |  ||   |    | | `* Re: SpectreTerje Mathisen
||     |  ||   |    | |  `* Re: SpectreThomas Koenig
||     |  ||   |    | |   +* Re: SpectreAnton Ertl
||     |  ||   |    | |   |`* Re: SpectreThomas Koenig
||     |  ||   |    | |   | +- Re: SpectreAnton Ertl
||     |  ||   |    | |   | `- Re: SpectreMichael S
||     |  ||   |    | |   `- Re: SpectreMitchAlsup
||     |  ||   |    | `* Re: Spectre (was: Why separate 32-bit arithmetic ...)MitchAlsup
||     |  ||   |    |  `- Re: Spectre (was: Why separate 32-bit arithmetic ...)Anton Ertl
||     |  ||   |    `* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Quadibloc
||     |  ||   |     `- Re: Why separate 32-bit arithmetic on a 64-bit architecture?Quadibloc
||     |  ||   `* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Anton Ertl
||     |  |+- Re: Why separate 32-bit arithmetic on a 64-bit architecture?Bill Findlay
||     |  |`* Re: Imprecision, was Why separate 32-bit arithmetic on a 64-bit architecture?John Levine
||     |  `- Re: Why separate 32-bit arithmetic on a 64-bit architecture?Michael S
||     `* Re: Why separate 32-bit arithmetic on a 64-bit architecture?MitchAlsup
|`* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Anton Ertl
`* Re: Why separate 32-bit arithmetic on a 64-bit architecture?Quadibloc

Pages:1234567
Re: Why separate 32-bit arithmetic on a 64-bit architecture?

<2022Jan27.184124@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23175&group=comp.arch#23175

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Why separate 32-bit arithmetic on a 64-bit architecture?
Date: Thu, 27 Jan 2022 17:41:24 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 89
Message-ID: <2022Jan27.184124@mips.complang.tuwien.ac.at>
References: <sso6aq$37b$1@newsreader4.netcologne.de> <UPXHJ.6202$9O.4300@fx12.iad> <2022Jan25.235313@mips.complang.tuwien.ac.at> <5NdIJ.15873$mS1.13257@fx10.iad> <sss30f$io0$1@newsreader4.netcologne.de> <3a90c85b-1371-481b-bf81-574e3c0af68en@googlegroups.com> <2022Jan27.103742@mips.complang.tuwien.ac.at> <7PyIJ.359540$aF1.217448@fx98.iad>
Injection-Info: reader02.eternal-september.org; posting-host="df00b91258d925986056b62bd2d5bc82";
logging-data="30553"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+f4NeedeTK14wo/rM+Foo2"
Cancel-Lock: sha1:bfm1UMG1PMCLunkc2l3w969vdAA=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Thu, 27 Jan 2022 17:41 UTC

EricP <ThatWouldBeTelling@thevillage.com> writes:
>Anton Ertl wrote:
>> MitchAlsup <MitchAlsup@aol.com> writes:
>>> On Wednesday, January 26, 2022 at 12:13:38 PM UTC-6, Thomas Koenig wrote:
>>>> EricP <ThatWould...@thevillage.com> schrieb:
>>>>> They left out the byte and word load and stores, supposedly because=20
>>>>> that would require the byte shifter network on the critical path.
>>
>> The story I have heard is that they required ECC for write-back caches
>> (and parity for write-through caches); while the first implementations
>> had write-through D-caches, they designed the architecture for
>> implementations with write-back D-caches, and there byte stores would
>> have cost either ECC on the byte level (50% overhead compared to ~20%
>> for ECC on 32-bit units) or implementing stores by performing a
>> read-modify-write of a larger unit.
>
>That seems reasonable too, except Alpha has 32 bit stores and,
>assuming they used the standard off-the-shelf 64+8 bit ECC DRAM,
>those dword stores would require a read-modify-write memory cycle
>and recomputing SECDED ECC.

I don't know how they implemented it, but they had a write-back L2
cache on the 21064 and 21164, and already a write-back L1 on the
21264. For Alpha without BWX I would then store 32-bit units + 5 bits
ECC in the write-back cache, and when a cache line is evicted from the
write-back cache, check and correct the ECC of these units and
generate ECC for 64-bit units, resulting in stuff that fits in
standard ECC DRAM.

I did not look closely at the RAM of our Alphas, but my impression was
that they used pretty standard PC stuff. Just looked up information
for our first Alpha generation, which used an AlphaPC64 board
<https://www.manualslib.com/manual/667049/Digital-Equipment-Alphapc64.html?page=9#manual>:

|The AlphaPC64 memory subsystem supports DRAM memory arrays of 16MB to
|512MB with a 128-bit data bus. The memory is contained in two banks of
|four commodity single inline memory modules (SIMMs). Each SIMM is 36
|bits wide, with 32 data bits, 1 parity bit, and 3 unused bits with
|70-ns or less access.

So they did not go to the complications I imagined, but they also did
not give us ECC (it was not a DEC machine, so I guess the ECC
requirement did not apply).

>From the text below, it would seem
>that they used a different ECC applied to 32 bit words,
>which implies they did not used standard 72 bit DRAM DIMMs
>(which in turn would have hit them on main memory cost).

At least the AlphaPC64 was apparently too early for DIMMs.

>> Funnily, they introduced byte stores before they introduced an
>> implementation with a write-back D-cache.
>
>And according to DEC benchmarks, code got faster and smaller
>so it would seem that many of the assumptions used to
>justify leaving out byte/word were not correct.
>
>Maybe later implementations switched to using 72 bit DIMMs
>so it was doing the RMW for 32-bit dwords anyway.

Looking at <https://www.omnistep.com/~advantag/matrix.htm>, the
AlphaPC164 used parity FPM (IIRC fast path memory) DRAM in the form of
SIMMs, while the AlphaPC164LX used ECC SDRAM DIMMs.

My design for byte stores in the 21164A (EV56) would be: if the cache
line is not in L1, load it into L1 (and have byte parity there), and
the store waits until the cache line is there (but the common case is
that there is no need to wait). Then replace the stored byte(s), and
write the cache line (or at least 64-bit parcels) into the L2,
generating ECC on the way. When a cache line is evicted from L2, you
have the 64-bit parcels ready, including ECC. So essentially the
write-through D-cache provides the RMW functionality necessary for
ECC.

What they apparently did in the AlphaPC164 is a memory controller that
worked with 32-bit parcels on the DRAM side and therefore threw the
ECC away (or checked it) and used parity instead on writing to DRAM;
on reading from DRAM the memory controller checked the parity and then
faked the ECC data.

On the AlphaPC164LX they changed to 64-bit ECC SDRAM DIMMs, wich is
better aligned to (my guess of) the L2 representation of the data, and
therefore they now used ECC functionality in DRAM.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Why separate 32-bit arithmetic on a 64-bit architecture?

<jwvwnilui61.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23176&group=comp.arch#23176

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Why separate 32-bit arithmetic on a 64-bit architecture?
Date: Thu, 27 Jan 2022 13:52:02 -0500
Organization: A noiseless patient Spider
Lines: 10
Message-ID: <jwvwnilui61.fsf-monnier+comp.arch@gnu.org>
References: <sso6aq$37b$1@newsreader4.netcologne.de>
<UPXHJ.6202$9O.4300@fx12.iad>
<2022Jan25.235313@mips.complang.tuwien.ac.at>
<5NdIJ.15873$mS1.13257@fx10.iad>
<sss30f$io0$1@newsreader4.netcologne.de>
<3a90c85b-1371-481b-bf81-574e3c0af68en@googlegroups.com>
<2022Jan27.103742@mips.complang.tuwien.ac.at>
<7PyIJ.359540$aF1.217448@fx98.iad>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="d4c7264edc9fb73bf29a0866fb49ae93";
logging-data="27292"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+EUU6z7a7j7+fWWvrgeHMW"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:fmQGirhCN7W3wDIzYglqGOMxOT8=
sha1:4Z+VmuBTseJTT84wqG+lcpqgLTY=
 by: Stefan Monnier - Thu, 27 Jan 2022 18:52 UTC

> and recomputing SECDED ECC. From the text below, it would seem
> that they used a different ECC applied to 32 bit words,
> which implies they did not used standard 72 bit DRAM DIMMs
> (which in turn would have hit them on main memory cost).

IIRC they use 64bit-bundle ECCs in the DRAM, and only supported the
32bit granularity in cache accesses, not at the DRAM interface.

Stefan

Re: Why separate 32-bit arithmetic on a 64-bit architecture?

<ssuqde$cek$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23177&group=comp.arch#23177

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Why separate 32-bit arithmetic on a 64-bit architecture?
Date: Thu, 27 Jan 2022 11:05:17 -0800
Organization: A noiseless patient Spider
Lines: 83
Message-ID: <ssuqde$cek$1@dont-email.me>
References: <sso6aq$37b$1@newsreader4.netcologne.de>
<UPXHJ.6202$9O.4300@fx12.iad> <2022Jan25.235313@mips.complang.tuwien.ac.at>
<5NdIJ.15873$mS1.13257@fx10.iad> <sss30f$io0$1@newsreader4.netcologne.de>
<3a90c85b-1371-481b-bf81-574e3c0af68en@googlegroups.com>
<2022Jan27.103742@mips.complang.tuwien.ac.at>
<7PyIJ.359540$aF1.217448@fx98.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 27 Jan 2022 19:05:18 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="5eba3a068417600d975b3f62a6b3aba5";
logging-data="12756"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18FeERwTCO+UD54jpblJkuI"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.0
Cancel-Lock: sha1:H/5XJHK1DlRWLsgoAAnkgqfhAhY=
In-Reply-To: <7PyIJ.359540$aF1.217448@fx98.iad>
Content-Language: en-US
 by: Ivan Godard - Thu, 27 Jan 2022 19:05 UTC

On 1/27/2022 7:24 AM, EricP wrote:
> Anton Ertl wrote:
>> MitchAlsup <MitchAlsup@aol.com> writes:
>>> On Wednesday, January 26, 2022 at 12:13:38 PM UTC-6, Thomas Koenig
>>> wrote:
>>>> EricP <ThatWould...@thevillage.com> schrieb:
>>>>> They left out the byte and word load and stores, supposedly because=20
>>>>> that would require the byte shifter network on the critical path.
>>
>> The story I have heard is that they required ECC for write-back caches
>> (and parity for write-through caches); while the first implementations
>> had write-through D-caches, they designed the architecture for
>> implementations with write-back D-caches, and there byte stores would
>> have cost either ECC on the byte level (50% overhead compared to ~20%
>> for ECC on 32-bit units) or implementing stores by performing a
>> read-modify-write of a larger unit.
>
> That seems reasonable too, except Alpha has 32 bit stores and,
> assuming they used the standard off-the-shelf 64+8 bit ECC DRAM,
> those dword stores would require a read-modify-write memory cycle
> and recomputing SECDED ECC. From the text below, it would seem
> that they used a different ECC applied to 32 bit words,
> which implies they did not used standard 72 bit DRAM DIMMs
> (which in turn would have hit them on main memory cost).
>
> Richard Sites, credited as an Alpha co-architect, gives reasons in
> "Alpha AXP Architecture" 1993 and supports both views:
>
> "The byte load/store instructions and unaligned accesses found in
> some RISC architectures can be a performance bottleneck. They require
> an extra byte shifter in the speed-critical load and store paths,
> and they force a hard choice in fast cache design."
> ...
> "On a previous project involving a MIPS implementation, we found the
> shifter for the load-left/load-right instructions to be a direct
> cycle-time bottleneck. Also, the VAX 8700 implementation (circa 1986)
> removed the byte shifter in the load/store hardware in favor of a faster
> microcycle, with 2 cycles for a byte load and 6 cycles for an unaligned
> 32-bit access. This decision achieved a net performance gain.
> Our experience encouraged us to avoid byte load/store.
>
> An additional problem with byte stores is that an implementer may easily
> choose only two of the three design features: fast write-back cache,
> single-bit error correction code (ECC), or byte stores.
>
> Byte stores are straightforward in simple byte parity write-through cache
> implementations. Except for the expensive design of four or five ECC bits
> for every eight bits of data, a byte store to a fast ECC write-back cache
> involves
> 1. Reading an entire cache word
> 2. Checking the ECC bits and correcting any single bit error
> 3. Modifying the byte
> 4. Calculating the new ECC bits
> 5. Writing the entire cache word
>
> This read-modify-write sequence requires hidden sequencer hardware and
> hidden state to hold the cache word temporarily. The sequencer tends to
> slow down ordinary full-cache-word stores. The need for byte stores tends
> to ripple throughout the memory subsystem design, making each piece a
> little more complicated and a little slower. With non-replicated
> hidden state, it is difficult to issue another byte store until the
> first one finishes."
>
>> Funnily, they introduced byte stores before they introduced an
>> implementation with a write-back D-cache.
>
> And according to DEC benchmarks, code got faster and smaller
> so it would seem that many of the assumptions used to
> justify leaving out byte/word were not correct.
>
> Maybe later implementations switched to using 72 bit DIMMs
> so it was doing the RMW for 32-bit dwords anyway.
>
>> I guess that these days, with a relatively large write-combining store
>> buffer, the RMW cost is acceptable.
>
> x86 has its virtually tagged write combine buffers.
> I don't know if anyone else does.

<sticks up hand>

falls out of cache-in-virtual

Re: Why separate 32-bit arithmetic on a 64-bit architecture?

<ssvo55$a3i$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23188&group=comp.arch#23188

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ggt...@yahoo.com (Brett)
Newsgroups: comp.arch
Subject: Re: Why separate 32-bit arithmetic on a 64-bit architecture?
Date: Fri, 28 Jan 2022 03:32:54 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 47
Message-ID: <ssvo55$a3i$1@dont-email.me>
References: <sso6aq$37b$1@newsreader4.netcologne.de>
<UPXHJ.6202$9O.4300@fx12.iad>
<2022Jan25.235313@mips.complang.tuwien.ac.at>
<5NdIJ.15873$mS1.13257@fx10.iad>
<ssst1s$pk9$1@dont-email.me>
<c7537b99-c3f7-45ad-a40f-2c38904b2a4cn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 28 Jan 2022 03:32:54 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="2b173751ba21d454305122aa091263b8";
logging-data="10354"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX194gC6zTIvxFs3d00LHlUsY"
User-Agent: NewsTap/5.5 (iPad)
Cancel-Lock: sha1:6GoUy49UV13zEoBj9LBEQ4NCe0k=
sha1:Mn81nAkKs6Qm72oFyUqFUhHWxsY=
 by: Brett - Fri, 28 Jan 2022 03:32 UTC

MitchAlsup <MitchAlsup@aol.com> wrote:
> On Wednesday, January 26, 2022 at 7:38:07 PM UTC-6, gg...@yahoo.com wrote:
>> EricP <ThatWould...@thevillage.com> wrote:
>>> Anton Ertl wrote:
>>>> EricP <ThatWould...@thevillage.com> writes:
>>>>> Alpha didn't have SEXT/ZEXT sign/zero extend instructions
>>>>> so it would have required a pair of shifts.
>>>>
>>>> Alpha uses addl for sign extension and some zap instruction for zero
>>>> extension. If they had not added addl, they could have added an sext
>>>> instruction instead.
>>>
>>> There were lots of instructions they could have had, but they seemed
>>> almost obsessive that the R in RISC means reduced instruction _count_.
>>>
>>> They left out the byte and word load and stores, supposedly because
>>> that would require the byte shifter network on the critical path.
>>> I think that was a disastrous decision that contributed greatly
>>> to porting difficulties and lack of wide market acceptance.
>>> Once it established in peoples minds that it is difficult to work with
>>> or more trouble than its worth, it is tough to come back from.
>>> In effect, they created a market barrier to themselves.
>>>
>>> And when they finally did add byte and word load and store,
>>> load byte and word only did zero extend not sign extend
>>> supposedly because they did not want to put the sign extension
>>> logic into the load critical path. But still, WTF!
> <
>> Loads are so variable that they should have bit the bullet and supported 2
>> cycle signed loads in addition to 1 cycle unsigned loads. RISC religion
>> stupidity.
> <
> How do you do this when an ADD takes 3/4 of a cycle and an SRAM
> access takes 1+1/8 cycle, and the sign extension aligner takes 3/8 cycle ?

I rounded up to integer cycle counts, same as two instructions- load and
extend, except now as one instruction saving instruction cache. So you can
crack it, or split the load path at the extender unit. Perhaps only one of
the two load units would support extension if you split the load. If that
load port is full you can do a late crack.

>>> Anyway, SEXT and ZEXT to sign or zero extend a register from
>>> a bit position would not have affected the cycle time.
>

Re: Why separate 32-bit arithmetic on a 64-bit architecture?

<DdXIJ.50253$gX.47600@fx40.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23198&group=comp.arch#23198

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!news.freedyn.de!newsreader4.netcologne.de!news.netcologne.de!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx40.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Why separate 32-bit arithmetic on a 64-bit architecture?
References: <sso6aq$37b$1@newsreader4.netcologne.de> <UPXHJ.6202$9O.4300@fx12.iad> <2022Jan25.235313@mips.complang.tuwien.ac.at> <5NdIJ.15873$mS1.13257@fx10.iad> <sss30f$io0$1@newsreader4.netcologne.de> <3a90c85b-1371-481b-bf81-574e3c0af68en@googlegroups.com> <2022Jan27.103742@mips.complang.tuwien.ac.at> <7PyIJ.359540$aF1.217448@fx98.iad> <2022Jan27.184124@mips.complang.tuwien.ac.at>
In-Reply-To: <2022Jan27.184124@mips.complang.tuwien.ac.at>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 128
Message-ID: <DdXIJ.50253$gX.47600@fx40.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Fri, 28 Jan 2022 19:11:31 UTC
Date: Fri, 28 Jan 2022 14:11:11 -0500
X-Received-Bytes: 7549
 by: EricP - Fri, 28 Jan 2022 19:11 UTC

Anton Ertl wrote:
> EricP <ThatWouldBeTelling@thevillage.com> writes:
>> Anton Ertl wrote:
>>> MitchAlsup <MitchAlsup@aol.com> writes:
>>>> On Wednesday, January 26, 2022 at 12:13:38 PM UTC-6, Thomas Koenig wrote:
>>>>> EricP <ThatWould...@thevillage.com> schrieb:
>>>>>> They left out the byte and word load and stores, supposedly because=20
>>>>>> that would require the byte shifter network on the critical path.
>>> The story I have heard is that they required ECC for write-back caches
>>> (and parity for write-through caches); while the first implementations
>>> had write-through D-caches, they designed the architecture for
>>> implementations with write-back D-caches, and there byte stores would
>>> have cost either ECC on the byte level (50% overhead compared to ~20%
>>> for ECC on 32-bit units) or implementing stores by performing a
>>> read-modify-write of a larger unit.
>> That seems reasonable too, except Alpha has 32 bit stores and,
>> assuming they used the standard off-the-shelf 64+8 bit ECC DRAM,
>> those dword stores would require a read-modify-write memory cycle
>> and recomputing SECDED ECC.
>
> I don't know how they implemented it, but they had a write-back L2
> cache on the 21064 and 21164, and already a write-back L1 on the
> 21264. For Alpha without BWX I would then store 32-bit units + 5 bits
> ECC in the write-back cache, and when a cache line is evicted from the
> write-back cache, check and correct the ECC of these units and
> generate ECC for 64-bit units, resulting in stuff that fits in
> standard ECC DRAM.

I was trying to figure out what kind of cache design could handle
aligned dword merges into caches lines but not handle byte merges,
thus justifying the original decision to elide byte/word load/store.
After all, if they have the cache line sitting right there,
what's the big deal about merging a byte into it.

I realized overnight that the L1 cache on the 21064 model must be
read allocate - it only loads a line into cache on read.
So stores can't depend on the line being present to merge bytes into,
so byte/word writes could require an extra read cycle.

Looking at the 21064 Hardware Reference Manual, a 32-bit aligned
dword store that misses a cache hit would write-thru
to perform just the necessary aligned dword write to DRAM,
with the parity or 7-bit ECC generated per dword by the processor.
There is also a 4 entry 32-byte write combine circular buffer.

Whole 32-byte write lines are transferred to cache in one or two
128-bit bus cycles, with 4 parity or 7 ECC bits per dword,
plus a 4-bit mask indicating which dwords are valid.

If this had used a write-allocate write-thru cache or write-back
then I think any advantage of dropping byte/word instruction disappears.

> I did not look closely at the RAM of our Alphas, but my impression was
> that they used pretty standard PC stuff. Just looked up information
> for our first Alpha generation, which used an AlphaPC64 board
> <https://www.manualslib.com/manual/667049/Digital-Equipment-Alphapc64.html?page=9#manual>:
>
> |The AlphaPC64 memory subsystem supports DRAM memory arrays of 16MB to
> |512MB with a 128-bit data bus. The memory is contained in two banks of
> |four commodity single inline memory modules (SIMMs). Each SIMM is 36
> |bits wide, with 32 data bits, 1 parity bit, and 3 unused bits with
> |70-ns or less access.
>
> So they did not go to the complications I imagined, but they also did
> not give us ECC (it was not a DEC machine, so I guess the ECC
> requirement did not apply).
>
>>From the text below, it would seem
>> that they used a different ECC applied to 32 bit words,
>> which implies they did not used standard 72 bit DRAM DIMMs
>> (which in turn would have hit them on main memory cost).
>
> At least the AlphaPC64 was apparently too early for DIMMs.
>
>>> Funnily, they introduced byte stores before they introduced an
>>> implementation with a write-back D-cache.
>> And according to DEC benchmarks, code got faster and smaller
>> so it would seem that many of the assumptions used to
>> justify leaving out byte/word were not correct.
>>
>> Maybe later implementations switched to using 72 bit DIMMs
>> so it was doing the RMW for 32-bit dwords anyway.
>
> Looking at <https://www.omnistep.com/~advantag/matrix.htm>, the
> AlphaPC164 used parity FPM (IIRC fast path memory) DRAM in the form of
> SIMMs, while the AlphaPC164LX used ECC SDRAM DIMMs.
>
> My design for byte stores in the 21164A (EV56) would be: if the cache
> line is not in L1, load it into L1 (and have byte parity there), and
> the store waits until the cache line is there (but the common case is
> that there is no need to wait). Then replace the stored byte(s), and
> write the cache line (or at least 64-bit parcels) into the L2,
> generating ECC on the way. When a cache line is evicted from L2, you
> have the 64-bit parcels ready, including ECC. So essentially the
> write-through D-cache provides the RMW functionality necessary for
> ECC.

Yes, it works because you are using a write-allocate cache
so you always have data to merge bytes into.

There is also the extra logic for handling misaligned load/store
which I would put mostly in the LSQ (because it may straddle pages).
So what the cache sees from LSQ are one or two aligned 64-bit load/store
physical addresses with an 8 bit byte select mask.
The cache takes care of merging selected bytes into lines.

> What they apparently did in the AlphaPC164 is a memory controller that
> worked with 32-bit parcels on the DRAM side and therefore threw the
> ECC away (or checked it) and used parity instead on writing to DRAM;
> on reading from DRAM the memory controller checked the parity and then
> faked the ECC data.

Yes, looks that way. Actually they say they only store 1 parity bit
per dword so it looks like they merge 4 parity to 1, store that and
leave 3 bits unused. For some unknown reason.

> On the AlphaPC164LX they changed to 64-bit ECC SDRAM DIMMs, wich is
> better aligned to (my guess of) the L2 representation of the data, and
> therefore they now used ECC functionality in DRAM.
>
> - anton

Their choice of doing 7 bits ECC per 32 bit dword doubles the overhead
and makes it incompatible with standard 72 bit SIMMs.
But is does allow then to side step the whole issue of merging
dword stores into qwords.

Re: Why separate 32-bit arithmetic on a 64-bit architecture?

<2022Jan28.231944@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23204&group=comp.arch#23204

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Why separate 32-bit arithmetic on a 64-bit architecture?
Date: Fri, 28 Jan 2022 22:19:44 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 16
Message-ID: <2022Jan28.231944@mips.complang.tuwien.ac.at>
References: <sso6aq$37b$1@newsreader4.netcologne.de> <UPXHJ.6202$9O.4300@fx12.iad> <2022Jan25.235313@mips.complang.tuwien.ac.at> <5NdIJ.15873$mS1.13257@fx10.iad> <sss30f$io0$1@newsreader4.netcologne.de> <3a90c85b-1371-481b-bf81-574e3c0af68en@googlegroups.com> <2022Jan27.103742@mips.complang.tuwien.ac.at> <7PyIJ.359540$aF1.217448@fx98.iad> <2022Jan27.184124@mips.complang.tuwien.ac.at> <DdXIJ.50253$gX.47600@fx40.iad>
Injection-Info: reader02.eternal-september.org; posting-host="2564c2e7e7b021e3b688522e299eb7d6";
logging-data="18655"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1848P2rrDUc/Au6XUhlfLsI"
Cancel-Lock: sha1:jxKpq8bH948CvSp4gwlImHZry54=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Fri, 28 Jan 2022 22:19 UTC

EricP <ThatWouldBeTelling@thevillage.com> writes:
>If this had used a write-allocate write-thru cache or write-back
>then I think any advantage of dropping byte/word instruction disappears.

For write-back it still means either ECC on the byte level or reading
the rest of the unit to produce the combined ECC.

Write allocation was not that common in the early 1990s. E.g., the
first generation K6-2 had write-back, but without write allocation; a
later revision, the K6-2 CXT, added (or enabled) write allocation (in
1998).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Why separate 32-bit arithmetic on a 64-bit architecture?

<st27or$gvk$2@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23206&group=comp.arch#23206

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Why separate 32-bit arithmetic on a 64-bit architecture?
Date: Fri, 28 Jan 2022 18:11:39 -0800
Organization: A noiseless patient Spider
Lines: 16
Message-ID: <st27or$gvk$2@dont-email.me>
References: <sso6aq$37b$1@newsreader4.netcologne.de>
<UPXHJ.6202$9O.4300@fx12.iad> <2022Jan25.235313@mips.complang.tuwien.ac.at>
<5NdIJ.15873$mS1.13257@fx10.iad> <sss30f$io0$1@newsreader4.netcologne.de>
<3a90c85b-1371-481b-bf81-574e3c0af68en@googlegroups.com>
<2022Jan27.103742@mips.complang.tuwien.ac.at>
<7PyIJ.359540$aF1.217448@fx98.iad>
<2022Jan27.184124@mips.complang.tuwien.ac.at> <DdXIJ.50253$gX.47600@fx40.iad>
<2022Jan28.231944@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 29 Jan 2022 02:11:39 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="96dfe7e9118fb69d3a41b408559493db";
logging-data="17396"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Yzm1akPzDOqSiVKzMFrm2"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.0
Cancel-Lock: sha1:fE+oyNlOqpPOFufEygKHfoOUV3M=
In-Reply-To: <2022Jan28.231944@mips.complang.tuwien.ac.at>
Content-Language: en-US
 by: Ivan Godard - Sat, 29 Jan 2022 02:11 UTC

On 1/28/2022 2:19 PM, Anton Ertl wrote:
> EricP <ThatWouldBeTelling@thevillage.com> writes:
>> If this had used a write-allocate write-thru cache or write-back
>> then I think any advantage of dropping byte/word instruction disappears.
>
> For write-back it still means either ECC on the byte level or reading
> the rest of the unit to produce the combined ECC.
>
> Write allocation was not that common in the early 1990s. E.g., the
> first generation K6-2 had write-back, but without write allocation; a
> later revision, the K6-2 CXT, added (or enabled) write allocation (in
> 1998).
>
> - anton

@Mitch - what cache policy is my66 using?

Re: Why separate 32-bit arithmetic on a 64-bit architecture?

<488d6998-13fc-4062-8663-8f97b146a24dn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23207&group=comp.arch#23207

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:27e7:: with SMTP id jt7mr9495069qvb.113.1643427910985;
Fri, 28 Jan 2022 19:45:10 -0800 (PST)
X-Received: by 2002:a9d:7cd4:: with SMTP id r20mr6560268otn.350.1643427910692;
Fri, 28 Jan 2022 19:45:10 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 28 Jan 2022 19:45:10 -0800 (PST)
In-Reply-To: <st27or$gvk$2@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:6542:406c:754e:48de;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:6542:406c:754e:48de
References: <sso6aq$37b$1@newsreader4.netcologne.de> <UPXHJ.6202$9O.4300@fx12.iad>
<2022Jan25.235313@mips.complang.tuwien.ac.at> <5NdIJ.15873$mS1.13257@fx10.iad>
<sss30f$io0$1@newsreader4.netcologne.de> <3a90c85b-1371-481b-bf81-574e3c0af68en@googlegroups.com>
<2022Jan27.103742@mips.complang.tuwien.ac.at> <7PyIJ.359540$aF1.217448@fx98.iad>
<2022Jan27.184124@mips.complang.tuwien.ac.at> <DdXIJ.50253$gX.47600@fx40.iad>
<2022Jan28.231944@mips.complang.tuwien.ac.at> <st27or$gvk$2@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <488d6998-13fc-4062-8663-8f97b146a24dn@googlegroups.com>
Subject: Re: Why separate 32-bit arithmetic on a 64-bit architecture?
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 29 Jan 2022 03:45:10 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 59
 by: MitchAlsup - Sat, 29 Jan 2022 03:45 UTC

On Friday, January 28, 2022 at 8:11:42 PM UTC-6, Ivan Godard wrote:
> On 1/28/2022 2:19 PM, Anton Ertl wrote:
> > EricP <ThatWould...@thevillage.com> writes:
> >> If this had used a write-allocate write-thru cache or write-back
> >> then I think any advantage of dropping byte/word instruction disappears.
> >
> > For write-back it still means either ECC on the byte level or reading
> > the rest of the unit to produce the combined ECC.
> >
> > Write allocation was not that common in the early 1990s. E.g., the
> > first generation K6-2 had write-back, but without write allocation; a
> > later revision, the K6-2 CXT, added (or enabled) write allocation (in
> > 1998).
> >
> > - anton
> @Mitch - what cache policy is my66 using?
<
Officially "every implementation gets to make its own choices", but TLBs
and table-walk accelerators are coherent. {Like Mill, there is no supervisor
state or instructions. If the MMU tables allow access you can "do it".
Certain pages accessed by HW have the TLB marked RWE=000 and Present=1
LDs, STs, and Fetches take page faults, ENTER and EXTRACT can access
data on the safe stack, HW can access machine and thread configuration
information,... And all control registers are memory mapped. }
<
However, I am, in general, a fan of exclusive caches at least at the L1 and
L2 levels. DRAM->L1 modified in L1->L2 modified in L2->DRAM.
<
Beyond the data routing, I prefer write allocate caches, using byte parity
at the L1 level and SECDED ECC at L2; configured for write-back operation.
However, the ST buffers would be hold appears to have been written waiting
on the line arriving.
<
Back in 1993 the phrase was "If it is spelled with a D it gets ECC" D=DRAM.
At this current level of technology, "If it is spelled xRAM is should get ECC".
On a GBOoO implementation I am contemplating, entire cache lines are
brought into LO (My conditional Cache == Memory reorder Buffer) played
around with for a while, then sent back to L1 with ECC--the same ECC used
throughout the entire fabric {DRAM, device RAM, PCIe message protection,
system interconnect,...}. The CC is fully associative and can deal with an
entire execution window of associativity. This puts it in a place to make
better routing decisions--such as the location in the L1 it came from was
displaced, so when you write it back, write it to L2 instead of L1. This takes
pressure off the amount of associativity needed in the L1 caches.
<
We had situations in MATRIX300 (back in 1991) where "where" the write
back data got written changed every other clock cycle in the CC; oscillating
between L1 and DRAM (no L2 back then, but Hitachi pseudo static DRAM
was only 5 cycles of bus latency at 100 MHz)
<
One of the things I learned in doing the Samsung GPU design was to use
the multi-layer metal now available as bus wires. There is no particular
reason that the 512-bits of a cache line take more than 1 beat from flip-
flop to flip-flop (and at double data rate so 256 wires in and 256 wires out.)
Also, If you have/get all 512-bits in a single cycle, it is a lot more feasible
to use a really protective ECC 3EC5ED on a per 64-bit DoubleWord.
<
In my small 1.3-wide implementation, I am using 24KB 3-way set caches because
I can fit 3 tags and states in a single SRAM word with 46-bit physical address.
My larger designs will likely be 4-way L1s sized to balance L1 perf with L2 latency.

Re: Why separate 32-bit arithmetic on a 64-bit architecture?

<h6xJJ.33518$N31.21257@fx45.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23219&group=comp.arch#23219

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsreader4.netcologne.de!news.netcologne.de!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx45.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Why separate 32-bit arithmetic on a 64-bit architecture?
References: <sso6aq$37b$1@newsreader4.netcologne.de> <UPXHJ.6202$9O.4300@fx12.iad> <2022Jan25.235313@mips.complang.tuwien.ac.at> <5NdIJ.15873$mS1.13257@fx10.iad> <sss30f$io0$1@newsreader4.netcologne.de> <3a90c85b-1371-481b-bf81-574e3c0af68en@googlegroups.com> <2022Jan27.103742@mips.complang.tuwien.ac.at> <7PyIJ.359540$aF1.217448@fx98.iad> <2022Jan27.184124@mips.complang.tuwien.ac.at> <DdXIJ.50253$gX.47600@fx40.iad> <2022Jan28.231944@mips.complang.tuwien.ac.at>
In-Reply-To: <2022Jan28.231944@mips.complang.tuwien.ac.at>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 45
Message-ID: <h6xJJ.33518$N31.21257@fx45.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sun, 30 Jan 2022 14:17:49 UTC
Date: Sun, 30 Jan 2022 09:14:39 -0500
X-Received-Bytes: 3338
 by: EricP - Sun, 30 Jan 2022 14:14 UTC

Anton Ertl wrote:
> EricP <ThatWouldBeTelling@thevillage.com> writes:
>> If this had used a write-allocate write-thru cache or write-back
>> then I think any advantage of dropping byte/word instruction disappears.
>
> For write-back it still means either ECC on the byte level or reading
> the rest of the unit to produce the combined ECC.

Right, exactly. And byte level ECC means custom SIMMs and we don't
want that. So ECC implies 64 bit quadword writes so we can use standard
64+8 SIMMs, which implies reading the whole unit for dword writes,
which requires reading the whole 32 byte cache line.

(It could read part of a cache line, but that would inefficiently
use the DRAM RAS cycle which is the bulk of its access time,
and a partial bus transfer would waste much bus bandwidth.)

It would be wasteful to read the cache line into the cpu in order to do
that merge, write the changed line back, and then toss the cache line.
So for efficiency the cache retains the line, and that is write allocate.

And since its got the whole line, there is little extra cost difference
to support byte/word accesses (misaligned accesses still do have
an extra cost).

To summarize, potential support for ECC memory and desire for efficiency
basically forces cache to write allocate, and means there is no reason
to elide the byte/word access instructions as it gains little or nothing.

> Write allocation was not that common in the early 1990s. E.g., the
> first generation K6-2 had write-back, but without write allocation; a
> later revision, the K6-2 CXT, added (or enabled) write allocation (in
> 1998).
>
> - anton

Yes and those were mostly PC's which (a) already had byte/word instructions
and (b) very few had even parity memory. Doing individual byte writes
is less efficient but does not cost extra logic in the DRAM controller
provided SIMMs allow individual byte writes to be enabled,
and a write combine buffer helps with the efficiency.
So write-thru cache with read allocate is the least complex design
that supports the system functional requirements.
But add potential ECC to a PC and it changes that calculation.

Re: Why separate 32-bit arithmetic on a 64-bit architecture?

<jwvpmo9qgcb.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23223&group=comp.arch#23223

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: Why separate 32-bit arithmetic on a 64-bit architecture?
Date: Sun, 30 Jan 2022 12:38:05 -0500
Organization: A noiseless patient Spider
Lines: 17
Message-ID: <jwvpmo9qgcb.fsf-monnier+comp.arch@gnu.org>
References: <sso6aq$37b$1@newsreader4.netcologne.de>
<UPXHJ.6202$9O.4300@fx12.iad>
<2022Jan25.235313@mips.complang.tuwien.ac.at>
<5NdIJ.15873$mS1.13257@fx10.iad>
<sss30f$io0$1@newsreader4.netcologne.de>
<3a90c85b-1371-481b-bf81-574e3c0af68en@googlegroups.com>
<2022Jan27.103742@mips.complang.tuwien.ac.at>
<7PyIJ.359540$aF1.217448@fx98.iad>
<2022Jan27.184124@mips.complang.tuwien.ac.at>
<DdXIJ.50253$gX.47600@fx40.iad>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="f556b7e6bcf5ec9696342390a003fd6e";
logging-data="7320"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+cOu6MVVW/Oah/f/bG9MIG"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:ToZAHyzzDC755f0PPWt0OSyyFxY=
sha1:hHp49vhw/cwxz4TKsC64OhYQGyo=
 by: Stefan Monnier - Sun, 30 Jan 2022 17:38 UTC

> I was trying to figure out what kind of cache design could handle
> aligned dword merges into caches lines but not handle byte merges,
> thus justifying the original decision to elide byte/word load/store.

The original Alpha announcements insisted heavily on the architecture
being designed for the long term. So whether they could implement it
cheaply enough on the first implementation was definitely not the only
requirement for a feature to make the cut.

Based on the quote sent earlier from one of the designers, I think it's
pretty clear that they had had difficulty making byte-operations cheap
in the past (for a Vax CPU), so they apparently decided to resist the
temptation to add byte-operation until it was proved to be really worth
the potential trouble. Same for precise exceptions.

Stefan

Re: Why separate 32-bit arithmetic on a 64-bit architecture?

<9ce2e83b-3370-495c-a4a1-e4074327e04dn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23227&group=comp.arch#23227

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:e84:: with SMTP id hf4mr15070916qvb.12.1643567615452;
Sun, 30 Jan 2022 10:33:35 -0800 (PST)
X-Received: by 2002:a05:6830:1492:: with SMTP id s18mr9898117otq.331.1643567615177;
Sun, 30 Jan 2022 10:33:35 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 30 Jan 2022 10:33:34 -0800 (PST)
In-Reply-To: <h6xJJ.33518$N31.21257@fx45.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:35e5:91ad:d820:f05e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:35e5:91ad:d820:f05e
References: <sso6aq$37b$1@newsreader4.netcologne.de> <UPXHJ.6202$9O.4300@fx12.iad>
<2022Jan25.235313@mips.complang.tuwien.ac.at> <5NdIJ.15873$mS1.13257@fx10.iad>
<sss30f$io0$1@newsreader4.netcologne.de> <3a90c85b-1371-481b-bf81-574e3c0af68en@googlegroups.com>
<2022Jan27.103742@mips.complang.tuwien.ac.at> <7PyIJ.359540$aF1.217448@fx98.iad>
<2022Jan27.184124@mips.complang.tuwien.ac.at> <DdXIJ.50253$gX.47600@fx40.iad>
<2022Jan28.231944@mips.complang.tuwien.ac.at> <h6xJJ.33518$N31.21257@fx45.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <9ce2e83b-3370-495c-a4a1-e4074327e04dn@googlegroups.com>
Subject: Re: Why separate 32-bit arithmetic on a 64-bit architecture?
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 30 Jan 2022 18:33:35 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 50
 by: MitchAlsup - Sun, 30 Jan 2022 18:33 UTC

On Sunday, January 30, 2022 at 8:17:52 AM UTC-6, EricP wrote:
> Anton Ertl wrote:
> > EricP <ThatWould...@thevillage.com> writes:
> >> If this had used a write-allocate write-thru cache or write-back
> >> then I think any advantage of dropping byte/word instruction disappears.
> >
> > For write-back it still means either ECC on the byte level or reading
> > the rest of the unit to produce the combined ECC.
> Right, exactly. And byte level ECC means custom SIMMs and we don't
> want that. So ECC implies 64 bit quadword writes so we can use standard
> 64+8 SIMMs, which implies reading the whole unit for dword writes,
> which requires reading the whole 32 byte cache line.
>
> (It could read part of a cache line, but that would inefficiently
> use the DRAM RAS cycle which is the bulk of its access time,
> and a partial bus transfer would waste much bus bandwidth.)
>
> It would be wasteful to read the cache line into the cpu in order to do
> that merge, write the changed line back, and then toss the cache line.
> So for efficiency the cache retains the line, and that is write allocate.
>
> And since its got the whole line, there is little extra cost difference
> to support byte/word accesses (misaligned accesses still do have
> an extra cost).
>
> To summarize, potential support for ECC memory and desire for efficiency
> basically forces cache to write allocate, and means there is no reason
> to elide the byte/word access instructions as it gains little or nothing.
<
But, if the Memory+DRAM Controller have the ability to repair ECC read
errors from the DRAM accesses, you can write through on a miss and
let the M+D controller do its thing.
<
I am NOT suggesting that this is the proper course of events (I like
Write Back Caches) but it remains an option for certain scenarios
{Boot before caches are enabled,...}
<
> > Write allocation was not that common in the early 1990s. E.g., the
> > first generation K6-2 had write-back, but without write allocation; a
> > later revision, the K6-2 CXT, added (or enabled) write allocation (in
> > 1998).
> >
> > - anton
> Yes and those were mostly PC's which (a) already had byte/word instructions
> and (b) very few had even parity memory. Doing individual byte writes
> is less efficient but does not cost extra logic in the DRAM controller
> provided SIMMs allow individual byte writes to be enabled,
> and a write combine buffer helps with the efficiency.
> So write-thru cache with read allocate is the least complex design
> that supports the system functional requirements.
> But add potential ECC to a PC and it changes that calculation.

Re: Why separate 32-bit arithmetic on a 64-bit architecture?

<T0TJJ.23588$jb1.13706@fx46.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23243&group=comp.arch#23243

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx46.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Why separate 32-bit arithmetic on a 64-bit architecture?
References: <sso6aq$37b$1@newsreader4.netcologne.de> <UPXHJ.6202$9O.4300@fx12.iad> <2022Jan25.235313@mips.complang.tuwien.ac.at> <5NdIJ.15873$mS1.13257@fx10.iad> <sss30f$io0$1@newsreader4.netcologne.de> <3a90c85b-1371-481b-bf81-574e3c0af68en@googlegroups.com> <2022Jan27.103742@mips.complang.tuwien.ac.at> <7PyIJ.359540$aF1.217448@fx98.iad> <2022Jan27.184124@mips.complang.tuwien.ac.at> <DdXIJ.50253$gX.47600@fx40.iad> <2022Jan28.231944@mips.complang.tuwien.ac.at> <h6xJJ.33518$N31.21257@fx45.iad> <9ce2e83b-3370-495c-a4a1-e4074327e04dn@googlegroups.com>
In-Reply-To: <9ce2e83b-3370-495c-a4a1-e4074327e04dn@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 58
Message-ID: <T0TJJ.23588$jb1.13706@fx46.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Mon, 31 Jan 2022 15:13:55 UTC
Date: Mon, 31 Jan 2022 10:13:57 -0500
X-Received-Bytes: 4253
 by: EricP - Mon, 31 Jan 2022 15:13 UTC

MitchAlsup wrote:
> On Sunday, January 30, 2022 at 8:17:52 AM UTC-6, EricP wrote:
>> Anton Ertl wrote:
>>> EricP <ThatWould...@thevillage.com> writes:
>>>> If this had used a write-allocate write-thru cache or write-back
>>>> then I think any advantage of dropping byte/word instruction disappears.
>>> For write-back it still means either ECC on the byte level or reading
>>> the rest of the unit to produce the combined ECC.
>> Right, exactly. And byte level ECC means custom SIMMs and we don't
>> want that. So ECC implies 64 bit quadword writes so we can use standard
>> 64+8 SIMMs, which implies reading the whole unit for dword writes,
>> which requires reading the whole 32 byte cache line.
>>
>> (It could read part of a cache line, but that would inefficiently
>> use the DRAM RAS cycle which is the bulk of its access time,
>> and a partial bus transfer would waste much bus bandwidth.)
>>
>> It would be wasteful to read the cache line into the cpu in order to do
>> that merge, write the changed line back, and then toss the cache line.
>> So for efficiency the cache retains the line, and that is write allocate.
>>
>> And since its got the whole line, there is little extra cost difference
>> to support byte/word accesses (misaligned accesses still do have
>> an extra cost).
>>
>> To summarize, potential support for ECC memory and desire for efficiency
>> basically forces cache to write allocate, and means there is no reason
>> to elide the byte/word access instructions as it gains little or nothing.
> <
> But, if the Memory+DRAM Controller have the ability to repair ECC read
> errors from the DRAM accesses, you can write through on a miss and
> let the M+D controller do its thing.

Actually I've always thought the best place for ECC was on the
DRAM chip itself so that error scavenging integrates with refresh.
When the refresh cycle reads a row, it would keep a copy and spends
some internal cycles checking each field of the 2kb or 4kb row.
Of course this requires a completely different DRAM interface,
and historically DRAM processes don't mix well with logic,
and package pin out was traditionally a limiting factor.

Other than that, one needs the refresh, scavenge, and normal access logic,
pending op queues, open page scheduler, and ECC all in the same place.
And since the memory controller is integrated on the processor chip,
that is where it all winds up.

That avoids crossing more clock domains and avoids going through more
metastability synchronizers. Also it allows very wide internal buses.

> I am NOT suggesting that this is the proper course of events (I like
> Write Back Caches) but it remains an option for certain scenarios
> {Boot before caches are enabled,...}

If I understand your scenario correctly this would only partially
utilize the bus between (disabled) cache and memory controller,
moving 8 bytes in a 16 or 32 byte packet. Which is fine for boot.

Re: Why separate 32-bit arithmetic on a 64-bit architecture?

<stf6qk$5bs$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23261&group=comp.arch#23261

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ggt...@yahoo.com (Brett)
Newsgroups: comp.arch
Subject: Re: Why separate 32-bit arithmetic on a 64-bit architecture?
Date: Thu, 3 Feb 2022 00:15:21 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 53
Message-ID: <stf6qk$5bs$1@dont-email.me>
References: <sso6aq$37b$1@newsreader4.netcologne.de>
<UPXHJ.6202$9O.4300@fx12.iad>
<2022Jan25.235313@mips.complang.tuwien.ac.at>
<5NdIJ.15873$mS1.13257@fx10.iad>
<ssst1s$pk9$1@dont-email.me>
<c7537b99-c3f7-45ad-a40f-2c38904b2a4cn@googlegroups.com>
<ssvo55$a3i$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 3 Feb 2022 00:15:21 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="038a3f35258ef44ad59c4ed08da60de2";
logging-data="5500"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/v+ANSNPowKA7hxxyyjPIh"
User-Agent: NewsTap/5.5 (iPad)
Cancel-Lock: sha1:Tnj8NUurwNMBJzM5jEtfklDtrh0=
sha1:RKpU+t/QWCSRezMdD9prVbJI2oQ=
 by: Brett - Thu, 3 Feb 2022 00:15 UTC

Brett <ggtgp@yahoo.com> wrote:
> MitchAlsup <MitchAlsup@aol.com> wrote:
>> On Wednesday, January 26, 2022 at 7:38:07 PM UTC-6, gg...@yahoo.com wrote:
>>> EricP <ThatWould...@thevillage.com> wrote:
>>>> Anton Ertl wrote:
>>>>> EricP <ThatWould...@thevillage.com> writes:
>>>>>> Alpha didn't have SEXT/ZEXT sign/zero extend instructions
>>>>>> so it would have required a pair of shifts.
>>>>>
>>>>> Alpha uses addl for sign extension and some zap instruction for zero
>>>>> extension. If they had not added addl, they could have added an sext
>>>>> instruction instead.
>>>>
>>>> There were lots of instructions they could have had, but they seemed
>>>> almost obsessive that the R in RISC means reduced instruction _count_.
>>>>
>>>> They left out the byte and word load and stores, supposedly because
>>>> that would require the byte shifter network on the critical path.
>>>> I think that was a disastrous decision that contributed greatly
>>>> to porting difficulties and lack of wide market acceptance.
>>>> Once it established in peoples minds that it is difficult to work with
>>>> or more trouble than its worth, it is tough to come back from.
>>>> In effect, they created a market barrier to themselves.
>>>>
>>>> And when they finally did add byte and word load and store,
>>>> load byte and word only did zero extend not sign extend
>>>> supposedly because they did not want to put the sign extension
>>>> logic into the load critical path. But still, WTF!
>> <
>>> Loads are so variable that they should have bit the bullet and supported 2
>>> cycle signed loads in addition to 1 cycle unsigned loads. RISC religion
>>> stupidity.
>> <
>> How do you do this when an ADD takes 3/4 of a cycle and an SRAM
>> access takes 1+1/8 cycle, and the sign extension aligner takes 3/8 cycle ?
>
> I rounded up to integer cycle counts, same as two instructions- load and
> extend, except now as one instruction saving instruction cache. So you can
> crack it, or split the load path at the extender unit. Perhaps only one of
> the two load units would support extension if you split the load. If that
> load port is full you can do a late crack.

Just realized that loads come back out of order and you have no control
over which load path used, so you would split both load paths. This causes
some issues with more write ports to the ALU. But you could queue these
loads instead.

Is any of this reasonable, or am I too far out of my league?

>>>> Anyway, SEXT and ZEXT to sign or zero extend a register from
>>>> a bit position would not have affected the cycle time.

Re: Why separate 32-bit arithmetic on a 64-bit architecture?

<6ba7592a-33a6-4c6b-ae7e-9d804c43c31fn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23262&group=comp.arch#23262

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:3181:: with SMTP id bi1mr21960892qkb.691.1643847915854;
Wed, 02 Feb 2022 16:25:15 -0800 (PST)
X-Received: by 2002:a05:6808:689:: with SMTP id k9mr6197849oig.281.1643847915574;
Wed, 02 Feb 2022 16:25:15 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 2 Feb 2022 16:25:15 -0800 (PST)
In-Reply-To: <stf6qk$5bs$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:80d5:8b42:92ed:f97b;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:80d5:8b42:92ed:f97b
References: <sso6aq$37b$1@newsreader4.netcologne.de> <UPXHJ.6202$9O.4300@fx12.iad>
<2022Jan25.235313@mips.complang.tuwien.ac.at> <5NdIJ.15873$mS1.13257@fx10.iad>
<ssst1s$pk9$1@dont-email.me> <c7537b99-c3f7-45ad-a40f-2c38904b2a4cn@googlegroups.com>
<ssvo55$a3i$1@dont-email.me> <stf6qk$5bs$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <6ba7592a-33a6-4c6b-ae7e-9d804c43c31fn@googlegroups.com>
Subject: Re: Why separate 32-bit arithmetic on a 64-bit architecture?
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 03 Feb 2022 00:25:15 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 29
 by: MitchAlsup - Thu, 3 Feb 2022 00:25 UTC

On Wednesday, February 2, 2022 at 6:15:25 PM UTC-6, gg...@yahoo.com wrote:
> Brett <gg...@yahoo.com> wrote:
> > MitchAlsup <Mitch...@aol.com> wrote:

> >> How do you do this when an ADD takes 3/4 of a cycle and an SRAM
> >> access takes 1+1/8 cycle, and the sign extension aligner takes 3/8 cycle ?
> >
> > I rounded up to integer cycle counts, same as two instructions- load and
> > extend, except now as one instruction saving instruction cache. So you can
> > crack it, or split the load path at the extender unit. Perhaps only one of
> > the two load units would support extension if you split the load. If that
> > load port is full you can do a late crack.
<
> Just realized that loads come back out of order and you have no control
> over which load path used, so you would split both load paths. This causes
> some issues with more write ports to the ALU. But you could queue these
> loads instead.
>
> Is any of this reasonable, or am I too far out of my league?
<
For the same reason you do not split FADD into 3 instructions, you should
not split LDs into 2 instructions {to say nothing of the code density problems
you will face when making this choice.}
<
In order to split a LD into a Fetch and a Decompose instruction, you need
the LD instruction to create a bit extraction specifier that gets passed to
the decompose instruction, and this adds to register write port pressure.
<
> >>>> Anyway, SEXT and ZEXT to sign or zero extend a register from
> >>>> a bit position would not have affected the cycle time.

Re: Why separate 32-bit arithmetic on a 64-bit architecture?

<stk5c8$oga$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23281&group=comp.arch#23281

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ggt...@yahoo.com (Brett)
Newsgroups: comp.arch
Subject: Re: Why separate 32-bit arithmetic on a 64-bit architecture?
Date: Fri, 4 Feb 2022 21:21:13 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 44
Message-ID: <stk5c8$oga$1@dont-email.me>
References: <sso6aq$37b$1@newsreader4.netcologne.de>
<UPXHJ.6202$9O.4300@fx12.iad>
<2022Jan25.235313@mips.complang.tuwien.ac.at>
<5NdIJ.15873$mS1.13257@fx10.iad>
<ssst1s$pk9$1@dont-email.me>
<c7537b99-c3f7-45ad-a40f-2c38904b2a4cn@googlegroups.com>
<ssvo55$a3i$1@dont-email.me>
<stf6qk$5bs$1@dont-email.me>
<6ba7592a-33a6-4c6b-ae7e-9d804c43c31fn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 4 Feb 2022 21:21:13 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="530527cc50553f594de4d7083533974c";
logging-data="25098"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/BFx6zbNkIOlZMm/deHjEa"
User-Agent: NewsTap/5.5 (iPad)
Cancel-Lock: sha1:mhy/z31hx3lQMHGouqMfAbcS9w8=
sha1:2vh9NRSE8mZmei1fUL+5yBfGGTA=
 by: Brett - Fri, 4 Feb 2022 21:21 UTC

MitchAlsup <MitchAlsup@aol.com> wrote:
> On Wednesday, February 2, 2022 at 6:15:25 PM UTC-6, gg...@yahoo.com wrote:
>> Brett <gg...@yahoo.com> wrote:
>>> MitchAlsup <Mitch...@aol.com> wrote:
>
>>>> How do you do this when an ADD takes 3/4 of a cycle and an SRAM
>>>> access takes 1+1/8 cycle, and the sign extension aligner takes 3/8 cycle ?
>>>
>>> I rounded up to integer cycle counts, same as two instructions- load and
>>> extend, except now as one instruction saving instruction cache. So you can
>>> crack it, or split the load path at the extender unit. Perhaps only one of
>>> the two load units would support extension if you split the load. If that
>>> load port is full you can do a late crack.
> <
>> Just realized that loads come back out of order and you have no control
>> over which load path used, so you would split both load paths. This causes
>> some issues with more write ports to the ALU. But you could queue these
>> loads instead.
>>
>> Is any of this reasonable, or am I too far out of my league?
> <
> For the same reason you do not split FADD into 3 instructions, you should
> not split LDs into 2 instructions {to say nothing of the code density problems
> you will face when making this choice.}

Yes, that is why I support signed loads, saves an instruction.

> In order to split a LD into a Fetch and a Decompose instruction, you need
> the LD instruction to create a bit extraction specifier that gets passed to
> the decompose instruction, and this adds to register write port pressure.

I assume part of the reason Intel has 3 cycle L1 access is sign extension.
There may not be much benefit from allowing unsigned access in 2 cycles at
the same time?

POWER has some fancy extract instructions, load versions of these
instructions could also make use of the bit extraction specifier on the
load unit.

>>>>>> Anyway, SEXT and ZEXT to sign or zero extend a register from
>>>>>> a bit position would not have affected the cycle time.

Re: Why separate 32-bit arithmetic on a 64-bit architecture?

<d03fb26e-3719-486d-b1e9-2ac4dccddfc6n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23286&group=comp.arch#23286

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:d0c:: with SMTP id 12mr3693563qvh.93.1644015926997;
Fri, 04 Feb 2022 15:05:26 -0800 (PST)
X-Received: by 2002:a05:6870:d396:: with SMTP id k22mr1311158oag.327.1644015926810;
Fri, 04 Feb 2022 15:05:26 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 4 Feb 2022 15:05:26 -0800 (PST)
In-Reply-To: <stk5c8$oga$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:302e:7005:d193:7915;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:302e:7005:d193:7915
References: <sso6aq$37b$1@newsreader4.netcologne.de> <UPXHJ.6202$9O.4300@fx12.iad>
<2022Jan25.235313@mips.complang.tuwien.ac.at> <5NdIJ.15873$mS1.13257@fx10.iad>
<ssst1s$pk9$1@dont-email.me> <c7537b99-c3f7-45ad-a40f-2c38904b2a4cn@googlegroups.com>
<ssvo55$a3i$1@dont-email.me> <stf6qk$5bs$1@dont-email.me> <6ba7592a-33a6-4c6b-ae7e-9d804c43c31fn@googlegroups.com>
<stk5c8$oga$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d03fb26e-3719-486d-b1e9-2ac4dccddfc6n@googlegroups.com>
Subject: Re: Why separate 32-bit arithmetic on a 64-bit architecture?
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 04 Feb 2022 23:05:26 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 47
 by: MitchAlsup - Fri, 4 Feb 2022 23:05 UTC

On Friday, February 4, 2022 at 3:21:16 PM UTC-6, gg...@yahoo.com wrote:
> MitchAlsup <Mitch...@aol.com> wrote:
> > On Wednesday, February 2, 2022 at 6:15:25 PM UTC-6, gg...@yahoo.com wrote:
> >> Brett <gg...@yahoo.com> wrote:
> >>> MitchAlsup <Mitch...@aol.com> wrote:
> >
> >>>> How do you do this when an ADD takes 3/4 of a cycle and an SRAM
> >>>> access takes 1+1/8 cycle, and the sign extension aligner takes 3/8 cycle ?
> >>>
> >>> I rounded up to integer cycle counts, same as two instructions- load and
> >>> extend, except now as one instruction saving instruction cache. So you can
> >>> crack it, or split the load path at the extender unit. Perhaps only one of
> >>> the two load units would support extension if you split the load. If that
> >>> load port is full you can do a late crack.
> > <
> >> Just realized that loads come back out of order and you have no control
> >> over which load path used, so you would split both load paths. This causes
> >> some issues with more write ports to the ALU. But you could queue these
> >> loads instead.
> >>
> >> Is any of this reasonable, or am I too far out of my league?
> > <
> > For the same reason you do not split FADD into 3 instructions, you should
> > not split LDs into 2 instructions {to say nothing of the code density problems
> > you will face when making this choice.}
> Yes, that is why I support signed loads, saves an instruction.
> > In order to split a LD into a Fetch and a Decompose instruction, you need
> > the LD instruction to create a bit extraction specifier that gets passed to
> > the decompose instruction, and this adds to register write port pressure.

> I assume part of the reason Intel has 3 cycle L1 access is sign extension.
<
I would label the problem as "byte extraction"* of which sign extension is a
minor complication.
<
(*) or Load Alignment.
<
> There may not be much benefit from allowing unsigned access in 2 cycles at
> the same time?
<
When you look into passing properly sized and aligned data without going through
the Byte Extractor, you complicate the memory/cache pipeline design.
>
> POWER has some fancy extract instructions, load versions of these
> instructions could also make use of the bit extraction specifier on the
> load unit.
> >>>>>> Anyway, SEXT and ZEXT to sign or zero extend a register from
> >>>>>> a bit position would not have affected the cycle time.

Re: Why separate 32-bit arithmetic on a 64-bit architecture?

<stl8u5$db0$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23290&group=comp.arch#23290

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ggt...@yahoo.com (Brett)
Newsgroups: comp.arch
Subject: Re: Why separate 32-bit arithmetic on a 64-bit architecture?
Date: Sat, 5 Feb 2022 07:28:06 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 65
Message-ID: <stl8u5$db0$1@dont-email.me>
References: <sso6aq$37b$1@newsreader4.netcologne.de>
<UPXHJ.6202$9O.4300@fx12.iad>
<2022Jan25.235313@mips.complang.tuwien.ac.at>
<5NdIJ.15873$mS1.13257@fx10.iad>
<ssst1s$pk9$1@dont-email.me>
<c7537b99-c3f7-45ad-a40f-2c38904b2a4cn@googlegroups.com>
<ssvo55$a3i$1@dont-email.me>
<stf6qk$5bs$1@dont-email.me>
<6ba7592a-33a6-4c6b-ae7e-9d804c43c31fn@googlegroups.com>
<stk5c8$oga$1@dont-email.me>
<d03fb26e-3719-486d-b1e9-2ac4dccddfc6n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 5 Feb 2022 07:28:06 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="47f5ff748bfdbe9356228569b173b42e";
logging-data="13664"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX181oMlBkRwGamUfq/7vjlQA"
User-Agent: NewsTap/5.5 (iPad)
Cancel-Lock: sha1:LxXUlD6+M1C2bmwyyFH+er+tG0A=
sha1:OVL6dzu5x9Qg72B+9jZr1o4yOzY=
 by: Brett - Sat, 5 Feb 2022 07:28 UTC

MitchAlsup <MitchAlsup@aol.com> wrote:
> On Friday, February 4, 2022 at 3:21:16 PM UTC-6, gg...@yahoo.com wrote:
>> MitchAlsup <Mitch...@aol.com> wrote:
>>> On Wednesday, February 2, 2022 at 6:15:25 PM UTC-6, gg...@yahoo.com wrote:
>>>> Brett <gg...@yahoo.com> wrote:
>>>>> MitchAlsup <Mitch...@aol.com> wrote:
>>>
>>>>>> How do you do this when an ADD takes 3/4 of a cycle and an SRAM
>>>>>> access takes 1+1/8 cycle, and the sign extension aligner takes 3/8 cycle ?
>>>>>
>>>>> I rounded up to integer cycle counts, same as two instructions- load and
>>>>> extend, except now as one instruction saving instruction cache. So you can
>>>>> crack it, or split the load path at the extender unit. Perhaps only one of
>>>>> the two load units would support extension if you split the load. If that
>>>>> load port is full you can do a late crack.
>>> <
>>>> Just realized that loads come back out of order and you have no control
>>>> over which load path used, so you would split both load paths. This causes
>>>> some issues with more write ports to the ALU. But you could queue these
>>>> loads instead.
>>>>
>>>> Is any of this reasonable, or am I too far out of my league?
>>> <
>>> For the same reason you do not split FADD into 3 instructions, you should
>>> not split LDs into 2 instructions {to say nothing of the code density problems
>>> you will face when making this choice.}
>> Yes, that is why I support signed loads, saves an instruction.
>>> In order to split a LD into a Fetch and a Decompose instruction, you need
>>> the LD instruction to create a bit extraction specifier that gets passed to
>>> the decompose instruction, and this adds to register write port pressure.
>
>> I assume part of the reason Intel has 3 cycle L1 access is sign extension.
> <
> I would label the problem as "byte extraction"* of which sign extension is a
> minor complication.
> <
> (*) or Load Alignment.

Sign extension is two gate delays, extraction is half a dozen or so and
lots of crossing wires. Clearly extraction is the harder and limiting task.

>> There may not be much benefit from allowing unsigned access in 2 cycles at
>> the same time?
> <
> When you look into passing properly sized and aligned data without going through
> the Byte Extractor, you complicate the memory/cache pipeline design.

How about the Apple M1 which I hear does two loads a cycle from cache, but
pulls full lines and can support many more than two results a cycle from
cache.

I love this idea, a cheap way to get enough loads for a six wide design,
and saves power.

I would guess this costs a gate delay and 50% more transistors?

>> POWER has some fancy extract instructions, load versions of these
>> instructions could also make use of the bit extraction specifier on the
>> load unit.
>>>>>>>> Anyway, SEXT and ZEXT to sign or zero extend a register from
>>>>>>>> a bit position would not have affected the cycle time.
>

Re: Why separate 32-bit arithmetic on a 64-bit architecture?

<f108750f-3fab-40aa-8a65-2eb3149ccd82n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23292&group=comp.arch#23292

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:400f:: with SMTP id kd15mr5824861qvb.69.1644080136835;
Sat, 05 Feb 2022 08:55:36 -0800 (PST)
X-Received: by 2002:a9d:d10:: with SMTP id 16mr1559819oti.142.1644080136559;
Sat, 05 Feb 2022 08:55:36 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 5 Feb 2022 08:55:36 -0800 (PST)
In-Reply-To: <stl8u5$db0$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:60d3:1a7f:3d93:7e73;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:60d3:1a7f:3d93:7e73
References: <sso6aq$37b$1@newsreader4.netcologne.de> <UPXHJ.6202$9O.4300@fx12.iad>
<2022Jan25.235313@mips.complang.tuwien.ac.at> <5NdIJ.15873$mS1.13257@fx10.iad>
<ssst1s$pk9$1@dont-email.me> <c7537b99-c3f7-45ad-a40f-2c38904b2a4cn@googlegroups.com>
<ssvo55$a3i$1@dont-email.me> <stf6qk$5bs$1@dont-email.me> <6ba7592a-33a6-4c6b-ae7e-9d804c43c31fn@googlegroups.com>
<stk5c8$oga$1@dont-email.me> <d03fb26e-3719-486d-b1e9-2ac4dccddfc6n@googlegroups.com>
<stl8u5$db0$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f108750f-3fab-40aa-8a65-2eb3149ccd82n@googlegroups.com>
Subject: Re: Why separate 32-bit arithmetic on a 64-bit architecture?
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 05 Feb 2022 16:55:36 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 70
 by: MitchAlsup - Sat, 5 Feb 2022 16:55 UTC

On Saturday, February 5, 2022 at 1:28:10 AM UTC-6, gg...@yahoo.com wrote:
> MitchAlsup <Mitch...@aol.com> wrote:
> > On Friday, February 4, 2022 at 3:21:16 PM UTC-6, gg...@yahoo.com wrote:
> >> MitchAlsup <Mitch...@aol.com> wrote:
> >>> On Wednesday, February 2, 2022 at 6:15:25 PM UTC-6, gg...@yahoo.com wrote:
> >>>> Brett <gg...@yahoo.com> wrote:
> >>>>> MitchAlsup <Mitch...@aol.com> wrote:
> >>>
> >>>>>> How do you do this when an ADD takes 3/4 of a cycle and an SRAM
> >>>>>> access takes 1+1/8 cycle, and the sign extension aligner takes 3/8 cycle ?
> >>>>>
> >>>>> I rounded up to integer cycle counts, same as two instructions- load and
> >>>>> extend, except now as one instruction saving instruction cache. So you can
> >>>>> crack it, or split the load path at the extender unit. Perhaps only one of
> >>>>> the two load units would support extension if you split the load. If that
> >>>>> load port is full you can do a late crack.
> >>> <
> >>>> Just realized that loads come back out of order and you have no control
> >>>> over which load path used, so you would split both load paths. This causes
> >>>> some issues with more write ports to the ALU. But you could queue these
> >>>> loads instead.
> >>>>
> >>>> Is any of this reasonable, or am I too far out of my league?
> >>> <
> >>> For the same reason you do not split FADD into 3 instructions, you should
> >>> not split LDs into 2 instructions {to say nothing of the code density problems
> >>> you will face when making this choice.}
> >> Yes, that is why I support signed loads, saves an instruction.
> >>> In order to split a LD into a Fetch and a Decompose instruction, you need
> >>> the LD instruction to create a bit extraction specifier that gets passed to
> >>> the decompose instruction, and this adds to register write port pressure.
> >
> >> I assume part of the reason Intel has 3 cycle L1 access is sign extension.
> > <
> > I would label the problem as "byte extraction"* of which sign extension is a
> > minor complication.
> > <
> > (*) or Load Alignment.
<
> Sign extension is two gate delays, extraction is half a dozen or so and
<
It takes 2 gate delays to buffer up the bit selected as the sign so it can be
multiplexed into the result (1:9 multiplexer is 2 gates of delay plus wire).
<
> lots of crossing wires. Clearly extraction is the harder and limiting task.
<
Wire delay and fan-out to drive the wires is more than the gates of delay.
<
<
> >> There may not be much benefit from allowing unsigned access in 2 cycles at
> >> the same time?
> > <
> > When you look into passing properly sized and aligned data without going through
> > the Byte Extractor, you complicate the memory/cache pipeline design.
<
> How about the Apple M1 which I hear does two loads a cycle from cache, but
> pulls full lines and can support many more than two results a cycle from
> cache.
<
Athlon was doing 2 LDs per cycle to different banks of its cache in 1999.
>
> I love this idea, a cheap way to get enough loads for a six wide design,
> and saves power.
>
> I would guess this costs a gate delay and 50% more transistors?
> >> POWER has some fancy extract instructions, load versions of these
> >> instructions could also make use of the bit extraction specifier on the
> >> load unit.
> >>>>>>>> Anyway, SEXT and ZEXT to sign or zero extend a register from
> >>>>>>>> a bit position would not have affected the cycle time.
> >

Re: Why separate 32-bit arithmetic on a 64-bit architecture?

<t10mvq$4oe$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24299&group=comp.arch#24299

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!NZ87pNe1TKxNDknVl4tZhw.user.46.165.242.91.POSTED!not-for-mail
From: antis...@math.uni.wroc.pl
Newsgroups: comp.arch
Subject: Re: Why separate 32-bit arithmetic on a 64-bit architecture?
Date: Fri, 18 Mar 2022 01:24:10 -0000 (UTC)
Organization: Aioe.org NNTP Server
Message-ID: <t10mvq$4oe$1@gioia.aioe.org>
References: <sso6aq$37b$1@newsreader4.netcologne.de> <UPXHJ.6202$9O.4300@fx12.iad> <ssplm7$1sv$1@newsreader4.netcologne.de> <sspnd8$3d6$1@newsreader4.netcologne.de> <tJZHJ.10626$8Q.353@fx19.iad> <ssqrr7$ptr$1@newsreader4.netcologne.de> <0cf5023d-3458-46d2-ad3d-fa0e6ecb18dfn@googlegroups.com>
Injection-Info: gioia.aioe.org; logging-data="4878"; posting-host="NZ87pNe1TKxNDknVl4tZhw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: tin/2.4.5-20201224 ("Glen Albyn") (Linux/5.10.0-9-amd64 (x86_64))
Cancel-Lock: sha1:gcl2RGIcke2TYizzE+QWPWnhHcE=
X-Notice: Filtered by postfilter v. 0.9.2
 by: antis...@math.uni.wroc.pl - Fri, 18 Mar 2022 01:24 UTC

Quadibloc <jsavard@ecn.ab.ca> wrote:
> On Wednesday, January 26, 2022 at 12:05:13 AM UTC-7, Thomas Koenig wrote:
>
> > I especially like the sentence
> >
> > DEC?s main advocate on the IEEE p754 Committee was a Mathematician
> > and Numerical Analyst Dr. Mary H. Payne. She was experienced,
> > fully competent,
> >
> > and misled by DEC?s hardware engineers
> >
> > when they assured her that Gradual Underflow was unimplementable
> > as the default (no trap) of any computer arithmetic aspiring to
> > high performance.
>
> That is true, but, although I may be mistaken, it appeared to me that
> while one certainly _could_ have a high-performance floating-point
> implementation which fully supported IEEE 754 gradual underflow,
>
> one would have to do it by representing floating-point numbers in an
> "internal form" when they're in registers, like the way the 8087 did it,
> where the internal form was like an old-style plain floating-point format,
> but with a range larger than the architectural floating-point format, so
> that it covered all the gradual underflow territory.
>
> And, of course, if you do it that way, you can't properly and accurately
> trap when a register-to-register calculation underflows, because that isn't
> easily visible any longer. Instead, you find out when you store the result
> in memory.

Well, "internal format" really means one extra exponent bit.
Detecting underflow with such format does not present extra
difficulty. Real problem is that one has to properly round
result so that it does not have more accuracy than allowed
by denormal representation. AFAICS rounding after addition
does not present problem. But rounding after multiplication
of division may be problematic, basically instead of rounding
in fixed binary position one need rounding at variable position.

> So if you _exclude_ the idea of doing calculations on an internal form
> of numbers, because it's problematic, then indeed gradual underflow will
> obstruct high performance.

Well, there is need for extra hardware in data path. Depending
on organization of FPU almost all hardware needed for denormals
may be already there. Probably worst case is FPU with separate
load/store unit, adder and multiplier. When operating internally
on normalized numbers such FPU needs extra exponenent bit,
shifter in load/store unit (not needed otherwise), and extra
rounding circuitry after multiplication. This may increase
load/store latency and multiplication latency. AFAICS allowing
denormals internaly means slightly modified arithemtic units
to avoid inserting hideden bit for denormals. For multiplication
and division again there is problem of rounding plus need to
shift (to get denormal representation).

Either case does not look like a very bing problem. OTOH
extra gates may lower clock frequentcy by few percent or
add extra clock of latency.

I find denormals of almost no use, so even low extra cost
seem too high. But other folks may prefer different
tradeof.

--
Waldek Hebisch

Re: Why separate 32-bit arithmetic on a 64-bit architecture?

<t11h9a$g5v$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24300&group=comp.arch#24300

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!aioe.org!EhtdJS5E9ITDZpJm3Uerlg.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Why separate 32-bit arithmetic on a 64-bit architecture?
Date: Fri, 18 Mar 2022 09:52:56 +0100
Organization: Aioe.org NNTP Server
Message-ID: <t11h9a$g5v$1@gioia.aioe.org>
References: <sso6aq$37b$1@newsreader4.netcologne.de>
<UPXHJ.6202$9O.4300@fx12.iad> <ssplm7$1sv$1@newsreader4.netcologne.de>
<sspnd8$3d6$1@newsreader4.netcologne.de> <tJZHJ.10626$8Q.353@fx19.iad>
<ssqrr7$ptr$1@newsreader4.netcologne.de>
<0cf5023d-3458-46d2-ad3d-fa0e6ecb18dfn@googlegroups.com>
<t10mvq$4oe$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="16575"; posting-host="EhtdJS5E9ITDZpJm3Uerlg.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.11
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Fri, 18 Mar 2022 08:52 UTC

antispam@math.uni.wroc.pl wrote:
> Quadibloc <jsavard@ecn.ab.ca> wrote:
>> On Wednesday, January 26, 2022 at 12:05:13 AM UTC-7, Thomas Koenig wrote:
>>
>>> I especially like the sentence
>>>
>>> DEC?s main advocate on the IEEE p754 Committee was a Mathematician
>>> and Numerical Analyst Dr. Mary H. Payne. She was experienced,
>>> fully competent,
>>>
>>> and misled by DEC?s hardware engineers
>>>
>>> when they assured her that Gradual Underflow was unimplementable
>>> as the default (no trap) of any computer arithmetic aspiring to
>>> high performance.
>>
>> That is true, but, although I may be mistaken, it appeared to me that
>> while one certainly _could_ have a high-performance floating-point
>> implementation which fully supported IEEE 754 gradual underflow,
>>
>> one would have to do it by representing floating-point numbers in an
>> "internal form" when they're in registers, like the way the 8087 did it,
>> where the internal form was like an old-style plain floating-point format,
>> but with a range larger than the architectural floating-point format, so
>> that it covered all the gradual underflow territory.
>>
>> And, of course, if you do it that way, you can't properly and accurately
>> trap when a register-to-register calculation underflows, because that isn't
>> easily visible any longer. Instead, you find out when you store the result
>> in memory.
>
> Well, "internal format" really means one extra exponent bit.
> Detecting underflow with such format does not present extra
> difficulty. Real problem is that one has to properly round
> result so that it does not have more accuracy than allowed
> by denormal representation. AFAICS rounding after addition
> does not present problem. But rounding after multiplication
> of division may be problematic, basically instead of rounding
> in fixed binary position one need rounding at variable position.
>
>> So if you _exclude_ the idea of doing calculations on an internal form
>> of numbers, because it's problematic, then indeed gradual underflow will
>> obstruct high performance.
>
> Well, there is need for extra hardware in data path. Depending
> on organization of FPU almost all hardware needed for denormals
> may be already there. Probably worst case is FPU with separate
> load/store unit, adder and multiplier. When operating internally
> on normalized numbers such FPU needs extra exponenent bit,
> shifter in load/store unit (not needed otherwise), and extra
> rounding circuitry after multiplication. This may increase
> load/store latency and multiplication latency. AFAICS allowing
> denormals internaly means slightly modified arithemtic units
> to avoid inserting hideden bit for denormals. For multiplication
> and division again there is problem of rounding plus need to
> shift (to get denormal representation).
>
> Either case does not look like a very bing problem. OTOH
> extra gates may lower clock frequentcy by few percent or
> add extra clock of latency.
>
> I find denormals of almost no use, so even low extra cost
> seem too high. But other folks may prefer different
> tradeof.

I seldom _need_ subnormal myself, but I have been told that there exists
solution solvers/zero finders that fail to stabilize (you end up
flipping back & forth between two values) unless you have subnormals.

That, plus the fact (which you mention above) that any FMAC-supporting
fp unit already have all the required circuitry to allow you to skip
input normalization, means that I like Mitch Alsup strongly believe all
fp units should support subnormal values at zero cycle cost.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Why separate 32-bit arithmetic on a 64-bit architecture?

<2e234d6c-be42-4df7-a8e7-3345e22e6acdn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24302&group=comp.arch#24302

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:410b:b0:67d:d59c:13b8 with SMTP id j11-20020a05620a410b00b0067dd59c13b8mr6025884qko.449.1647613748436;
Fri, 18 Mar 2022 07:29:08 -0700 (PDT)
X-Received: by 2002:a05:6870:5829:b0:c8:9f42:f919 with SMTP id
r41-20020a056870582900b000c89f42f919mr3793349oap.54.1647613748108; Fri, 18
Mar 2022 07:29:08 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 18 Mar 2022 07:29:07 -0700 (PDT)
In-Reply-To: <t10mvq$4oe$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:bc9d:792b:1cbb:a18b;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:bc9d:792b:1cbb:a18b
References: <sso6aq$37b$1@newsreader4.netcologne.de> <UPXHJ.6202$9O.4300@fx12.iad>
<ssplm7$1sv$1@newsreader4.netcologne.de> <sspnd8$3d6$1@newsreader4.netcologne.de>
<tJZHJ.10626$8Q.353@fx19.iad> <ssqrr7$ptr$1@newsreader4.netcologne.de>
<0cf5023d-3458-46d2-ad3d-fa0e6ecb18dfn@googlegroups.com> <t10mvq$4oe$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2e234d6c-be42-4df7-a8e7-3345e22e6acdn@googlegroups.com>
Subject: Re: Why separate 32-bit arithmetic on a 64-bit architecture?
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 18 Mar 2022 14:29:08 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 82
 by: MitchAlsup - Fri, 18 Mar 2022 14:29 UTC

On Thursday, March 17, 2022 at 8:24:14 PM UTC-5, anti...@math.uni.wroc.pl wrote:
> Quadibloc <jsa...@ecn.ab.ca> wrote:
> > On Wednesday, January 26, 2022 at 12:05:13 AM UTC-7, Thomas Koenig wrote:
> >
> > > I especially like the sentence
> > >
> > > DEC?s main advocate on the IEEE p754 Committee was a Mathematician
> > > and Numerical Analyst Dr. Mary H. Payne. She was experienced,
> > > fully competent,
> > >
> > > and misled by DEC?s hardware engineers
> > >
> > > when they assured her that Gradual Underflow was unimplementable
> > > as the default (no trap) of any computer arithmetic aspiring to
> > > high performance.
> >
> > That is true, but, although I may be mistaken, it appeared to me that
> > while one certainly _could_ have a high-performance floating-point
> > implementation which fully supported IEEE 754 gradual underflow,
> >
> > one would have to do it by representing floating-point numbers in an
> > "internal form" when they're in registers, like the way the 8087 did it,
> > where the internal form was like an old-style plain floating-point format,
> > but with a range larger than the architectural floating-point format, so
> > that it covered all the gradual underflow territory.
> >
> > And, of course, if you do it that way, you can't properly and accurately
> > trap when a register-to-register calculation underflows, because that isn't
> > easily visible any longer. Instead, you find out when you store the result
> > in memory.
>
> Well, "internal format" really means one extra exponent bit.
> Detecting underflow with such format does not present extra
> difficulty. Real problem is that one has to properly round
> result so that it does not have more accuracy than allowed
> by denormal representation. AFAICS rounding after addition
> does not present problem. But rounding after multiplication
> of division may be problematic, basically instead of rounding
> in fixed binary position one need rounding at variable position.
>
> > So if you _exclude_ the idea of doing calculations on an internal form
> > of numbers, because it's problematic, then indeed gradual underflow will
> > obstruct high performance.
>
> Well, there is need for extra hardware in data path. Depending
> on organization of FPU almost all hardware needed for denormals
> may be already there. Probably worst case is FPU with separate
> load/store unit, adder and multiplier. When operating internally
> on normalized numbers such FPU needs extra exponenent bit,
> shifter in load/store unit (not needed otherwise), and extra
> rounding circuitry after multiplication.
<
Once your FPU design is centered around FMAC, the added circuitry
is about 1.2-gates per bit of accumulator width and 1 gate delay of
added latency.
<
The FMUL unit in Opteron was ~60-gates of delay
The FADD unit of Opteron was ~50-gates of delay
With a cycle time of 16-gates of logic per cycle.
<
<
< This may increase
> load/store latency and multiplication latency. AFAICS allowing
> denormals internaly means slightly modified arithemtic units
> to avoid inserting hideden bit for denormals. For multiplication
> and division again there is problem of rounding plus need to
> shift (to get denormal representation).
>
> Either case does not look like a very bing problem. OTOH
> extra gates may lower clock frequentcy by few percent or
> add extra clock of latency.
>
> I find denormals of almost no use, so even low extra cost
> seem too high. But other folks may prefer different
> tradeof.
<
That was my position n 1985, I have been edumacated to the
facts that denorms are not "expensive" and it is easier to obey
the std than to hand wave around not needing to fully comply.
<
>
> --
> Waldek Hebisch

Re: Why separate 32-bit arithmetic on a 64-bit architecture?

<1cb8bb2d-4e59-43f3-9992-ef658ec5ecden@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24304&group=comp.arch#24304

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:181:b0:2e1:e70a:ec2a with SMTP id s1-20020a05622a018100b002e1e70aec2amr7930023qtw.42.1647617253300;
Fri, 18 Mar 2022 08:27:33 -0700 (PDT)
X-Received: by 2002:a9d:5e15:0:b0:5b2:5125:fd09 with SMTP id
d21-20020a9d5e15000000b005b25125fd09mr3438618oti.129.1647617253104; Fri, 18
Mar 2022 08:27:33 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 18 Mar 2022 08:27:32 -0700 (PDT)
In-Reply-To: <t11h9a$g5v$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:9c02:dcf9:eab4:49a;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:9c02:dcf9:eab4:49a
References: <sso6aq$37b$1@newsreader4.netcologne.de> <UPXHJ.6202$9O.4300@fx12.iad>
<ssplm7$1sv$1@newsreader4.netcologne.de> <sspnd8$3d6$1@newsreader4.netcologne.de>
<tJZHJ.10626$8Q.353@fx19.iad> <ssqrr7$ptr$1@newsreader4.netcologne.de>
<0cf5023d-3458-46d2-ad3d-fa0e6ecb18dfn@googlegroups.com> <t10mvq$4oe$1@gioia.aioe.org>
<t11h9a$g5v$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1cb8bb2d-4e59-43f3-9992-ef658ec5ecden@googlegroups.com>
Subject: Re: Why separate 32-bit arithmetic on a 64-bit architecture?
From: already5...@yahoo.com (Michael S)
Injection-Date: Fri, 18 Mar 2022 15:27:33 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 89
 by: Michael S - Fri, 18 Mar 2022 15:27 UTC

On Friday, March 18, 2022 at 10:53:03 AM UTC+2, Terje Mathisen wrote:
> anti...@math.uni.wroc.pl wrote:
> > Quadibloc <jsa...@ecn.ab.ca> wrote:
> >> On Wednesday, January 26, 2022 at 12:05:13 AM UTC-7, Thomas Koenig wrote:
> >>
> >>> I especially like the sentence
> >>>
> >>> DEC?s main advocate on the IEEE p754 Committee was a Mathematician
> >>> and Numerical Analyst Dr. Mary H. Payne. She was experienced,
> >>> fully competent,
> >>>
> >>> and misled by DEC?s hardware engineers
> >>>
> >>> when they assured her that Gradual Underflow was unimplementable
> >>> as the default (no trap) of any computer arithmetic aspiring to
> >>> high performance.
> >>
> >> That is true, but, although I may be mistaken, it appeared to me that
> >> while one certainly _could_ have a high-performance floating-point
> >> implementation which fully supported IEEE 754 gradual underflow,
> >>
> >> one would have to do it by representing floating-point numbers in an
> >> "internal form" when they're in registers, like the way the 8087 did it,
> >> where the internal form was like an old-style plain floating-point format,
> >> but with a range larger than the architectural floating-point format, so
> >> that it covered all the gradual underflow territory.
> >>
> >> And, of course, if you do it that way, you can't properly and accurately
> >> trap when a register-to-register calculation underflows, because that isn't
> >> easily visible any longer. Instead, you find out when you store the result
> >> in memory.
> >
> > Well, "internal format" really means one extra exponent bit.
> > Detecting underflow with such format does not present extra
> > difficulty. Real problem is that one has to properly round
> > result so that it does not have more accuracy than allowed
> > by denormal representation. AFAICS rounding after addition
> > does not present problem. But rounding after multiplication
> > of division may be problematic, basically instead of rounding
> > in fixed binary position one need rounding at variable position.
> >
> >> So if you _exclude_ the idea of doing calculations on an internal form
> >> of numbers, because it's problematic, then indeed gradual underflow will
> >> obstruct high performance.
> >
> > Well, there is need for extra hardware in data path. Depending
> > on organization of FPU almost all hardware needed for denormals
> > may be already there. Probably worst case is FPU with separate
> > load/store unit, adder and multiplier. When operating internally
> > on normalized numbers such FPU needs extra exponenent bit,
> > shifter in load/store unit (not needed otherwise), and extra
> > rounding circuitry after multiplication. This may increase
> > load/store latency and multiplication latency. AFAICS allowing
> > denormals internaly means slightly modified arithemtic units
> > to avoid inserting hideden bit for denormals. For multiplication
> > and division again there is problem of rounding plus need to
> > shift (to get denormal representation).
> >
> > Either case does not look like a very bing problem. OTOH
> > extra gates may lower clock frequentcy by few percent or
> > add extra clock of latency.
> >
> > I find denormals of almost no use, so even low extra cost
> > seem too high. But other folks may prefer different
> > tradeof.
> I seldom _need_ subnormal myself, but I have been told that there exists
> solution solvers/zero finders that fail to stabilize (you end up
> flipping back & forth between two values) unless you have subnormals.
>

That's what I expect for "chord&tangent" solver, the one I was taught back in high school
even before we "officially" were taught derivatives in the math class.
Probably, the same applies to pure tangent (i.e. Newton) method, but here I'm less sure.

> That, plus the fact (which you mention above) that any FMAC-supporting
> fp unit already have all the required circuitry to allow you to skip
> input normalization, means that I like Mitch Alsup strongly believe all
> fp units should support subnormal values at zero cycle cost.

There are different scenarios for one or more denormal input or output.
In some of them microtrap could be the best practical solution.
In particular, I am thinking about designs where latency of regular ("normal") FADD is lower than latency of FMUL and of FMADD.
Mitch does not see a point in such designs, but I do see it.

> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Re: Why separate 32-bit arithmetic on a 64-bit architecture?

<t12aif$182v$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24305&group=comp.arch#24305

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!EhtdJS5E9ITDZpJm3Uerlg.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Why separate 32-bit arithmetic on a 64-bit architecture?
Date: Fri, 18 Mar 2022 17:04:29 +0100
Organization: Aioe.org NNTP Server
Message-ID: <t12aif$182v$1@gioia.aioe.org>
References: <sso6aq$37b$1@newsreader4.netcologne.de>
<UPXHJ.6202$9O.4300@fx12.iad> <ssplm7$1sv$1@newsreader4.netcologne.de>
<sspnd8$3d6$1@newsreader4.netcologne.de> <tJZHJ.10626$8Q.353@fx19.iad>
<ssqrr7$ptr$1@newsreader4.netcologne.de>
<0cf5023d-3458-46d2-ad3d-fa0e6ecb18dfn@googlegroups.com>
<t10mvq$4oe$1@gioia.aioe.org> <t11h9a$g5v$1@gioia.aioe.org>
<1cb8bb2d-4e59-43f3-9992-ef658ec5ecden@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="41055"; posting-host="EhtdJS5E9ITDZpJm3Uerlg.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.11
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Fri, 18 Mar 2022 16:04 UTC

Michael S wrote:
> On Friday, March 18, 2022 at 10:53:03 AM UTC+2, Terje Mathisen wrote:
>> I seldom _need_ subnormal myself, but I have been told that there exists
>> solution solvers/zero finders that fail to stabilize (you end up
>> flipping back & forth between two values) unless you have subnormals.
>>
>
> That's what I expect for "chord&tangent" solver, the one I was taught back in high school
> even before we "officially" were taught derivatives in the math class.
> Probably, the same applies to pure tangent (i.e. Newton) method, but here I'm less sure.
>
>> That, plus the fact (which you mention above) that any FMAC-supporting
>> fp unit already have all the required circuitry to allow you to skip
>> input normalization, means that I like Mitch Alsup strongly believe all
>> fp units should support subnormal values at zero cycle cost.
>
> There are different scenarios for one or more denormal input or output.
> In some of them microtrap could be the best practical solution.
> In particular, I am thinking about designs where latency of regular ("normal") FADD is lower than latency of FMUL and of FMADD.
> Mitch does not see a point in such designs, but I do see it.

After spending 2016-2019 on the ieee754 update working group I strongly
support Mitch's viewpoint here: Just optimize the heck out of FMAC and
make that the basic building block for all your operations, including
FDIV/FSQRT and all the trancendentals.

The possibility of getting a pure FADD one cycle faster, but only when
none of potentially many parallel SIMD operations hit a subnormal input
or output, fails to excite me.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Why separate 32-bit arithmetic on a 64-bit architecture?

<ea3fec28-7635-450a-afa9-ad3d93baef97n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24307&group=comp.arch#24307

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a0c:e6c5:0:b0:42c:d5f:7e4c with SMTP id l5-20020a0ce6c5000000b0042c0d5f7e4cmr7855125qvn.93.1647629032485;
Fri, 18 Mar 2022 11:43:52 -0700 (PDT)
X-Received: by 2002:a05:6808:152b:b0:2ec:f48f:8120 with SMTP id
u43-20020a056808152b00b002ecf48f8120mr5175928oiw.58.1647629032102; Fri, 18
Mar 2022 11:43:52 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 18 Mar 2022 11:43:51 -0700 (PDT)
In-Reply-To: <1cb8bb2d-4e59-43f3-9992-ef658ec5ecden@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:dd3d:e99a:37f7:e0f7;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:dd3d:e99a:37f7:e0f7
References: <sso6aq$37b$1@newsreader4.netcologne.de> <UPXHJ.6202$9O.4300@fx12.iad>
<ssplm7$1sv$1@newsreader4.netcologne.de> <sspnd8$3d6$1@newsreader4.netcologne.de>
<tJZHJ.10626$8Q.353@fx19.iad> <ssqrr7$ptr$1@newsreader4.netcologne.de>
<0cf5023d-3458-46d2-ad3d-fa0e6ecb18dfn@googlegroups.com> <t10mvq$4oe$1@gioia.aioe.org>
<t11h9a$g5v$1@gioia.aioe.org> <1cb8bb2d-4e59-43f3-9992-ef658ec5ecden@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ea3fec28-7635-450a-afa9-ad3d93baef97n@googlegroups.com>
Subject: Re: Why separate 32-bit arithmetic on a 64-bit architecture?
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 18 Mar 2022 18:43:52 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 114
 by: MitchAlsup - Fri, 18 Mar 2022 18:43 UTC

On Friday, March 18, 2022 at 10:27:35 AM UTC-5, Michael S wrote:
> On Friday, March 18, 2022 at 10:53:03 AM UTC+2, Terje Mathisen wrote:
> > anti...@math.uni.wroc.pl wrote:
> > > Quadibloc <jsa...@ecn.ab.ca> wrote:
> > >> On Wednesday, January 26, 2022 at 12:05:13 AM UTC-7, Thomas Koenig wrote:
> > >>
> > >>> I especially like the sentence
> > >>>
> > >>> DEC?s main advocate on the IEEE p754 Committee was a Mathematician
> > >>> and Numerical Analyst Dr. Mary H. Payne. She was experienced,
> > >>> fully competent,
> > >>>
> > >>> and misled by DEC?s hardware engineers
> > >>>
> > >>> when they assured her that Gradual Underflow was unimplementable
> > >>> as the default (no trap) of any computer arithmetic aspiring to
> > >>> high performance.
> > >>
> > >> That is true, but, although I may be mistaken, it appeared to me that
> > >> while one certainly _could_ have a high-performance floating-point
> > >> implementation which fully supported IEEE 754 gradual underflow,
> > >>
> > >> one would have to do it by representing floating-point numbers in an
> > >> "internal form" when they're in registers, like the way the 8087 did it,
> > >> where the internal form was like an old-style plain floating-point format,
> > >> but with a range larger than the architectural floating-point format, so
> > >> that it covered all the gradual underflow territory.
> > >>
> > >> And, of course, if you do it that way, you can't properly and accurately
> > >> trap when a register-to-register calculation underflows, because that isn't
> > >> easily visible any longer. Instead, you find out when you store the result
> > >> in memory.
> > >
> > > Well, "internal format" really means one extra exponent bit.
> > > Detecting underflow with such format does not present extra
> > > difficulty. Real problem is that one has to properly round
> > > result so that it does not have more accuracy than allowed
> > > by denormal representation. AFAICS rounding after addition
> > > does not present problem. But rounding after multiplication
> > > of division may be problematic, basically instead of rounding
> > > in fixed binary position one need rounding at variable position.
> > >
> > >> So if you _exclude_ the idea of doing calculations on an internal form
> > >> of numbers, because it's problematic, then indeed gradual underflow will
> > >> obstruct high performance.
> > >
> > > Well, there is need for extra hardware in data path. Depending
> > > on organization of FPU almost all hardware needed for denormals
> > > may be already there. Probably worst case is FPU with separate
> > > load/store unit, adder and multiplier. When operating internally
> > > on normalized numbers such FPU needs extra exponenent bit,
> > > shifter in load/store unit (not needed otherwise), and extra
> > > rounding circuitry after multiplication. This may increase
> > > load/store latency and multiplication latency. AFAICS allowing
> > > denormals internaly means slightly modified arithemtic units
> > > to avoid inserting hideden bit for denormals. For multiplication
> > > and division again there is problem of rounding plus need to
> > > shift (to get denormal representation).
> > >
> > > Either case does not look like a very bing problem. OTOH
> > > extra gates may lower clock frequentcy by few percent or
> > > add extra clock of latency.
> > >
> > > I find denormals of almost no use, so even low extra cost
> > > seem too high. But other folks may prefer different
> > > tradeof.
> > I seldom _need_ subnormal myself, but I have been told that there exists
> > solution solvers/zero finders that fail to stabilize (you end up
> > flipping back & forth between two values) unless you have subnormals.
> >
> That's what I expect for "chord&tangent" solver, the one I was taught back in high school
> even before we "officially" were taught derivatives in the math class.
> Probably, the same applies to pure tangent (i.e. Newton) method, but here I'm less sure.
> > That, plus the fact (which you mention above) that any FMAC-supporting
> > fp unit already have all the required circuitry to allow you to skip
> > input normalization, means that I like Mitch Alsup strongly believe all
> > fp units should support subnormal values at zero cycle cost.
<
> There are different scenarios for one or more denormal input or output.
> In some of them microtrap could be the best practical solution.
> In particular, I am thinking about designs where latency of regular ("normal")
> FADD is lower than latency of FMUL and of FMADD.
<
For both FADD and FMUL and FMAC and for the numerator of FDIV, a denorm
as input simply does not have to generate the hidden bit::
<
if( operand.exponent = 0 )
then 0.fraction;
else 1.fraction;
<
For the output of FADD and FMUL and FMAC and FDIV, you inject a synthetic
1-bit at the position the hidden bit would have been in the largest possible denorm.
This prevents the normalizer from shifting too many positions to the left and gives
you the proper denormalized result.
{Note: you insert the synthetic 1-bit only with the value going into the find-first
circuit, not the value going into the shifter the find-first controls.}
<
> Mitch does not see a point in such designs, but I do see it.
<
Correction, Mitch sees no value in not fully obeying IEEE 754-2019 because
the cost to do so is so small.
<
25 years ago I was in the same camp you seem to be in today. I was shown
various circuitry tricks that made the denorm problem vanish into a cost so
small I would be foolish not to do the complete package.
<
It costs far less to satisfy the denorm problem in the FPU than it cost to solve
the precise exception problem when taking exceptions. Far far less. Yet you
would never accept the later.
<
> > Terje
> >
> > --
> > - <Terje.Mathisen at tmsw.no>
> > "almost all programming can be viewed as an exercise in caching"

Re: Why separate 32-bit arithmetic on a 64-bit architecture?

<3d95c40a-41c8-48db-a983-98a2fc066023n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=24309&group=comp.arch#24309

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:4691:b0:67d:9bab:33d7 with SMTP id bq17-20020a05620a469100b0067d9bab33d7mr6822060qkb.500.1647634594831;
Fri, 18 Mar 2022 13:16:34 -0700 (PDT)
X-Received: by 2002:a05:6871:6a0:b0:dd:716f:5afd with SMTP id
l32-20020a05687106a000b000dd716f5afdmr4132024oao.69.1647634594496; Fri, 18
Mar 2022 13:16:34 -0700 (PDT)
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 18 Mar 2022 13:16:34 -0700 (PDT)
In-Reply-To: <ea3fec28-7635-450a-afa9-ad3d93baef97n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:54ed:4d39:c468:35ac;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:54ed:4d39:c468:35ac
References: <sso6aq$37b$1@newsreader4.netcologne.de> <UPXHJ.6202$9O.4300@fx12.iad>
<ssplm7$1sv$1@newsreader4.netcologne.de> <sspnd8$3d6$1@newsreader4.netcologne.de>
<tJZHJ.10626$8Q.353@fx19.iad> <ssqrr7$ptr$1@newsreader4.netcologne.de>
<0cf5023d-3458-46d2-ad3d-fa0e6ecb18dfn@googlegroups.com> <t10mvq$4oe$1@gioia.aioe.org>
<t11h9a$g5v$1@gioia.aioe.org> <1cb8bb2d-4e59-43f3-9992-ef658ec5ecden@googlegroups.com>
<ea3fec28-7635-450a-afa9-ad3d93baef97n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3d95c40a-41c8-48db-a983-98a2fc066023n@googlegroups.com>
Subject: Re: Why separate 32-bit arithmetic on a 64-bit architecture?
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Fri, 18 Mar 2022 20:16:34 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 52
 by: Quadibloc - Fri, 18 Mar 2022 20:16 UTC

On Friday, March 18, 2022 at 12:43:53 PM UTC-6, MitchAlsup wrote:

> It costs far less to satisfy the denorm problem in the FPU than it cost to solve
> the precise exception problem when taking exceptions. Far far less. Yet you
> would never accept the later.

There is cost, and then there is value.

People mistakenly believe that solving the denomals problem in
IEEE 754 floating-point has a significant cost, and so they're
against bothering because computers got along fine for years
without denormals.

But exceptions were always precise before IBM came up with
the Tomasulo algorithm and the Model 91. (Actually, not quite:
the 6600 had imprecise interrupts too.) So this was a _change_
that deprived people of something they were used to, and didn't
have any idea of how to do without.

Initially, the Model 91 didn't have precise exceptions; the manual
explained that there was this limitation of the machine compared
to others.

Incidentally, on a page about the history of this computer, I saw
this quote from a memo at SLAC: "The Model 91 designers have
provided a switch that puts the machine into non-overlapped
mode, in which it runs at about Model 75 speed but with precise
interrupts. This switch will be important for program debugging,
and it can be set and reset by programming."

This was cheaper than providing precise interrupts and out-of-order
operation at the same time; a similar feature on modern machines
might be useful as a Spectre mitigation - let only the untrusted
code run slower, trusted code doesn't need it.

But the Model 91 was a usable computer as it was. So it is
possible to have a computer capable of doing useful work, with
I/O done by interrupts, even if the interrupts are imprecise.

Actually, _that_ makes sense. Interrupts to service I/O should be
totally invisible to the programs that are running concurrently.
So they shouldn't mess with them in any way. So it doesn't matter
if the interrupt happens after an instruction, or in the middle of
a complicated pipeline state, since as far as the running user mode
programs are concerned, the interrupt could have been serviced by
another core.

It's when _traps_ are imprecise, like a divide by zero error, that it's
a pain; now, the program has no choice but to fail, while with
precise traps, the interrupt handler could include a customized
fixup, and set the program to running again.

John Savard

Pages:1234567
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor