Message-ID:

Delta: We never make the same mistake three times. -- David Letterman

devel / comp.arch / Re: The Gigabyte Era

Re: The Gigabyte Era

<s9vvjq$r1i$1@dont-email.me>

https://www.novabbs.com/devel/article-flat.php?id=17624&group=comp.arch#17624

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: The Gigabyte Era
Date: Fri, 11 Jun 2021 08:31:04 -0700
Organization: A noiseless patient Spider
Lines: 97
Message-ID: <s9vvjq$r1i$1@dont-email.me>
References: <aa3d0fec-63e1-430e-bc44-66ef82b7fb60n@googlegroups.com>
<s9os14$1vs9$1@gioia.aioe.org>
<2202c83c-afd9-4a8e-82f2-0c4034cf85f2n@googlegroups.com>
<94f9758d-ee80-41f4-b8d7-d3c707a3c32dn@googlegroups.com>
<s9q873$6h3$1@dont-email.me>
<c823c902-4dcd-45cc-817c-3a4f85746a75n@googlegroups.com>
<s9t95g$9lm$1@dont-email.me>
<080e7cac-83a0-4596-b92a-6921ce3874e9n@googlegroups.com>
<s9ti37$c9p$1@dont-email.me>
<4d66f670-f9b7-4ee2-bf6f-59641f4bc5f3n@googlegroups.com>
<s9to3o$4su$1@dont-email.me>
<67d4cfef-8787-4061-b623-44c89fc589b5n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 11 Jun 2021 15:31:06 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="3a1fa3d22e7cf84eca7dea7f12749e95";
logging-data="27698"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+VWNrwTCTTl/mf+JlSNWmoRYk0crmjqVQ="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:jn+1TNt5lcT27quwpPeDh1fbzJ0=
In-Reply-To: <67d4cfef-8787-4061-b623-44c89fc589b5n@googlegroups.com>
Content-Language: en-US

by: Stephen Fuld - Fri, 11 Jun 2021 15:31 UTC

On 6/11/2021 4:05 AM, Michael S wrote:
> On Thursday, June 10, 2021 at 10:10:50 PM UTC+3, Stephen Fuld wrote:
>> On 6/10/2021 11:53 AM, Michael S wrote:
>>> On Thursday, June 10, 2021 at 8:28:10 PM UTC+3, Stephen Fuld wrote:
>>>> On 6/10/2021 9:55 AM, Michael S wrote:
>>>>> On Thursday, June 10, 2021 at 5:55:47 PM UTC+3, Stephen Fuld wrote:
>>>>>> On 6/10/2021 1:07 AM, Michael S wrote:
>>>>>>> On Wednesday, June 9, 2021 at 2:21:09 PM UTC+3, Stephen Fuld wrote:
>>>>>>>> On 6/8/2021 11:25 PM, Quadibloc wrote:
>>>>>>>>> On Tuesday, June 8, 2021 at 6:20:57 PM UTC-6, MitchAlsup wrote:
>>>>>>>>>
>>>>>>>>>> CPU architects (like me) have advocated for DRAM caches since caches
>>>>>>>>>> got ECC. Test and marketing guys wave red flags at the people with money.
>>>>>>>>>
>>>>>>>>> I remember that at one point IBM made a chip with an L3 cache made out of DRAM.
>>>>>>>>> This, of course, made it slower, but it allowed it to be bigger - but IIRC, only 2x bigger.
>>>>>>>>>
>>>>>>>>> I can't remember the name of the chip.
>>>>>>>> Power7. It used eDRAM. Other chips use or used eDRAM.
>>>>>>>>
>>>>>>>> https://en.wikipedia.org/wiki/EDRAM
>>>>>>>>
>>>>>>>>
>>>>>>>> I am surprised it isn't more popular.
>>>>>>>>
>>>>>>>
>>>>>>> There is a good reason for that.
>>>>>>> Good DRAM takes good capacitors. IBM knew how to integrate good capacitors into logic process, other manufacturers didn't.
>>>>>> The Wikipedia article says several Intel chips used it.
>>>>>
>>>>> It's not the first time when Wikipedia is wrong.
>>>> True, although they are usually right. But it did seem a little odd, so
>>>> I did some more research. Note that the article says the e-DRAM was
>>>> used in the chips with IRIS graphics. So I looked at
>>>>
>>>> https://en.wikipedia.org/wiki/Intel_Graphics_Technology#cite_note-anand-iris-6
>>>>
>>>> and it said the graphics processor used e-DRAM, with a reference to an
>>>> Anandtech article that says the same thing
>>>>
>>>> https://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/2
>>>>
>>>> I have absolutely no direct knowledge, one way or the other, but if
>>>> Wikipedia is wrong, they at least have a pretty good reference.
>>>
>>> Anand explains it well. Read the 3rd page.
>>> "Despite its name, the eDRAM silicon is actually separate from the main microprocessor die - it’s simply housed on the same package."
>> Yes. I hadn't read past the first section. :-( But it is a, albeit
>> modified, Intel logic process, not a DRAM process. See the quote from
>> the article
>>
>>> As we’ve been talking about for a while now, the highest end Haswell graphics configuration includes 128MB of eDRAM on-package. The eDRAM itself is a custom design by Intel and it’s built on a variant of Intel’s P1271 22nm SoC process (not P1270, the CPU process). Intel needed a set of low leakage 22nm transistors rather than the ability to drive very high frequencies which is why it’s using the mobile SoC 22nm process variant here.
>>
>> The process change didn't seem to be about the capacitors, but I am way
>> beyond my depth here.
>
> The process they used for Crystal Well eDRAM die is not fast enough for 3+GHz CPU.

I certainly agree with that.

> It's probably different now, but back 5-6 years ago "mobile process" meant "optimized for reduced leakage and generally for low static power consumption at big price in increased dynamic power consumption".

Seems very reasonable.

>
> To be clear, I don't say that other silicon manufacturers don't know, how to do eDRAM. They do know, foundries even have eDRAM in their libraries.
> But all those non-IBM eDRAMs are not good enough to serve as L3 cache on the same die with 4-5 GHz CPU cores. Even less so, to serve as L2 cache, like in IBM z15.

OK, I see what you are saying. To paraphrase, anyone can do eDRAM, but
IBM does it enough better than anyone else to make it fast enough to be
used as an L3, much less an L2 cache. Is that a reasonable restatement?

Regarding L3 caches, certainly "fast enough" is a little too concise.
Obviously eDRAM will be slower than SRAM, but the idea is that the
larger sized caches allowed by the smaller cells increases the hit rate
enough to compensate for the slower access time. I take your comments
to mean that, except for IBM, that tradeoff doesn't work out. Is that fair?

Since you are far more knowledgeable than I am in this area, let me ask
a question. One of the advantages of eDRAM over "commodity DRAM" is
that one could, at least in theory, configure the bank sizes to meet
your needs. Given that the communication is all on chip, so number of
wires isn't a big issue, couldn't you configure the row size to be equal
to the cache line size such that once you get the row into the buffers,
you can transfer the whole row in one clock, essentially eliminating the
whole CAS part of the access time and transferring a whole cache line at
once instead of the eight clocks typically used for external DRAM? Does
this work?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: The Gigabyte Era

<sa00cj$u7a$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=17625&group=comp.arch#17625

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: The Gigabyte Era
Date: Fri, 11 Jun 2021 08:44:19 -0700
Organization: A noiseless patient Spider
Lines: 131
Message-ID: <sa00cj$u7a$1@dont-email.me>
References: <aa3d0fec-63e1-430e-bc44-66ef82b7fb60n@googlegroups.com>
<s9os14$1vs9$1@gioia.aioe.org>
<2202c83c-afd9-4a8e-82f2-0c4034cf85f2n@googlegroups.com>
<94f9758d-ee80-41f4-b8d7-d3c707a3c32dn@googlegroups.com>
<s9q873$6h3$1@dont-email.me>
<c823c902-4dcd-45cc-817c-3a4f85746a75n@googlegroups.com>
<s9t95g$9lm$1@dont-email.me>
<080e7cac-83a0-4596-b92a-6921ce3874e9n@googlegroups.com>
<s9ti37$c9p$1@dont-email.me>
<4d66f670-f9b7-4ee2-bf6f-59641f4bc5f3n@googlegroups.com>
<s9to3o$4su$1@dont-email.me>
<67d4cfef-8787-4061-b623-44c89fc589b5n@googlegroups.com>
<s9vvjq$r1i$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 11 Jun 2021 15:44:19 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="3a1fa3d22e7cf84eca7dea7f12749e95";
logging-data="30954"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19thBp4mfuvTpCPH+Od+2fhi6AO6k3r9+c="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Cancel-Lock: sha1:IrvkqKWp1xHuYTL2fZtinwgKgJo=
In-Reply-To: <s9vvjq$r1i$1@dont-email.me>
Content-Language: en-US

by: Stephen Fuld - Fri, 11 Jun 2021 15:44 UTC

On 6/11/2021 8:31 AM, Stephen Fuld wrote:
> On 6/11/2021 4:05 AM, Michael S wrote:
>> On Thursday, June 10, 2021 at 10:10:50 PM UTC+3, Stephen Fuld wrote:
>>> On 6/10/2021 11:53 AM, Michael S wrote:
>>>> On Thursday, June 10, 2021 at 8:28:10 PM UTC+3, Stephen Fuld wrote:
>>>>> On 6/10/2021 9:55 AM, Michael S wrote:
>>>>>> On Thursday, June 10, 2021 at 5:55:47 PM UTC+3, Stephen Fuld wrote:
>>>>>>> On 6/10/2021 1:07 AM, Michael S wrote:
>>>>>>>> On Wednesday, June 9, 2021 at 2:21:09 PM UTC+3, Stephen Fuld wrote:
>>>>>>>>> On 6/8/2021 11:25 PM, Quadibloc wrote:
>>>>>>>>>> On Tuesday, June 8, 2021 at 6:20:57 PM UTC-6, MitchAlsup wrote:
>>>>>>>>>>
>>>>>>>>>>> CPU architects (like me) have advocated for DRAM caches since
>>>>>>>>>>> caches
>>>>>>>>>>> got ECC. Test and marketing guys wave red flags at the people
>>>>>>>>>>> with money.
>>>>>>>>>>
>>>>>>>>>> I remember that at one point IBM made a chip with an L3 cache
>>>>>>>>>> made out of DRAM.
>>>>>>>>>> This, of course, made it slower, but it allowed it to be
>>>>>>>>>> bigger - but IIRC, only 2x bigger.
>>>>>>>>>>
>>>>>>>>>> I can't remember the name of the chip.
>>>>>>>>> Power7. It used eDRAM. Other chips use or used eDRAM.
>>>>>>>>>
>>>>>>>>> https://en.wikipedia.org/wiki/EDRAM
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I am surprised it isn't more popular.
>>>>>>>>>
>>>>>>>>
>>>>>>>> There is a good reason for that.
>>>>>>>> Good DRAM takes good capacitors. IBM knew how to integrate good
>>>>>>>> capacitors into logic process, other manufacturers didn't.
>>>>>>> The Wikipedia article says several Intel chips used it.
>>>>>>
>>>>>> It's not the first time when Wikipedia is wrong.
>>>>> True, although they are usually right. But it did seem a little
>>>>> odd, so
>>>>> I did some more research. Note that the article says the e-DRAM was
>>>>> used in the chips with IRIS graphics. So I looked at
>>>>>
>>>>> https://en.wikipedia.org/wiki/Intel_Graphics_Technology#cite_note-anand-iris-6
>>>>>
>>>>>
>>>>> and it said the graphics processor used e-DRAM, with a reference to an
>>>>> Anandtech article that says the same thing
>>>>>
>>>>> https://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/2
>>>>>
>>>>>
>>>>> I have absolutely no direct knowledge, one way or the other, but if
>>>>> Wikipedia is wrong, they at least have a pretty good reference.
>>>>
>>>> Anand explains it well. Read the 3rd page.
>>>> "Despite its name, the eDRAM silicon is actually separate from the
>>>> main microprocessor die - it’s simply housed on the same package."
>>> Yes. I hadn't read past the first section. :-( But it is a, albeit
>>> modified, Intel logic process, not a DRAM process. See the quote from
>>> the article
>>>
>>>> As we’ve been talking about for a while now, the highest end Haswell
>>>> graphics configuration includes 128MB of eDRAM on-package. The eDRAM
>>>> itself is a custom design by Intel and it’s built on a variant of
>>>> Intel’s P1271 22nm SoC process (not P1270, the CPU process). Intel
>>>> needed a set of low leakage 22nm transistors rather than the ability
>>>> to drive very high frequencies which is why it’s using the mobile
>>>> SoC 22nm process variant here.
>>>
>>> The process change didn't seem to be about the capacitors, but I am way
>>> beyond my depth here.
>>
>> The process they used for Crystal Well eDRAM die is not fast enough
>> for 3+GHz CPU.
>
> I certainly agree with that.
>
>
>
>> It's probably different now, but back 5-6 years ago "mobile process"
>> meant "optimized for reduced leakage and generally for low static
>> power consumption at big price in increased dynamic power consumption".
>
> Seems very reasonable.
>
>
>
>>
>> To be clear, I don't say that other silicon manufacturers don't know,
>> how to do eDRAM. They do know, foundries even have eDRAM in their
>> libraries.
>> But all those non-IBM eDRAMs are not good enough to serve as L3 cache
>> on the same die with 4-5 GHz CPU cores. Even less so, to serve as L2
>> cache, like in IBM z15.
>
> OK, I see what you are saying. To paraphrase, anyone can do eDRAM, but
> IBM does it enough better than anyone else to make it fast enough to be
> used as an L3, much less an L2 cache.

Sorry, I got the logic messed up. Should be "even an L2 cache."

Is that a reasonable restatement?
>
> Regarding L3 caches, certainly "fast enough" is a little too concise.
> Obviously eDRAM will be slower than SRAM, but the idea is that the
> larger sized caches allowed by the smaller cells increases the hit rate
> enough to compensate for the slower access time. I take your comments
> to mean that, except for IBM, that tradeoff doesn't work out. Is that
> fair?
>
> Since you are far more knowledgeable than I am in this area, let me ask
> a question. One of the advantages of eDRAM over "commodity DRAM" is
> that one could, at least in theory, configure the bank sizes to meet
> your needs. Given that the communication is all on chip, so number of
> wires isn't a big issue, couldn't you configure the row size to be equal
> to the cache line size such that once you get the row into the buffers,
> you can transfer the whole row in one clock, essentially eliminating the
> whole CAS part of the access time and transferring a whole cache line at
> once instead of the eight clocks typically used for external DRAM? Does
> this work?
>
>

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: The Gigabyte Era

<4a6b7551-cbb9-47ce-8e61-0d6d0e179df1n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=17627&group=comp.arch#17627

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:974:: with SMTP id do20mr5744233qvb.28.1623429166560;
Fri, 11 Jun 2021 09:32:46 -0700 (PDT)
X-Received: by 2002:a05:6830:3089:: with SMTP id f9mr3816008ots.276.1623429166288;
Fri, 11 Jun 2021 09:32:46 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 11 Jun 2021 09:32:46 -0700 (PDT)
In-Reply-To: <s9vvjq$r1i$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:a87a:6554:33a2:4294;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:a87a:6554:33a2:4294
References: <aa3d0fec-63e1-430e-bc44-66ef82b7fb60n@googlegroups.com>
<s9os14$1vs9$1@gioia.aioe.org> <2202c83c-afd9-4a8e-82f2-0c4034cf85f2n@googlegroups.com>
<94f9758d-ee80-41f4-b8d7-d3c707a3c32dn@googlegroups.com> <s9q873$6h3$1@dont-email.me>
<c823c902-4dcd-45cc-817c-3a4f85746a75n@googlegroups.com> <s9t95g$9lm$1@dont-email.me>
<080e7cac-83a0-4596-b92a-6921ce3874e9n@googlegroups.com> <s9ti37$c9p$1@dont-email.me>
<4d66f670-f9b7-4ee2-bf6f-59641f4bc5f3n@googlegroups.com> <s9to3o$4su$1@dont-email.me>
<67d4cfef-8787-4061-b623-44c89fc589b5n@googlegroups.com> <s9vvjq$r1i$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4a6b7551-cbb9-47ce-8e61-0d6d0e179df1n@googlegroups.com>
Subject: Re: The Gigabyte Era
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 11 Jun 2021 16:32:46 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 4527

by: MitchAlsup - Fri, 11 Jun 2021 16:32 UTC

On Friday, June 11, 2021 at 10:31:08 AM UTC-5, Stephen Fuld wrote:
> On 6/11/2021 4:05 AM, Michael S wrote:
> > On Thursday, June 10, 2021 at 10:10:50 PM UTC+3, Stephen Fuld wrote:
> >> On 6/10/2021 11:53 AM, Michael S wrote:

> >
> > To be clear, I don't say that other silicon manufacturers don't know, how to do eDRAM. They do know, foundries even have eDRAM in their libraries.
> > But all those non-IBM eDRAMs are not good enough to serve as L3 cache on the same die with 4-5 GHz CPU cores. Even less so, to serve as L2 cache, like in IBM z15.
<
> OK, I see what you are saying. To paraphrase, anyone can do eDRAM, but
> IBM does it enough better than anyone else to make it fast enough to be
> used as an L3, much less an L2 cache. Is that a reasonable restatement?
>
> Regarding L3 caches, certainly "fast enough" is a little too concise.
> Obviously eDRAM will be slower than SRAM, but the idea is that the
> larger sized caches allowed by the smaller cells increases the hit rate
> enough to compensate for the slower access time. I take your comments
> to mean that, except for IBM, that tradeoff doesn't work out. Is that fair?
<
The problem is that L2 and L3 architectures are designed for SRAM where
there is no penalty for hitting the same macro every clock. DRAM has an
acceptably fast access time address to data at sense amp, but after this
the stored data must be refreshed and the bank put back into a controlled
precharge condition. This takes 3-5 more cycles.
<
So, in order to utilize an L2-L3 made of DRAM, you need to have multiple
outstanding requests to different partitions simultaneously, rather than
strip mining a line from the L2-L3 and then another.
>
> Since you are far more knowledgeable than I am in this area, let me ask
> a question. One of the advantages of eDRAM over "commodity DRAM" is
> that one could, at least in theory, configure the bank sizes to meet
> your needs. Given that the communication is all on chip, so number of
> wires isn't a big issue, couldn't you configure the row size to be equal
> to the cache line size such that once you get the row into the buffers,
> you can transfer the whole row in one clock, essentially eliminating the
> whole CAS part of the access time and transferring a whole cache line at
> once instead of the eight clocks typically used for external DRAM? Does
> this work?
<
Commodity DRAMs use 1024-4096 bits per bank read, so if you are willing
to make you line size 128B-512B then yes you could do this. But, these DRAMs
allow you to access "all the bits" by sending multiple CAS requests and
strip mining the data through the pins.
<
See my patent 5,367,494 for a better organization--more appropriate
organization for a block of DRAM on a die.
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

Re: The Gigabyte Era

<sa0ecm$fa9$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=17635&group=comp.arch#17635

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!NBiuIU74OKL7NpIOsbuNjQ.user.gioia.aioe.org.POSTED!not-for-mail
From: chris.m....@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: The Gigabyte Era
Date: Fri, 11 Jun 2021 12:43:18 -0700
Organization: Aioe.org NNTP Server
Lines: 100
Message-ID: <sa0ecm$fa9$1@gioia.aioe.org>
References: <aa3d0fec-63e1-430e-bc44-66ef82b7fb60n@googlegroups.com>
<s9os14$1vs9$1@gioia.aioe.org>
<2202c83c-afd9-4a8e-82f2-0c4034cf85f2n@googlegroups.com>
<s9rcsh$c92$1@gioia.aioe.org> <s9sbbf$skk$1@newsreader4.netcologne.de>
<d8218830-aa53-4be7-888b-63d6498e6c5cn@googlegroups.com>
<fbe557cf-ee99-4484-b432-adbc83bbda59n@googlegroups.com>
<s9vmus$1ngj$1@gioia.aioe.org>
NNTP-Posting-Host: NBiuIU74OKL7NpIOsbuNjQ.user.gioia.aioe.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Complaints-To: abuse@aioe.org
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-US

by: Chris M. Thomasson - Fri, 11 Jun 2021 19:43 UTC

On 6/11/2021 6:03 AM, Terje Mathisen wrote:
> Quadibloc wrote:
>> On Thursday, June 10, 2021 at 1:59:19 PM UTC-6, MitchAlsup wrote:
>>> On Thursday, June 10, 2021 at 1:26:57 AM UTC-5, Thomas Koenig wrote:
>>
>>>> It's certainly an idea. I could see a vector processor integrated
>>>> tightly with the RAM so it can bypass all of the caches etc and
>>>> only the end result of some calculation needs to reach the CPU.
>>>>
>>>> However, it hasn't caught on, and there are several reasons why this
>>>> could be the case.
>>> <
>>> FFTs cannot be performed on such an architecture. FFTs have the
>>> aspect that sooner (Decimation in time) or later (decimation in
>>> frequency) you are touching memory that is very far apart. That is
>>> it is not suitable for bringing the compute to memory architecture.
>>
>> Yes.
>>
>> I would have used *matrix multiplications* as my example, and once
>> again you have to touch memory far apart to do them.
>>
>> That's not to say that processing in memory integrated with DRAM
>> isn't of _some_ use; *other* kinds of dumb calculations that don't
>> require accessing far-apart memory could at least be done massively.
>>
>> However, my concept of doing processing in memory is based on not
>> just one, but *two*, historic failures in the field of computing.
>>
>> The Illiac IV and the PDP-8/S.
>>
>> The Illiac IV was SIMD instead of MIMD, and that was why it failed;
>> unlike modern "MIMD" computing, it was constructed in such a way
>> that the cost savings from making it SIMD instead of MIMD were
>> modest - even though the loss in usability and power was large.
>>
>> The PDP-8/S had a serial ALU, and that made it slow.
>>
>> The idea is:
>>
>> You have to make the DRAM chips 64 bits wide to do calculations on
>> 64 bit values. That's unfortunate. However, since today's DRAM only
>> works at maximum efficiency when you pump out 16 consecutive values
>> at a time... you could leave the pins at 4 bits wide, and just
>> re-organize
>> how values are stored in memory.
>>
>> The downside is now that you _always_ have to read the memory in the
>> maximum efficiency value, and can never read just one value if that's
>> all you need.
>>
>> Each DRAM chip gets one, or a bunch of, seriously minimal CPUs.
>>
>> These consist of:
>>
>> A 64-bit shift register or three...
>>
>> A serial ALU that is controlled from an external pin of the DRAM chip,
>>
>> Storage for a handful of logic flags.
>>
>> So this adds a _minimal_ amount of circuitry to the DRAM chip, on the
>> basis that usually most systems will seldom if ever use this capability,
>> so providing it can't threaten cost-competitiveness.
>>
>> When it is used...
>>
>> The central CPU steps a range of memory through a calculation. Say it
>> wants
>> to replace the 64-bit floating-point numbers in an enormous swath of
>> memory
>> with their sines.
>>
>> It steps the serial ALUs through the calculation. One shift and add at
>> a time.
>>
>> Sometimes, there will be "test" instructions that set a logic flag,
>> and sometimes
>> code sequences will be predicated on one of those logic flags. (The
>> Illiac IV
>> did that kind of stuff.)
>>
>> You can actually manage to do 64-bit floating arithmetic with a
>> single-bit ALU.
>> It just takes time. But if you have thousands of them all working at
>> once, it's
>> not too bad.
>
> What would the power efficiency be of such a system, i.e. compared to
> streaming the same data past a GPU array?

If it can get 60fps rendering a hardcore scene, it should be okay.
Shadertoy is a nice friend to have:

https://www.shadertoy.com/view/Ms2SD1

https://www.shadertoy.com/view/Nt2GRW

;^)

Re: The Gigabyte Era

<5cb624aa-c9d3-4d74-a74c-28b94b9d075an@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=17636&group=comp.arch#17636

copy link Newsgroups: comp.arch

X-Received: by 2002:ae9:e8d2:: with SMTP id a201mr5783158qkg.98.1623443641870;
Fri, 11 Jun 2021 13:34:01 -0700 (PDT)
X-Received: by 2002:a9d:12a9:: with SMTP id g38mr4786622otg.114.1623443641422;
Fri, 11 Jun 2021 13:34:01 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.snarked.org!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 11 Jun 2021 13:34:01 -0700 (PDT)
In-Reply-To: <sa0ecm$fa9$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:a87a:6554:33a2:4294;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:a87a:6554:33a2:4294
References: <aa3d0fec-63e1-430e-bc44-66ef82b7fb60n@googlegroups.com>
<s9os14$1vs9$1@gioia.aioe.org> <2202c83c-afd9-4a8e-82f2-0c4034cf85f2n@googlegroups.com>
<s9rcsh$c92$1@gioia.aioe.org> <s9sbbf$skk$1@newsreader4.netcologne.de>
<d8218830-aa53-4be7-888b-63d6498e6c5cn@googlegroups.com> <fbe557cf-ee99-4484-b432-adbc83bbda59n@googlegroups.com>
<s9vmus$1ngj$1@gioia.aioe.org> <sa0ecm$fa9$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5cb624aa-c9d3-4d74-a74c-28b94b9d075an@googlegroups.com>
Subject: Re: The Gigabyte Era
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 11 Jun 2021 20:34:01 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 104

by: MitchAlsup - Fri, 11 Jun 2021 20:34 UTC

On Friday, June 11, 2021 at 2:43:23 PM UTC-5, Chris M. Thomasson wrote:
> On 6/11/2021 6:03 AM, Terje Mathisen wrote:
> > Quadibloc wrote:
> >> On Thursday, June 10, 2021 at 1:59:19 PM UTC-6, MitchAlsup wrote:
> >>> On Thursday, June 10, 2021 at 1:26:57 AM UTC-5, Thomas Koenig wrote:
> >>
> >>>> It's certainly an idea. I could see a vector processor integrated
> >>>> tightly with the RAM so it can bypass all of the caches etc and
> >>>> only the end result of some calculation needs to reach the CPU.
> >>>>
> >>>> However, it hasn't caught on, and there are several reasons why this
> >>>> could be the case.
> >>> <
> >>> FFTs cannot be performed on such an architecture. FFTs have the
> >>> aspect that sooner (Decimation in time) or later (decimation in
> >>> frequency) you are touching memory that is very far apart. That is
> >>> it is not suitable for bringing the compute to memory architecture.
> >>
> >> Yes.
> >>
> >> I would have used *matrix multiplications* as my example, and once
> >> again you have to touch memory far apart to do them.
> >>
> >> That's not to say that processing in memory integrated with DRAM
> >> isn't of _some_ use; *other* kinds of dumb calculations that don't
> >> require accessing far-apart memory could at least be done massively.
> >>
> >> However, my concept of doing processing in memory is based on not
> >> just one, but *two*, historic failures in the field of computing.
> >>
> >> The Illiac IV and the PDP-8/S.
> >>
> >> The Illiac IV was SIMD instead of MIMD, and that was why it failed;
> >> unlike modern "MIMD" computing, it was constructed in such a way
> >> that the cost savings from making it SIMD instead of MIMD were
> >> modest - even though the loss in usability and power was large.
> >>
> >> The PDP-8/S had a serial ALU, and that made it slow.
> >>
> >> The idea is:
> >>
> >> You have to make the DRAM chips 64 bits wide to do calculations on
> >> 64 bit values. That's unfortunate. However, since today's DRAM only
> >> works at maximum efficiency when you pump out 16 consecutive values
> >> at a time... you could leave the pins at 4 bits wide, and just
> >> re-organize
> >> how values are stored in memory.
> >>
> >> The downside is now that you _always_ have to read the memory in the
> >> maximum efficiency value, and can never read just one value if that's
> >> all you need.
> >>
> >> Each DRAM chip gets one, or a bunch of, seriously minimal CPUs.
> >>
> >> These consist of:
> >>
> >> A 64-bit shift register or three...
> >>
> >> A serial ALU that is controlled from an external pin of the DRAM chip,
> >>
> >> Storage for a handful of logic flags.
> >>
> >> So this adds a _minimal_ amount of circuitry to the DRAM chip, on the
> >> basis that usually most systems will seldom if ever use this capability,
> >> so providing it can't threaten cost-competitiveness.
> >>
> >> When it is used...
> >>
> >> The central CPU steps a range of memory through a calculation. Say it
> >> wants
> >> to replace the 64-bit floating-point numbers in an enormous swath of
> >> memory
> >> with their sines.
> >>
> >> It steps the serial ALUs through the calculation. One shift and add at
> >> a time.
> >>
> >> Sometimes, there will be "test" instructions that set a logic flag,
> >> and sometimes
> >> code sequences will be predicated on one of those logic flags. (The
> >> Illiac IV
> >> did that kind of stuff.)
> >>
> >> You can actually manage to do 64-bit floating arithmetic with a
> >> single-bit ALU.
> >> It just takes time. But if you have thousands of them all working at
> >> once, it's
> >> not too bad.
> >
> > What would the power efficiency be of such a system, i.e. compared to
> > streaming the same data past a GPU array?
<
> If it can get 60fps rendering a hardcore scene, it should be okay.
<
<
Well GPU tends to have embarisingly large amounts of parallelism, so maybe
1,000,000,000 1-bit CPUs at 1 GHz would be sufficient.
<
> Shadertoy is a nice friend to have:
>
> https://www.shadertoy.com/view/Ms2SD1
>
> https://www.shadertoy.com/view/Nt2GRW
>
> ;^)

Re: The Gigabyte Era

<f4a94c1a-a255-45c7-92ea-fdf74fe4e717n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=17657&group=comp.arch#17657

copy link Newsgroups: comp.arch

X-Received: by 2002:a0c:d68f:: with SMTP id k15mr3233126qvi.14.1623525159023;
Sat, 12 Jun 2021 12:12:39 -0700 (PDT)
X-Received: by 2002:a05:6830:33ea:: with SMTP id i10mr6121953otu.342.1623525158765;
Sat, 12 Jun 2021 12:12:38 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 12 Jun 2021 12:12:38 -0700 (PDT)
In-Reply-To: <s9vvjq$r1i$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=87.68.183.172; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 87.68.183.172
References: <aa3d0fec-63e1-430e-bc44-66ef82b7fb60n@googlegroups.com>
<s9os14$1vs9$1@gioia.aioe.org> <2202c83c-afd9-4a8e-82f2-0c4034cf85f2n@googlegroups.com>
<94f9758d-ee80-41f4-b8d7-d3c707a3c32dn@googlegroups.com> <s9q873$6h3$1@dont-email.me>
<c823c902-4dcd-45cc-817c-3a4f85746a75n@googlegroups.com> <s9t95g$9lm$1@dont-email.me>
<080e7cac-83a0-4596-b92a-6921ce3874e9n@googlegroups.com> <s9ti37$c9p$1@dont-email.me>
<4d66f670-f9b7-4ee2-bf6f-59641f4bc5f3n@googlegroups.com> <s9to3o$4su$1@dont-email.me>
<67d4cfef-8787-4061-b623-44c89fc589b5n@googlegroups.com> <s9vvjq$r1i$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f4a94c1a-a255-45c7-92ea-fdf74fe4e717n@googlegroups.com>
Subject: Re: The Gigabyte Era
From: already5...@yahoo.com (Michael S)
Injection-Date: Sat, 12 Jun 2021 19:12:39 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: Michael S - Sat, 12 Jun 2021 19:12 UTC

On Friday, June 11, 2021 at 6:31:08 PM UTC+3, Stephen Fuld wrote:
> On 6/11/2021 4:05 AM, Michael S wrote:
> > On Thursday, June 10, 2021 at 10:10:50 PM UTC+3, Stephen Fuld wrote:
> >> On 6/10/2021 11:53 AM, Michael S wrote:
> >>> On Thursday, June 10, 2021 at 8:28:10 PM UTC+3, Stephen Fuld wrote:
> >>>> On 6/10/2021 9:55 AM, Michael S wrote:
> >>>>> On Thursday, June 10, 2021 at 5:55:47 PM UTC+3, Stephen Fuld wrote:
> >>>>>> On 6/10/2021 1:07 AM, Michael S wrote:
> >>>>>>> On Wednesday, June 9, 2021 at 2:21:09 PM UTC+3, Stephen Fuld wrote:
> >>>>>>>> On 6/8/2021 11:25 PM, Quadibloc wrote:
> >>>>>>>>> On Tuesday, June 8, 2021 at 6:20:57 PM UTC-6, MitchAlsup wrote:
> >>>>>>>>>
> >>>>>>>>>> CPU architects (like me) have advocated for DRAM caches since caches
> >>>>>>>>>> got ECC. Test and marketing guys wave red flags at the people with money.
> >>>>>>>>>
> >>>>>>>>> I remember that at one point IBM made a chip with an L3 cache made out of DRAM.
> >>>>>>>>> This, of course, made it slower, but it allowed it to be bigger - but IIRC, only 2x bigger.
> >>>>>>>>>
> >>>>>>>>> I can't remember the name of the chip.
> >>>>>>>> Power7. It used eDRAM. Other chips use or used eDRAM.
> >>>>>>>>
> >>>>>>>> https://en.wikipedia.org/wiki/EDRAM
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> I am surprised it isn't more popular.
> >>>>>>>>
> >>>>>>>
> >>>>>>> There is a good reason for that.
> >>>>>>> Good DRAM takes good capacitors. IBM knew how to integrate good capacitors into logic process, other manufacturers didn't.
> >>>>>> The Wikipedia article says several Intel chips used it.
> >>>>>
> >>>>> It's not the first time when Wikipedia is wrong.
> >>>> True, although they are usually right. But it did seem a little odd, so
> >>>> I did some more research. Note that the article says the e-DRAM was
> >>>> used in the chips with IRIS graphics. So I looked at
> >>>>
> >>>> https://en.wikipedia.org/wiki/Intel_Graphics_Technology#cite_note-anand-iris-6
> >>>>
> >>>> and it said the graphics processor used e-DRAM, with a reference to an
> >>>> Anandtech article that says the same thing
> >>>>
> >>>> https://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/2
> >>>>
> >>>> I have absolutely no direct knowledge, one way or the other, but if
> >>>> Wikipedia is wrong, they at least have a pretty good reference.
> >>>
> >>> Anand explains it well. Read the 3rd page.
> >>> "Despite its name, the eDRAM silicon is actually separate from the main microprocessor die - it’s simply housed on the same package."
> >> Yes. I hadn't read past the first section. :-( But it is a, albeit
> >> modified, Intel logic process, not a DRAM process. See the quote from
> >> the article
> >>
> >>> As we’ve been talking about for a while now, the highest end Haswell graphics configuration includes 128MB of eDRAM on-package. The eDRAM itself is a custom design by Intel and it’s built on a variant of Intel’s P1271 22nm SoC process (not P1270, the CPU process). Intel needed a set of low leakage 22nm transistors rather than the ability to drive very high frequencies which is why it’s using the mobile SoC 22nm process variant here.
> >>
> >> The process change didn't seem to be about the capacitors, but I am way
> >> beyond my depth here.
> >
> > The process they used for Crystal Well eDRAM die is not fast enough for 3+GHz CPU.
> I certainly agree with that.
> > It's probably different now, but back 5-6 years ago "mobile process" meant "optimized for reduced leakage and generally for low static power consumption at big price in increased dynamic power consumption".
> Seems very reasonable.
> >
> > To be clear, I don't say that other silicon manufacturers don't know, how to do eDRAM. They do know, foundries even have eDRAM in their libraries.
> > But all those non-IBM eDRAMs are not good enough to serve as L3 cache on the same die with 4-5 GHz CPU cores. Even less so, to serve as L2 cache, like in IBM z15.
> OK, I see what you are saying. To paraphrase, anyone can do eDRAM, but
> IBM does it enough better than anyone else to make it fast enough to be
> used as an L3, much less an L2 cache. Is that a reasonable restatement?
>
> Regarding L3 caches, certainly "fast enough" is a little too concise.
> Obviously eDRAM will be slower than SRAM, but the idea is that the
> larger sized caches allowed by the smaller cells increases the hit rate
> enough to compensate for the slower access time. I take your comments
> to mean that, except for IBM, that tradeoff doesn't work out. Is that fair?
>
> Since you are far more knowledgeable than I am in this area, let me ask
> a question.

My knowledge is at the "blackbox" level, interfaces and parameters visible to outside world, rather than internal operations.

> One of the advantages of eDRAM over "commodity DRAM" is
> that one could, at least in theory, configure the bank sizes to meet
> your needs. Given that the communication is all on chip, so number of
> wires isn't a big issue, couldn't you configure the row size to be equal
> to the cache line size such that once you get the row into the buffers,
> you can transfer the whole row in one clock, essentially eliminating the
> whole CAS part of the access time and transferring a whole cache line at
> once instead of the eight clocks typically used for external DRAM? Does
> this work?

I think that more that, say, 8K rows per bank wouldn't work.
But a lot of small banks, say 8K rows * 128B * 256 banks for total size of 256 MB, does not seem unreasonable if one is willing to sacrifice a bit of density (more banks==more sense amplifiers).
I wouldn't be surprised if that's how things were done in POWER9 except that the total was only 120 MB

> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

Re: The Gigabyte Era

<01adb7cd-8899-4ef0-a6a5-c0822dda4128n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=17660&group=comp.arch#17660

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:5248:: with SMTP id y8mr9886024qtn.346.1623530951055;
Sat, 12 Jun 2021 13:49:11 -0700 (PDT)
X-Received: by 2002:a9d:7d05:: with SMTP id v5mr6800158otn.240.1623530950816;
Sat, 12 Jun 2021 13:49:10 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.snarked.org!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 12 Jun 2021 13:49:10 -0700 (PDT)
In-Reply-To: <f4a94c1a-a255-45c7-92ea-fdf74fe4e717n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:854:b35b:9aab:ce33;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:854:b35b:9aab:ce33
References: <aa3d0fec-63e1-430e-bc44-66ef82b7fb60n@googlegroups.com>
<s9os14$1vs9$1@gioia.aioe.org> <2202c83c-afd9-4a8e-82f2-0c4034cf85f2n@googlegroups.com>
<94f9758d-ee80-41f4-b8d7-d3c707a3c32dn@googlegroups.com> <s9q873$6h3$1@dont-email.me>
<c823c902-4dcd-45cc-817c-3a4f85746a75n@googlegroups.com> <s9t95g$9lm$1@dont-email.me>
<080e7cac-83a0-4596-b92a-6921ce3874e9n@googlegroups.com> <s9ti37$c9p$1@dont-email.me>
<4d66f670-f9b7-4ee2-bf6f-59641f4bc5f3n@googlegroups.com> <s9to3o$4su$1@dont-email.me>
<67d4cfef-8787-4061-b623-44c89fc589b5n@googlegroups.com> <s9vvjq$r1i$1@dont-email.me>
<f4a94c1a-a255-45c7-92ea-fdf74fe4e717n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <01adb7cd-8899-4ef0-a6a5-c0822dda4128n@googlegroups.com>
Subject: Re: The Gigabyte Era
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 12 Jun 2021 20:49:11 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 174

by: MitchAlsup - Sat, 12 Jun 2021 20:49 UTC

On Saturday, June 12, 2021 at 2:12:40 PM UTC-5, Michael S wrote:
> On Friday, June 11, 2021 at 6:31:08 PM UTC+3, Stephen Fuld wrote:
> > On 6/11/2021 4:05 AM, Michael S wrote:
> > > On Thursday, June 10, 2021 at 10:10:50 PM UTC+3, Stephen Fuld wrote:
> > >> On 6/10/2021 11:53 AM, Michael S wrote:
> > >>> On Thursday, June 10, 2021 at 8:28:10 PM UTC+3, Stephen Fuld wrote:
> > >>>> On 6/10/2021 9:55 AM, Michael S wrote:
> > >>>>> On Thursday, June 10, 2021 at 5:55:47 PM UTC+3, Stephen Fuld wrote:
> > >>>>>> On 6/10/2021 1:07 AM, Michael S wrote:
> > >>>>>>> On Wednesday, June 9, 2021 at 2:21:09 PM UTC+3, Stephen Fuld wrote:
> > >>>>>>>> On 6/8/2021 11:25 PM, Quadibloc wrote:
> > >>>>>>>>> On Tuesday, June 8, 2021 at 6:20:57 PM UTC-6, MitchAlsup wrote:
> > >>>>>>>>>
> > >>>>>>>>>> CPU architects (like me) have advocated for DRAM caches since caches
> > >>>>>>>>>> got ECC. Test and marketing guys wave red flags at the people with money.
> > >>>>>>>>>
> > >>>>>>>>> I remember that at one point IBM made a chip with an L3 cache made out of DRAM.
> > >>>>>>>>> This, of course, made it slower, but it allowed it to be bigger - but IIRC, only 2x bigger.
> > >>>>>>>>>
> > >>>>>>>>> I can't remember the name of the chip.
> > >>>>>>>> Power7. It used eDRAM. Other chips use or used eDRAM.
> > >>>>>>>>
> > >>>>>>>> https://en.wikipedia.org/wiki/EDRAM
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> I am surprised it isn't more popular.
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>> There is a good reason for that.
> > >>>>>>> Good DRAM takes good capacitors. IBM knew how to integrate good capacitors into logic process, other manufacturers didn't.
> > >>>>>> The Wikipedia article says several Intel chips used it.
> > >>>>>
> > >>>>> It's not the first time when Wikipedia is wrong.
> > >>>> True, although they are usually right. But it did seem a little odd, so
> > >>>> I did some more research. Note that the article says the e-DRAM was
> > >>>> used in the chips with IRIS graphics. So I looked at
> > >>>>
> > >>>> https://en.wikipedia.org/wiki/Intel_Graphics_Technology#cite_note-anand-iris-6
> > >>>>
> > >>>> and it said the graphics processor used e-DRAM, with a reference to an
> > >>>> Anandtech article that says the same thing
> > >>>>
> > >>>> https://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/2
> > >>>>
> > >>>> I have absolutely no direct knowledge, one way or the other, but if
> > >>>> Wikipedia is wrong, they at least have a pretty good reference.
> > >>>
> > >>> Anand explains it well. Read the 3rd page.
> > >>> "Despite its name, the eDRAM silicon is actually separate from the main microprocessor die - it’s simply housed on the same package."
> > >> Yes. I hadn't read past the first section. :-( But it is a, albeit
> > >> modified, Intel logic process, not a DRAM process. See the quote from
> > >> the article
> > >>
> > >>> As we’ve been talking about for a while now, the highest end Haswell graphics configuration includes 128MB of eDRAM on-package. The eDRAM itself is a custom design by Intel and it’s built on a variant of Intel’s P1271 22nm SoC process (not P1270, the CPU process). Intel needed a set of low leakage 22nm transistors rather than the ability to drive very high frequencies which is why it’s using the mobile SoC 22nm process variant here.
> > >>
> > >> The process change didn't seem to be about the capacitors, but I am way
> > >> beyond my depth here.
> > >
> > > The process they used for Crystal Well eDRAM die is not fast enough for 3+GHz CPU.
> > I certainly agree with that.
> > > It's probably different now, but back 5-6 years ago "mobile process" meant "optimized for reduced leakage and generally for low static power consumption at big price in increased dynamic power consumption".
> > Seems very reasonable.
> > >
> > > To be clear, I don't say that other silicon manufacturers don't know, how to do eDRAM. They do know, foundries even have eDRAM in their libraries.
> > > But all those non-IBM eDRAMs are not good enough to serve as L3 cache on the same die with 4-5 GHz CPU cores. Even less so, to serve as L2 cache, like in IBM z15.
> > OK, I see what you are saying. To paraphrase, anyone can do eDRAM, but
> > IBM does it enough better than anyone else to make it fast enough to be
> > used as an L3, much less an L2 cache. Is that a reasonable restatement?
> >
> > Regarding L3 caches, certainly "fast enough" is a little too concise.
> > Obviously eDRAM will be slower than SRAM, but the idea is that the
> > larger sized caches allowed by the smaller cells increases the hit rate
> > enough to compensate for the slower access time. I take your comments
> > to mean that, except for IBM, that tradeoff doesn't work out. Is that fair?
> >
> > Since you are far more knowledgeable than I am in this area, let me ask
> > a question.
<
> My knowledge is at the "blackbox" level, interfaces and parameters visible to outside world, rather than internal operations.
<
> > One of the advantages of eDRAM over "commodity DRAM" is
> > that one could, at least in theory, configure the bank sizes to meet
> > your needs. Given that the communication is all on chip, so number of
> > wires isn't a big issue, couldn't you configure the row size to be equal
> > to the cache line size such that once you get the row into the buffers,
> > you can transfer the whole row in one clock, essentially eliminating the
> > whole CAS part of the access time and transferring a whole cache line at
> > once instead of the eight clocks typically used for external DRAM? Does
> > this work?
<
> I think that more that, say, 8K rows per bank wouldn't work.
<
At equivalent word line speed a DRAM can have 2× as many bits on the word line
as an SRAM.
<
Modern fast SRAMs have no more than 128bits on a word line and no more than
64 words on a bit line (otherwise the leakage can be sufficient to drown out the
driving cell compared to the leakage of all the unselected cells.
<
DRAM cells have a bit less than ½ the bit line drive capability.
<
So the modern SRAM macro will have 1KB-2KB and so with the DRAM macros.
<
Let us postulate that the DRAM cell is ¼ the size of the SRAM cell and that the
SRAM overhead is 20% of the array size. This means that three (3) DRAM macros
are the size of an SRAM macro.
<
> But a lot of small banks, say 8K rows * 128B * 256 banks for total size of 256 MB, does not seem unreasonable if one is willing to sacrifice a bit of density (more banks==more sense amplifiers).
<
Yes, lots of banks (maybe as many as 128 or 256) to keep the "same bank" problem to
a minimum. {Somebody prior to the DRAM cache should have already filtered out any
"I need the same line as him" conflicts"}
<
> I wouldn't be surprised if that's how things were done in POWER9 except that the total was only 120 MB
> > --
> > - Stephen Fuld
> > (e-mail address disguised to prevent spam)

Re: The Gigabyte Era

<779048982.646247684.946437.acolvin-efunct.com@news.eternal-september.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18055&group=comp.arch#18055

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: acol...@efunct.com (mac)
Newsgroups: comp.arch
Subject: Re: The Gigabyte Era
Date: Thu, 24 Jun 2021 17:23:50 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 6
Message-ID: <779048982.646247684.946437.acolvin-efunct.com@news.eternal-september.org>
References: <aa3d0fec-63e1-430e-bc44-66ef82b7fb60n@googlegroups.com>
<s9os14$1vs9$1@gioia.aioe.org>
<2202c83c-afd9-4a8e-82f2-0c4034cf85f2n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 24 Jun 2021 17:23:50 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="69ba59a406f6fee278f7ac10ac3f493c";
logging-data="17126"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18mKJTUt3zJwQMBbRXQLMK5"
User-Agent: NewsTap/5.5 (iPad)
Cancel-Lock: sha1:5+ch3App4B1DXvJCVEzSYrdWx34=
sha1:3jTMmTLkWXxeA+JFdBJwO4F7zGg=

by: mac - Thu, 24 Jun 2021 17:23 UTC

> CPU architects (like me) have advocated for DRAM caches since caches
> got ECC. Test and marketing guys wave red flags at the people with money.

But DRAM _is_ only a cache for disk. At least in COMA.

Re: The Gigabyte Era

<sb332i$bm0$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=18060&group=comp.arch#18060

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!NBiuIU74OKL7NpIOsbuNjQ.user.gioia.aioe.org.POSTED!not-for-mail
From: chris.m....@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: The Gigabyte Era
Date: Thu, 24 Jun 2021 16:04:53 -0700
Organization: Aioe.org NNTP Server
Lines: 112
Message-ID: <sb332i$bm0$1@gioia.aioe.org>
References: <aa3d0fec-63e1-430e-bc44-66ef82b7fb60n@googlegroups.com>
<s9os14$1vs9$1@gioia.aioe.org>
<2202c83c-afd9-4a8e-82f2-0c4034cf85f2n@googlegroups.com>
<s9rcsh$c92$1@gioia.aioe.org> <s9sbbf$skk$1@newsreader4.netcologne.de>
<d8218830-aa53-4be7-888b-63d6498e6c5cn@googlegroups.com>
<fbe557cf-ee99-4484-b432-adbc83bbda59n@googlegroups.com>
<s9vmus$1ngj$1@gioia.aioe.org> <sa0ecm$fa9$1@gioia.aioe.org>
<5cb624aa-c9d3-4d74-a74c-28b94b9d075an@googlegroups.com>
NNTP-Posting-Host: NBiuIU74OKL7NpIOsbuNjQ.user.gioia.aioe.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Complaints-To: abuse@aioe.org
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.11.0
Content-Language: en-US
X-Notice: Filtered by postfilter v. 0.9.2

by: Chris M. Thomasson - Thu, 24 Jun 2021 23:04 UTC

On 6/11/2021 1:34 PM, MitchAlsup wrote:
> On Friday, June 11, 2021 at 2:43:23 PM UTC-5, Chris M. Thomasson wrote:
>> On 6/11/2021 6:03 AM, Terje Mathisen wrote:
>>> Quadibloc wrote:
>>>> On Thursday, June 10, 2021 at 1:59:19 PM UTC-6, MitchAlsup wrote:
>>>>> On Thursday, June 10, 2021 at 1:26:57 AM UTC-5, Thomas Koenig wrote:
>>>>
>>>>>> It's certainly an idea. I could see a vector processor integrated
>>>>>> tightly with the RAM so it can bypass all of the caches etc and
>>>>>> only the end result of some calculation needs to reach the CPU.
>>>>>>
>>>>>> However, it hasn't caught on, and there are several reasons why this
>>>>>> could be the case.
>>>>> <
>>>>> FFTs cannot be performed on such an architecture. FFTs have the
>>>>> aspect that sooner (Decimation in time) or later (decimation in
>>>>> frequency) you are touching memory that is very far apart. That is
>>>>> it is not suitable for bringing the compute to memory architecture.
>>>>
>>>> Yes.
>>>>
>>>> I would have used *matrix multiplications* as my example, and once
>>>> again you have to touch memory far apart to do them.
>>>>
>>>> That's not to say that processing in memory integrated with DRAM
>>>> isn't of _some_ use; *other* kinds of dumb calculations that don't
>>>> require accessing far-apart memory could at least be done massively.
>>>>
>>>> However, my concept of doing processing in memory is based on not
>>>> just one, but *two*, historic failures in the field of computing.
>>>>
>>>> The Illiac IV and the PDP-8/S.
>>>>
>>>> The Illiac IV was SIMD instead of MIMD, and that was why it failed;
>>>> unlike modern "MIMD" computing, it was constructed in such a way
>>>> that the cost savings from making it SIMD instead of MIMD were
>>>> modest - even though the loss in usability and power was large.
>>>>
>>>> The PDP-8/S had a serial ALU, and that made it slow.
>>>>
>>>> The idea is:
>>>>
>>>> You have to make the DRAM chips 64 bits wide to do calculations on
>>>> 64 bit values. That's unfortunate. However, since today's DRAM only
>>>> works at maximum efficiency when you pump out 16 consecutive values
>>>> at a time... you could leave the pins at 4 bits wide, and just
>>>> re-organize
>>>> how values are stored in memory.
>>>>
>>>> The downside is now that you _always_ have to read the memory in the
>>>> maximum efficiency value, and can never read just one value if that's
>>>> all you need.
>>>>
>>>> Each DRAM chip gets one, or a bunch of, seriously minimal CPUs.
>>>>
>>>> These consist of:
>>>>
>>>> A 64-bit shift register or three...
>>>>
>>>> A serial ALU that is controlled from an external pin of the DRAM chip,
>>>>
>>>> Storage for a handful of logic flags.
>>>>
>>>> So this adds a _minimal_ amount of circuitry to the DRAM chip, on the
>>>> basis that usually most systems will seldom if ever use this capability,
>>>> so providing it can't threaten cost-competitiveness.
>>>>
>>>> When it is used...
>>>>
>>>> The central CPU steps a range of memory through a calculation. Say it
>>>> wants
>>>> to replace the 64-bit floating-point numbers in an enormous swath of
>>>> memory
>>>> with their sines.
>>>>
>>>> It steps the serial ALUs through the calculation. One shift and add at
>>>> a time.
>>>>
>>>> Sometimes, there will be "test" instructions that set a logic flag,
>>>> and sometimes
>>>> code sequences will be predicated on one of those logic flags. (The
>>>> Illiac IV
>>>> did that kind of stuff.)
>>>>
>>>> You can actually manage to do 64-bit floating arithmetic with a
>>>> single-bit ALU.
>>>> It just takes time. But if you have thousands of them all working at
>>>> once, it's
>>>> not too bad.
>>>
>>> What would the power efficiency be of such a system, i.e. compared to
>>> streaming the same data past a GPU array?
> <
>> If it can get 60fps rendering a hardcore scene, it should be okay.
> <
> <
> Well GPU tends to have embarisingly large amounts of parallelism, so maybe
> 1,000,000,000 1-bit CPUs at 1 GHz would be sufficient.

That would allow for a 31622^2 resolution to have a cpu per pixel. Nice!

> <
>> Shadertoy is a nice friend to have:
>>
>> https://www.shadertoy.com/view/Ms2SD1
>>
>> https://www.shadertoy.com/view/Nt2GRW
>>
>> ;^)

Subject	Author
The Gigabyte Era	Quadibloc
Re: The Gigabyte Era	Chris M. Thomasson
Re: The Gigabyte Era	MitchAlsup
Re: The Gigabyte Era	Quadibloc
Re: The Gigabyte Era	Stephen Fuld
Re: The Gigabyte Era	Michael S
Re: The Gigabyte Era	Stephen Fuld
Re: The Gigabyte Era	Michael S
Re: The Gigabyte Era	Stephen Fuld
Re: The Gigabyte Era	Michael S
Re: The Gigabyte Era	Stephen Fuld
Re: The Gigabyte Era	Michael S
Re: The Gigabyte Era	Stephen Fuld
Re: The Gigabyte Era	Stephen Fuld
Re: The Gigabyte Era	MitchAlsup
Re: The Gigabyte Era	Michael S
Re: The Gigabyte Era	MitchAlsup
Re: The Gigabyte Era	Thomas Koenig
Re: The Gigabyte Era	Chris M. Thomasson
Re: The Gigabyte Era	Quadibloc
Re: The Gigabyte Era	MitchAlsup
Re: The Gigabyte Era	Quadibloc
Re: The Gigabyte Era	Thomas Koenig
Re: The Gigabyte Era	Stephen Fuld
Re: The Gigabyte Era	MitchAlsup
Re: The Gigabyte Era	Quadibloc
Re: The Gigabyte Era	Terje Mathisen
Re: The Gigabyte Era	Chris M. Thomasson
Re: The Gigabyte Era	MitchAlsup
Re: The Gigabyte Era	Chris M. Thomasson
Re: The Gigabyte Era	Thomas Koenig
Re: The Gigabyte Era	Anton Ertl
Re: The Gigabyte Era	Michael S
Re: The Gigabyte Era	mac