Message-ID:

6 May, 2024: The networking issue during the past two days has been identified and appears to be fixed. Will keep monitoring.

devel / comp.arch / Re: Reconsidering Variable-Length Operands

Re: Reconsidering Variable-Length Operands

<t88cia$pvr$1@newsreader4.netcologne.de>

https://www.novabbs.com/devel/article-flat.php?id=25810&group=comp.arch#25810

Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-30b7-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Reconsidering Variable-Length Operands
Date: Mon, 13 Jun 2022 22:09:14 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <t88cia$pvr$1@newsreader4.netcologne.de>
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com>
<fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<1d638683-ae01-402e-9cad-2cc75fb4035cn@googlegroups.com>
<t7o3ag$lv6$2@newsreader4.netcologne.de>
<95ba8206-5e46-4f6d-b157-a43154249a5fn@googlegroups.com>
<t7us5d$jns$1@dont-email.me>
<09a1a9fb-f8f8-4f62-99e5-06bc66a32a06n@googlegroups.com>
<t875o0$1bo3$1@gioia.aioe.org>
Injection-Date: Mon, 13 Jun 2022 22:09:14 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-30b7-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:30b7:0:7285:c2ff:fe6c:992d";
logging-data="26619"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Mon, 13 Jun 2022 22:09 UTC

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

> For a lot of crypto work, the data arrays are actually quite small,
> making the overhead of a coprocessor much more significant. I'd rather
> dedicate one of my 16 cores to do the AES-256 crypto, along with the key
> handling, authentication etc.

POWER has both - instructions for an AES round (one round per cycle
throughput) and a coprocessor.

I suppose it might be possible to push through more than one AES
round per cylce, but that would of course cost more silicon area.

Re: Reconsidering Variable-Length Operands

<e3a0e4cf-01a9-4460-bfd6-8df4db9965e9n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=25811&group=comp.arch#25811

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:1948:b0:464:4c88:dafa with SMTP id q8-20020a056214194800b004644c88dafamr1426854qvk.12.1655158924655;
Mon, 13 Jun 2022 15:22:04 -0700 (PDT)
X-Received: by 2002:ac8:7d87:0:b0:304:ffc3:dc08 with SMTP id
c7-20020ac87d87000000b00304ffc3dc08mr1793687qtd.191.1655158924521; Mon, 13
Jun 2022 15:22:04 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 13 Jun 2022 15:22:04 -0700 (PDT)
In-Reply-To: <t88cia$pvr$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:65dd:bb7:48b5:b678;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:65dd:bb7:48b5:b678
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com> <fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<1d638683-ae01-402e-9cad-2cc75fb4035cn@googlegroups.com> <t7o3ag$lv6$2@newsreader4.netcologne.de>
<95ba8206-5e46-4f6d-b157-a43154249a5fn@googlegroups.com> <t7us5d$jns$1@dont-email.me>
<09a1a9fb-f8f8-4f62-99e5-06bc66a32a06n@googlegroups.com> <t875o0$1bo3$1@gioia.aioe.org>
<t88cia$pvr$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e3a0e4cf-01a9-4460-bfd6-8df4db9965e9n@googlegroups.com>
Subject: Re: Reconsidering Variable-Length Operands
From: already5...@yahoo.com (Michael S)
Injection-Date: Mon, 13 Jun 2022 22:22:04 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2644

by: Michael S - Mon, 13 Jun 2022 22:22 UTC

On Tuesday, June 14, 2022 at 1:09:17 AM UTC+3, Thomas Koenig wrote:
> Terje Mathisen <terje.m...@tmsw.no> schrieb:
> > For a lot of crypto work, the data arrays are actually quite small,
> > making the overhead of a coprocessor much more significant. I'd rather
> > dedicate one of my 16 cores to do the AES-256 crypto, along with the key
> > handling, authentication etc.
> POWER has both - instructions for an AES round (one round per cycle
> throughput) and a coprocessor.
>
> I suppose it might be possible to push through more than one AES
> round per cylce, but that would of course cost more silicon area.

One round per cycle - throughput or latency?
1/cycle throughput - easy, but there are modes (e.g. CBC encrypt) where
it's not too useful. On the other hand, in other modes it's useful,
but only with non-trivial effort on SW part.
1 cycle latency - very very hard. At frequencies typical for POWER processors -
likely impossible.

Re: Reconsidering Variable-Length Operands

<dfa64ff6-96cd-4e3f-bccf-a4f0b10cb764n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=25812&group=comp.arch#25812

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:3003:b0:462:1c15:772c with SMTP id ke3-20020a056214300300b004621c15772cmr1140780qvb.71.1655159139737;
Mon, 13 Jun 2022 15:25:39 -0700 (PDT)
X-Received: by 2002:a05:622a:44f:b0:305:3353:40f3 with SMTP id
o15-20020a05622a044f00b00305335340f3mr1799368qtx.587.1655159139597; Mon, 13
Jun 2022 15:25:39 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 13 Jun 2022 15:25:39 -0700 (PDT)
In-Reply-To: <t88b3e$8k6$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:5198:94af:1583:dab6;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:5198:94af:1583:dab6
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com> <fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<1d638683-ae01-402e-9cad-2cc75fb4035cn@googlegroups.com> <t7o3ag$lv6$2@newsreader4.netcologne.de>
<95ba8206-5e46-4f6d-b157-a43154249a5fn@googlegroups.com> <t7us5d$jns$1@dont-email.me>
<09a1a9fb-f8f8-4f62-99e5-06bc66a32a06n@googlegroups.com> <t803pg$q39$1@dont-email.me>
<81a486ea-85e0-4534-97d3-3b1df74373b5n@googlegroups.com> <t82ltm$tim$1@dont-email.me>
<fc3e3255-bd1c-408f-abea-8c41bfae2a21n@googlegroups.com> <t87s78$uam$1@dont-email.me>
<0MKpK.633$nZ1.568@fx05.iad> <08acfcce-11e0-447f-8a71-871e773a84dfn@googlegroups.com>
<t88b3e$8k6$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <dfa64ff6-96cd-4e3f-bccf-a4f0b10cb764n@googlegroups.com>
Subject: Re: Reconsidering Variable-Length Operands
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 13 Jun 2022 22:25:39 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 3809

by: MitchAlsup - Mon, 13 Jun 2022 22:25 UTC

On Monday, June 13, 2022 at 4:44:17 PM UTC-5, Ivan Godard wrote:
> On 6/13/2022 1:02 PM, MitchAlsup wrote:
> > On Monday, June 13, 2022 at 12:48:49 PM UTC-5, EricP wrote:
> >> Stephen Fuld wrote:
> >>> On 6/11/2022 3:12 PM, MitchAlsup wrote:

> >> I think you want a HW FORK instruction in here someplace
> >> which passes certain registers read only to the AP.
> > <
> > Send Message and Relinquish
> > and
> > Send Message and Continue
> > <
> > Both of these send and ABI full of arguments to a "method"
> > {Method is a means of identifying a thread to run where you may not
> > need to know its name of executing environment--Thread running
> > under a different Guest OS, for example.}
> >>
> >> A later HW JOIN would wait for and sync with the AP,
> >> and transfer back results to registers indicated with a bitmask.
> > <
> > Return Message
<
> That's a co-routine model, and it works so long as the information flow
> is visit-like and the duration is long enough to pay the overhead. If
> all you have is a visit, then everything looks like a visit. I cheer for
> your efforts toward a fast visit; it will help, but it's not a general
> solution. I expect trouble in the dispatch selector, and nightmares in
> debugging.
<
Send Message and Relinquish takes on-the-order of 20-ish cycles
from sender to control transfer arrives at service provider. {And
multiple threads can target the same servicer at essentially the
same time without bothering to check. Services are performed in
arrival order.}
<
Return Message takes another 20-ish cycles from returner to
originating sender.
<
Both of these 20-ish numbers are AFTER the core has wandered
through its cache hierarchy and gotten to the interconnect fabric.
I have seen designs where the adder would be as low as 5 and
seen designs where this was greater than 20.

Re: Reconsidering Variable-Length Operands

<691f9b31-76e2-454f-af34-4231c0cdeb8cn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=25813&group=comp.arch#25813

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:2683:b0:69c:8c9c:5f80 with SMTP id c3-20020a05620a268300b0069c8c9c5f80mr1806182qkp.367.1655159251546;
Mon, 13 Jun 2022 15:27:31 -0700 (PDT)
X-Received: by 2002:ad4:4ae9:0:b0:46a:5726:58c2 with SMTP id
cp9-20020ad44ae9000000b0046a572658c2mr1219943qvb.36.1655159251430; Mon, 13
Jun 2022 15:27:31 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 13 Jun 2022 15:27:31 -0700 (PDT)
In-Reply-To: <9tOpK.638$nZ1.373@fx05.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:5198:94af:1583:dab6;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:5198:94af:1583:dab6
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com> <fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<t7o34q$lv6$1@newsreader4.netcologne.de> <t7o8mq$vk1$1@dont-email.me>
<t7oeck$d8t$1@dont-email.me> <t7oh10$74r$1@dont-email.me> <9328b81d-dfd0-4e25-b2c9-af2c1e6a745en@googlegroups.com>
<t7pdmh$iec$2@newsreader4.netcologne.de> <c4c3064c-e6a5-4d40-a1b9-9409b7d18ff6n@googlegroups.com>
<355191db-2afe-4c6f-9eaa-c7984f542d95n@googlegroups.com> <t8030p$3v0$2@newsreader4.netcologne.de>
<ae7a6e19-20a6-4efa-9c5e-b9002e3bc28an@googlegroups.com> <t87gs9$mp0$1@gioia.aioe.org>
<bmJpK.3$9j2.2@fx33.iad> <t87u9g$8j0$1@dont-email.me> <9tOpK.638$nZ1.373@fx05.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <691f9b31-76e2-454f-af34-4231c0cdeb8cn@googlegroups.com>
Subject: Re: Reconsidering Variable-Length Operands
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 13 Jun 2022 22:27:31 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 3608

by: MitchAlsup - Mon, 13 Jun 2022 22:27 UTC

On Monday, June 13, 2022 at 5:01:44 PM UTC-5, EricP wrote:
> Ivan Godard wrote:
> > On 6/13/2022 9:12 AM, EricP wrote:
> >> Terje Mathisen wrote:
> >> SIMD might have trouble dealing with a vector of structs
> >> with fields of different types and sizes.
> >> I suspect VVM would handle this naturally.
> >
> > Narrowing (or widening) of intermediate results seems difficult because
> > the number of lanes in the bypass changes on the fly. A widen needs to
> > buffer while the following stages of the pipe are double-pumped, while a
> > narrow needs to stall while earlier stages are. Of course, the whole
> > pipe could run at the lane count of the widest data and just ignore the
> > wastage, just as we ignore idle FUs that a code doesn't use.
<
> I didn't pay much attention to previous discussions about lanes
> as I was mostly interested in the VVM alias mapper and its
> forwarding mechanism, so apologies if this was already discussed.
>
> But it seems to me that this "packing problem" is straight
> forwardly handled by separating the operands into Int and FP streams,
> then just packing the operands together into packets
> based on the next higher naturally aligned field.
> When packet is full or the loop ends, dispatch the operand packet
> to a VVM Reservation station.
<
I do not foresee any problem performing widening or narrowing in
VVM implementations.
>
> The question is how to do inter-packet or scalar operand forwarding.
> And scheduling, specifically knowing when all the operands in a packet
> can launch together, and when they need to break apart because of
> FU resource limitations, or to avoid dependency deadlock.

Re: Reconsidering Variable-Length Operands

<t899gd$91v$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=25815&group=comp.arch#25815

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd6-30b7-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Reconsidering Variable-Length Operands
Date: Tue, 14 Jun 2022 06:23:09 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <t899gd$91v$1@newsreader4.netcologne.de>
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com>
<fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<1d638683-ae01-402e-9cad-2cc75fb4035cn@googlegroups.com>
<t7o3ag$lv6$2@newsreader4.netcologne.de>
<95ba8206-5e46-4f6d-b157-a43154249a5fn@googlegroups.com>
<t7us5d$jns$1@dont-email.me>
<09a1a9fb-f8f8-4f62-99e5-06bc66a32a06n@googlegroups.com>
<t875o0$1bo3$1@gioia.aioe.org> <t88cia$pvr$1@newsreader4.netcologne.de>
<e3a0e4cf-01a9-4460-bfd6-8df4db9965e9n@googlegroups.com>
Injection-Date: Tue, 14 Jun 2022 06:23:09 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd6-30b7-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd6:30b7:0:7285:c2ff:fe6c:992d";
logging-data="9279"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Tue, 14 Jun 2022 06:23 UTC

Michael S <already5chosen@yahoo.com> schrieb:
> On Tuesday, June 14, 2022 at 1:09:17 AM UTC+3, Thomas Koenig wrote:
>> Terje Mathisen <terje.m...@tmsw.no> schrieb:
>> > For a lot of crypto work, the data arrays are actually quite small,
>> > making the overhead of a coprocessor much more significant. I'd rather
>> > dedicate one of my 16 cores to do the AES-256 crypto, along with the key
>> > handling, authentication etc.
>> POWER has both - instructions for an AES round (one round per cycle
>> throughput) and a coprocessor.
>>
>> I suppose it might be possible to push through more than one AES
>> round per cylce, but that would of course cost more silicon area.
>
> One round per cycle - throughput or latency?

Throughput. Of course, a complete AES has 14 rounds, so throughput
for AES-256 on POWER via vcipher is limited to one 128-bit word
every 14 cycles.

Each round xors five values together, with values from a lookup table.

One thin that might be limiting is register encoding - having a
256-bit key, and 128 bit of encrypted and decrypted data would
need to address eight 64-bit registers at once. Does fit lightly with
POWER's instructions :-)

> 1/cycle throughput - easy, but there are modes (e.g. CBC encrypt) where
> it's not too useful.

I meant per round, not per complete encryption.

>On the other hand, in other modes it's useful,
> but only with non-trivial effort on SW part.
> 1 cycle latency - very very hard. At frequencies typical for POWER processors -
> likely impossible.

Agreed.

Re: Reconsidering Variable-Length Operands

<987a25cc-4005-4d35-940d-f8519316995cn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=25817&group=comp.arch#25817

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:57c3:0:b0:305:2dbd:92b3 with SMTP id w3-20020ac857c3000000b003052dbd92b3mr3087529qta.173.1655195860602;
Tue, 14 Jun 2022 01:37:40 -0700 (PDT)
X-Received: by 2002:a05:622a:20c:b0:304:f6db:6631 with SMTP id
b12-20020a05622a020c00b00304f6db6631mr3174718qtx.257.1655195860435; Tue, 14
Jun 2022 01:37:40 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 14 Jun 2022 01:37:40 -0700 (PDT)
In-Reply-To: <t899gd$91v$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=199.203.251.52; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 199.203.251.52
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com> <fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<1d638683-ae01-402e-9cad-2cc75fb4035cn@googlegroups.com> <t7o3ag$lv6$2@newsreader4.netcologne.de>
<95ba8206-5e46-4f6d-b157-a43154249a5fn@googlegroups.com> <t7us5d$jns$1@dont-email.me>
<09a1a9fb-f8f8-4f62-99e5-06bc66a32a06n@googlegroups.com> <t875o0$1bo3$1@gioia.aioe.org>
<t88cia$pvr$1@newsreader4.netcologne.de> <e3a0e4cf-01a9-4460-bfd6-8df4db9965e9n@googlegroups.com>
<t899gd$91v$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <987a25cc-4005-4d35-940d-f8519316995cn@googlegroups.com>
Subject: Re: Reconsidering Variable-Length Operands
From: already5...@yahoo.com (Michael S)
Injection-Date: Tue, 14 Jun 2022 08:37:40 +0000
Content-Type: text/plain; charset="UTF-8"

by: Michael S - Tue, 14 Jun 2022 08:37 UTC

On Tuesday, June 14, 2022 at 9:23:12 AM UTC+3, Thomas Koenig wrote:
> Michael S <already...@yahoo.com> schrieb:
> > On Tuesday, June 14, 2022 at 1:09:17 AM UTC+3, Thomas Koenig wrote:
> >> Terje Mathisen <terje.m...@tmsw.no> schrieb:
> >> > For a lot of crypto work, the data arrays are actually quite small,
> >> > making the overhead of a coprocessor much more significant. I'd rather
> >> > dedicate one of my 16 cores to do the AES-256 crypto, along with the key
> >> > handling, authentication etc.
> >> POWER has both - instructions for an AES round (one round per cycle
> >> throughput) and a coprocessor.
> >>
> >> I suppose it might be possible to push through more than one AES
> >> round per cylce, but that would of course cost more silicon area.
> >
> > One round per cycle - throughput or latency?
> Throughput. Of course, a complete AES has 14 rounds, so throughput
> for AES-256 on POWER via vcipher is limited to one 128-bit word
> every 14 cycles.
>
> Each round xors five values together, with values from a lookup table.

xor, shuffles octets, shuffles (permutes) bits in octets, something else...
I used to know this stuff (had implemented it in FPGA twice), but by now
only remember in principles, not details.

>
> One thin that might be limiting is register encoding - having a
> 256-bit key, and 128 bit of encrypted and decrypted data would
> need to address eight 64-bit registers at once. Does fit lightly with
> POWER's instructions :-)

I am not sure that I understand your concerns about encoding.
Each individual round has only 2 inputs - result of the previous round
and "round key" which is a pre-calculated product of Key and Index of
the round. If your architecture has enough of 128-bit registers then you
hold all round keys (14 in case of AES256) in registers all the time.
If it does not, then you hold as many as fit and load the rest dynamically
from memory (cache). The later case is more complicated, but on superscalar
HW not necessarily slower.

> > 1/cycle throughput - easy, but there are modes (e.g. CBC encrypt) where
> > it's not too useful.
> I meant per round, not per complete encryption.

Sure. I looked in the manual. On POWER9 latency=6 cycles.
The problem with CBC mode encrypt is that not only each round depend on the
previous, but each 128-bit block of the message depends on result of encryption
of the previous block. So, you can't encrypt blocks in parallel, making the
whole process effectively latency-bound.
Of course, in theory, you can achieve good throughput by encrypting several
messages in parallel, but practically it's extremely inconvenient.

On Power, I'd guess, the recommended solution is - fill gaps with multiple HW
threads. But that works only on servers or in benchmarks, not in real-world
client use cases.
Another solution, as you mentioned, is AES accelerator, which appears to be
specifically crafted for CBC encrypt mode. But use of accelerator is also less
convenient than use of SW library and at least on POWER9 accelerator appears to
be attached through DMA, so setup overhead is likely in hundreds or thousands
of clock cycles. And the throughput of accelerator (on POWER9) is not
particularly great - 6.4 Gbits/sec for AES256.
For comparison, nice operations, e.g. CBC decrypt, done in SW on core running at
3.5GHz could approach 30 Gbits/sec.

> >On the other hand, in other modes it's useful,
> > but only with non-trivial effort on SW part.
> > 1 cycle latency - very very hard. At frequencies typical for POWER processors -
> > likely impossible.
> Agreed.

Still, 6 is not the best possible latency at [relatively modest] POWER9 frequencies.
May be, when they designed the unit they were hoping that POWER9 will be clocked
higher, similarly to POWER8. Or, may be, it's was not considered important.

Re: Reconsidering Variable-Length Operands

<g6fhahtmbpm5gsemar0624pq8url2ek93g@4ax.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=25826&group=comp.arch#25826

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: gneun...@comcast.net (George Neuner)
Newsgroups: comp.arch
Subject: Re: Reconsidering Variable-Length Operands
Date: Tue, 14 Jun 2022 13:02:29 -0400
Organization: A noiseless patient Spider
Lines: 12
Message-ID: <g6fhahtmbpm5gsemar0624pq8url2ek93g@4ax.com>
References: <t7us5d$jns$1@dont-email.me> <09a1a9fb-f8f8-4f62-99e5-06bc66a32a06n@googlegroups.com> <t803pg$q39$1@dont-email.me> <81a486ea-85e0-4534-97d3-3b1df74373b5n@googlegroups.com> <t82ltm$tim$1@dont-email.me> <fc3e3255-bd1c-408f-abea-8c41bfae2a21n@googlegroups.com> <t87s78$uam$1@dont-email.me> <0MKpK.633$nZ1.568@fx05.iad> <t87vu6$90c$1@dont-email.me> <t881hc$9ur$1@dont-email.me> <t883db$f3t$1@dont-email.me> <t88af7$s8o$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Injection-Info: reader02.eternal-september.org; posting-host="37f7cdec367c5899f74c078c1bd807a1";
logging-data="9891"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19uYB6aw1qJNgOqdzHq5TORVffd5Y5yJN0="
User-Agent: ForteAgent/8.00.32.1272
Cancel-Lock: sha1:gdfAs5dZEQLracI9WibXa9XOgrE=

by: George Neuner - Tue, 14 Jun 2022 17:02 UTC

On Mon, 13 Jun 2022 14:33:27 -0700, Ivan Godard
<ivan@millcomputing.com> wrote:

> :
>there's an intellectual tractability problem: imperative time lines seem
>to be difficult enough for the next programmer off a bus, have we any
>hope to get the world to write in Lisp?

In many ways, Python appears to be a poor implementation of Lisp.

YMMV,
George

Re: Reconsidering Variable-Length Operands

<e503b5c7-9474-4afe-a155-ec0b3c7b52fen@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=25827&group=comp.arch#25827

copy link Newsgroups: comp.arch

X-Received: by 2002:ae9:e011:0:b0:6a6:a5c6:cafe with SMTP id m17-20020ae9e011000000b006a6a5c6cafemr5002860qkk.717.1655227690964;
Tue, 14 Jun 2022 10:28:10 -0700 (PDT)
X-Received: by 2002:ac8:5cc9:0:b0:304:e03b:5964 with SMTP id
s9-20020ac85cc9000000b00304e03b5964mr5014406qta.433.1655227690829; Tue, 14
Jun 2022 10:28:10 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 14 Jun 2022 10:28:10 -0700 (PDT)
In-Reply-To: <t88af7$s8o$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:9c09:5a98:31d7:1b27;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:9c09:5a98:31d7:1b27
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com> <fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<1d638683-ae01-402e-9cad-2cc75fb4035cn@googlegroups.com> <t7o3ag$lv6$2@newsreader4.netcologne.de>
<95ba8206-5e46-4f6d-b157-a43154249a5fn@googlegroups.com> <t7us5d$jns$1@dont-email.me>
<09a1a9fb-f8f8-4f62-99e5-06bc66a32a06n@googlegroups.com> <t803pg$q39$1@dont-email.me>
<81a486ea-85e0-4534-97d3-3b1df74373b5n@googlegroups.com> <t82ltm$tim$1@dont-email.me>
<fc3e3255-bd1c-408f-abea-8c41bfae2a21n@googlegroups.com> <t87s78$uam$1@dont-email.me>
<0MKpK.633$nZ1.568@fx05.iad> <t87vu6$90c$1@dont-email.me> <t881hc$9ur$1@dont-email.me>
<t883db$f3t$1@dont-email.me> <t88af7$s8o$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e503b5c7-9474-4afe-a155-ec0b3c7b52fen@googlegroups.com>
Subject: Re: Reconsidering Variable-Length Operands
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 14 Jun 2022 17:28:10 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2306

by: MitchAlsup - Tue, 14 Jun 2022 17:28 UTC

On Monday, June 13, 2022 at 4:33:30 PM UTC-5, Ivan Godard wrote:

> there's an intellectual tractability problem: imperative time lines seem
> to be difficult enough for the next programmer off a bus, have we any
> hope to get the world to write in Lisp?
<
If you run your C program through the preprocessor, you do get "lots of
infuriating small parentheses".

George Neuner <gneuner2@comcast.net> writes:
> In many ways, Python appears to be a poor implementation of Lisp.

My memory is until, hmmm, Python 2.7 I think, this:

l = [...]
sum = 0
for x in xrange(len(l)):
sum += l[x]

Was an O(n^2) operation (lists were basically car/cdr constructs).
2.7 switched them to a more obvious data structure. One saw very
aggressive use of tuples and generators to avoid subscripting
what--from the outside--looked like arrays.

I don't know about a "poor implementation", but it seems clear
that awareness of Lisp was present during Python's development.

(I wonder about generators and if/how they cribbed from Icon.)

Andy Valencia
Home page: https://www.vsta.org/andy/
To contact me: https://www.vsta.org/contact/andy.html

Re: Reconsidering Variable-Length Operands

<t8c16a$ad$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=25848&group=comp.arch#25848

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: david.br...@hesbynett.no (David Brown)
Newsgroups: comp.arch
Subject: Re: Reconsidering Variable-Length Operands
Date: Wed, 15 Jun 2022 09:19:37 +0200
Organization: A noiseless patient Spider
Lines: 54
Message-ID: <t8c16a$ad$1@dont-email.me>
References: <t87s78$uam$1@dont-email.me> <0MKpK.633$nZ1.568@fx05.iad>
<t87vu6$90c$1@dont-email.me> <t881hc$9ur$1@dont-email.me>
<t883db$f3t$1@dont-email.me> <t88af7$s8o$1@dont-email.me>
<165525862436.16705.4251872628076337042@media.vsta.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 15 Jun 2022 07:19:38 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="92b5b87e0bbbf136740402712684d102";
logging-data="333"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19H6vXoNyeBUQuUcIZVx0RO3KdXvhluVkg="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.9.1
Cancel-Lock: sha1:l+4z3PXEKc/PP/EdSNRavPLUwOk=
In-Reply-To: <165525862436.16705.4251872628076337042@media.vsta.org>
Content-Language: en-GB

by: David Brown - Wed, 15 Jun 2022 07:19 UTC

On 15/06/2022 04:03, Andy Valencia wrote:
> George Neuner <gneuner2@comcast.net> writes:
>> In many ways, Python appears to be a poor implementation of Lisp.
>
> My memory is until, hmmm, Python 2.7 I think, this:
>
> l = [...]
> sum = 0
> for x in xrange(len(l)):
> sum += l[x]
>
> Was an O(n^2) operation (lists were basically car/cdr constructs).
> 2.7 switched them to a more obvious data structure. One saw very
> aggressive use of tuples and generators to avoid subscripting
> what--from the outside--looked like arrays.
>
> I don't know about a "poor implementation", but it seems clear
> that awareness of Lisp was present during Python's development.
>
> (I wonder about generators and if/how they cribbed from Icon.)
>

How would that compare to the way you would write this in Python, rather
than what looks like a translation of some other language (C, perhaps?)
into Python?

In Python, you'd write :

l = [...]
s = sum(l)

If it is more complex than simple addition handled by the built-in "sum"
function, so that you need a loop, you'd write :

l = [...]
s = 0
for x in l :
sum += l

or you'd use a functional programming style :

s = reduce(lambda a, b : a + b, l)

Efficiency in Python is primarily about writing expressions that can be
handled by the C runtime and library - thus for big numerical
calculations, you use numpy (fast C libraries wrapped in Python
convenience) rather than raw Python code.

You can certainly criticise the speed of Python in many ways.
(Apparently the upcoming Python 3.11 can be up to twice as fast as 3.10,
which also means that even 3.10 is at least twice as slow as it needed
to be.) But it only makes sense to look at the speed of code written in
the appropriate way for any given language.

Re: Reconsidering Variable-Length Operands

<t8deb2$v7s$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=25855&group=comp.arch#25855

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Reconsidering Variable-Length Operands
Date: Wed, 15 Jun 2022 13:10:08 -0700
Organization: A noiseless patient Spider
Lines: 59
Message-ID: <t8deb2$v7s$1@dont-email.me>
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com>
<fcfa3766-6c93-4f67-8b91-308842f924den@googlegroups.com>
<t7mo9u$tio$1@newsreader4.netcologne.de> <t7msum$i3c$1@dont-email.me>
<1d638683-ae01-402e-9cad-2cc75fb4035cn@googlegroups.com>
<t7o3ag$lv6$2@newsreader4.netcologne.de>
<95ba8206-5e46-4f6d-b157-a43154249a5fn@googlegroups.com>
<t7us5d$jns$1@dont-email.me>
<09a1a9fb-f8f8-4f62-99e5-06bc66a32a06n@googlegroups.com>
<t803pg$q39$1@dont-email.me>
<81a486ea-85e0-4534-97d3-3b1df74373b5n@googlegroups.com>
<t82ltm$tim$1@dont-email.me>
<fc3e3255-bd1c-408f-abea-8c41bfae2a21n@googlegroups.com>
<t87s78$uam$1@dont-email.me> <0MKpK.633$nZ1.568@fx05.iad>
<t87vu6$90c$1@dont-email.me> <t881hc$9ur$1@dont-email.me>
<t883db$f3t$1@dont-email.me> <t88af7$s8o$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 15 Jun 2022 20:10:10 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="3f9ae4c1fe35f3f8494725b63dbd9910";
logging-data="31996"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18Wd8q75WaxgvPYG37OKl3E26MGRVLmUC0="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.10.0
Cancel-Lock: sha1:yXBzUane1judwW03Xy+8lhRk+tk=
In-Reply-To: <t88af7$s8o$1@dont-email.me>
Content-Language: en-US

by: Stephen Fuld - Wed, 15 Jun 2022 20:10 UTC

On 6/13/2022 2:33 PM, Ivan Godard wrote:
> On 6/13/2022 12:32 PM, Stephen Fuld wrote:
>> On 6/13/2022 12:01 PM, Ivan Godard wrote:

Big snip - In doing the snip, I may have messed up the attributions. If
so, I apologize.

>>> As with all async units. What happens if the AP is used by an
>>> interrupt handler, inadvertently or not? By a second thread? We know
>>> how to deal with that for unshared i/o devices - at too great an
>>> overhead for a crypto unit. Or we can save the state, like we do
>>> regs - and mess up call and task switch. Or stall until the unit
>>> completes, and have livelock and response time problems.
>>>
>>> The only solution found so far has been to push it all off to software.
>>
>> That is one of the things that, I hope, the hardware queuing mechanism
>> that Mitch alluded to some time ago in connection with I/O starts will
>> take care of, at least most of the time.
>>
>> My idea of an overview of how it would work is essentially, a "Start
>> Attached Processor" instruction would activate the AP if it is
>> available, else put an entry on the APs HW queue and then return to
>> executing instructions. The SAP instruction acts similarly to other
>> long running instructions (Mitch's example was a load miss that had to
>> go to DRAM). Execution can continue until the results of the
>> instruction are needed. Eventually, execution will stall when no more
>> instructions that don't need the results are available. This all fits
>> into the hardware, but, as I wrote in another post, I see some
>> problems with software handling, especially with long running
>> operations like encrypting a long string. But essentially, most of
>> your questions are answered "Just like any other long running
>> instruction."
>>
>> Clearly, there are lots of things to work out, but I think this
>> approach is worth exploring.
>>
>>
>
> The question is "how long is long?" We can, with complications,
> tolerate times in the hundreds of cycle by using an instruction model.
> We can, with complications, tolerate times in the billions of cycles by
> using an i/o device model. What then for times in the thousands and
> millions of cycles?
>
> Uncanny valley.

I think that is pretty much correct. If Mitch can indeed deliver a
message passing mechanism (including messages to/from hardware
components) with the about 20 cycles latency he is aiming for, I think
that will fix the problem certainly for the millions of cycles, and even
for most of the thousands. We will see.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Reconsidering Variable-Length Operands

<tlt5ug$1d3ld$2@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=29175&group=comp.arch#29175

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: paaroncl...@gmail.com (Paul A. Clayton)
Newsgroups: comp.arch
Subject: Re: Reconsidering Variable-Length Operands
Date: Sat, 26 Nov 2022 08:56:32 -0500
Organization: A noiseless patient Spider
Lines: 49
Message-ID: <tlt5ug$1d3ld$2@dont-email.me>
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 26 Nov 2022 13:56:32 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="321980a362e7e93ddccaed609d9e83c6";
logging-data="1478317"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18mAXaPSYuyez6HlHBuMONujyTwf31cDSw="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.0
Cancel-Lock: sha1:lLFtksY9jhETnNt0prQpJZlvS4I=
X-Mozilla-News-Host: news://news.eternal-september.org
In-Reply-To: <3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com>

by: Paul A. Clayton - Sat, 26 Nov 2022 13:56 UTC

MitchAlsup wrote:
[snip]
> In todays fabrication technology, I am more in the mood for a 512-bit
> "bus" to memory (cache line in a single beat).

This would seem to be impractical for an interface to conventional
DRAM. (Presumably this is another annoyance with the price-per-bit
tyranny. GDDR seems to allow wider chip interfaces. HBM seems to
use narrower channel interfaces between the interface chip and the
host while being very wide between DRAM chips and the interface
chip — it is interesting that the substantial area loss from
through silicon vias is acceptable for the bandwidth gains but
similar area costs are not acceptable for latency reduction.) DDR5
DIMMs even split the 64-bit data interface into two channels!

> Heck, once you get 16-32
> cores on a die you are going to need this kind of BW.

Do wider channels really help bandwidth that much compared to more
narrower channels given high thread count (and out-of-order [and
perhaps prefetching] providing some memory-level parallelism)?
Caches would seem to reduce miss rates such that multiple threads
would be generating many misses at the same time (i.e., misses
seem like to be phased and the phases for different threads are
unlikely to overlap — good for localized bandwidth demand, bad for
using threads to increase MLP).

I may have mentioned some time the concept of a wide interface
with multiple phases in a intra-block non-uniform latency cache.
This would be applying width pipelining to cache access; the first
chunk (possibly predicted critical chunk placed closest to
expected user) arrives faster and later accesses can use those
interface wires for their first chunk in the next cycle. Higher
latency chunks might be the same size or be larger (depending on
physical design tradeoffs — e.g., the first chunks might be in
lower density, lower latency SRAM arrays and favor only 64-bit
width), but each beat a full cache block could be transferred
(high bandwidth).

I am skeptical that such intra-block NUCA makes sense. The concept
is sufficiently obvious that if it was worthwhile it would already
have been implemented. (I assume that such an implementation would
have been detected even if never documented.) Even block-based
NUCA does not seem to be generally implemented.

I mention this mainly to point out that a cache block wide
interface does not have to transfer the entire cache block in a
single beat — width pipelining is possible. Yet I would also love
for my mentioning such to inspire an actual implementation.☺

Re: Reconsidering Variable-Length Operands

<2022Nov26.171219@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=29176&group=comp.arch#29176

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Reconsidering Variable-Length Operands
Date: Sat, 26 Nov 2022 16:12:19 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 28
Message-ID: <2022Nov26.171219@mips.complang.tuwien.ac.at>
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com> <3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com> <tlt5ug$1d3ld$2@dont-email.me>
Injection-Info: reader01.eternal-september.org; posting-host="43e8ffa53af09fc5712caa4ef73a2d9a";
logging-data="1497726"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+xpPimBPt2H1x6IH0WsZgM"
Cancel-Lock: sha1:S45+A9KAurIBxiN0Tp6pMOnQOzQ=
X-newsreader: xrn 10.11

by: Anton Ertl - Sat, 26 Nov 2022 16:12 UTC

I don't know if Mitch Alsup is thinking about the interface to DRAM,
or to outer cache levels.

For DRAM, it would certainly impractical, because the difference
between DRAM cycle time and transfer speeds has led to burst lengths
of at least 16 beats for DDR5 (8 for DDR4 and DDR3, 4 for DDR2, 2 for
DDR). In order to serve 64-byte cache lines in 16 beats, DDR5 has
narrowed each channel to 32 bits. You don't get any real speedup with
a 512-bit channel with burst lengths of 1 (i.e., SDR RAM), because you
now have to wait for the DRAM cycle to complete before you can perform
the next access.

But I am sure that Mitch Alsup knows this and he was thinking about
interfaces to outer cache levels.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Reconsidering Variable-Length Operands

<jTrgL.138800$fg35.137747@fx10.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=29177&group=comp.arch#29177

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx10.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: sco...@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Reconsidering Variable-Length Operands
Newsgroups: comp.arch
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com> <3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com> <tlt5ug$1d3ld$2@dont-email.me>
Lines: 37
Message-ID: <jTrgL.138800$fg35.137747@fx10.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Sat, 26 Nov 2022 17:18:07 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Sat, 26 Nov 2022 17:18:07 GMT
X-Received-Bytes: 2487

by: Scott Lurndal - Sat, 26 Nov 2022 17:18 UTC

"Paul A. Clayton" <paaronclayton@gmail.com> writes:
>MitchAlsup wrote:
>[snip]
>> In todays fabrication technology, I am more in the mood for a 512-bit
>> "bus" to memory (cache line in a single beat).
>
>This would seem to be impractical for an interface to conventional
>DRAM. (Presumably this is another annoyance with the price-per-bit
>tyranny. GDDR seems to allow wider chip interfaces. HBM seems to
>use narrower channel interfaces between the interface chip and the
>host while being very wide between DRAM chips and the interface
>chip — it is interesting that the substantial area loss from
>through silicon vias is acceptable for the bandwidth gains but
>similar area costs are not acceptable for latency reduction.) DDR5
>DIMMs even split the 64-bit data interface into two channels!

Most modern multicore CPUs use a mesh structure in place of a bus.
The ARM CMN-600, for example, has two 256-bit data channels, one
for each direction, and supports up to 16 memory controllers.

https://developer.arm.com/documentation/100180/0302/?lang=en

Striping the physical address space across multiple memory controllers
allows greater bandwidth from memory to LLC/SLC; which is distributed across
points in the mesh.

Other processor vendor interconnect implementations may have wider
data channels (e.g. 512 bits) (Intel Mesh or AMD InfinityFabric,
although it's not clear that Intel has completely abandoned the
ring structured bus or crossbar switch).

https://ieeexplore.ieee.org/document/6275442?reload=true

Typical internal busses to PCIe or on-board I/O controllers are already 256,
512 (or more) bits wide in modern SoCs (consider the bandwidth requirements
for a DPU managing multiple 100/200/400 Gb/sec ethernet ports at line rate, for example).

Re: Reconsidering Variable-Length Operands

<5594d3d6-f9bf-45bc-90da-94d9e85f55a2n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=29179&group=comp.arch#29179

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:993:b0:6fa:172:c37d with SMTP id x19-20020a05620a099300b006fa0172c37dmr39448077qkx.92.1669491230000;
Sat, 26 Nov 2022 11:33:50 -0800 (PST)
X-Received: by 2002:a05:6808:b0c:b0:35a:62b8:f313 with SMTP id
s12-20020a0568080b0c00b0035a62b8f313mr12114410oij.106.1669491229765; Sat, 26
Nov 2022 11:33:49 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 26 Nov 2022 11:33:49 -0800 (PST)
In-Reply-To: <2022Nov26.171219@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3006:8fd2:4f97:2879;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3006:8fd2:4f97:2879
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com> <tlt5ug$1d3ld$2@dont-email.me>
<2022Nov26.171219@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5594d3d6-f9bf-45bc-90da-94d9e85f55a2n@googlegroups.com>
Subject: Re: Reconsidering Variable-Length Operands
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 26 Nov 2022 19:33:49 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3310

by: MitchAlsup - Sat, 26 Nov 2022 19:33 UTC

On Saturday, November 26, 2022 at 10:22:39 AM UTC-6, Anton Ertl wrote:
> "Paul A. Clayton" <paaron...@gmail.com> writes:
> >MitchAlsup wrote:
> >[snip]
> >> In todays fabrication technology, I am more in the mood for a 512-bit
> >> "bus" to memory (cache line in a single beat).
> >
> >This would seem to be impractical for an interface to conventional
> >DRAM.
> I don't know if Mitch Alsup is thinking about the interface to DRAM,
> or to outer cache levels.
>
> For DRAM, it would certainly impractical, because the difference
> between DRAM cycle time and transfer speeds has led to burst lengths
> of at least 16 beats for DDR5 (8 for DDR4 and DDR3, 4 for DDR2, 2 for
> DDR). In order to serve 64-byte cache lines in 16 beats, DDR5 has
> narrowed each channel to 32 bits. You don't get any real speedup with
> a 512-bit channel with burst lengths of 1 (i.e., SDR RAM), because you
> now have to wait for the DRAM cycle to complete before you can perform
> the next access.
<
Direct copy and paste from Wikipedia::
<
HBM memory bus is very wide in comparison to other DRAM memories
such as DDR4 or GDDR5. An HBM stack of four DRAM dies (4‑Hi) has two
128‑bit channels per die for a total of 8 channels and a width of 1024 bits
in total. A graphics card/GPU with four 4‑Hi HBM stacks would therefore
have a memory bus with a width of 4096 bits. In comparison, the bus width
of GDDR memories is 32 bits, with 16 channels for a graphics card with a
512‑bit memory interface.[12] HBM supports up to 4 GB per package.
>
> But I am sure that Mitch Alsup knows this and he was thinking about
> interfaces to outer cache levels.
>
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: Reconsidering Variable-Length Operands

<tltqdq$1en60$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=29180&group=comp.arch#29180

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: paaroncl...@gmail.com (Paul A. Clayton)
Newsgroups: comp.arch
Subject: Re: Reconsidering Variable-Length Operands
Date: Sat, 26 Nov 2022 14:46:01 -0500
Organization: A noiseless patient Spider
Lines: 50
Message-ID: <tltqdq$1en60$1@dont-email.me>
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com>
<tlt5ug$1d3ld$2@dont-email.me> <2022Nov26.171219@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 26 Nov 2022 19:46:02 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="321980a362e7e93ddccaed609d9e83c6";
logging-data="1531072"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19t51UiAnfOo2GKLBMV2AmfAE9suwBiR5c="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.0
Cancel-Lock: sha1:gbmKe54O2bLbVX4BZTK+dgKEamg=
In-Reply-To: <2022Nov26.171219@mips.complang.tuwien.ac.at>

by: Paul A. Clayton - Sat, 26 Nov 2022 19:46 UTC

Anton Ertl wrote:
> "Paul A. Clayton" <paaronclayton@gmail.com> writes:
>> MitchAlsup wrote:
>> [snip]
>>> In todays fabrication technology, I am more in the mood for a 512-bit
>>> "bus" to memory (cache line in a single beat).
>>
>> This would seem to be impractical for an interface to conventional
>> DRAM.
>
> I don't know if Mitch Alsup is thinking about the interface to DRAM,
> or to outer cache levels >
> For DRAM, it would certainly impractical, because the difference
> between DRAM cycle time and transfer speeds has led to burst lengths
> of at least 16 beats for DDR5 (8 for DDR4 and DDR3, 4 for DDR2, 2 for
> DDR). In order to serve 64-byte cache lines in 16 beats, DDR5 has
> narrowed each channel to 32 bits. You don't get any real speedup with
> a 512-bit channel with burst lengths of 1 (i.e., SDR RAM), because you
> now have to wait for the DRAM cycle to complete before you can perform
> the next access.
>
> But I am sure that Mitch Alsup knows this and he was thinking about
> interfaces to outer cache levels.

I am also sure he knows that, but I am not certain if he meant
that he would prefer a wide interface to memory even if he knew it
would not happen, similarly to how he would prefer a greater
emphasis on latency for DRAM.

I do not think the cycle time is physically required to be as long
as it is (other than for cost-per-bit optimization). Obviously
designing each chip as if it was two chips would allow halving the
cycle time when addressing different "chips" and I rather suspect
that with sufficient power accessing different banks within a chip
would not be problematic. Even the single internal array cycle
time is presumably a cost-driven choice.

Mitch Alsup has mentioned the latency achieved for an embedded
DRAM design he worked on. IBM has used embedded DRAM for outer
levels of cache in POWER and zArch. These designs included not
merely latency improvements but "cycle time" improvements (though
I seem to recall IBM's design had cycle time limits much greater
than SRAM).

(I think it is kind of sad that one cannot get, e.g., 1 GiB of low
latency DRAM and use it effectively. Effective use would probably
require early L3 cache miss detection/prediction, which would be
somewhat challenging I suspect. With more integrated memory — on
board or even in-package — there should be opportunities not only
for increasing bandwidth but also for reducing latency.)

Re: Reconsidering Variable-Length Operands

<9b7ed466-344f-4312-9a68-fc539f58f00cn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=29181&group=comp.arch#29181

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:260d:b0:6fb:a9af:2238 with SMTP id z13-20020a05620a260d00b006fba9af2238mr22733907qko.457.1669495710111;
Sat, 26 Nov 2022 12:48:30 -0800 (PST)
X-Received: by 2002:a05:6870:b381:b0:143:a469:8c4e with SMTP id
w1-20020a056870b38100b00143a4698c4emr8178oap.1.1669495709720; Sat, 26 Nov
2022 12:48:29 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border-2.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 26 Nov 2022 12:48:29 -0800 (PST)
In-Reply-To: <tltqdq$1en60$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3006:8fd2:4f97:2879;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3006:8fd2:4f97:2879
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com> <tlt5ug$1d3ld$2@dont-email.me>
<2022Nov26.171219@mips.complang.tuwien.ac.at> <tltqdq$1en60$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <9b7ed466-344f-4312-9a68-fc539f58f00cn@googlegroups.com>
Subject: Re: Reconsidering Variable-Length Operands
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 26 Nov 2022 20:48:30 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 65

by: MitchAlsup - Sat, 26 Nov 2022 20:48 UTC

On Saturday, November 26, 2022 at 1:46:06 PM UTC-6, Paul A. Clayton wrote:
> Anton Ertl wrote:
> > "Paul A. Clayton" <paaron...@gmail.com> writes:
> >> MitchAlsup wrote:

> > But I am sure that Mitch Alsup knows this and he was thinking about
> > interfaces to outer cache levels.
<
> I am also sure he knows that, but I am not certain if he meant
> that he would prefer a wide interface to memory even if he knew it
> would not happen, similarly to how he would prefer a greater
> emphasis on latency for DRAM.
<
It depends on what your technology allows you to do. at one end
of the scale is 1 wire operating at optical frequencies, at the other
end is the number of wires equal to a cache line. You can't go lower
than 1 wire, and you don't need more than 1 cache line per cycle
per step in the interconnect.
>
> I do not think the cycle time is physically required to be as long
> as it is (other than for cost-per-bit optimization). Obviously
> designing each chip as if it was two chips would allow halving the
<
The prominent delay is not logic, but wires, therefore you get the
delay of 71% if you split the chip in half. SQRT( ½ )
<
> cycle time when addressing different "chips" and I rather suspect
> that with sufficient power accessing different banks within a chip
> would not be problematic. Even the single internal array cycle
> time is presumably a cost-driven choice.
>
> Mitch Alsup has mentioned the latency achieved for an embedded
> DRAM design he worked on. IBM has used embedded DRAM for outer
> levels of cache in POWER and zArch. These designs included not
> merely latency improvements but "cycle time" improvements (though
> I seem to recall IBM's design had cycle time limits much greater
> than SRAM).
<
I built a DRAM macro--including layout and SPICE modeling. Using
the same sense amplifiers as I used for my SRAM design, the access
time for DRAM was only a few picoseconds longer than SRAM and
the DRAM was at least 6× more dense. Cycle time was 3× but access
time was not usefully different.
<
What most people don't realize is that most of the delay in accessing
DRAM has to do with intra-chip wire delay, not cell access. Another large
component is waiting for Power and Ground to stabilize after an ACTivate
or PREcharge; so one can (again) use those sensitive analog sense amps.
>
> (I think it is kind of sad that one cannot get, e.g., 1 GiB of low
> latency DRAM and use it effectively. Effective use would probably
> require early L3 cache miss detection/prediction, which would be
> somewhat challenging I suspect. With more integrated memory — on
> board or even in-package — there should be opportunities not only
> for increasing bandwidth but also for reducing latency.)
<
DRAM manufactures spend Billions (20B+) to put up FABs to manufacture
DRAMS. DRAM suppliers put up $30M factories to build packages to put
those DRAM chips in; the cost of the chip and the cost of the package are
rather equal in true volume manufacture.
<
Would you be willing to pay 50% more for a DRAM with 2× as many data
pins ? {Yeah, me neither}.

Re: Reconsidering Variable-Length Operands

<2022Nov27.003336@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=29183&group=comp.arch#29183

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Reconsidering Variable-Length Operands
Date: Sat, 26 Nov 2022 23:33:36 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 23
Message-ID: <2022Nov27.003336@mips.complang.tuwien.ac.at>
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com> <3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com> <tlt5ug$1d3ld$2@dont-email.me> <2022Nov26.171219@mips.complang.tuwien.ac.at> <tltqdq$1en60$1@dont-email.me>
Injection-Info: reader01.eternal-september.org; posting-host="5d4ae79c267fe45ab272e3485c7578ba";
logging-data="1560701"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/OUf8+vbJxfFXoxe+dDC+x"
Cancel-Lock: sha1:nW8eeis33qRyhkHnlaZvmCAlYxw=
X-newsreader: xrn 10.11

by: Anton Ertl - Sat, 26 Nov 2022 23:33 UTC

"Paul A. Clayton" <paaronclayton@gmail.com> writes:
>I do not think the cycle time is physically required to be as long
>as it is

I think it is. The cycle time is determined by the time for reading
the contents of the DRAM cells, amplifying the results, then writing
the reconstructed content of the bank back.

>(other than for cost-per-bit optimization). Obviously
>designing each chip as if it was two chips would allow halving the
>cycle time when addressing different "chips"

Yes, you can perform n reads per DRAM cycle with n channels; people
are doing that with independent channel contents, but yes, if you are
prepared to get only half the usable memory for the same money, you
can write the same stuff to two channels at the same time, and read
from both channels at the same. Not sure what's this is intended to
achieve, though.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Reconsidering Variable-Length Operands

<24a2d66d-4a19-4d00-88b6-46af85b0d6a2n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=29185&group=comp.arch#29185

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:a53:0:b0:6e2:285e:92ea with SMTP id 80-20020a370a53000000b006e2285e92eamr25859807qkk.213.1669508453445;
Sat, 26 Nov 2022 16:20:53 -0800 (PST)
X-Received: by 2002:a05:6871:b1f:b0:13c:9414:5c46 with SMTP id
fq31-20020a0568710b1f00b0013c94145c46mr29646441oab.79.1669508453193; Sat, 26
Nov 2022 16:20:53 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 26 Nov 2022 16:20:52 -0800 (PST)
In-Reply-To: <2022Nov27.003336@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:8ca1:2642:c05a:88da;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:8ca1:2642:c05a:88da
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com> <tlt5ug$1d3ld$2@dont-email.me>
<2022Nov26.171219@mips.complang.tuwien.ac.at> <tltqdq$1en60$1@dont-email.me> <2022Nov27.003336@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <24a2d66d-4a19-4d00-88b6-46af85b0d6a2n@googlegroups.com>
Subject: Re: Reconsidering Variable-Length Operands
From: already5...@yahoo.com (Michael S)
Injection-Date: Sun, 27 Nov 2022 00:20:53 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2634

by: Michael S - Sun, 27 Nov 2022 00:20 UTC

On Sunday, November 27, 2022 at 1:47:58 AM UTC+2, Anton Ertl wrote:
> "Paul A. Clayton" <paaron...@gmail.com> writes:
> >I do not think the cycle time is physically required to be as long
> >as it is
> I think it is. The cycle time is determined by the time for reading
> the contents of the DRAM cells, amplifying the results, then writing
> the reconstructed content of the bank back.

That's true for Row Miss and even more so for Row Conflict.
But not for Row Hit, which is hopefully the most common case.

> >(other than for cost-per-bit optimization). Obviously
> >designing each chip as if it was two chips would allow halving the
> >cycle time when addressing different "chips"
> Yes, you can perform n reads per DRAM cycle with n channels; people
> are doing that with independent channel contents, but yes, if you are
> prepared to get only half the usable memory for the same money, you
> can write the same stuff to two channels at the same time, and read
> from both channels at the same. Not sure what's this is intended to
> achieve, though.
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: Reconsidering Variable-Length Operands

<be83ae79-3b72-4c4b-afb0-7867d62160d3n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=29187&group=comp.arch#29187

copy link Newsgroups: comp.arch

X-Received: by 2002:a0c:aa97:0:b0:4c6:d7dc:d2e8 with SMTP id f23-20020a0caa97000000b004c6d7dcd2e8mr15359901qvb.130.1669509221302;
Sat, 26 Nov 2022 16:33:41 -0800 (PST)
X-Received: by 2002:a9d:4b08:0:b0:66e:1f02:1201 with SMTP id
q8-20020a9d4b08000000b0066e1f021201mr8864883otf.81.1669509221067; Sat, 26 Nov
2022 16:33:41 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 26 Nov 2022 16:33:40 -0800 (PST)
In-Reply-To: <24a2d66d-4a19-4d00-88b6-46af85b0d6a2n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:8ca1:2642:c05a:88da;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:8ca1:2642:c05a:88da
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com> <tlt5ug$1d3ld$2@dont-email.me>
<2022Nov26.171219@mips.complang.tuwien.ac.at> <tltqdq$1en60$1@dont-email.me>
<2022Nov27.003336@mips.complang.tuwien.ac.at> <24a2d66d-4a19-4d00-88b6-46af85b0d6a2n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <be83ae79-3b72-4c4b-afb0-7867d62160d3n@googlegroups.com>
Subject: Re: Reconsidering Variable-Length Operands
From: already5...@yahoo.com (Michael S)
Injection-Date: Sun, 27 Nov 2022 00:33:41 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 3418

by: Michael S - Sun, 27 Nov 2022 00:33 UTC

On Sunday, November 27, 2022 at 2:20:54 AM UTC+2, Michael S wrote:
> On Sunday, November 27, 2022 at 1:47:58 AM UTC+2, Anton Ertl wrote:
> > "Paul A. Clayton" <paaron...@gmail.com> writes:
> > >I do not think the cycle time is physically required to be as long
> > >as it is
> > I think it is. The cycle time is determined by the time for reading
> > the contents of the DRAM cells, amplifying the results, then writing
> > the reconstructed content of the bank back.
> That's true for Row Miss and even more so for Row Conflict.
> But not for Row Hit, which is hopefully the most common case.

Oh, I answered before thinking.
No, it's not true even for Row Miss/Conflict.
Row Miss today is ~13 ns, a cycle time of control machine within
DRAM chip is typically 2.5ns. So, even today Row Miss takes
multiple cycles which clearly demonstrates that the two are
in principle not related to each other.
But I am not sure that Paul is correct, that 2.5ns is due to optimization
for cost per bit. More likely it's due to optimization for several things
but primarily for power and for simplicity of interface of DRAM chip.
Faster control machine will likely require one more supply voltage.

> > >(other than for cost-per-bit optimization). Obviously
> > >designing each chip as if it was two chips would allow halving the
> > >cycle time when addressing different "chips"
> > Yes, you can perform n reads per DRAM cycle with n channels; people
> > are doing that with independent channel contents, but yes, if you are
> > prepared to get only half the usable memory for the same money, you
> > can write the same stuff to two channels at the same time, and read
> > from both channels at the same. Not sure what's this is intended to
> > achieve, though.
> > - anton
> > --
> > 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> > Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Re: Reconsidering Variable-Length Operands

<tnl5iu$3ngpd$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=29590&group=comp.arch#29590

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: paaroncl...@gmail.com (Paul A. Clayton)
Newsgroups: comp.arch
Subject: Re: Reconsidering Variable-Length Operands
Date: Sat, 17 Dec 2022 14:33:49 -0500
Organization: A noiseless patient Spider
Lines: 104
Message-ID: <tnl5iu$3ngpd$1@dont-email.me>
References: <b0b53c16-1229-449c-ade0-be0ed4de01e8n@googlegroups.com>
<3a4d88a0-b36c-4ef7-adf3-0557cd13de0an@googlegroups.com>
<tlt5ug$1d3ld$2@dont-email.me> <2022Nov26.171219@mips.complang.tuwien.ac.at>
<tltqdq$1en60$1@dont-email.me>
<9b7ed466-344f-4312-9a68-fc539f58f00cn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 17 Dec 2022 19:33:50 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="8ce3a816fc037f589dbf7bcebac79e1f";
logging-data="3916589"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+L5ddAxJpnRQ4+CNTnrFxv/kDIroV8xiQ="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.0
Cancel-Lock: sha1:dldc49ABTB7UCSaJypfYNBYNodQ=
X-Mozilla-News-Host: news://news.eternal-september.org
In-Reply-To: <9b7ed466-344f-4312-9a68-fc539f58f00cn@googlegroups.com>

by: Paul A. Clayton - Sat, 17 Dec 2022 19:33 UTC

MitchAlsup wrote:
> On Saturday, November 26, 2022 at 1:46:06 PM UTC-6, Paul A. Clayton wrote:
[snip]
>> I do not think the cycle time is physically required to be as long
>> as it is (other than for cost-per-bit optimization). Obviously
>> designing each chip as if it was two chips would allow halving the
> <
> The prominent delay is not logic, but wires, therefore you get the
> delay of 71% if you split the chip in half. SQRT( ½ )

My statement was a result of misunderstanding. I had thought the
cycle time referenced was for the chip and not the specific DRAM
array(s) accessed. I only meant that clearly two different DRAM
chips could be accessed with minimal delay between accesses since
they are completely independent. Accesses to the same array but a
different row — what was intended by Anton Ertl — would have a
longer cycle time.

I was not thinking about latency of internal array/sense amplifier
row to external pads, which seems to be what your statement concerns.

(This does bring up the possibility I may have mentioned sometime
of width pipelined access from multiple arrays with varying
distances from the output pads and possibly variation in row size
(wire delay and sense amplifier sensitivity) or even capacitor
sizing. This would be exploiting the narrower transmission to hide
latency from more distant or otherwise higher latency arrays.
Accessing more arrays would obviously cost more energy and would
amplify Row Hammer effects.)

[snip]
> I built a DRAM macro--including layout and SPICE modeling. Using
> the same sense amplifiers as I used for my SRAM design, the access
> time for DRAM was only a few picoseconds longer than SRAM and
> the DRAM was at least 6× more dense. Cycle time was 3× but access
> time was not usefully different.

Obviously higher density would reduce wire delay at a larger scale
*and* facilitate more on-chip capacity for which higher bandwidth
would be less expensive.

> What most people don't realize is that most of the delay in accessing
> DRAM has to do with intra-chip wire delay, not cell access. Another large
> component is waiting for Power and Ground to stabilize after an ACTivate
> or PREcharge; so one can (again) use those sensitive analog sense amps.

There is also latency on the other end: checking three levels of
cache, possibly multiple on-chip-network hopes to the memory
controller, memory controller delay, and probably other factors.
Some of these could perhaps be reduced (cache miss prediction or
hit filtering, on-chip NUMA [possibly even sending on-chip data to
a processing node closer to the memory interface, overlapping that
latency with the memory access latency???]). Chip-to-chip latency
is also non-zero.

>> (I think it is kind of sad that one cannot get, e.g., 1 GiB of low
>> latency DRAM and use it effectively. Effective use would probably
>> require early L3 cache miss detection/prediction, which would be
>> somewhat challenging I suspect. With more integrated memory — on
>> board or even in-package — there should be opportunities not only
>> for increasing bandwidth but also for reducing latency.)
>
> DRAM manufactures spend Billions (20B+) to put up FABs to manufacture
> DRAMS. DRAM suppliers put up $30M factories to build packages to put
> those DRAM chips in; the cost of the chip and the cost of the package are
> rather equal in true volume manufacture.
>
> Would you be willing to pay 50% more for a DRAM with 2× as many data
> pins ? {Yeah, me neither}.

Given that High Bandwidth Memory has a significant market, at
least some users are willing to pay for higher *bandwidth*.
Perhaps a market might be build for lower latency.

A 50% per bit price premium might not be horrifying for a
half-latency memory with 10% the capacity of main memory. If
memory was half the system cost, this would increase system cost
by less than 10%.

(This also brings the question of whether a "small low latency"
memory could be sufficiently useful. The effectiveness would
depend on the workload, software optimization, *and* hardware
optimization. Using such as an L4 cache has the danger of
increasing average access time if miss detection adds much
latency. Using such as a fast memory presents utilization issues
from page-sized allocations and other factors.)

I would probably not be willing to pay that premium because my
current computer is fast enough for writing, content display (even
most PDFs — 2000-page ISA manuals can be slow — and heavily
javascript inflicted web pages), and light gaming. I also tend to
be a little stingy on personal spending — partially financial
constraints, partially ethics, and partially self-conceit issues.
I also tend to fall prey to the false emphasis on low
unit/acquisition cost versus value and total cost, to middle
option bias, and to other fallacies as well as decision phobia.

Some organizations are able to make somewhat more rational and
information-based decisions.

Since such would presumably be packaged with the CPU, some of the
packaging cost relative to pad count considerations might be
reduced. (The packaging cost would be higher, but Apple is already
integrating DRAM into the processor package [I think].)

Subject	Author
Reconsidering Variable-Length Operands	Quadibloc
Re: Reconsidering Variable-Length Operands	Quadibloc
Re: Reconsidering Variable-Length Operands	Quadibloc
Re: Reconsidering Variable-Length Operands	Ivan Godard
Re: Reconsidering Variable-Length Operands	Bill Findlay
Re: Reconsidering Variable-Length Operands	Quadibloc
Re: Reconsidering Variable-Length Operands	MitchAlsup
Re: Reconsidering Variable-Length Operands	Quadibloc
Re: Reconsidering Variable-Length Operands	Quadibloc
Re: Reconsidering Variable-Length Operands	Thomas Koenig
Re: Reconsidering Variable-Length Operands	Marcus
Re: Reconsidering Variable-Length Operands	BGB
Re: Reconsidering Variable-Length Operands	MitchAlsup
Re: Reconsidering Variable-Length Operands	Thomas Koenig
Re: Reconsidering Variable-Length Operands	BGB
Re: Reconsidering Variable-Length Operands	MitchAlsup
Re: Reconsidering Variable-Length Operands	Marcus
Re: Reconsidering Variable-Length Operands	MitchAlsup
Re: Reconsidering Variable-Length Operands	Ivan Godard
Re: Reconsidering Variable-Length Operands	Marcus
Re: Reconsidering Variable-Length Operands	MitchAlsup
Re: Reconsidering Variable-Length Operands	Stephen Fuld
Re: Reconsidering Variable-Length Operands	MitchAlsup
Re: Reconsidering Variable-Length Operands	Stephen Fuld
Re: Reconsidering Variable-Length Operands	EricP
Re: Reconsidering Variable-Length Operands	Stephen Fuld
Re: Reconsidering Variable-Length Operands	Ivan Godard
Re: Reconsidering Variable-Length Operands	Stephen Fuld
Re: Reconsidering Variable-Length Operands	Ivan Godard
Re: Reconsidering Variable-Length Operands	George Neuner
Re: Reconsidering Variable-Length Operands	MitchAlsup
Re: Reconsidering Variable-Length Operands	Andy Valencia
Re: Reconsidering Variable-Length Operands	David Brown
Re: Reconsidering Variable-Length Operands	Stephen Fuld
Re: Reconsidering Variable-Length Operands	MitchAlsup
Re: Reconsidering Variable-Length Operands	Ivan Godard
Re: Reconsidering Variable-Length Operands	MitchAlsup
Re: Reconsidering Variable-Length Operands	Terje Mathisen
Re: Reconsidering Variable-Length Operands	Ivan Godard
Re: Reconsidering Variable-Length Operands	EricP
Re: Reconsidering Variable-Length Operands	Stephen Fuld
Re: Reconsidering Variable-Length Operands	BGB
Re: Reconsidering Variable-Length Operands	Thomas Koenig
Re: Reconsidering Variable-Length Operands	Michael S
Re: Reconsidering Variable-Length Operands	Thomas Koenig
Re: Reconsidering Variable-Length Operands	Michael S
Re: Reconsidering Variable-Length Operands	Thomas Koenig
Re: Reconsidering Variable-Length Operands	BGB
Re: Reconsidering Variable-Length Operands	Ivan Godard
Re: Reconsidering Variable-Length Operands	BGB
Re: Reconsidering Variable-Length Operands	MitchAlsup
Re: Reconsidering Variable-Length Operands	BGB
Re: Reconsidering Variable-Length Operands	Thomas Koenig
Re: Reconsidering Variable-Length Operands	MitchAlsup
Re: Reconsidering Variable-Length Operands	Stephen Fuld
Re: Reconsidering Variable-Length Operands	EricP
Re: Reconsidering Variable-Length Operands	MitchAlsup
Re: Reconsidering Variable-Length Operands	BGB
Re: Reconsidering Variable-Length Operands	MitchAlsup
Re: Reconsidering Variable-Length Operands	BGB
Re: Reconsidering Variable-Length Operands	MitchAlsup
Re: Reconsidering Variable-Length Operands	Thomas Koenig
Re: Reconsidering Variable-Length Operands	BGB
Re: Reconsidering Variable-Length Operands	Thomas Koenig
Re: Reconsidering Variable-Length Operands	MitchAlsup
Re: Reconsidering Variable-Length Operands	Thomas Koenig
Re: Reconsidering Variable-Length Operands	Quadibloc
Re: Reconsidering Variable-Length Operands	MitchAlsup
Re: Reconsidering Variable-Length Operands	Thomas Koenig
Re: Reconsidering Variable-Length Operands	MitchAlsup
Re: Reconsidering Variable-Length Operands	BGB
Re: Reconsidering Variable-Length Operands	MitchAlsup
Re: Reconsidering Variable-Length Operands	BGB
Re: Reconsidering Variable-Length Operands	BGB
Re: Reconsidering Variable-Length Operands	MitchAlsup
Re: Reconsidering Variable-Length Operands	BGB
Re: Reconsidering Variable-Length Operands	MitchAlsup
Re: Reconsidering Variable-Length Operands	Quadibloc
Re: Reconsidering Variable-Length Operands	BGB
Re: Reconsidering Variable-Length Operands	Terje Mathisen
Re: Reconsidering Variable-Length Operands	EricP
Re: Reconsidering Variable-Length Operands	Ivan Godard
Re: Reconsidering Variable-Length Operands	BGB
Re: Reconsidering Variable-Length Operands	EricP
Re: Reconsidering Variable-Length Operands	MitchAlsup
Re: Reconsidering Variable-Length Operands	Stephen Fuld
Re: Reconsidering Variable-Length Operands	MitchAlsup
Re: Reconsidering Variable-Length Operands	Stefan Monnier
Re: Reconsidering Variable-Length Operands	BGB
Re: Reconsidering Variable-Length Operands	Paul A. Clayton
Re: Reconsidering Variable-Length Operands	MitchAlsup
Re: Reconsidering Variable-Length Operands	MitchAlsup
Re: Reconsidering Variable-Length Operands	George Neuner
Re: Reconsidering Variable-Length Operands	MitchAlsup
Re: Reconsidering Variable-Length Operands	Ivan Godard
Re: Reconsidering Variable-Length Operands	Paul A. Clayton
Re: Reconsidering Variable-Length Operands	Anton Ertl
Re: Reconsidering Variable-Length Operands	MitchAlsup
Re: Reconsidering Variable-Length Operands	Paul A. Clayton
Re: Reconsidering Variable-Length Operands	MitchAlsup
Re: Reconsidering Variable-Length Operands	Paul A. Clayton
Re: Reconsidering Variable-Length Operands	Anton Ertl
Re: Reconsidering Variable-Length Operands	Scott Lurndal
Re: Reconsidering Variable-Length Operands	BGB
Re: Reconsidering Variable-Length Operands	Stefan Monnier
Re: Reconsidering Variable-Length Operands	John Dallman