Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Sentient plasmoids are a gas.


devel / comp.arch / Re: RISC-V vs. Aarch64

SubjectAuthor
* RISC-V vs. Aarch64Anton Ertl
+* Re: RISC-V vs. Aarch64MitchAlsup
|+* Re: RISC-V vs. Aarch64Anton Ertl
||`* Re: RISC-V vs. Aarch64MitchAlsup
|| +- Re: RISC-V vs. Aarch64BGB
|| `- Re: RISC-V vs. Aarch64Anton Ertl
|+* Re: RISC-V vs. Aarch64Ivan Godard
||+- Re: RISC-V vs. Aarch64robf...@gmail.com
||+- Re: RISC-V vs. Aarch64MitchAlsup
||`* Re: RISC-V vs. Aarch64Quadibloc
|| `* Re: RISC-V vs. Aarch64Quadibloc
||  `- Re: RISC-V vs. Aarch64Quadibloc
|+* Re: RISC-V vs. Aarch64Marcus
||+- Re: RISC-V vs. Aarch64BGB
||`* Re: RISC-V vs. Aarch64MitchAlsup
|| +- Re: RISC-V vs. Aarch64BGB
|| `- Re: RISC-V vs. Aarch64Ivan Godard
|`- Re: RISC-V vs. Aarch64MitchAlsup
`* Re: RISC-V vs. Aarch64BGB
 +* Re: RISC-V vs. Aarch64MitchAlsup
 |+- Re: RISC-V vs. Aarch64MitchAlsup
 |+* Re: RISC-V vs. Aarch64Thomas Koenig
 ||+* Re: RISC-V vs. Aarch64Ivan Godard
 |||`* Re: RISC-V vs. Aarch64EricP
 ||| `- Re: RISC-V vs. Aarch64Ivan Godard
 ||+* Re: RISC-V vs. Aarch64MitchAlsup
 |||`* Re: RISC-V vs. Aarch64Ivan Godard
 ||| `* Re: RISC-V vs. Aarch64MitchAlsup
 |||  `* Re: RISC-V vs. Aarch64Ivan Godard
 |||   `* Re: RISC-V vs. Aarch64MitchAlsup
 |||    `- Re: RISC-V vs. Aarch64Marcus
 ||`* Re: RISC-V vs. Aarch64BGB
 || `- Re: RISC-V vs. Aarch64MitchAlsup
 |+* Re: RISC-V vs. Aarch64BGB
 ||`* Re: RISC-V vs. Aarch64MitchAlsup
 || `- Re: RISC-V vs. Aarch64Thomas Koenig
 |`* Re: RISC-V vs. Aarch64Marcus
 | `* Re: RISC-V vs. Aarch64EricP
 |  +* Re: RISC-V vs. Aarch64Marcus
 |  |+* Re: RISC-V vs. Aarch64MitchAlsup
 |  ||+* Re: RISC-V vs. Aarch64Niklas Holsti
 |  |||+* Re: RISC-V vs. Aarch64Bill Findlay
 |  ||||`- Re: RISC-V vs. Aarch64MitchAlsup
 |  |||`- Re: RISC-V vs. Aarch64Ivan Godard
 |  ||`- Re: RISC-V vs. Aarch64Thomas Koenig
 |  |+* Re: RISC-V vs. Aarch64Thomas Koenig
 |  ||+* Re: RISC-V vs. Aarch64MitchAlsup
 |  |||`- Re: RISC-V vs. Aarch64BGB
 |  ||+* Re: RISC-V vs. Aarch64Ivan Godard
 |  |||`* Re: RISC-V vs. Aarch64Thomas Koenig
 |  ||| `- Re: RISC-V vs. Aarch64Ivan Godard
 |  ||`* Re: RISC-V vs. Aarch64Marcus
 |  || +* Re: RISC-V vs. Aarch64Thomas Koenig
 |  || |`* Re: RISC-V vs. Aarch64aph
 |  || | +- Re: RISC-V vs. Aarch64Michael S
 |  || | `* Re: RISC-V vs. Aarch64Thomas Koenig
 |  || |  `* Re: RISC-V vs. Aarch64robf...@gmail.com
 |  || |   +* Re: RISC-V vs. Aarch64Ivan Godard
 |  || |   |`- Re: RISC-V vs. Aarch64Tim Rentsch
 |  || |   `* Re: RISC-V vs. Aarch64Terje Mathisen
 |  || |    `* Re: RISC-V vs. Aarch64Thomas Koenig
 |  || |     `* Re: RISC-V vs. Aarch64Marcus
 |  || |      `* Re: RISC-V vs. Aarch64Guillaume
 |  || |       `* Re: RISC-V vs. Aarch64MitchAlsup
 |  || |        +- Re: RISC-V vs. Aarch64Marcus
 |  || |        +* Re: RISC-V vs. Aarch64Ivan Godard
 |  || |        |`* Re: RISC-V vs. Aarch64MitchAlsup
 |  || |        | `* Re: RISC-V vs. Aarch64Ivan Godard
 |  || |        |  `* Re: RISC-V vs. Aarch64Thomas Koenig
 |  || |        |   `* Re: RISC-V vs. Aarch64Ivan Godard
 |  || |        |    `* Re: RISC-V vs. Aarch64EricP
 |  || |        |     +* Re: RISC-V vs. Aarch64MitchAlsup
 |  || |        |     |`* Re: RISC-V vs. Aarch64EricP
 |  || |        |     | `- Re: RISC-V vs. Aarch64MitchAlsup
 |  || |        |     `* Re: RISC-V vs. Aarch64Ivan Godard
 |  || |        |      `* Re: RISC-V vs. Aarch64EricP
 |  || |        |       +- Re: RISC-V vs. Aarch64MitchAlsup
 |  || |        |       `* Re: RISC-V vs. Aarch64Ivan Godard
 |  || |        |        +* Re: RISC-V vs. Aarch64Brett
 |  || |        |        |+* Re: RISC-V vs. Aarch64MitchAlsup
 |  || |        |        ||`- Re: RISC-V vs. Aarch64Ivan Godard
 |  || |        |        |`- Re: RISC-V vs. Aarch64Ivan Godard
 |  || |        |        `* Re: RISC-V vs. Aarch64Stephen Fuld
 |  || |        |         `* Re: RISC-V vs. Aarch64Ivan Godard
 |  || |        |          +* Re: RISC-V vs. Aarch64Stefan Monnier
 |  || |        |          |`- Re: RISC-V vs. Aarch64Ivan Godard
 |  || |        |          +* Re: RISC-V vs. Aarch64MitchAlsup
 |  || |        |          |`* Re: RISC-V vs. Aarch64Ivan Godard
 |  || |        |          | `- Re: RISC-V vs. Aarch64MitchAlsup
 |  || |        |          +* Re: RISC-V vs. Aarch64Stephen Fuld
 |  || |        |          |`- Re: RISC-V vs. Aarch64Ivan Godard
 |  || |        |          `* Re: RISC-V vs. Aarch64EricP
 |  || |        |           +* Re: RISC-V vs. Aarch64EricP
 |  || |        |           |`* Re: RISC-V vs. Aarch64Ivan Godard
 |  || |        |           | `* The type of Mill's belt's slotsStefan Monnier
 |  || |        |           |  +- Re: The type of Mill's belt's slotsMitchAlsup
 |  || |        |           |  `* Re: The type of Mill's belt's slotsIvan Godard
 |  || |        |           |   `* Re: The type of Mill's belt's slotsStefan Monnier
 |  || |        |           |    `* Re: The type of Mill's belt's slotsIvan Godard
 |  || |        |           |     +* Re: The type of Mill's belt's slotsStefan Monnier
 |  || |        |           |     |`* Re: The type of Mill's belt's slotsIvan Godard
 |  || |        |           |     `* Re: The type of Mill's belt's slotsMitchAlsup
 |  || |        |           `- Re: RISC-V vs. Aarch64Ivan Godard
 |  || |        +* Re: RISC-V vs. Aarch64Guillaume
 |  || |        `* Re: RISC-V vs. Aarch64Quadibloc
 |  || `* MRISC32 vectorization (was: RISC-V vs. Aarch64)Thomas Koenig
 |  |`* Re: RISC-V vs. Aarch64Terje Mathisen
 |  `- Re: RISC-V vs. Aarch64Quadibloc
 +* Re: RISC-V vs. Aarch64Anton Ertl
 `- Re: RISC-V vs. Aarch64aph

Pages:123456789101112131415
Re: RISC-V vs. Aarch64

<sruqs6$q1j$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22976&group=comp.arch#22976

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Sat, 15 Jan 2022 07:56:54 -0800
Organization: A noiseless patient Spider
Lines: 64
Message-ID: <sruqs6$q1j$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<KC_zJ.59028$Ak2.12921@fx20.iad> <86h7agvxun.fsf@linuxsc.com>
<M4_BJ.140002$lz3.547@fx34.iad>
<f91f3db8-640e-4c10-b0f7-61c7085b70c8n@googlegroups.com>
<srag0i$2ed$2@dont-email.me>
<00add816-93d7-4763-a68b-33a67db6d770n@googlegroups.com>
<2022Jan8.101413@mips.complang.tuwien.ac.at>
<7557bf3a-61ce-4500-8cf8-ced2dbed7087n@googlegroups.com>
<ad2ee700-b604-4565-9e24-3386580b90c8n@googlegroups.com>
<4d2fbc82-af69-4388-bfa5-e3b2be652744n@googlegroups.com>
<2e706405-006a-49bb-8e8a-f634d749205en@googlegroups.com>
<570acc73-a5da-497f-8ec4-810150e0a9f1n@googlegroups.com>
<850b7681-204a-4df6-9095-cd6ee816a7d5n@googlegroups.com>
<5ea00397-5572-4fbd-bfb3-85c3554f1eb9n@googlegroups.com>
<srnkf0$4cb$1@dont-email.me>
<b4e98991-4fb9-4ef7-a831-430c3fc10145n@googlegroups.com>
<srp3n0$e0a$1@dont-email.me> <srq01m$qvi$1@newsreader4.netcologne.de>
<srr7rr$e7d$1@gioia.aioe.org> <sru612$g7c$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 15 Jan 2022 15:56:54 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="5dc20c94e0b42d014f20cc012d9d41e9";
logging-data="26675"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX184foT2RXun4DvbG/O9W5M1"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.0
Cancel-Lock: sha1:s/YHk0cbdtXumg3CoDrZ6qfG8Oo=
In-Reply-To: <sru612$g7c$1@newsreader4.netcologne.de>
Content-Language: en-US
 by: Ivan Godard - Sat, 15 Jan 2022 15:56 UTC

On 1/15/2022 2:01 AM, Thomas Koenig wrote:
> Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
>> Thomas Koenig wrote:
>>> Ivan Godard <ivan@millcomputing.com> schrieb:
>>>
>>>> Where ISAs really fall down is parsing a bit stream: grab a dynamic
>>>> number of bits off the front of a bit stream, advancing the stream; word
>>>> boundaries are not significant. The problem is that HW provides word
>>>> streams (loop/load) and mapping that to a bit stream is nasty. The logic
>>>> is the same as mapping a line stream into a (VL) instruction stream in
>>>> the decoder's instruction buffer, but how to represent that in an ISA?
>>>
>>> The same way that a vector instruction would be represented?
>>>
>>> Vectors could be made to operate on sub-word quantities such as bytes,
>>> with microarchitectural SIMD underneath.
>>>
>>> I have to confess I do not know much of how compression and
>>> decompression algorithms work. What are the operations that need
>>> to be done with the chunk of bits that is grabbed?
>>>
>> That varies a _lot_! Absolute worst case in my experience is the CABAC
>> option for h264: Content-Adaptive-Binary-Arithmetic-Coding
>>
>> Here you normally extract single bits from the input stream, then you
>> immediately branch on the value of that bit (by definition completely
>> unpredictable, right?) to two separate code paths that really cannot be
>> combined into a single branchless/predicated stream.
>
> Interesting. You can do it in hardware in parallel, then select
> the result. I would assume that the workloads of the two branches
> are roughly the same?
>
> A general purpose CPU could do something similar. A superscalar
> in-order architecture could distribute the work between its
> pipelines. If the ISA has a way to specify separate execution
> units, so that (in SSA form)
>
> if (condition) {
> foo_1 = some work;
> }
> else {
> foo_2 = some other work;
> }
> foo = PHI (foo_1, foo_2);
>
> this could be implemented efficiently even on a superscalar
> in-order machine.
>
> I know this can be done with Mitch's shadow modifier for a limited
> number of instructions, and I also suspect that I am just describing
> a feature of the Mill.

It's a feature of any ISA than can predicate all non-idempotent
instructions of both branches. Mitch's projected predicates can
predicate any instruction; Mill only offers predication for instructions
that can have side effects, and ensures that the ISA has few of those
(in practice only loads and stores and control flow, because all the
potentially-excepting arithmetic instructions produce idempotent NaRs).

However, just because an ISA lets you merge the then and else without a
branch doesn't mean that you *should* merge them. Past an ISA-specific
code size, the cost of processing the instructions you won't execute
exceeds the savings from avoiding the branch.

Re: RISC-V vs. Aarch64

<207f5ece-bfda-4bcd-838d-a4eb30208f77n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22978&group=comp.arch#22978

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:242:: with SMTP id c2mr11558350qtx.559.1642264967670;
Sat, 15 Jan 2022 08:42:47 -0800 (PST)
X-Received: by 2002:a9d:206a:: with SMTP id n97mr11133007ota.142.1642264967458;
Sat, 15 Jan 2022 08:42:47 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!nntp.club.cc.cmu.edu!5.161.45.24.MISMATCH!2.us.feeder.erje.net!feeder.erje.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 15 Jan 2022 08:42:47 -0800 (PST)
In-Reply-To: <sru612$g7c$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:8090:fb97:58b8:238a;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:8090:fb97:58b8:238a
References: <2021Dec24.163843@mips.complang.tuwien.ac.at> <2021Dec31.203710@mips.complang.tuwien.ac.at>
<KC_zJ.59028$Ak2.12921@fx20.iad> <86h7agvxun.fsf@linuxsc.com>
<M4_BJ.140002$lz3.547@fx34.iad> <f91f3db8-640e-4c10-b0f7-61c7085b70c8n@googlegroups.com>
<srag0i$2ed$2@dont-email.me> <00add816-93d7-4763-a68b-33a67db6d770n@googlegroups.com>
<2022Jan8.101413@mips.complang.tuwien.ac.at> <7557bf3a-61ce-4500-8cf8-ced2dbed7087n@googlegroups.com>
<ad2ee700-b604-4565-9e24-3386580b90c8n@googlegroups.com> <4d2fbc82-af69-4388-bfa5-e3b2be652744n@googlegroups.com>
<2e706405-006a-49bb-8e8a-f634d749205en@googlegroups.com> <570acc73-a5da-497f-8ec4-810150e0a9f1n@googlegroups.com>
<850b7681-204a-4df6-9095-cd6ee816a7d5n@googlegroups.com> <5ea00397-5572-4fbd-bfb3-85c3554f1eb9n@googlegroups.com>
<srnkf0$4cb$1@dont-email.me> <b4e98991-4fb9-4ef7-a831-430c3fc10145n@googlegroups.com>
<srp3n0$e0a$1@dont-email.me> <srq01m$qvi$1@newsreader4.netcologne.de>
<srr7rr$e7d$1@gioia.aioe.org> <sru612$g7c$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <207f5ece-bfda-4bcd-838d-a4eb30208f77n@googlegroups.com>
Subject: Re: RISC-V vs. Aarch64
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 15 Jan 2022 16:42:47 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 59
 by: MitchAlsup - Sat, 15 Jan 2022 16:42 UTC

On Saturday, January 15, 2022 at 4:01:09 AM UTC-6, Thomas Koenig wrote:
> Terje Mathisen <terje.m...@tmsw.no> schrieb:
> > Thomas Koenig wrote:
> >> Ivan Godard <iv...@millcomputing.com> schrieb:
> >>
> >>> Where ISAs really fall down is parsing a bit stream: grab a dynamic
> >>> number of bits off the front of a bit stream, advancing the stream; word
> >>> boundaries are not significant. The problem is that HW provides word
> >>> streams (loop/load) and mapping that to a bit stream is nasty. The logic
> >>> is the same as mapping a line stream into a (VL) instruction stream in
> >>> the decoder's instruction buffer, but how to represent that in an ISA?
> >>
> >> The same way that a vector instruction would be represented?
> >>
> >> Vectors could be made to operate on sub-word quantities such as bytes,
> >> with microarchitectural SIMD underneath.
> >>
> >> I have to confess I do not know much of how compression and
> >> decompression algorithms work. What are the operations that need
> >> to be done with the chunk of bits that is grabbed?
> >>
> > That varies a _lot_! Absolute worst case in my experience is the CABAC
> > option for h264: Content-Adaptive-Binary-Arithmetic-Coding
> >
> > Here you normally extract single bits from the input stream, then you
> > immediately branch on the value of that bit (by definition completely
> > unpredictable, right?) to two separate code paths that really cannot be
> > combined into a single branchless/predicated stream.
> Interesting. You can do it in hardware in parallel, then select
> the result. I would assume that the workloads of the two branches
> are roughly the same?
>
> A general purpose CPU could do something similar. A superscalar
> in-order architecture could distribute the work between its
> pipelines. If the ISA has a way to specify separate execution
> units, so that (in SSA form)
>
> if (condition) {
> foo_1 = some work;
> }
> else {
> foo_2 = some other work;
> }
> foo = PHI (foo_1, foo_2);
>
> this could be implemented efficiently even on a superscalar
> in-order machine.
>
> I know this can be done with Mitch's shadow modifier for a limited
> number of instructions,
<
My 66000 can predicate up to 8 instructions. Those 8 can be placed in the
then-clause or in the else-clause; so those 8 instructions are the sum of
both clauses.
<
Predication makes use of the momentum of the instruction fetch-decode
stream and pipelines.
<
> and I also suspect that I am just describing
> a feature of the Mill.

Re: RISC-V vs. Aarch64

<jwvy23gj3lo.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22981&group=comp.arch#22981

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Sat, 15 Jan 2022 12:47:17 -0500
Organization: A noiseless patient Spider
Lines: 23
Message-ID: <jwvy23gj3lo.fsf-monnier+comp.arch@gnu.org>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<KC_zJ.59028$Ak2.12921@fx20.iad> <86h7agvxun.fsf@linuxsc.com>
<M4_BJ.140002$lz3.547@fx34.iad>
<f91f3db8-640e-4c10-b0f7-61c7085b70c8n@googlegroups.com>
<srag0i$2ed$2@dont-email.me>
<00add816-93d7-4763-a68b-33a67db6d770n@googlegroups.com>
<2022Jan8.101413@mips.complang.tuwien.ac.at>
<7557bf3a-61ce-4500-8cf8-ced2dbed7087n@googlegroups.com>
<ad2ee700-b604-4565-9e24-3386580b90c8n@googlegroups.com>
<4d2fbc82-af69-4388-bfa5-e3b2be652744n@googlegroups.com>
<2e706405-006a-49bb-8e8a-f634d749205en@googlegroups.com>
<570acc73-a5da-497f-8ec4-810150e0a9f1n@googlegroups.com>
<850b7681-204a-4df6-9095-cd6ee816a7d5n@googlegroups.com>
<5ea00397-5572-4fbd-bfb3-85c3554f1eb9n@googlegroups.com>
<srnkf0$4cb$1@dont-email.me>
<b4e98991-4fb9-4ef7-a831-430c3fc10145n@googlegroups.com>
<srp3n0$e0a$1@dont-email.me> <srq01m$qvi$1@newsreader4.netcologne.de>
<srr7rr$e7d$1@gioia.aioe.org> <sru612$g7c$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="588022d6ecd818d51aee8d75a0fd1dfe";
logging-data="4650"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/JBbGl+XzSxpmqb7u+v2BI"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:GAPRD0eMd6gVtMt168a5zIdELCU=
sha1:omi4H8T1GIVEOeWtYZVmG7M7zSI=
 by: Stefan Monnier - Sat, 15 Jan 2022 17:47 UTC

> A general purpose CPU could do something similar. A superscalar
> in-order architecture could distribute the work between its
> pipelines. If the ISA has a way to specify separate execution
> units, so that (in SSA form)
>
> if (condition) {
> foo_1 = some work;
> }
> else {
> foo_2 = some other work;
> }
> foo = PHI (foo_1, foo_2);
>
> this could be implemented efficiently even on a superscalar
> in-order machine.

Eagerly running both sides of the branches is "easy" only if both
branches are simple enough (using predication for example). If they
include function calls and/or loops you really enter into the
multithreading realm.

Stefan

Re: RISC-V vs. Aarch64

<e163737e-c461-4d8f-9fc9-0f4376a7fbffn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22982&group=comp.arch#22982

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:e67:: with SMTP id jz7mr13180321qvb.85.1642269329676;
Sat, 15 Jan 2022 09:55:29 -0800 (PST)
X-Received: by 2002:a05:6808:1286:: with SMTP id a6mr790293oiw.110.1642269329082;
Sat, 15 Jan 2022 09:55:29 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 15 Jan 2022 09:55:28 -0800 (PST)
In-Reply-To: <jwvy23gj3lo.fsf-monnier+comp.arch@gnu.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:8090:fb97:58b8:238a;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:8090:fb97:58b8:238a
References: <2021Dec24.163843@mips.complang.tuwien.ac.at> <KC_zJ.59028$Ak2.12921@fx20.iad>
<86h7agvxun.fsf@linuxsc.com> <M4_BJ.140002$lz3.547@fx34.iad>
<f91f3db8-640e-4c10-b0f7-61c7085b70c8n@googlegroups.com> <srag0i$2ed$2@dont-email.me>
<00add816-93d7-4763-a68b-33a67db6d770n@googlegroups.com> <2022Jan8.101413@mips.complang.tuwien.ac.at>
<7557bf3a-61ce-4500-8cf8-ced2dbed7087n@googlegroups.com> <ad2ee700-b604-4565-9e24-3386580b90c8n@googlegroups.com>
<4d2fbc82-af69-4388-bfa5-e3b2be652744n@googlegroups.com> <2e706405-006a-49bb-8e8a-f634d749205en@googlegroups.com>
<570acc73-a5da-497f-8ec4-810150e0a9f1n@googlegroups.com> <850b7681-204a-4df6-9095-cd6ee816a7d5n@googlegroups.com>
<5ea00397-5572-4fbd-bfb3-85c3554f1eb9n@googlegroups.com> <srnkf0$4cb$1@dont-email.me>
<b4e98991-4fb9-4ef7-a831-430c3fc10145n@googlegroups.com> <srp3n0$e0a$1@dont-email.me>
<srq01m$qvi$1@newsreader4.netcologne.de> <srr7rr$e7d$1@gioia.aioe.org>
<sru612$g7c$1@newsreader4.netcologne.de> <jwvy23gj3lo.fsf-monnier+comp.arch@gnu.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e163737e-c461-4d8f-9fc9-0f4376a7fbffn@googlegroups.com>
Subject: Re: RISC-V vs. Aarch64
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 15 Jan 2022 17:55:29 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 26
 by: MitchAlsup - Sat, 15 Jan 2022 17:55 UTC

On Saturday, January 15, 2022 at 11:47:21 AM UTC-6, Stefan Monnier wrote:
> > A general purpose CPU could do something similar. A superscalar
> > in-order architecture could distribute the work between its
> > pipelines. If the ISA has a way to specify separate execution
> > units, so that (in SSA form)
> >
> > if (condition) {
> > foo_1 = some work;
> > }
> > else {
> > foo_2 = some other work;
> > }
> > foo = PHI (foo_1, foo_2);
> >
> > this could be implemented efficiently even on a superscalar
> > in-order machine.
<
> Eagerly running both sides of the branches is "easy" only if both
> branches are simple enough (using predication for example). If they
> include function calls and/or loops you really enter into the
> multithreading realm.
<
Running both sides and throwing one side away is a waste of power
{the vast majority of the time}.
>
>
> Stefan

Re: RISC-V vs. Aarch64

<srv1so$bnl$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22983&group=comp.arch#22983

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Sat, 15 Jan 2022 09:56:42 -0800
Organization: A noiseless patient Spider
Lines: 26
Message-ID: <srv1so$bnl$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<86h7agvxun.fsf@linuxsc.com> <M4_BJ.140002$lz3.547@fx34.iad>
<f91f3db8-640e-4c10-b0f7-61c7085b70c8n@googlegroups.com>
<srag0i$2ed$2@dont-email.me>
<00add816-93d7-4763-a68b-33a67db6d770n@googlegroups.com>
<2022Jan8.101413@mips.complang.tuwien.ac.at>
<7557bf3a-61ce-4500-8cf8-ced2dbed7087n@googlegroups.com>
<ad2ee700-b604-4565-9e24-3386580b90c8n@googlegroups.com>
<4d2fbc82-af69-4388-bfa5-e3b2be652744n@googlegroups.com>
<2e706405-006a-49bb-8e8a-f634d749205en@googlegroups.com>
<570acc73-a5da-497f-8ec4-810150e0a9f1n@googlegroups.com>
<850b7681-204a-4df6-9095-cd6ee816a7d5n@googlegroups.com>
<5ea00397-5572-4fbd-bfb3-85c3554f1eb9n@googlegroups.com>
<srnkf0$4cb$1@dont-email.me>
<b4e98991-4fb9-4ef7-a831-430c3fc10145n@googlegroups.com>
<srp3n0$e0a$1@dont-email.me> <srq01m$qvi$1@newsreader4.netcologne.de>
<srr7rr$e7d$1@gioia.aioe.org> <sru612$g7c$1@newsreader4.netcologne.de>
<jwvy23gj3lo.fsf-monnier+comp.arch@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 15 Jan 2022 17:56:40 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="5dc20c94e0b42d014f20cc012d9d41e9";
logging-data="12021"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/YUiuSTtzKqSP7EcvEh7DT"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.0
Cancel-Lock: sha1:X3i8kufhIkqZZqSVDRy1aFhEsB8=
In-Reply-To: <jwvy23gj3lo.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US
 by: Ivan Godard - Sat, 15 Jan 2022 17:56 UTC

On 1/15/2022 9:47 AM, Stefan Monnier wrote:
>> A general purpose CPU could do something similar. A superscalar
>> in-order architecture could distribute the work between its
>> pipelines. If the ISA has a way to specify separate execution
>> units, so that (in SSA form)
>>
>> if (condition) {
>> foo_1 = some work;
>> }
>> else {
>> foo_2 = some other work;
>> }
>> foo = PHI (foo_1, foo_2);
>>
>> this could be implemented efficiently even on a superscalar
>> in-order machine.
>
> Eagerly running both sides of the branches is "easy" only if both
> branches are simple enough (using predication for example). If they
> include function calls and/or loops you really enter into the
> multithreading realm.
>
>
> Stefan

Yes. NYF.

Re: RISC-V vs. Aarch64

<srv53i$oq5$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22985&group=comp.arch#22985

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!To5nvU/sTaigmVbgRJ05pQ.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Sat, 15 Jan 2022 19:51:28 +0100
Organization: Aioe.org NNTP Server
Message-ID: <srv53i$oq5$1@gioia.aioe.org>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<KC_zJ.59028$Ak2.12921@fx20.iad> <86h7agvxun.fsf@linuxsc.com>
<M4_BJ.140002$lz3.547@fx34.iad>
<f91f3db8-640e-4c10-b0f7-61c7085b70c8n@googlegroups.com>
<srag0i$2ed$2@dont-email.me>
<00add816-93d7-4763-a68b-33a67db6d770n@googlegroups.com>
<2022Jan8.101413@mips.complang.tuwien.ac.at>
<7557bf3a-61ce-4500-8cf8-ced2dbed7087n@googlegroups.com>
<ad2ee700-b604-4565-9e24-3386580b90c8n@googlegroups.com>
<4d2fbc82-af69-4388-bfa5-e3b2be652744n@googlegroups.com>
<2e706405-006a-49bb-8e8a-f634d749205en@googlegroups.com>
<570acc73-a5da-497f-8ec4-810150e0a9f1n@googlegroups.com>
<850b7681-204a-4df6-9095-cd6ee816a7d5n@googlegroups.com>
<5ea00397-5572-4fbd-bfb3-85c3554f1eb9n@googlegroups.com>
<srnkf0$4cb$1@dont-email.me>
<b4e98991-4fb9-4ef7-a831-430c3fc10145n@googlegroups.com>
<srp3n0$e0a$1@dont-email.me> <srq01m$qvi$1@newsreader4.netcologne.de>
<srr7rr$e7d$1@gioia.aioe.org> <sru612$g7c$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="25413"; posting-host="To5nvU/sTaigmVbgRJ05pQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.10.2
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Sat, 15 Jan 2022 18:51 UTC

Thomas Koenig wrote:
> Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
>> Thomas Koenig wrote:
>>> Ivan Godard <ivan@millcomputing.com> schrieb:
>>>
>>>> Where ISAs really fall down is parsing a bit stream: grab a dynamic
>>>> number of bits off the front of a bit stream, advancing the stream; word
>>>> boundaries are not significant. The problem is that HW provides word
>>>> streams (loop/load) and mapping that to a bit stream is nasty. The logic
>>>> is the same as mapping a line stream into a (VL) instruction stream in
>>>> the decoder's instruction buffer, but how to represent that in an ISA?
>>>
>>> The same way that a vector instruction would be represented?
>>>
>>> Vectors could be made to operate on sub-word quantities such as bytes,
>>> with microarchitectural SIMD underneath.
>>>
>>> I have to confess I do not know much of how compression and
>>> decompression algorithms work. What are the operations that need
>>> to be done with the chunk of bits that is grabbed?
>>>
>> That varies a _lot_! Absolute worst case in my experience is the CABAC
>> option for h264: Content-Adaptive-Binary-Arithmetic-Coding
>>
>> Here you normally extract single bits from the input stream, then you
>> immediately branch on the value of that bit (by definition completely
>> unpredictable, right?) to two separate code paths that really cannot be
>> combined into a single branchless/predicated stream.
>
> Interesting. You can do it in hardware in parallel, then select

Not needed.

> the result. I would assume that the workloads of the two branches
> are roughly the same?

You would be wrong: Those two paths can be completely different both in
size/time & complexity.

A HW decoder only has to keep up with 40 Mbit/s, so with a ~GHz clock it
has all the time in the world to figure which state machine to enter
now. I.e. I can't see any reason/need to predict/run-ahead/predicate
parallel blocks.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: RISC-V vs. Aarch64

<ss4g91$hvs$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22990&group=comp.arch#22990

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Mon, 17 Jan 2022 11:32:47 -0800
Organization: A noiseless patient Spider
Lines: 81
Message-ID: <ss4g91$hvs$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sq5dj1$1q9$1@dont-email.me>
<59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com>
<sqkcvk$n97$1@dont-email.me> <RrlzJ.130558$SR4.25229@fx43.iad>
<sql2cm$3h7$1@dont-email.me> <sql73d$6es$2@newsreader4.netcologne.de>
<sqmj5j$s31$1@dont-email.me> <sqmmso$446$2@newsreader4.netcologne.de>
<gs2dnRZj-ucyZ1P8nZ2dnUU78YfNnZ2d@supernews.com>
<sqpd0i$spj$1@newsreader4.netcologne.de>
<650c822a-3776-4ea9-aa72-5a6b19bdcabbn@googlegroups.com>
<sqpocs$1so3$1@gioia.aioe.org> <sqpqbm$7qo$1@newsreader4.netcologne.de>
<sqq3ce$c4n$2@dont-email.me> <sqssff$a9j$1@gioia.aioe.org>
<077afaee-009e-4860-be45-61106126934bn@googlegroups.com>
<squhht$79u$1@dont-email.me>
<bb6d49bb-a676-44bd-9a6d-29386d429454n@googlegroups.com>
<sr0vhm$c4u$1@dont-email.me> <sr114i$1qc$1@newsreader4.netcologne.de>
<sr1dca$70e$1@dont-email.me> <kM%AJ.186634$np6.183460@fx46.iad>
<sr2gf6$64u$1@dont-email.me> <7DpBJ.254731$3q9.63673@fx47.iad>
<sr62tb$u2o$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 17 Jan 2022 19:32:49 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="6af9ee32a821ae7ee0ed125404c5c3e1";
logging-data="18428"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX189t1FRYfb/4CoN6f7W/FK2WGqQhHUMWQU="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.0
Cancel-Lock: sha1:AHiiY5y0ny/+nM7hdxZWlVzIePQ=
In-Reply-To: <sr62tb$u2o$1@dont-email.me>
Content-Language: en-US
 by: Stephen Fuld - Mon, 17 Jan 2022 19:32 UTC

On 1/5/2022 10:40 PM, Ivan Godard wrote:
> On 1/5/2022 3:13 PM, EricP wrote:
>> Ivan Godard wrote:
>>> On 1/4/2022 9:48 AM, EricP wrote:
>>>> Ivan Godard wrote:
>>>>> On 1/4/2022 12:39 AM, Thomas Koenig wrote:
>>>>>> Ivan Godard <ivan@millcomputing.com> schrieb:
>>>>>>
>>>>>>> Perhaps you haven't noticed me saying: *the belt is not physically a
>>>>>>> shift register*.
>>>>>>
>>>>>> It's usually implemented as a circular buffer, correct?
>>>>>
>>>>> Not at all.
>>>>>
>>>>> Computed values are left where they were produced - FU's output
>>>>> latches, for example - just as in a forwarding bypass network. Move
>>>>> only happens if the location is needed by some other computation,
>>>>> and then only to an adjacent location, also on the bypass network -
>>>>> which the way issue happens guarantees is free.
>>>>
>>>> Moving results out of the way is what makes this work.
>>>> If you only have one adder FU and you get a bunch of add instructions
>>>> in a row, then you need to stash older results in other registers
>>>> and this becomes the equivalent of a delayed register writeback.
>>>> But I think you need a crossbar to accomplish this while a
>>>> traditional approach just needs some small number of result buses.
>>>>
>>>>
>>>
>>> Still not :-)
>>>
>>> Our FU slots can (and typically do) support operations of different
>>> natural latencies (pipe lengths). Each slot can accept one op per
>>> cycle, so if the latencies differ you can get more than one result
>>> retiring in the same cycle. Consequently the FUs have one result FF
>>> per supported latency.
>>>
>>> If a op of latency N retires in cycle C to FF#N, necessarily the
>>> following cycle C+1 the FF#N+1 is free (think about it).
>>> Consequently, the FU's FFs are daisy chained so that each cycle FF#N
>>> is moved to FF#N+1  and every result always is retiring to a known
>>> free FF; the set of FFs are right next to each other so the move is
>>> trivial.
>>
>> This sounds like it has a belt's worth of FF for each FU.
>> And some of them are shift registers?
>> I'm a bit confused.
>
> There can be (and usually are) several FUs per slot, forming in effect a
> "superFU". There is only one set of output FFs per slot, one per
> latency. These are daisy chained. I suppose that you can think of the
> output FF daisy as being a shift register, and it could be done that
> way, but it also could be done by simply rotating which FF is considered
> which latency, thereby replacing a physical data move with a
> result-to-FF fanout. That's a HW design choice; IANAHWG.

I thought I had at least a basic understanding of how you actually
implement the belt, but Eric's post made me question that, and I don't
think you addressed his exact question. So, suppose you have exactly
one add FU, and that it has a latency of 1. My understanding is that
there is therefore one set of FFs at the add FU's output. Suppose the
belt length is 16, though this isn't particularly critical.
Now suppose you have an instruction stream with say 4-5 consecutive
adds. I understand that this may take several instructions, as you only
have one add FU. After the first add, the result is in that FUs output
FFs, and has a some belt position. After the second add, since you want
the FUs output FFs to hold the result, but they still have the result
from the first add, you need some place to hold the first result. I
gather this is somewhere in the spiller. Now let's keep going to the
third add. Are there more spots in the spiller? If so, there must be
some limit, although it is probably configurable. What happens if the
number of consecutive adds exceeds that limit.

Anyway, I hope you can see my confusion and will clarity it.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: RISC-V vs. Aarch64

<ss4ktr$dvv$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22991&group=comp.arch#22991

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Mon, 17 Jan 2022 12:52:11 -0800
Organization: A noiseless patient Spider
Lines: 142
Message-ID: <ss4ktr$dvv$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com>
<sqkcvk$n97$1@dont-email.me> <RrlzJ.130558$SR4.25229@fx43.iad>
<sql2cm$3h7$1@dont-email.me> <sql73d$6es$2@newsreader4.netcologne.de>
<sqmj5j$s31$1@dont-email.me> <sqmmso$446$2@newsreader4.netcologne.de>
<gs2dnRZj-ucyZ1P8nZ2dnUU78YfNnZ2d@supernews.com>
<sqpd0i$spj$1@newsreader4.netcologne.de>
<650c822a-3776-4ea9-aa72-5a6b19bdcabbn@googlegroups.com>
<sqpocs$1so3$1@gioia.aioe.org> <sqpqbm$7qo$1@newsreader4.netcologne.de>
<sqq3ce$c4n$2@dont-email.me> <sqssff$a9j$1@gioia.aioe.org>
<077afaee-009e-4860-be45-61106126934bn@googlegroups.com>
<squhht$79u$1@dont-email.me>
<bb6d49bb-a676-44bd-9a6d-29386d429454n@googlegroups.com>
<sr0vhm$c4u$1@dont-email.me> <sr114i$1qc$1@newsreader4.netcologne.de>
<sr1dca$70e$1@dont-email.me> <kM%AJ.186634$np6.183460@fx46.iad>
<sr2gf6$64u$1@dont-email.me> <7DpBJ.254731$3q9.63673@fx47.iad>
<sr62tb$u2o$1@dont-email.me> <ss4g91$hvs$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 17 Jan 2022 20:52:12 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="5bc70d39891b07f0c02727195b5848bf";
logging-data="14335"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19eAeqr+5p6iAX7uly/ifsC"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.0
Cancel-Lock: sha1:LVUt5w1qBkUEEZgQ98vgPkcyju0=
In-Reply-To: <ss4g91$hvs$1@dont-email.me>
Content-Language: en-US
 by: Ivan Godard - Mon, 17 Jan 2022 20:52 UTC

On 1/17/2022 11:32 AM, Stephen Fuld wrote:
> On 1/5/2022 10:40 PM, Ivan Godard wrote:
>> On 1/5/2022 3:13 PM, EricP wrote:
>>> Ivan Godard wrote:
>>>> On 1/4/2022 9:48 AM, EricP wrote:
>>>>> Ivan Godard wrote:
>>>>>> On 1/4/2022 12:39 AM, Thomas Koenig wrote:
>>>>>>> Ivan Godard <ivan@millcomputing.com> schrieb:
>>>>>>>
>>>>>>>> Perhaps you haven't noticed me saying: *the belt is not
>>>>>>>> physically a
>>>>>>>> shift register*.
>>>>>>>
>>>>>>> It's usually implemented as a circular buffer, correct?
>>>>>>
>>>>>> Not at all.
>>>>>>
>>>>>> Computed values are left where they were produced - FU's output
>>>>>> latches, for example - just as in a forwarding bypass network.
>>>>>> Move only happens if the location is needed by some other
>>>>>> computation, and then only to an adjacent location, also on the
>>>>>> bypass network - which the way issue happens guarantees is free.
>>>>>
>>>>> Moving results out of the way is what makes this work.
>>>>> If you only have one adder FU and you get a bunch of add instructions
>>>>> in a row, then you need to stash older results in other registers
>>>>> and this becomes the equivalent of a delayed register writeback.
>>>>> But I think you need a crossbar to accomplish this while a
>>>>> traditional approach just needs some small number of result buses.
>>>>>
>>>>>
>>>>
>>>> Still not :-)
>>>>
>>>> Our FU slots can (and typically do) support operations of different
>>>> natural latencies (pipe lengths). Each slot can accept one op per
>>>> cycle, so if the latencies differ you can get more than one result
>>>> retiring in the same cycle. Consequently the FUs have one result FF
>>>> per supported latency.
>>>>
>>>> If a op of latency N retires in cycle C to FF#N, necessarily the
>>>> following cycle C+1 the FF#N+1 is free (think about it).
>>>> Consequently, the FU's FFs are daisy chained so that each cycle FF#N
>>>> is moved to FF#N+1  and every result always is retiring to a known
>>>> free FF; the set of FFs are right next to each other so the move is
>>>> trivial.
>>>
>>> This sounds like it has a belt's worth of FF for each FU.
>>> And some of them are shift registers?
>>> I'm a bit confused.
>>
>> There can be (and usually are) several FUs per slot, forming in effect
>> a "superFU". There is only one set of output FFs per slot, one per
>> latency. These are daisy chained. I suppose that you can think of the
>> output FF daisy as being a shift register, and it could be done that
>> way, but it also could be done by simply rotating which FF is
>> considered which latency, thereby replacing a physical data move with
>> a result-to-FF fanout. That's a HW design choice; IANAHWG.
>
> I thought I had at least a basic understanding of how you actually
> implement the belt, but Eric's post made me question that, and I don't
> think you addressed his exact question.  So, suppose you have exactly
> one add FU, and that it has a latency of 1. My understanding is that
> there is therefore one set of FFs at the add FU's output. Suppose the
> belt length is 16, though this isn't particularly critical.
> Now suppose you have an instruction stream with say 4-5 consecutive
> adds.  I understand that this may take several instructions, as you only
> have one add FU.  After the first add, the result is in that FUs output
> FFs, and has a some belt position.  After the second add, since you want
> the FUs output FFs to hold the result, but they still have the result
> from the first add, you need some place to hold the first result.  I
> gather this is somewhere in the spiller.  Now let's keep going to the
> third add.  Are there more spots in the spiller?  If so, there must be
> some limit, although it is probably configurable.  What happens if the
> number of consecutive adds exceeds that limit.
>
> Anyway, I hope you can see my confusion and will clarity it.

Looks like the confusion is the difference between FU and slot. A FU is
a unit of computation - an ALU or FPU for example. A slot is a unit of
encoding. All slots have at least one FU attached, but they can (and
usually do) have several. The only requirement is that decode be able
to supply to the slot everything that any of the attached FUs needs to
do for any of its supported instructions.

The result FFs are per-slot, not per-FU. There is one FF per latency
supported by any instruction of any FU attached to that slot, and for
any intermediate latency that happens not to have any instructions that
use it. The FFs are logically chained to make a shift register, although
for power reasons it may use a different implementation than a physical
shift register.

Your example with only one latency is unrealistic; typical arithmetic
(exu) slots have several FUs with an assortment of latencies: an ALU, a
FPU, an integer multiplier, an assortment of 2-cycle wide-data
operations, and so on. If it should happen that a slot has only one or a
few latencies then the member config may add one or more extra FFs so as
to lengthen the FF shift register, to make it more likely that a value
will die before it gets to the end of the SR.

A live result that gets to the end is moved to the spiller buffers,
which are just FFs, almost as if the spiller were just another slot. The
spiller buffers are able to accept one value from each slot each cycle,
because the slot SR can only have one value falling off the SR per
cycle. The spiller buffers get reused when a value they have goes dead;
the logic is essentially the same as is used in an OOO genreg machine to
assign physical registers, though the amount of logic is much less than
that required to assign several hundred genregs because the maximum live
population is bounded by the belt length.

The number of slots, slot FU (and instruction) populations, the length
of the per slot SR, the number of spiller buffers and so on are all
member config parameters, that is balanced for market, expected usage,
power and other constraints at chip design time.

To take your example of an empty belt and a string of adds: a+b+c+d+e,
all latency one. Say there is only one exu slot, and it has a 1-cycle
ALU, a 3-cycle multiplier, and a filler lat-2 FF even though the slot
has no lat-2 instructions. The belt is assumed to be longer than 5, so
no intermediate results die in the example.

In the first cycle:
cycle 1: (a+b)->lat1
cycle 2: lat1->lat2; (a+b)+c->lat1
cycle 3: lat2->lat3; lat1->lat2; (a+b+c)+d->lat1
cycle 4: lat3->sX; lat2->lat3; lat1->lat2; (a+b+c+d)->lat1
cycle 5: (sX); lat3->sY; lat2->lat3; lat1->lat2; (a+b+c+d+e)->lat1

I emphasize that only the moves from the end of the SR (lat3 here) to
the spiller are physical and cost power.

The number of spiller buffers is constrained by the length of the belt -
there can never be more live values than that. However, there may be
fewer at hardware option. The spiller must be able to hold an arbitrary
amount of data because the spiller is used to save the belt across
calls. If it runs out of buffers then values can move to an internal
SRAM; when it runs out of SRAM halues are written to the spiller stack
in memory. These overflow conditions can saturate the spiller bandwidth,
causing stalls. The spiller buffer and SRAM sizes are configured to
balance stalls vs chip cost.

Does this help?

Re: RISC-V vs. Aarch64

<jwvk0eycbsz.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22992&group=comp.arch#22992

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Mon, 17 Jan 2022 16:19:37 -0500
Organization: A noiseless patient Spider
Lines: 38
Message-ID: <jwvk0eycbsz.fsf-monnier+comp.arch@gnu.org>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sqmmso$446$2@newsreader4.netcologne.de>
<gs2dnRZj-ucyZ1P8nZ2dnUU78YfNnZ2d@supernews.com>
<sqpd0i$spj$1@newsreader4.netcologne.de>
<650c822a-3776-4ea9-aa72-5a6b19bdcabbn@googlegroups.com>
<sqpocs$1so3$1@gioia.aioe.org>
<sqpqbm$7qo$1@newsreader4.netcologne.de> <sqq3ce$c4n$2@dont-email.me>
<sqssff$a9j$1@gioia.aioe.org>
<077afaee-009e-4860-be45-61106126934bn@googlegroups.com>
<squhht$79u$1@dont-email.me>
<bb6d49bb-a676-44bd-9a6d-29386d429454n@googlegroups.com>
<sr0vhm$c4u$1@dont-email.me> <sr114i$1qc$1@newsreader4.netcologne.de>
<sr1dca$70e$1@dont-email.me> <kM%AJ.186634$np6.183460@fx46.iad>
<sr2gf6$64u$1@dont-email.me> <7DpBJ.254731$3q9.63673@fx47.iad>
<sr62tb$u2o$1@dont-email.me> <ss4g91$hvs$1@dont-email.me>
<ss4ktr$dvv$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="7f9c7350d2f93347ef181e9f533a38be";
logging-data="445"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+xVKy6th9iQ5YGKZHFKfiG"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:+yb3XibK88f1LH51Oh9h2WS/qfc=
sha1:RSL9qX19UrHo/O3gKKelJvtt7KY=
 by: Stefan Monnier - Mon, 17 Jan 2022 21:19 UTC

> Your example with only one latency is unrealistic; typical arithmetic (exu)
> slots have several FUs with an assortment of latencies: an ALU, a FPU, an
> integer multiplier, an assortment of 2-cycle wide-data operations, and so
> on. If it should happen that a slot has only one or a few latencies then the
> member config may add one or more extra FFs so as to lengthen the FF shift
> register, to make it more likely that a value will die before it gets to the
> end of the SR.

IIUC your crossbar has a number of inputs equal to

nb-of-slots * nb-of-latencies + spiller (+ immediates?)

[ Modulo the fact that different slots may actually have different
latencies. ]

So adding latencies (well, adding extra FFs beyond the ones required by
the actual ops's latencies) might save power by avoiding moving the data
to the spiller, but it also increases the input size of the crossbar, so
there's a tradeoff which is why you have the spiller (rather than
extend all slots to have the same number of latencies as belt
positions).

Also you say the number of FFs in the spiller can be bounded by the
number of belt positions, but is it even possible to reach this bound?
I guess a tighter bound might be something like

belt-size - nb-of-latencies

right? And I guess if you can somehow make sure the code always spreads
its work among the slots this could be reduced closer to the ideal

belt-size - nb-of-slots * nb-of-latencies

tho I don't know how practical this is nor how close you can hope to get
to that.

Stefan

Re: RISC-V vs. Aarch64

<ss4qk5$si5$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22993&group=comp.arch#22993

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Mon, 17 Jan 2022 14:29:25 -0800
Organization: A noiseless patient Spider
Lines: 72
Message-ID: <ss4qk5$si5$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sqmmso$446$2@newsreader4.netcologne.de>
<gs2dnRZj-ucyZ1P8nZ2dnUU78YfNnZ2d@supernews.com>
<sqpd0i$spj$1@newsreader4.netcologne.de>
<650c822a-3776-4ea9-aa72-5a6b19bdcabbn@googlegroups.com>
<sqpocs$1so3$1@gioia.aioe.org> <sqpqbm$7qo$1@newsreader4.netcologne.de>
<sqq3ce$c4n$2@dont-email.me> <sqssff$a9j$1@gioia.aioe.org>
<077afaee-009e-4860-be45-61106126934bn@googlegroups.com>
<squhht$79u$1@dont-email.me>
<bb6d49bb-a676-44bd-9a6d-29386d429454n@googlegroups.com>
<sr0vhm$c4u$1@dont-email.me> <sr114i$1qc$1@newsreader4.netcologne.de>
<sr1dca$70e$1@dont-email.me> <kM%AJ.186634$np6.183460@fx46.iad>
<sr2gf6$64u$1@dont-email.me> <7DpBJ.254731$3q9.63673@fx47.iad>
<sr62tb$u2o$1@dont-email.me> <ss4g91$hvs$1@dont-email.me>
<ss4ktr$dvv$1@dont-email.me> <jwvk0eycbsz.fsf-monnier+comp.arch@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 17 Jan 2022 22:29:25 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="5bc70d39891b07f0c02727195b5848bf";
logging-data="29253"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+WtA/XYbS6SNW1pHqnhWSj"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.0
Cancel-Lock: sha1:H+lUm2YC75L9vdnS8ghn07lNUho=
In-Reply-To: <jwvk0eycbsz.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US
 by: Ivan Godard - Mon, 17 Jan 2022 22:29 UTC

On 1/17/2022 1:19 PM, Stefan Monnier wrote:
>> Your example with only one latency is unrealistic; typical arithmetic (exu)
>> slots have several FUs with an assortment of latencies: an ALU, a FPU, an
>> integer multiplier, an assortment of 2-cycle wide-data operations, and so
>> on. If it should happen that a slot has only one or a few latencies then the
>> member config may add one or more extra FFs so as to lengthen the FF shift
>> register, to make it more likely that a value will die before it gets to the
>> end of the SR.
>
> IIUC your crossbar has a number of inputs equal to
>
> nb-of-slots * nb-of-latencies + spiller (+ immediates?)
>
> [ Modulo the fact that different slots may actually have different
> latencies. ]
>
> So adding latencies (well, adding extra FFs beyond the ones required by
> the actual ops's latencies) might save power by avoiding moving the data
> to the spiller, but it also increases the input size of the crossbar, so
> there's a tradeoff which is why you have the spiller (rather than
> extend all slots to have the same number of latencies as belt
> positions).
>
> Also you say the number of FFs in the spiller can be bounded by the
> number of belt positions, but is it even possible to reach this bound?
> I guess a tighter bound might be something like
>
> belt-size - nb-of-latencies
>
> right? And I guess if you can somehow make sure the code always spreads
> its work among the slots this could be reduced closer to the ideal
>
> belt-size - nb-of-slots * nb-of-latencies
>
> tho I don't know how practical this is nor how close you can hope to get
> to that.
>
>
> Stefan

Yep, it's all a game of tradeoffs, as in any HW design.

The spiller bound is (beltsize). You could (for a belt-16) have 16
unrelated drops, then enough drop-less cycles for all of them to migrate
to the spiller. The next drop (in the slot FFs) would kill one of the
spiller buffers, but that would eventually be filled when the 17th drop
made it to the spiller. Of course that is a very artificial situation in
real code, so an actual config can reduce the number of spiller buffers
if it judges that the rate of spiller stalls is acceptable.

The number of inputs to the crossbar is also logically bounded by the
belt size. However, the belt is scattered over what amounts to a bypass
network, so the number of physical inputs is (as you suggest) the sum of
the slot latency FFs, the spiller buffers, and some other sources such
as the load retire stations and the scratchpad. There are still a lot
fewer than on a genreg with similar performce - Silver (belt 16) has ~30
sources IIRC.

It helps that we have a patented cascaded crossbar. All the lat1 FFs go
direct to the xbar, but the rest first go through a N->3 pre-bar and the
three join with the lat1s in a (slot+3)->inputs post-bar. Because all
the sources for the pre-bar are lat2 or greater we push the pre-bar gate
cost into the sources.

This works because no instructions ever has an exact integral latency -
that 3-cycle multiplier is really 2-8/14ths. So there is 6/14th of a
cycle to do its pre-bar, so the crossbar actually takes only the
post-bar time, and the post-bar has only 20-25% of the overall sources
and can be fast. Sure, the odd long-latency instruction may be natively
so close to the cycle boundary that it has to push its pre-bar into
another cycle, but so what? Them's the tradeoffs :-)

Re: RISC-V vs. Aarch64

<2a4e2b74-9d00-4e2e-97a3-dccad1167d0cn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22994&group=comp.arch#22994

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:1c8a:: with SMTP id ib10mr20451307qvb.126.1642458973402;
Mon, 17 Jan 2022 14:36:13 -0800 (PST)
X-Received: by 2002:a05:6808:115:: with SMTP id b21mr933182oie.7.1642458973184;
Mon, 17 Jan 2022 14:36:13 -0800 (PST)
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsfeed.xs4all.nl!newsfeed9.news.xs4all.nl!feeder1.cambriumusenet.nl!feed.tweak.nl!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 17 Jan 2022 14:36:13 -0800 (PST)
In-Reply-To: <ss4ktr$dvv$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:f075:db2d:95ba:d71e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:f075:db2d:95ba:d71e
References: <2021Dec24.163843@mips.complang.tuwien.ac.at> <59376149-c3d3-489e-8b41-f21bdd0ce5a9n@googlegroups.com>
<sqkcvk$n97$1@dont-email.me> <RrlzJ.130558$SR4.25229@fx43.iad>
<sql2cm$3h7$1@dont-email.me> <sql73d$6es$2@newsreader4.netcologne.de>
<sqmj5j$s31$1@dont-email.me> <sqmmso$446$2@newsreader4.netcologne.de>
<gs2dnRZj-ucyZ1P8nZ2dnUU78YfNnZ2d@supernews.com> <sqpd0i$spj$1@newsreader4.netcologne.de>
<650c822a-3776-4ea9-aa72-5a6b19bdcabbn@googlegroups.com> <sqpocs$1so3$1@gioia.aioe.org>
<sqpqbm$7qo$1@newsreader4.netcologne.de> <sqq3ce$c4n$2@dont-email.me>
<sqssff$a9j$1@gioia.aioe.org> <077afaee-009e-4860-be45-61106126934bn@googlegroups.com>
<squhht$79u$1@dont-email.me> <bb6d49bb-a676-44bd-9a6d-29386d429454n@googlegroups.com>
<sr0vhm$c4u$1@dont-email.me> <sr114i$1qc$1@newsreader4.netcologne.de>
<sr1dca$70e$1@dont-email.me> <kM%AJ.186634$np6.183460@fx46.iad>
<sr2gf6$64u$1@dont-email.me> <7DpBJ.254731$3q9.63673@fx47.iad>
<sr62tb$u2o$1@dont-email.me> <ss4g91$hvs$1@dont-email.me> <ss4ktr$dvv$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2a4e2b74-9d00-4e2e-97a3-dccad1167d0cn@googlegroups.com>
Subject: Re: RISC-V vs. Aarch64
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 17 Jan 2022 22:36:13 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Mon, 17 Jan 2022 22:36 UTC

On Monday, January 17, 2022 at 2:52:14 PM UTC-6, Ivan Godard wrote:

>
> In the first cycle:
> cycle 1: (a+b)->lat1
> cycle 2: lat1->lat2; (a+b)+c->lat1
> cycle 3: lat2->lat3; lat1->lat2; (a+b+c)+d->lat1
> cycle 4: lat3->sX; lat2->lat3; lat1->lat2; (a+b+c+d)->lat1
> cycle 5: (sX); lat3->sY; lat2->lat3; lat1->lat2; (a+b+c+d+e)->lat1
<
Why is cycle 4 not::
<
cycle 4: lat3->sX; lat2->lat3; lat1->lat2; (a+b+c+d)+e->lat1
<
The first 3 instructions perform an ADD why not cycle 4 ?

Re: RISC-V vs. Aarch64

<ss4tp6$j8v$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22995&group=comp.arch#22995

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Mon, 17 Jan 2022 15:23:20 -0800
Organization: A noiseless patient Spider
Lines: 25
Message-ID: <ss4tp6$j8v$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<RrlzJ.130558$SR4.25229@fx43.iad> <sql2cm$3h7$1@dont-email.me>
<sql73d$6es$2@newsreader4.netcologne.de> <sqmj5j$s31$1@dont-email.me>
<sqmmso$446$2@newsreader4.netcologne.de>
<gs2dnRZj-ucyZ1P8nZ2dnUU78YfNnZ2d@supernews.com>
<sqpd0i$spj$1@newsreader4.netcologne.de>
<650c822a-3776-4ea9-aa72-5a6b19bdcabbn@googlegroups.com>
<sqpocs$1so3$1@gioia.aioe.org> <sqpqbm$7qo$1@newsreader4.netcologne.de>
<sqq3ce$c4n$2@dont-email.me> <sqssff$a9j$1@gioia.aioe.org>
<077afaee-009e-4860-be45-61106126934bn@googlegroups.com>
<squhht$79u$1@dont-email.me>
<bb6d49bb-a676-44bd-9a6d-29386d429454n@googlegroups.com>
<sr0vhm$c4u$1@dont-email.me> <sr114i$1qc$1@newsreader4.netcologne.de>
<sr1dca$70e$1@dont-email.me> <kM%AJ.186634$np6.183460@fx46.iad>
<sr2gf6$64u$1@dont-email.me> <7DpBJ.254731$3q9.63673@fx47.iad>
<sr62tb$u2o$1@dont-email.me> <ss4g91$hvs$1@dont-email.me>
<ss4ktr$dvv$1@dont-email.me>
<2a4e2b74-9d00-4e2e-97a3-dccad1167d0cn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 17 Jan 2022 23:23:18 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="1124bcc9c296a3aac6f7fa093f4e8742";
logging-data="19743"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18WRt/cbfYPQCNL4rGK9rgr"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.0
Cancel-Lock: sha1:T3pujHHyFqmv5ATPXm5p9hDCJwA=
In-Reply-To: <2a4e2b74-9d00-4e2e-97a3-dccad1167d0cn@googlegroups.com>
Content-Language: en-US
 by: Ivan Godard - Mon, 17 Jan 2022 23:23 UTC

On 1/17/2022 2:36 PM, MitchAlsup wrote:
> On Monday, January 17, 2022 at 2:52:14 PM UTC-6, Ivan Godard wrote:
>
>>
>> In the first cycle:
>> cycle 1: (a+b)->lat1
>> cycle 2: lat1->lat2; (a+b)+c->lat1
>> cycle 3: lat2->lat3; lat1->lat2; (a+b+c)+d->lat1
>> cycle 4: lat3->sX; lat2->lat3; lat1->lat2; (a+b+c+d)->lat1
>> cycle 5: (sX); lat3->sY; lat2->lat3; lat1->lat2; (a+b+c+d+e)->lat1
> <
> Why is cycle 4 not::
> <
> cycle 4: lat3->sX; lat2->lat3; lat1->lat2; (a+b+c+d)+e->lat1
> <
> The first 3 instructions perform an ADD why not cycle 4 ?

Because I'm a bad proofreader?

In the first cycle:
cycle 1: (a+b)->lat1
cycle 2: lat1->lat2; (a+b)+c->lat1
cycle 3: lat2->lat3; lat1->lat2; (a+b+c)+d->lat1
cycle 4: lat3->sX; lat2->lat3; lat1->lat2; (a+b+c+d)+e->lat1
cycle 5: (sX); lat3->sY; lat2->lat3; lat1->lat2; (a+b+c+d+e)+f->lat1

Re: RISC-V vs. Aarch64

<df3193c7-a0db-4d04-9ebc-2e72f550e05bn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22996&group=comp.arch#22996

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:2528:: with SMTP id gg8mr19973814qvb.57.1642461969292;
Mon, 17 Jan 2022 15:26:09 -0800 (PST)
X-Received: by 2002:a4a:ab05:: with SMTP id i5mr16249545oon.61.1642461969074;
Mon, 17 Jan 2022 15:26:09 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 17 Jan 2022 15:26:08 -0800 (PST)
In-Reply-To: <ss4tp6$j8v$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:f075:db2d:95ba:d71e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:f075:db2d:95ba:d71e
References: <2021Dec24.163843@mips.complang.tuwien.ac.at> <RrlzJ.130558$SR4.25229@fx43.iad>
<sql2cm$3h7$1@dont-email.me> <sql73d$6es$2@newsreader4.netcologne.de>
<sqmj5j$s31$1@dont-email.me> <sqmmso$446$2@newsreader4.netcologne.de>
<gs2dnRZj-ucyZ1P8nZ2dnUU78YfNnZ2d@supernews.com> <sqpd0i$spj$1@newsreader4.netcologne.de>
<650c822a-3776-4ea9-aa72-5a6b19bdcabbn@googlegroups.com> <sqpocs$1so3$1@gioia.aioe.org>
<sqpqbm$7qo$1@newsreader4.netcologne.de> <sqq3ce$c4n$2@dont-email.me>
<sqssff$a9j$1@gioia.aioe.org> <077afaee-009e-4860-be45-61106126934bn@googlegroups.com>
<squhht$79u$1@dont-email.me> <bb6d49bb-a676-44bd-9a6d-29386d429454n@googlegroups.com>
<sr0vhm$c4u$1@dont-email.me> <sr114i$1qc$1@newsreader4.netcologne.de>
<sr1dca$70e$1@dont-email.me> <kM%AJ.186634$np6.183460@fx46.iad>
<sr2gf6$64u$1@dont-email.me> <7DpBJ.254731$3q9.63673@fx47.iad>
<sr62tb$u2o$1@dont-email.me> <ss4g91$hvs$1@dont-email.me> <ss4ktr$dvv$1@dont-email.me>
<2a4e2b74-9d00-4e2e-97a3-dccad1167d0cn@googlegroups.com> <ss4tp6$j8v$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <df3193c7-a0db-4d04-9ebc-2e72f550e05bn@googlegroups.com>
Subject: Re: RISC-V vs. Aarch64
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 17 Jan 2022 23:26:09 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 26
 by: MitchAlsup - Mon, 17 Jan 2022 23:26 UTC

On Monday, January 17, 2022 at 5:23:21 PM UTC-6, Ivan Godard wrote:
> On 1/17/2022 2:36 PM, MitchAlsup wrote:
> > On Monday, January 17, 2022 at 2:52:14 PM UTC-6, Ivan Godard wrote:
> >
> >>
> >> In the first cycle:
> >> cycle 1: (a+b)->lat1
> >> cycle 2: lat1->lat2; (a+b)+c->lat1
> >> cycle 3: lat2->lat3; lat1->lat2; (a+b+c)+d->lat1
> >> cycle 4: lat3->sX; lat2->lat3; lat1->lat2; (a+b+c+d)->lat1
> >> cycle 5: (sX); lat3->sY; lat2->lat3; lat1->lat2; (a+b+c+d+e)->lat1
> > <
> > Why is cycle 4 not::
> > <
> > cycle 4: lat3->sX; lat2->lat3; lat1->lat2; (a+b+c+d)+e->lat1
> > <
> > The first 3 instructions perform an ADD why not cycle 4 ?
> Because I'm a bad proofreader?
<
Are you surprised someone actually read what you wrote ?
<
> In the first cycle:
> cycle 1: (a+b)->lat1
> cycle 2: lat1->lat2; (a+b)+c->lat1
> cycle 3: lat2->lat3; lat1->lat2; (a+b+c)+d->lat1
> cycle 4: lat3->sX; lat2->lat3; lat1->lat2; (a+b+c+d)+e->lat1
> cycle 5: (sX); lat3->sY; lat2->lat3; lat1->lat2; (a+b+c+d+e)+f->lat1

Re: RISC-V vs. Aarch64

<ss5hj9$lhm$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22997&group=comp.arch#22997

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Mon, 17 Jan 2022 21:01:27 -0800
Organization: A noiseless patient Spider
Lines: 138
Message-ID: <ss5hj9$lhm$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sqkcvk$n97$1@dont-email.me> <RrlzJ.130558$SR4.25229@fx43.iad>
<sql2cm$3h7$1@dont-email.me> <sql73d$6es$2@newsreader4.netcologne.de>
<sqmj5j$s31$1@dont-email.me> <sqmmso$446$2@newsreader4.netcologne.de>
<gs2dnRZj-ucyZ1P8nZ2dnUU78YfNnZ2d@supernews.com>
<sqpd0i$spj$1@newsreader4.netcologne.de>
<650c822a-3776-4ea9-aa72-5a6b19bdcabbn@googlegroups.com>
<sqpocs$1so3$1@gioia.aioe.org> <sqpqbm$7qo$1@newsreader4.netcologne.de>
<sqq3ce$c4n$2@dont-email.me> <sqssff$a9j$1@gioia.aioe.org>
<077afaee-009e-4860-be45-61106126934bn@googlegroups.com>
<squhht$79u$1@dont-email.me>
<bb6d49bb-a676-44bd-9a6d-29386d429454n@googlegroups.com>
<sr0vhm$c4u$1@dont-email.me> <sr114i$1qc$1@newsreader4.netcologne.de>
<sr1dca$70e$1@dont-email.me> <kM%AJ.186634$np6.183460@fx46.iad>
<sr2gf6$64u$1@dont-email.me> <7DpBJ.254731$3q9.63673@fx47.iad>
<sr62tb$u2o$1@dont-email.me> <ss4g91$hvs$1@dont-email.me>
<ss4ktr$dvv$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 18 Jan 2022 05:01:29 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="a1facaec7d0850c0d775d336a6d96816";
logging-data="22070"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX186cwGzTg0haf0USEYulOLwuhq2K7eMhtU="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.0
Cancel-Lock: sha1:JF4KGWgpph/k/ixhkzuN8RVL3RY=
In-Reply-To: <ss4ktr$dvv$1@dont-email.me>
Content-Language: en-US
 by: Stephen Fuld - Tue, 18 Jan 2022 05:01 UTC

On 1/17/2022 12:52 PM, Ivan Godard wrote:
> On 1/17/2022 11:32 AM, Stephen Fuld wrote:
>> On 1/5/2022 10:40 PM, Ivan Godard wrote:
>>> On 1/5/2022 3:13 PM, EricP wrote:
>>>> Ivan Godard wrote:
>>>>> On 1/4/2022 9:48 AM, EricP wrote:
>>>>>> Ivan Godard wrote:
>>>>>>> On 1/4/2022 12:39 AM, Thomas Koenig wrote:
>>>>>>>> Ivan Godard <ivan@millcomputing.com> schrieb:
>>>>>>>>
>>>>>>>>> Perhaps you haven't noticed me saying: *the belt is not
>>>>>>>>> physically a
>>>>>>>>> shift register*.
>>>>>>>>
>>>>>>>> It's usually implemented as a circular buffer, correct?
>>>>>>>
>>>>>>> Not at all.
>>>>>>>
>>>>>>> Computed values are left where they were produced - FU's output
>>>>>>> latches, for example - just as in a forwarding bypass network.
>>>>>>> Move only happens if the location is needed by some other
>>>>>>> computation, and then only to an adjacent location, also on the
>>>>>>> bypass network - which the way issue happens guarantees is free.
>>>>>>
>>>>>> Moving results out of the way is what makes this work.
>>>>>> If you only have one adder FU and you get a bunch of add instructions
>>>>>> in a row, then you need to stash older results in other registers
>>>>>> and this becomes the equivalent of a delayed register writeback.
>>>>>> But I think you need a crossbar to accomplish this while a
>>>>>> traditional approach just needs some small number of result buses.
>>>>>>
>>>>>>
>>>>>
>>>>> Still not :-)
>>>>>
>>>>> Our FU slots can (and typically do) support operations of different
>>>>> natural latencies (pipe lengths). Each slot can accept one op per
>>>>> cycle, so if the latencies differ you can get more than one result
>>>>> retiring in the same cycle. Consequently the FUs have one result FF
>>>>> per supported latency.
>>>>>
>>>>> If a op of latency N retires in cycle C to FF#N, necessarily the
>>>>> following cycle C+1 the FF#N+1 is free (think about it).
>>>>> Consequently, the FU's FFs are daisy chained so that each cycle
>>>>> FF#N is moved to FF#N+1  and every result always is retiring to a
>>>>> known free FF; the set of FFs are right next to each other so the
>>>>> move is trivial.
>>>>
>>>> This sounds like it has a belt's worth of FF for each FU.
>>>> And some of them are shift registers?
>>>> I'm a bit confused.
>>>
>>> There can be (and usually are) several FUs per slot, forming in
>>> effect a "superFU". There is only one set of output FFs per slot, one
>>> per latency. These are daisy chained. I suppose that you can think of
>>> the output FF daisy as being a shift register, and it could be done
>>> that way, but it also could be done by simply rotating which FF is
>>> considered which latency, thereby replacing a physical data move with
>>> a result-to-FF fanout. That's a HW design choice; IANAHWG.
>>
>> I thought I had at least a basic understanding of how you actually
>> implement the belt, but Eric's post made me question that, and I don't
>> think you addressed his exact question.  So, suppose you have exactly
>> one add FU, and that it has a latency of 1. My understanding is that
>> there is therefore one set of FFs at the add FU's output. Suppose the
>> belt length is 16, though this isn't particularly critical.
>> Now suppose you have an instruction stream with say 4-5 consecutive
>> adds.  I understand that this may take several instructions, as you
>> only have one add FU.  After the first add, the result is in that FUs
>> output FFs, and has a some belt position.  After the second add, since
>> you want the FUs output FFs to hold the result, but they still have
>> the result from the first add, you need some place to hold the first
>> result.  I gather this is somewhere in the spiller.  Now let's keep
>> going to the third add.  Are there more spots in the spiller?  If so,
>> there must be some limit, although it is probably configurable.  What
>> happens if the number of consecutive adds exceeds that limit.
>>
>> Anyway, I hope you can see my confusion and will clarity it.
>
> Looks like the confusion is the difference between FU and slot.

After reading through your response, I agree - that was my confusion.
See more below.

> A FU is
> a unit of computation - an ALU or FPU for example. A slot is a unit of
> encoding. All slots have at least one FU attached, but they can (and
> usually do) have several. The only requirement is that decode be able to
> supply to the slot everything that any of the attached FUs needs to do
> for any of its supported instructions.

So since a slot can encode one operation, of any type supported on that
slot, at a time, doesn't that lead to lots of "extra" functional units.
i.e. if you have two slots, each able to encode an add or a multiply,
I think you have four FUs, of which only two can be used in any given
instruction. Or am I still confused?

> The result FFs are per-slot, not per-FU.

Got it.

> There is one FF per latency
> supported by any instruction of any FU attached to that slot, and for
> any intermediate latency that happens not to have any instructions that
> use it. The FFs are logically chained to make a shift register, although
> for power reasons it may use a different implementation than a physical
> shift register.

OK. I understand.

> Your example with only one latency is unrealistic; typical arithmetic
> (exu) slots have several FUs with an assortment of latencies: an ALU, a
> FPU, an integer multiplier, an assortment of 2-cycle wide-data
> operations, and so on. If it should happen that a slot has only one or a
> few latencies then the member config may add one or more extra FFs so as
> to lengthen the FF shift register, to make it more likely that a value
> will die before it gets to the end of the SR.

OK. So each slot has a sort of addressable FIFO of FFs. It is
addressable in that each has its own belt position identifier, but it is
a FIFO in that adding a new entry "pushes" down the other entries. Is
that correct?

snip a lot of useful exposition.

> Does this help?

Yes, very much so. Thanks.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: RISC-V vs. Aarch64

<UZCFJ.4610$uP.4480@fx16.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22998&group=comp.arch#22998

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx16.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
References: <2021Dec24.163843@mips.complang.tuwien.ac.at> <sqkcvk$n97$1@dont-email.me> <RrlzJ.130558$SR4.25229@fx43.iad> <sql2cm$3h7$1@dont-email.me> <sql73d$6es$2@newsreader4.netcologne.de> <sqmj5j$s31$1@dont-email.me> <sqmmso$446$2@newsreader4.netcologne.de> <gs2dnRZj-ucyZ1P8nZ2dnUU78YfNnZ2d@supernews.com> <sqpd0i$spj$1@newsreader4.netcologne.de> <650c822a-3776-4ea9-aa72-5a6b19bdcabbn@googlegroups.com> <sqpocs$1so3$1@gioia.aioe.org> <sqpqbm$7qo$1@newsreader4.netcologne.de> <sqq3ce$c4n$2@dont-email.me> <sqssff$a9j$1@gioia.aioe.org> <077afaee-009e-4860-be45-61106126934bn@googlegroups.com> <squhht$79u$1@dont-email.me> <bb6d49bb-a676-44bd-9a6d-29386d429454n@googlegroups.com> <sr0vhm$c4u$1@dont-email.me> <sr114i$1qc$1@newsreader4.netcologne.de> <sr1dca$70e$1@dont-email.me> <kM%AJ.186634$np6.183460@fx46.iad> <sr2gf6$64u$1@dont-email.me> <7DpBJ.254731$3q9.63673@fx47.iad> <sr62tb$u2o$1@dont-email.me> <ss4g91$hvs$1@dont-email.me> <ss4ktr$dvv$1@dont-email.me>
In-Reply-To: <ss4ktr$dvv$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 196
Message-ID: <UZCFJ.4610$uP.4480@fx16.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Tue, 18 Jan 2022 17:42:12 UTC
Date: Tue, 18 Jan 2022 12:40:30 -0500
X-Received-Bytes: 11512
X-Original-Bytes: 11460
 by: EricP - Tue, 18 Jan 2022 17:40 UTC

Ivan Godard wrote:
> On 1/17/2022 11:32 AM, Stephen Fuld wrote:
>> On 1/5/2022 10:40 PM, Ivan Godard wrote:
>>> On 1/5/2022 3:13 PM, EricP wrote:
>>>> Ivan Godard wrote:
>>>>> On 1/4/2022 9:48 AM, EricP wrote:
>>>>>> Ivan Godard wrote:
>>>>>>> On 1/4/2022 12:39 AM, Thomas Koenig wrote:
>>>>>>>> Ivan Godard <ivan@millcomputing.com> schrieb:
>>>>>>>>
>>>>>>>>> Perhaps you haven't noticed me saying: *the belt is not
>>>>>>>>> physically a
>>>>>>>>> shift register*.
>>>>>>>>
>>>>>>>> It's usually implemented as a circular buffer, correct?
>>>>>>>
>>>>>>> Not at all.
>>>>>>>
>>>>>>> Computed values are left where they were produced - FU's output
>>>>>>> latches, for example - just as in a forwarding bypass network.
>>>>>>> Move only happens if the location is needed by some other
>>>>>>> computation, and then only to an adjacent location, also on the
>>>>>>> bypass network - which the way issue happens guarantees is free.
>>>>>>
>>>>>> Moving results out of the way is what makes this work.
>>>>>> If you only have one adder FU and you get a bunch of add instructions
>>>>>> in a row, then you need to stash older results in other registers
>>>>>> and this becomes the equivalent of a delayed register writeback.
>>>>>> But I think you need a crossbar to accomplish this while a
>>>>>> traditional approach just needs some small number of result buses.
>>>>>>
>>>>>>
>>>>>
>>>>> Still not :-)
>>>>>
>>>>> Our FU slots can (and typically do) support operations of different
>>>>> natural latencies (pipe lengths). Each slot can accept one op per
>>>>> cycle, so if the latencies differ you can get more than one result
>>>>> retiring in the same cycle. Consequently the FUs have one result FF
>>>>> per supported latency.
>>>>>
>>>>> If a op of latency N retires in cycle C to FF#N, necessarily the
>>>>> following cycle C+1 the FF#N+1 is free (think about it).
>>>>> Consequently, the FU's FFs are daisy chained so that each cycle
>>>>> FF#N is moved to FF#N+1 and every result always is retiring to a
>>>>> known free FF; the set of FFs are right next to each other so the
>>>>> move is trivial.
>>>>
>>>> This sounds like it has a belt's worth of FF for each FU.
>>>> And some of them are shift registers?
>>>> I'm a bit confused.
>>>
>>> There can be (and usually are) several FUs per slot, forming in
>>> effect a "superFU". There is only one set of output FFs per slot, one
>>> per latency. These are daisy chained. I suppose that you can think of
>>> the output FF daisy as being a shift register, and it could be done
>>> that way, but it also could be done by simply rotating which FF is
>>> considered which latency, thereby replacing a physical data move with
>>> a result-to-FF fanout. That's a HW design choice; IANAHWG.
>>
>> I thought I had at least a basic understanding of how you actually
>> implement the belt, but Eric's post made me question that, and I don't
>> think you addressed his exact question.

Yeah, I didn't understand either. I found the patent for it (below) and
was looking at it for insights, but I wanted to get back to something else
I was looking at, Load-Store Queues, so I was going to get back again later.

>> So, suppose you have exactly
>> one add FU, and that it has a latency of 1. My understanding is that
>> there is therefore one set of FFs at the add FU's output. Suppose the
>> belt length is 16, though this isn't particularly critical.
>> Now suppose you have an instruction stream with say 4-5 consecutive
>> adds. I understand that this may take several instructions, as you
>> only have one add FU. After the first add, the result is in that FUs
>> output FFs, and has a some belt position. After the second add, since
>> you want the FUs output FFs to hold the result, but they still have
>> the result from the first add, you need some place to hold the first
>> result. I gather this is somewhere in the spiller. Now let's keep
>> going to the third add. Are there more spots in the spiller? If so,
>> there must be some limit, although it is probably configurable. What
>> happens if the number of consecutive adds exceeds that limit.
>>
>> Anyway, I hope you can see my confusion and will clarity it.
>
> Looks like the confusion is the difference between FU and slot. A FU is
> a unit of computation - an ALU or FPU for example. A slot is a unit of
> encoding. All slots have at least one FU attached, but they can (and
> usually do) have several. The only requirement is that decode be able to
> supply to the slot everything that any of the attached FUs needs to do
> for any of its supported instructions.

This is still confusing to me.
It may be because the term "latency" seems to be used two ways:
- latency to mean slot offset slot[0], slot[1]
- each HW FU has a latency and throughput (delay to start next operation).
ALU latency 1 throughput 1, int-MUL lat 4 thru 1, int-DIV lat 16 thru 16.

> The result FFs are per-slot, not per-FU. There is one FF per latency
> supported by any instruction of any FU attached to that slot, and for
> any intermediate latency that happens not to have any instructions that
> use it. The FFs are logically chained to make a shift register, although
> for power reasons it may use a different implementation than a physical
> shift register.

The Belt is a logical shift register - all FU's drop their results
into slot[-1] then shift that right into slot[0].
There is no OoO so if a slot[-1] result isn't ready the belt shift stalls.

If belt is length 16, then for any mix of instructions there are
only 16 logically valid results in this belt.
If I have 16 adds in a row, then FU.Alu must have a
16 FF shift registers to hold those results.
If there is a second Alu, then another 16 FF shift register for that.

Only 16 of those slots can hold valid values, but from your
description we'd need 32 FF and 32 xbar operand pair inputs,
and 2 xbar FU operand pair outputs.
If we toss in an IMUL and IDIV then 64 FF and xbar inputs,
with 4 xbar operand pair outputs.

From your description this would require the HW FF resources of a
OoO core with 64 physical registers, but derives not of the benefit
of producing

(This is why I assumed that Belt has a second xbar to route results
in my prior little diagram - it eliminates this slots*FU expansion.)

> Your example with only one latency is unrealistic; typical arithmetic
> (exu) slots have several FUs with an assortment of latencies: an ALU, a
> FPU, an integer multiplier, an assortment of 2-cycle wide-data
> operations, and so on. If it should happen that a slot has only one or a
> few latencies then the member config may add one or more extra FFs so as
> to lengthen the FF shift register, to make it more likely that a value
> will die before it gets to the end of the SR.

I don't understand this (maybe its the word latency again).
Each FU *MUST* have belt-length SR FF just in case that many instructions
show up. It could get 16 IDIV's in a row that eventually drop to slot[0:15].

> A live result that gets to the end is moved to the spiller buffers,
> which are just FFs, almost as if the spiller were just another slot. The
> spiller buffers are able to accept one value from each slot each cycle,
> because the slot SR can only have one value falling off the SR per
> cycle. The spiller buffers get reused when a value they have goes dead;
> the logic is essentially the same as is used in an OOO genreg machine to
> assign physical registers, though the amount of logic is much less than
> that required to assign several hundred genregs because the maximum live
> population is bounded by the belt length.
>
> The number of slots, slot FU (and instruction) populations, the length
> of the per slot SR, the number of spiller buffers and so on are all
> member config parameters, that is balanced for market, expected usage,
> power and other constraints at chip design time.
>
> To take your example of an empty belt and a string of adds: a+b+c+d+e,
> all latency one. Say there is only one exu slot, and it has a 1-cycle
> ALU, a 3-cycle multiplier, and a filler lat-2 FF even though the slot
> has no lat-2 instructions. The belt is assumed to be longer than 5, so
> no intermediate results die in the example.
>
> In the first cycle:
> cycle 1: (a+b)->lat1
> cycle 2: lat1->lat2; (a+b)+c->lat1
> cycle 3: lat2->lat3; lat1->lat2; (a+b+c)+d->lat1
> cycle 4: lat3->sX; lat2->lat3; lat1->lat2; (a+b+c+d)->lat1
> cycle 5: (sX); lat3->sY; lat2->lat3; lat1->lat2; (a+b+c+d+e)->lat1


Click here to read the complete article
Re: RISC-V vs. Aarch64

<bSDFJ.268310$1d1.64158@fx99.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=22999&group=comp.arch#22999

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!border1.nntp.dca1.giganews.com!nntp.giganews.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx99.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
References: <2021Dec24.163843@mips.complang.tuwien.ac.at> <sqkcvk$n97$1@dont-email.me> <RrlzJ.130558$SR4.25229@fx43.iad> <sql2cm$3h7$1@dont-email.me> <sql73d$6es$2@newsreader4.netcologne.de> <sqmj5j$s31$1@dont-email.me> <sqmmso$446$2@newsreader4.netcologne.de> <gs2dnRZj-ucyZ1P8nZ2dnUU78YfNnZ2d@supernews.com> <sqpd0i$spj$1@newsreader4.netcologne.de> <650c822a-3776-4ea9-aa72-5a6b19bdcabbn@googlegroups.com> <sqpocs$1so3$1@gioia.aioe.org> <sqpqbm$7qo$1@newsreader4.netcologne.de> <sqq3ce$c4n$2@dont-email.me> <sqssff$a9j$1@gioia.aioe.org> <077afaee-009e-4860-be45-61106126934bn@googlegroups.com> <squhht$79u$1@dont-email.me> <bb6d49bb-a676-44bd-9a6d-29386d429454n@googlegroups.com> <sr0vhm$c4u$1@dont-email.me> <sr114i$1qc$1@newsreader4.netcologne.de> <sr1dca$70e$1@dont-email.me> <kM%AJ.186634$np6.183460@fx46.iad> <sr2gf6$64u$1@dont-email.me> <7DpBJ.254731$3q9.63673@fx47.iad> <sr62tb$u2o$1@dont-email.me> <ss4g91$hvs$1@dont-email.me> <ss4ktr$dvv$1@dont-email.me> <UZCFJ.4610$uP.4480@fx16.iad>
In-Reply-To: <UZCFJ.4610$uP.4480@fx16.iad>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 67
Message-ID: <bSDFJ.268310$1d1.64158@fx99.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Tue, 18 Jan 2022 18:42:15 UTC
Date: Tue, 18 Jan 2022 13:42:05 -0500
X-Received-Bytes: 4092
X-Original-Bytes: 4041
 by: EricP - Tue, 18 Jan 2022 18:42 UTC

EricP wrote:
> Ivan Godard wrote:
>>
>> To take your example of an empty belt and a string of adds: a+b+c+d+e,
>> all latency one. Say there is only one exu slot, and it has a 1-cycle
>> ALU, a 3-cycle multiplier, and a filler lat-2 FF even though the slot
>> has no lat-2 instructions. The belt is assumed to be longer than 5, so
>> no intermediate results die in the example.
>>
>> In the first cycle:
>> cycle 1: (a+b)->lat1
>> cycle 2: lat1->lat2; (a+b)+c->lat1
>> cycle 3: lat2->lat3; lat1->lat2; (a+b+c)+d->lat1
>> cycle 4: lat3->sX; lat2->lat3; lat1->lat2; (a+b+c+d)->lat1
>> cycle 5: (sX); lat3->sY; lat2->lat3; lat1->lat2; (a+b+c+d+e)->lat1
>
> This is the easy case with HW (lat,thru) of (1,1) to slot[0:4].
>
> Toss in a second ALU plus (lat,thru) of an IMUL (4,1) and IDIV (16,16).
> In each case the Belt slot inputs to those FU are all valid
> so all launch at once but drop to different slot offsets.

There are multiple things being juggled here.
There is the (latency,throughput) of each type of FU.
There can be multiple FU of each type.
To take advantage of Mill's wide decode we want to get
as many concurrent, pipelined operations going at once.
Then it orders the results as they drop onto the belt,
and if multiple results drop at the same clock then
some kind of predictable ordering into the belt slots
so the compiler knows where things appear to land.

So just to walk through an example to see if I get the basics....

At time T1 we start IMUL1 which drops its result at T4.
There are also 3 ADDs and only one FU.Alu.
ADD1 also starts at T1, ADD2 and ADD3 at T2 and T3,
which drop at T2, T3 and T4.
At T2 ADD1 is finishes so goes slot[0].
At T3 Add2 goes to slot[0], ADD1 shifts to slot[1].
At T4 ADD3 and IMUL1 finish at the same time, and lets say the rule is
first to start drops first, so ADD1, ADD2 shift to slot[3], slot[2]
and IMUL1 and ADD3 drop to slot[1] and slot[0] respectively.

So we wind up with a belt like:

slot[0] slot[1] slot[2] slot[3]
ADD3 IMUL1 ADD2 ADD1

If we had started a second concrrent, pipelined IMUL2 at T2
the result belt looks like:

slot[0] slot[1] slot[2] slot[3] slot[4]
IMUL2 ADD3 IMUL1 ADD2 ADD1

But in terms of FU result FF it looks like

FU.Alu FU.IMul XBAR
ADD3 IMUL2 =>------------
ADD2 IMUL1 =>
ADD1 =>------------
| | | |
v v v v
FU.Alu FU.IMul

Re: RISC-V vs. Aarch64

<ss74j4$ke7$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23000&group=comp.arch#23000

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Tue, 18 Jan 2022 11:31:47 -0800
Organization: A noiseless patient Spider
Lines: 161
Message-ID: <ss74j4$ke7$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sqkcvk$n97$1@dont-email.me> <RrlzJ.130558$SR4.25229@fx43.iad>
<sql2cm$3h7$1@dont-email.me> <sql73d$6es$2@newsreader4.netcologne.de>
<sqmj5j$s31$1@dont-email.me> <sqmmso$446$2@newsreader4.netcologne.de>
<gs2dnRZj-ucyZ1P8nZ2dnUU78YfNnZ2d@supernews.com>
<sqpd0i$spj$1@newsreader4.netcologne.de>
<650c822a-3776-4ea9-aa72-5a6b19bdcabbn@googlegroups.com>
<sqpocs$1so3$1@gioia.aioe.org> <sqpqbm$7qo$1@newsreader4.netcologne.de>
<sqq3ce$c4n$2@dont-email.me> <sqssff$a9j$1@gioia.aioe.org>
<077afaee-009e-4860-be45-61106126934bn@googlegroups.com>
<squhht$79u$1@dont-email.me>
<bb6d49bb-a676-44bd-9a6d-29386d429454n@googlegroups.com>
<sr0vhm$c4u$1@dont-email.me> <sr114i$1qc$1@newsreader4.netcologne.de>
<sr1dca$70e$1@dont-email.me> <kM%AJ.186634$np6.183460@fx46.iad>
<sr2gf6$64u$1@dont-email.me> <7DpBJ.254731$3q9.63673@fx47.iad>
<sr62tb$u2o$1@dont-email.me> <ss4g91$hvs$1@dont-email.me>
<ss4ktr$dvv$1@dont-email.me> <ss5hj9$lhm$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 18 Jan 2022 19:31:48 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="1124bcc9c296a3aac6f7fa093f4e8742";
logging-data="20935"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19+M89eV7H3cj4QXUWV8YP8"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.0
Cancel-Lock: sha1:swWoe9HZ+NanbbLFsdDKoPQ+X1U=
In-Reply-To: <ss5hj9$lhm$1@dont-email.me>
Content-Language: en-US
 by: Ivan Godard - Tue, 18 Jan 2022 19:31 UTC

On 1/17/2022 9:01 PM, Stephen Fuld wrote:
> On 1/17/2022 12:52 PM, Ivan Godard wrote:
>> On 1/17/2022 11:32 AM, Stephen Fuld wrote:
>>> On 1/5/2022 10:40 PM, Ivan Godard wrote:
>>>> On 1/5/2022 3:13 PM, EricP wrote:
>>>>> Ivan Godard wrote:
>>>>>> On 1/4/2022 9:48 AM, EricP wrote:
>>>>>>> Ivan Godard wrote:
>>>>>>>> On 1/4/2022 12:39 AM, Thomas Koenig wrote:
>>>>>>>>> Ivan Godard <ivan@millcomputing.com> schrieb:
>>>>>>>>>
>>>>>>>>>> Perhaps you haven't noticed me saying: *the belt is not
>>>>>>>>>> physically a
>>>>>>>>>> shift register*.
>>>>>>>>>
>>>>>>>>> It's usually implemented as a circular buffer, correct?
>>>>>>>>
>>>>>>>> Not at all.
>>>>>>>>
>>>>>>>> Computed values are left where they were produced - FU's output
>>>>>>>> latches, for example - just as in a forwarding bypass network.
>>>>>>>> Move only happens if the location is needed by some other
>>>>>>>> computation, and then only to an adjacent location, also on the
>>>>>>>> bypass network - which the way issue happens guarantees is free.
>>>>>>>
>>>>>>> Moving results out of the way is what makes this work.
>>>>>>> If you only have one adder FU and you get a bunch of add
>>>>>>> instructions
>>>>>>> in a row, then you need to stash older results in other registers
>>>>>>> and this becomes the equivalent of a delayed register writeback.
>>>>>>> But I think you need a crossbar to accomplish this while a
>>>>>>> traditional approach just needs some small number of result buses.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Still not :-)
>>>>>>
>>>>>> Our FU slots can (and typically do) support operations of
>>>>>> different natural latencies (pipe lengths). Each slot can accept
>>>>>> one op per cycle, so if the latencies differ you can get more than
>>>>>> one result retiring in the same cycle. Consequently the FUs have
>>>>>> one result FF per supported latency.
>>>>>>
>>>>>> If a op of latency N retires in cycle C to FF#N, necessarily the
>>>>>> following cycle C+1 the FF#N+1 is free (think about it).
>>>>>> Consequently, the FU's FFs are daisy chained so that each cycle
>>>>>> FF#N is moved to FF#N+1  and every result always is retiring to a
>>>>>> known free FF; the set of FFs are right next to each other so the
>>>>>> move is trivial.
>>>>>
>>>>> This sounds like it has a belt's worth of FF for each FU.
>>>>> And some of them are shift registers?
>>>>> I'm a bit confused.
>>>>
>>>> There can be (and usually are) several FUs per slot, forming in
>>>> effect a "superFU". There is only one set of output FFs per slot,
>>>> one per latency. These are daisy chained. I suppose that you can
>>>> think of the output FF daisy as being a shift register, and it could
>>>> be done that way, but it also could be done by simply rotating which
>>>> FF is considered which latency, thereby replacing a physical data
>>>> move with a result-to-FF fanout. That's a HW design choice; IANAHWG.
>>>
>>> I thought I had at least a basic understanding of how you actually
>>> implement the belt, but Eric's post made me question that, and I
>>> don't think you addressed his exact question.  So, suppose you have
>>> exactly one add FU, and that it has a latency of 1. My understanding
>>> is that there is therefore one set of FFs at the add FU's output.
>>> Suppose the belt length is 16, though this isn't particularly critical.
>>> Now suppose you have an instruction stream with say 4-5 consecutive
>>> adds.  I understand that this may take several instructions, as you
>>> only have one add FU.  After the first add, the result is in that FUs
>>> output FFs, and has a some belt position.  After the second add,
>>> since you want the FUs output FFs to hold the result, but they still
>>> have the result from the first add, you need some place to hold the
>>> first result.  I gather this is somewhere in the spiller.  Now let's
>>> keep going to the third add.  Are there more spots in the spiller?
>>> If so, there must be some limit, although it is probably
>>> configurable.  What happens if the number of consecutive adds exceeds
>>> that limit.
>>>
>>> Anyway, I hope you can see my confusion and will clarity it.
>>
>> Looks like the confusion is the difference between FU and slot.
>
> After reading through your response, I agree - that was my confusion.
> See more below.
>
>> A FU is a unit of computation - an ALU or FPU for example. A slot is a
>> unit of encoding. All slots have at least one FU attached, but they
>> can (and usually do) have several. The only requirement is that decode
>> be able to supply to the slot everything that any of the attached FUs
>> needs to do for any of its supported instructions.
>
> So since a slot can encode one operation, of any type supported on that
> slot, at a time, doesn't that lead to lots of "extra" functional units.
>  i.e. if you have two slots, each able to encode an add or a multiply,
> I think you have four FUs, of which only two can be used in any given
> instruction.  Or am I still confused?
>

You are right; the density of FUs on slots means that many FUs can be
unused in any given cycle. It's all a game of configurations - scatter
the FUs and more can be used concurrently, but the crossbar gets bigger.

BTW, I agree with Mitch that a modern ALU is so small it's almost free,
so our current configs tend to put an ALU in every exu slot, and then
scatter the more expensive FUs more sparsely.

Also, remember that Mill is a wide issue machine with bundle encoding,
and the bundles have six (really four as two are data) independently
encoded blocks. Only the exu block (all two-in, one or two out slots and
all the compute instructions) have results that need using the FF
chains. Flow block (memory and control flow) and writer block (pure
sinks) have no results, and reader block (pure sources) has special
handling because some of the drops come from decode rather than from
pseudo-FUs.

>
>> The result FFs are per-slot, not per-FU.
>
> Got it.
>
>> There is one FF per latency supported by any instruction of any FU
>> attached to that slot, and for any intermediate latency that happens
>> not to have any instructions that use it. The FFs are logically
>> chained to make a shift register, although for power reasons it may
>> use a different implementation than a physical shift register.
>
> OK.  I understand.
>
>
>> Your example with only one latency is unrealistic; typical arithmetic
>> (exu) slots have several FUs with an assortment of latencies: an ALU,
>> a FPU, an integer multiplier, an assortment of 2-cycle wide-data
>> operations, and so on. If it should happen that a slot has only one or
>> a few latencies then the member config may add one or more extra FFs
>> so as to lengthen the FF shift register, to make it more likely that a
>> value will die before it gets to the end of the SR.
>
> OK.  So each slot has a sort of addressable FIFO of FFs.  It is
> addressable in that each has its own belt position identifier, but it is
> a FIFO in that adding a new entry "pushes" down the other entries.  Is
> that correct?

Each FIFO position has its own physical address, but may over time have
any logical belt position number, or none. Decode does the mapping
between belt position and physical address. There are various hardware
clevernessess involved in maintaining that mapping, some filed and some
NYF and all beyond my ken as a software guy.

>
> snip a lot of useful exposition.
>
>
>> Does this help?
>
> Yes, very much so.  Thanks.
>
>
>

Re: RISC-V vs. Aarch64

<ss77ba$a3t$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23001&group=comp.arch#23001

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Tue, 18 Jan 2022 12:18:49 -0800
Organization: A noiseless patient Spider
Lines: 265
Message-ID: <ss77ba$a3t$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sqkcvk$n97$1@dont-email.me> <RrlzJ.130558$SR4.25229@fx43.iad>
<sql2cm$3h7$1@dont-email.me> <sql73d$6es$2@newsreader4.netcologne.de>
<sqmj5j$s31$1@dont-email.me> <sqmmso$446$2@newsreader4.netcologne.de>
<gs2dnRZj-ucyZ1P8nZ2dnUU78YfNnZ2d@supernews.com>
<sqpd0i$spj$1@newsreader4.netcologne.de>
<650c822a-3776-4ea9-aa72-5a6b19bdcabbn@googlegroups.com>
<sqpocs$1so3$1@gioia.aioe.org> <sqpqbm$7qo$1@newsreader4.netcologne.de>
<sqq3ce$c4n$2@dont-email.me> <sqssff$a9j$1@gioia.aioe.org>
<077afaee-009e-4860-be45-61106126934bn@googlegroups.com>
<squhht$79u$1@dont-email.me>
<bb6d49bb-a676-44bd-9a6d-29386d429454n@googlegroups.com>
<sr0vhm$c4u$1@dont-email.me> <sr114i$1qc$1@newsreader4.netcologne.de>
<sr1dca$70e$1@dont-email.me> <kM%AJ.186634$np6.183460@fx46.iad>
<sr2gf6$64u$1@dont-email.me> <7DpBJ.254731$3q9.63673@fx47.iad>
<sr62tb$u2o$1@dont-email.me> <ss4g91$hvs$1@dont-email.me>
<ss4ktr$dvv$1@dont-email.me> <UZCFJ.4610$uP.4480@fx16.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 18 Jan 2022 20:18:51 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="1124bcc9c296a3aac6f7fa093f4e8742";
logging-data="10365"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+8ZtMPg/NChXbHq3crGgHD"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.0
Cancel-Lock: sha1:bi5XWHK3T5bF1GNm4T9Os8JewDw=
In-Reply-To: <UZCFJ.4610$uP.4480@fx16.iad>
Content-Language: en-US
 by: Ivan Godard - Tue, 18 Jan 2022 20:18 UTC

On 1/18/2022 9:40 AM, EricP wrote:
> Ivan Godard wrote:
>> On 1/17/2022 11:32 AM, Stephen Fuld wrote:
>>> On 1/5/2022 10:40 PM, Ivan Godard wrote:
>>>> On 1/5/2022 3:13 PM, EricP wrote:
>>>>> Ivan Godard wrote:
>>>>>> On 1/4/2022 9:48 AM, EricP wrote:
>>>>>>> Ivan Godard wrote:
>>>>>>>> On 1/4/2022 12:39 AM, Thomas Koenig wrote:
>>>>>>>>> Ivan Godard <ivan@millcomputing.com> schrieb:
>>>>>>>>>
>>>>>>>>>> Perhaps you haven't noticed me saying: *the belt is not
>>>>>>>>>> physically a
>>>>>>>>>> shift register*.
>>>>>>>>>
>>>>>>>>> It's usually implemented as a circular buffer, correct?
>>>>>>>>
>>>>>>>> Not at all.
>>>>>>>>
>>>>>>>> Computed values are left where they were produced - FU's output
>>>>>>>> latches, for example - just as in a forwarding bypass network.
>>>>>>>> Move only happens if the location is needed by some other
>>>>>>>> computation, and then only to an adjacent location, also on the
>>>>>>>> bypass network - which the way issue happens guarantees is free.
>>>>>>>
>>>>>>> Moving results out of the way is what makes this work.
>>>>>>> If you only have one adder FU and you get a bunch of add
>>>>>>> instructions
>>>>>>> in a row, then you need to stash older results in other registers
>>>>>>> and this becomes the equivalent of a delayed register writeback.
>>>>>>> But I think you need a crossbar to accomplish this while a
>>>>>>> traditional approach just needs some small number of result buses.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Still not :-)
>>>>>>
>>>>>> Our FU slots can (and typically do) support operations of
>>>>>> different natural latencies (pipe lengths). Each slot can accept
>>>>>> one op per cycle, so if the latencies differ you can get more than
>>>>>> one result retiring in the same cycle. Consequently the FUs have
>>>>>> one result FF per supported latency.
>>>>>>
>>>>>> If a op of latency N retires in cycle C to FF#N, necessarily the
>>>>>> following cycle C+1 the FF#N+1 is free (think about it).
>>>>>> Consequently, the FU's FFs are daisy chained so that each cycle
>>>>>> FF#N is moved to FF#N+1  and every result always is retiring to a
>>>>>> known free FF; the set of FFs are right next to each other so the
>>>>>> move is trivial.
>>>>>
>>>>> This sounds like it has a belt's worth of FF for each FU.
>>>>> And some of them are shift registers?
>>>>> I'm a bit confused.
>>>>
>>>> There can be (and usually are) several FUs per slot, forming in
>>>> effect a "superFU". There is only one set of output FFs per slot,
>>>> one per latency. These are daisy chained. I suppose that you can
>>>> think of the output FF daisy as being a shift register, and it could
>>>> be done that way, but it also could be done by simply rotating which
>>>> FF is considered which latency, thereby replacing a physical data
>>>> move with a result-to-FF fanout. That's a HW design choice; IANAHWG.
>>>
>>> I thought I had at least a basic understanding of how you actually
>>> implement the belt, but Eric's post made me question that, and I
>>> don't think you addressed his exact question.
>
> Yeah, I didn't understand either. I found the patent for it (below) and
> was looking at it for insights, but I wanted to get back to something else
> I was looking at, Load-Store Queues, so I was going to get back again
> later.
>
>>> So, suppose you have exactly one add FU, and that it has a latency of
>>> 1. My understanding is that there is therefore one set of FFs at the
>>> add FU's output. Suppose the belt length is 16, though this isn't
>>> particularly critical.
>>> Now suppose you have an instruction stream with say 4-5 consecutive
>>> adds.  I understand that this may take several instructions, as you
>>> only have one add FU.  After the first add, the result is in that FUs
>>> output FFs, and has a some belt position.  After the second add,
>>> since you want the FUs output FFs to hold the result, but they still
>>> have the result from the first add, you need some place to hold the
>>> first result.  I gather this is somewhere in the spiller.  Now let's
>>> keep going to the third add.  Are there more spots in the spiller?
>>> If so, there must be some limit, although it is probably
>>> configurable.  What happens if the number of consecutive adds exceeds
>>> that limit.
>>>
>>> Anyway, I hope you can see my confusion and will clarity it.
>>
>> Looks like the confusion is the difference between FU and slot. A FU
>> is a unit of computation - an ALU or FPU for example. A slot is a unit
>> of encoding. All slots have at least one FU attached, but they can
>> (and usually do) have several. The only requirement is that decode be
>> able to supply to the slot everything that any of the attached FUs
>> needs to do for any of its supported instructions.
>
> This is still confusing to me.
> It may be because the term "latency" seems to be used two ways:
> - latency to mean slot offset slot[0], slot[1]
> - each HW FU has a latency and throughput (delay to start next operation).
>   ALU latency 1 throughput 1, int-MUL lat 4 thru 1, int-DIV lat 16 thru
> 16.

Latency is the time (in issue cycles) between issue of and instruction
and the retire ("drop") or its result or the completion of its side
effects. Mill is statically scheduled so every kind of instruction has a
tape-out-time defined latency, known to the code scheduler (specializer)
and reflected in the generated code.

FUs do not have latency; individual instructions have latency. It is
common for a FU to support a mix of instructions of different latencies.
An FPU may support a SP FADD in three cycles and a DP FMUL in five. At
issue the FU uses signals from decode to adjust what it does and the
width it does it two. At retire, the FADD will drop its result in the
lat3 slot physical FF, and the FMULL will drop to the lat5 slot physical
FF. The mapping of belt # to physical will maintain the correct belt
physical ordering for as long as the drops remain live.

>> The result FFs are per-slot, not per-FU. There is one FF per latency
>> supported by any instruction of any FU attached to that slot, and for
>> any intermediate latency that happens not to have any instructions
>> that use it. The FFs are logically chained to make a shift register,
>> although for power reasons it may use a different implementation than
>> a physical shift register.
>
> The Belt is a logical shift register - all FU's drop their results
> into slot[-1] then shift that right into slot[0].
> There is no OoO so if a slot[-1] result isn't ready the belt shift stalls.
>
> If belt is length 16, then for any mix of instructions there are
> only 16 logically valid results in this belt.
> If I have 16 adds in a row, then FU.Alu must have a
> 16 FF shift registers to hold those results.
> If there is a second Alu, then another 16 FF shift register for that.
>
> Only 16 of those slots can hold valid values, but from your
> description we'd need 32 FF and 32 xbar operand pair inputs,
> and 2 xbar FU operand pair outputs.
> If we toss in an IMUL and IDIV then 64 FF and xbar inputs,
> with 4 xbar operand pair outputs.
>
> From your description this would require the HW FF resources of a
> OoO core with 64 physical registers, but derives not of the benefit
> of producing
>
> (This is why I assumed that Belt has a second xbar to route results
> in my prior little diagram - it eliminates this slots*FU expansion.)

Your description would be true iff the FF chain were the only residence
of belt values; that's why we don't do it that way :-)

Because a slot can support instructions of a mix of latencies, only one
of which can issue in a given cycle, if the instructions are issue with
the right timing you can get more than one retiring in the same cycle.
For example, if there were lat1, lat2, and lat3 supported, and in
consecutive cycles you issued ilat3 - ilat2 - ilat1 in that order, then
the following cycle would have three retires simultaneously, and so
necessarily (at least) three result FFs.


Click here to read the complete article
Re: RISC-V vs. Aarch64

<ss78bk$hm6$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23002&group=comp.arch#23002

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: RISC-V vs. Aarch64
Date: Tue, 18 Jan 2022 12:36:04 -0800
Organization: A noiseless patient Spider
Lines: 92
Message-ID: <ss78bk$hm6$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sql2cm$3h7$1@dont-email.me> <sql73d$6es$2@newsreader4.netcologne.de>
<sqmj5j$s31$1@dont-email.me> <sqmmso$446$2@newsreader4.netcologne.de>
<gs2dnRZj-ucyZ1P8nZ2dnUU78YfNnZ2d@supernews.com>
<sqpd0i$spj$1@newsreader4.netcologne.de>
<650c822a-3776-4ea9-aa72-5a6b19bdcabbn@googlegroups.com>
<sqpocs$1so3$1@gioia.aioe.org> <sqpqbm$7qo$1@newsreader4.netcologne.de>
<sqq3ce$c4n$2@dont-email.me> <sqssff$a9j$1@gioia.aioe.org>
<077afaee-009e-4860-be45-61106126934bn@googlegroups.com>
<squhht$79u$1@dont-email.me>
<bb6d49bb-a676-44bd-9a6d-29386d429454n@googlegroups.com>
<sr0vhm$c4u$1@dont-email.me> <sr114i$1qc$1@newsreader4.netcologne.de>
<sr1dca$70e$1@dont-email.me> <kM%AJ.186634$np6.183460@fx46.iad>
<sr2gf6$64u$1@dont-email.me> <7DpBJ.254731$3q9.63673@fx47.iad>
<sr62tb$u2o$1@dont-email.me> <ss4g91$hvs$1@dont-email.me>
<ss4ktr$dvv$1@dont-email.me> <UZCFJ.4610$uP.4480@fx16.iad>
<bSDFJ.268310$1d1.64158@fx99.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 18 Jan 2022 20:36:04 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="1124bcc9c296a3aac6f7fa093f4e8742";
logging-data="18118"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18Z+Nd2Zj7ICf8BZecZaqOi"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.0
Cancel-Lock: sha1:ycS9SOESGGOjP8ZhKb5FOpGsgow=
In-Reply-To: <bSDFJ.268310$1d1.64158@fx99.iad>
Content-Language: en-US
 by: Ivan Godard - Tue, 18 Jan 2022 20:36 UTC

On 1/18/2022 10:42 AM, EricP wrote:
> EricP wrote:
>> Ivan Godard wrote:
>>>
>>> To take your example of an empty belt and a string of adds:
>>> a+b+c+d+e, all latency one. Say there is only one exu slot, and it
>>> has a 1-cycle ALU, a 3-cycle multiplier, and a filler lat-2 FF even
>>> though the slot has no lat-2 instructions. The belt is assumed to be
>>> longer than 5, so no intermediate results die in the example.
>>>
>>> In the first cycle:
>>> cycle 1: (a+b)->lat1
>>> cycle 2: lat1->lat2; (a+b)+c->lat1
>>> cycle 3: lat2->lat3; lat1->lat2; (a+b+c)+d->lat1
>>> cycle 4: lat3->sX; lat2->lat3; lat1->lat2; (a+b+c+d)->lat1
>>> cycle 5: (sX); lat3->sY; lat2->lat3; lat1->lat2; (a+b+c+d+e)->lat1
>>
>> This is the easy case with HW (lat,thru) of (1,1) to slot[0:4].
>>
>> Toss in a second ALU plus (lat,thru) of an IMUL (4,1) and IDIV (16,16).
>> In each case the Belt slot inputs to those FU are all valid
>> so all launch at once but drop to different slot offsets.
>
> There are multiple things being juggled here.
> There is the (latency,throughput) of each type of FU.

Latency is per-instruction-kind, not per FU; FUs can support
instructions kinds with different latencies. This is especially common
with arguments of different widths. Mill does not promote everything to
some common "register width", but instead works like SIMD lanes and
executes in the natural width of the data. A Mill FMUL may be lat3 in w
width, lat4 in d width, and lat5 in q width. A legacy ISA may do the
same using different instructions (FMUL vs. DMUL vs ALPHABET_SOUP), but
Mill has the data encode its own width.

> There can be multiple FU of each type.
> To take advantage of Mill's wide decode we want to get
> as many concurrent, pipelined operations going at once.
> Then it orders the results as they drop onto the belt,
> and if multiple results drop at the same clock then
> some kind of predictable ordering into the belt slots
> so the compiler knows where things appear to land.

Yes

> So just to walk through an example to see if I get the basics....
>
> At time T1 we start IMUL1 which drops its result at T4.
> There are also 3 ADDs and only one FU.Alu.
> ADD1 also starts at T1, ADD2 and ADD3  at T2 and T3,
> which drop at T2, T3 and T4.

A slot can only issue one instruction per cycle, so it is not possible
for the IMUL1 and ADD1 to start in the same cycle in the same slot.

> At T2 ADD1 is finishes so goes slot[0].

Slots are encoding notions naming collections of FUs that share input
paths and output FF sets. You will get mighty confused if you use "slot"
for physical FF or for belt position number :-)

> At T3 Add2 goes to slot[0], ADD1 shifts to slot[1].
> At T4 ADD3 and IMUL1 finish at the same time, and lets say the rule is
> first to start drops first, so ADD1, ADD2 shift to slot[3], slot[2]
> and IMUL1 and ADD3 drop to slot[1] and slot[0] respectively.
>
> So we wind up with a belt like:
>
>   slot[0] slot[1] slot[2] slot[3]
>   ADD3    IMUL1   ADD2    ADD1
>
> If we had started a second concrrent, pipelined IMUL2 at T2
> the result belt looks like:
>
>   slot[0] slot[1] slot[2] slot[3] slot[4]
>   IMUL2   ADD3    IMUL1   ADD2    ADD1
>
> But in terms of FU result FF it looks like
>
>   FU.Alu  FU.IMul     XBAR
>   ADD3    IMUL2    =>------------
>   ADD2    IMUL1    =>
>   ADD1             =>------------
>                       |  |   |  |
>                       v  v   v  v
>                      FU.Alu FU.IMul

s/slot//belt#/ and it make more sense. The actual ordering (you are
right that the hardware and compiler must agree) is member specific and
more complicated to deal with things like phasing, non-FU values, and
control flow.

The type of Mill's belt's slots

<jwvo848n4ud.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23003&group=comp.arch#23003

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: The type of Mill's belt's slots
Date: Tue, 18 Jan 2022 15:58:36 -0500
Organization: A noiseless patient Spider
Lines: 16
Message-ID: <jwvo848n4ud.fsf-monnier+comp.arch@gnu.org>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<650c822a-3776-4ea9-aa72-5a6b19bdcabbn@googlegroups.com>
<sqpocs$1so3$1@gioia.aioe.org>
<sqpqbm$7qo$1@newsreader4.netcologne.de> <sqq3ce$c4n$2@dont-email.me>
<sqssff$a9j$1@gioia.aioe.org>
<077afaee-009e-4860-be45-61106126934bn@googlegroups.com>
<squhht$79u$1@dont-email.me>
<bb6d49bb-a676-44bd-9a6d-29386d429454n@googlegroups.com>
<sr0vhm$c4u$1@dont-email.me> <sr114i$1qc$1@newsreader4.netcologne.de>
<sr1dca$70e$1@dont-email.me> <kM%AJ.186634$np6.183460@fx46.iad>
<sr2gf6$64u$1@dont-email.me> <7DpBJ.254731$3q9.63673@fx47.iad>
<sr62tb$u2o$1@dont-email.me> <ss4g91$hvs$1@dont-email.me>
<ss4ktr$dvv$1@dont-email.me> <UZCFJ.4610$uP.4480@fx16.iad>
<bSDFJ.268310$1d1.64158@fx99.iad> <ss78bk$hm6$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="d05d279629f5eed5f8e3728d4c1b5888";
logging-data="7386"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/49T21gbDUA4XkWmB2XNNg"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:BiGYgT1sFQRN8paFzkd1DPGyg8A=
sha1:zPrQVmUJ1cH1sJVv24plXpRIVQY=
 by: Stefan Monnier - Tue, 18 Jan 2022 20:58 UTC

Ivan Godard [2022-01-18 12:36:04] wrote:
[...]
> vs. DMUL vs ALPHABET_SOUP), but Mill has the data encode its own width.

That reminds me: where (physically) is this type information kept?
It seems to be that keeping it with the data itself could be a problem
because it makes it available only rather late.
Do you try and do "type inference/propagation" in the decoder and keep
track of types there (or in the belt's logical-to-physical mapping
maybe), or is it really kept alongside the data and is hence only made
available to the next instruction when the data itself is
made available (so e.g. we don't know the latency of a MUL until we get
its operands)?

Stefan

Re: The type of Mill's belt's slots

<e58b2ec8-e6b0-422e-b657-7d66af821d74n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23005&group=comp.arch#23005

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:2486:: with SMTP id i6mr13049692qkn.449.1642546550565;
Tue, 18 Jan 2022 14:55:50 -0800 (PST)
X-Received: by 2002:a4a:8585:: with SMTP id t5mr19920130ooh.59.1642546550270;
Tue, 18 Jan 2022 14:55:50 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 18 Jan 2022 14:55:50 -0800 (PST)
In-Reply-To: <jwvo848n4ud.fsf-monnier+comp.arch@gnu.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:7c13:8310:45df:2e2d;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:7c13:8310:45df:2e2d
References: <2021Dec24.163843@mips.complang.tuwien.ac.at> <650c822a-3776-4ea9-aa72-5a6b19bdcabbn@googlegroups.com>
<sqpocs$1so3$1@gioia.aioe.org> <sqpqbm$7qo$1@newsreader4.netcologne.de>
<sqq3ce$c4n$2@dont-email.me> <sqssff$a9j$1@gioia.aioe.org>
<077afaee-009e-4860-be45-61106126934bn@googlegroups.com> <squhht$79u$1@dont-email.me>
<bb6d49bb-a676-44bd-9a6d-29386d429454n@googlegroups.com> <sr0vhm$c4u$1@dont-email.me>
<sr114i$1qc$1@newsreader4.netcologne.de> <sr1dca$70e$1@dont-email.me>
<kM%AJ.186634$np6.183460@fx46.iad> <sr2gf6$64u$1@dont-email.me>
<7DpBJ.254731$3q9.63673@fx47.iad> <sr62tb$u2o$1@dont-email.me>
<ss4g91$hvs$1@dont-email.me> <ss4ktr$dvv$1@dont-email.me> <UZCFJ.4610$uP.4480@fx16.iad>
<bSDFJ.268310$1d1.64158@fx99.iad> <ss78bk$hm6$1@dont-email.me> <jwvo848n4ud.fsf-monnier+comp.arch@gnu.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e58b2ec8-e6b0-422e-b657-7d66af821d74n@googlegroups.com>
Subject: Re: The type of Mill's belt's slots
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 18 Jan 2022 22:55:50 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 21
 by: MitchAlsup - Tue, 18 Jan 2022 22:55 UTC

On Tuesday, January 18, 2022 at 2:58:39 PM UTC-6, Stefan Monnier wrote:
> Ivan Godard [2022-01-18 12:36:04] wrote:
> [...]
> > vs. DMUL vs ALPHABET_SOUP), but Mill has the data encode its own width.
> That reminds me: where (physically) is this type information kept?
> It seems to be that keeping it with the data itself could be a problem
> because it makes it available only rather late.
<
This is the kind of stuff one keeps with the instruction (bits).
<
> Do you try and do "type inference/propagation" in the decoder and keep
> track of types there (or in the belt's logical-to-physical mapping
<
MILL is software scheduled, so this is all known of at compile time.
<
> maybe), or is it really kept alongside the data and is hence only made
> available to the next instruction when the data itself is
> made available (so e.g. we don't know the latency of a MUL until we get
> its operands)?
>
>
> Stefan

Re: The type of Mill's belt's slots

<ss7ila$obl$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23006&group=comp.arch#23006

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: The type of Mill's belt's slots
Date: Tue, 18 Jan 2022 15:31:53 -0800
Organization: A noiseless patient Spider
Lines: 21
Message-ID: <ss7ila$obl$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<650c822a-3776-4ea9-aa72-5a6b19bdcabbn@googlegroups.com>
<sqpocs$1so3$1@gioia.aioe.org> <sqpqbm$7qo$1@newsreader4.netcologne.de>
<sqq3ce$c4n$2@dont-email.me> <sqssff$a9j$1@gioia.aioe.org>
<077afaee-009e-4860-be45-61106126934bn@googlegroups.com>
<squhht$79u$1@dont-email.me>
<bb6d49bb-a676-44bd-9a6d-29386d429454n@googlegroups.com>
<sr0vhm$c4u$1@dont-email.me> <sr114i$1qc$1@newsreader4.netcologne.de>
<sr1dca$70e$1@dont-email.me> <kM%AJ.186634$np6.183460@fx46.iad>
<sr2gf6$64u$1@dont-email.me> <7DpBJ.254731$3q9.63673@fx47.iad>
<sr62tb$u2o$1@dont-email.me> <ss4g91$hvs$1@dont-email.me>
<ss4ktr$dvv$1@dont-email.me> <UZCFJ.4610$uP.4480@fx16.iad>
<bSDFJ.268310$1d1.64158@fx99.iad> <ss78bk$hm6$1@dont-email.me>
<jwvo848n4ud.fsf-monnier+comp.arch@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 18 Jan 2022 23:31:54 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="e035f31cff3a51acdec223f4e7c82b8d";
logging-data="24949"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19bf70u4h86GhSoYqly/a6V"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.0
Cancel-Lock: sha1:qbHE1fgWoYorPA46sYc2KYHCvYc=
In-Reply-To: <jwvo848n4ud.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US
 by: Ivan Godard - Tue, 18 Jan 2022 23:31 UTC

On 1/18/2022 12:58 PM, Stefan Monnier wrote:
> Ivan Godard [2022-01-18 12:36:04] wrote:
> [...]
>> vs. DMUL vs ALPHABET_SOUP), but Mill has the data encode its own width.
>
> That reminds me: where (physically) is this type information kept?
> It seems to be that keeping it with the data itself could be a problem
> because it makes it available only rather late.
> Do you try and do "type inference/propagation" in the decoder and keep
> track of types there (or in the belt's logical-to-physical mapping
> maybe), or is it really kept alongside the data and is hence only made
> available to the next instruction when the data itself is
> made available (so e.g. we don't know the latency of a MUL until we get
> its operands)?
>
>
> Stefan

It is both kept with the data and projected by the decoder/mapper.
Execution follows the projected width; the actual (from the data) is
matched against the expected and faults on a mismatch.

Re: The type of Mill's belt's slots

<jwv7dawmmb7.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23007&group=comp.arch#23007

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: The type of Mill's belt's slots
Date: Tue, 18 Jan 2022 22:35:46 -0500
Organization: A noiseless patient Spider
Lines: 13
Message-ID: <jwv7dawmmb7.fsf-monnier+comp.arch@gnu.org>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sqpqbm$7qo$1@newsreader4.netcologne.de> <sqq3ce$c4n$2@dont-email.me>
<sqssff$a9j$1@gioia.aioe.org>
<077afaee-009e-4860-be45-61106126934bn@googlegroups.com>
<squhht$79u$1@dont-email.me>
<bb6d49bb-a676-44bd-9a6d-29386d429454n@googlegroups.com>
<sr0vhm$c4u$1@dont-email.me> <sr114i$1qc$1@newsreader4.netcologne.de>
<sr1dca$70e$1@dont-email.me> <kM%AJ.186634$np6.183460@fx46.iad>
<sr2gf6$64u$1@dont-email.me> <7DpBJ.254731$3q9.63673@fx47.iad>
<sr62tb$u2o$1@dont-email.me> <ss4g91$hvs$1@dont-email.me>
<ss4ktr$dvv$1@dont-email.me> <UZCFJ.4610$uP.4480@fx16.iad>
<bSDFJ.268310$1d1.64158@fx99.iad> <ss78bk$hm6$1@dont-email.me>
<jwvo848n4ud.fsf-monnier+comp.arch@gnu.org>
<ss7ila$obl$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="847cebe6bbabac28336531b12e37a755";
logging-data="514"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/OvCuEYdtTbU2u3Ey7kjvU"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:JB4lTo7VXNOYHiFhrvGYk2OaRfE=
sha1:0GCQ2XzddEnL5A1p5xcTBtZvNuI=
 by: Stefan Monnier - Wed, 19 Jan 2022 03:35 UTC

> It is both kept with the data and projected by the decoder/mapper. Execution
> follows the projected width; the actual (from the data) is matched against
> the expected and faults on a mismatch.

Hmm... how/when would such a mismatch occur?
Do you have instructions whose output width is data-dependent?
Where is it useful/used? Do you then have instructions to dynamically
test the width of some belt slot?

Dynamically typed machine language?

Stefan

Re: The type of Mill's belt's slots

<ss858d$m0a$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23008&group=comp.arch#23008

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: The type of Mill's belt's slots
Date: Tue, 18 Jan 2022 20:49:16 -0800
Organization: A noiseless patient Spider
Lines: 32
Message-ID: <ss858d$m0a$1@dont-email.me>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sqpqbm$7qo$1@newsreader4.netcologne.de> <sqq3ce$c4n$2@dont-email.me>
<sqssff$a9j$1@gioia.aioe.org>
<077afaee-009e-4860-be45-61106126934bn@googlegroups.com>
<squhht$79u$1@dont-email.me>
<bb6d49bb-a676-44bd-9a6d-29386d429454n@googlegroups.com>
<sr0vhm$c4u$1@dont-email.me> <sr114i$1qc$1@newsreader4.netcologne.de>
<sr1dca$70e$1@dont-email.me> <kM%AJ.186634$np6.183460@fx46.iad>
<sr2gf6$64u$1@dont-email.me> <7DpBJ.254731$3q9.63673@fx47.iad>
<sr62tb$u2o$1@dont-email.me> <ss4g91$hvs$1@dont-email.me>
<ss4ktr$dvv$1@dont-email.me> <UZCFJ.4610$uP.4480@fx16.iad>
<bSDFJ.268310$1d1.64158@fx99.iad> <ss78bk$hm6$1@dont-email.me>
<jwvo848n4ud.fsf-monnier+comp.arch@gnu.org> <ss7ila$obl$1@dont-email.me>
<jwv7dawmmb7.fsf-monnier+comp.arch@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 19 Jan 2022 04:49:17 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="e035f31cff3a51acdec223f4e7c82b8d";
logging-data="22538"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18lkrwhNkXrpi4c+CDsqxaY"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.5.0
Cancel-Lock: sha1:Ddgg2j6/+VSC1QJ+InfRvdkpcWU=
In-Reply-To: <jwv7dawmmb7.fsf-monnier+comp.arch@gnu.org>
Content-Language: en-US
 by: Ivan Godard - Wed, 19 Jan 2022 04:49 UTC

On 1/18/2022 7:35 PM, Stefan Monnier wrote:
>> It is both kept with the data and projected by the decoder/mapper. Execution
>> follows the projected width; the actual (from the data) is matched against
>> the expected and faults on a mismatch.
>
> Hmm... how/when would such a mismatch occur?
> Do you have instructions whose output width is data-dependent?
> Where is it useful/used? Do you then have instructions to dynamically
> test the width of some belt slot?
>
> Dynamically typed machine language?
>
>
> Stefan

Width mismatch can arise from software (specializer bug) or hardware
(gamma ray) error, or from an attack that generates or modifies
binaries, before or after execution begins. The Mill design cares *very*
much about RAS issues and sanity-checks everything it can.

Most instructions that have results have data-dependent output widths.
Usually the result width is the same as the argument widths, but all
instructions that can overflow (including nearly all arithmetic
instructions) have a variant that produces a double-width result.
Obviously the WIDEN and NARROW instructions change the width, and there
are a few odd-balls.

The belt evaporates too quickly to do much useful with it dynamically. A
debugger or exception handler that needs to dynamically interpret a belt
can call a few nested functions to cause it to be pushed into the
spiller and thence to DRAM, where software can see the widths in the
spill-format data.

Re: The type of Mill's belt's slots

<jwv7davagpd.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=23013&group=comp.arch#23013

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: The type of Mill's belt's slots
Date: Wed, 19 Jan 2022 10:36:28 -0500
Organization: A noiseless patient Spider
Lines: 30
Message-ID: <jwv7davagpd.fsf-monnier+comp.arch@gnu.org>
References: <2021Dec24.163843@mips.complang.tuwien.ac.at>
<sqssff$a9j$1@gioia.aioe.org>
<077afaee-009e-4860-be45-61106126934bn@googlegroups.com>
<squhht$79u$1@dont-email.me>
<bb6d49bb-a676-44bd-9a6d-29386d429454n@googlegroups.com>
<sr0vhm$c4u$1@dont-email.me> <sr114i$1qc$1@newsreader4.netcologne.de>
<sr1dca$70e$1@dont-email.me> <kM%AJ.186634$np6.183460@fx46.iad>
<sr2gf6$64u$1@dont-email.me> <7DpBJ.254731$3q9.63673@fx47.iad>
<sr62tb$u2o$1@dont-email.me> <ss4g91$hvs$1@dont-email.me>
<ss4ktr$dvv$1@dont-email.me> <UZCFJ.4610$uP.4480@fx16.iad>
<bSDFJ.268310$1d1.64158@fx99.iad> <ss78bk$hm6$1@dont-email.me>
<jwvo848n4ud.fsf-monnier+comp.arch@gnu.org>
<ss7ila$obl$1@dont-email.me>
<jwv7dawmmb7.fsf-monnier+comp.arch@gnu.org>
<ss858d$m0a$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="6b077cbe65e9ab4d1d4961c7fa08cc9e";
logging-data="12090"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+6NU/2H20XV9N5fyvjKcvh"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:Oe8hUzgsJSevl//2tX7ueZob5mg=
sha1:5KzSulUhQw69Gf6tsnzPRSbh75I=
 by: Stefan Monnier - Wed, 19 Jan 2022 15:36 UTC

Ivan Godard [2022-01-18 20:49:16] wrote:
> Width mismatch can arise from software (specializer bug)

That would be in a case like when we add a 32bit int and a 64 bit int?
These can be detected in the decoder/mapper part, right?

> or hardware (gamma ray) error,

Sanity checks, like ECC, are good, yes.

> Most instructions that have results have data-dependent output widths.
> Usually the result width is the same as the argument widths, but all
> instructions that can overflow (including nearly all arithmetic
> instructions) have a variant that produces a double-width result.

But the width is not data-dependent, in the sense that it's only depend
on the input widths, not the input values. So the decoder/mapper can
always correctly predict the output width (and it can also correctly
predict the mismatches like adding a 32bit to a 64bit arg), it never
needs to guess, right?

IOW, when you said:

> the actual (from the data) is matched against the expected and faults
> on a mismatch.

This is only needed to detect hardware errors.

Stefan

Pages:123456789101112131415
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor