Welcome to novaBBS (click a section below)

mail files register newsreader groups login

Message-ID:

19 May, 2024: Line wrapping has been changed to be more consistent with Usenet standards.
If you find that it is broken please let me know here rocksolid.nodes.help

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

Subject	Author
chained multi-issue reg-renaming in the same clock cycle: is it possible?	luke.l...@gmail.com
Re: chained multi-issue reg-renaming in the same clock cycle: is	EricP
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	luke.l...@gmail.com
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	luke.l...@gmail.com
Re: chained multi-issue reg-renaming in the same clock cycle: is	EricP
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	luke.l...@gmail.com
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	Anton Ertl
Re: chained multi-issue reg-renaming in the same clock cycle: is	EricP
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	MitchAlsup
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	Anton Ertl
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	luke.l...@gmail.com
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	Scott Lurndal
Re: chained multi-issue reg-renaming in the same clock cycle: is it	BGB
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	luke.l...@gmail.com
Re: chained multi-issue reg-renaming in the same clock cycle: is it	BGB
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	robf...@gmail.com
Re: chained multi-issue reg-renaming in the same clock cycle: is it	BGB
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	MitchAlsup
Re: chained multi-issue reg-renaming in the same clock cycle: is it	BGB
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	MitchAlsup
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	Quadibloc
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	MitchAlsup
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	Quadibloc
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	Quadibloc
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	Quadibloc
Re: chained multi-issue reg-renaming in the same clock cycle: is it	BGB
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	Scott Lurndal
Re: chained multi-issue reg-renaming in the same clock cycle: is it	BGB
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	MitchAlsup
Re: chained multi-issue reg-renaming in the same clock cycle: is it	BGB
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	luke.l...@gmail.com
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	MitchAlsup
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	luke.l...@gmail.com
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	MitchAlsup
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	luke.l...@gmail.com
Re: chained multi-issue reg-renaming in the same clock cycle: is it	BGB
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	MitchAlsup
Re: chained multi-issue reg-renaming in the same clock cycle: is it	BGB
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	robf...@gmail.com
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	MitchAlsup
Re: chained multi-issue reg-renaming in the same clock cycle: is it	BGB
Re: chained multi-issue reg-renaming in the same clock cycle: is it	BGB
Re: chained multi-issue reg-renaming in the same clock cycle: is it	BGB
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	Quadibloc
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	luke.l...@gmail.com
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	MitchAlsup
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	luke.l...@gmail.com
Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?	MitchAlsup

Pages:12

chained multi-issue reg-renaming in the same clock cycle: is it possible?

<72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31731&group=comp.arch#31731

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:7d16:0:b0:3e6:71d6:5d5c with SMTP id g22-20020ac87d16000000b003e671d65d5cmr2658218qtb.1.1681547399465;
Sat, 15 Apr 2023 01:29:59 -0700 (PDT)
X-Received: by 2002:a9d:6e8b:0:b0:6a5:cf42:8d27 with SMTP id
a11-20020a9d6e8b000000b006a5cf428d27mr175834otr.6.1681547399192; Sat, 15 Apr
2023 01:29:59 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!feeder.erje.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 15 Apr 2023 01:29:58 -0700 (PDT)
Injection-Info: google-groups.googlegroups.com; posting-host=92.19.80.230; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.19.80.230
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
Subject: chained multi-issue reg-renaming in the same clock cycle: is it possible?
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Sat, 15 Apr 2023 08:29:59 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2231

by: luke.l...@gmail.com - Sat, 15 Apr 2023 08:29 UTC

an interesting question has come up where i don't know
if it's even possible (in reasonable gate propagation time
given how large Dependency Matrices can get).

the scenario is:

* a very small loop (LD-COMPUTE-ST-BRcond) of 4 ops
* each instruction (for the sake of simplicity) is Scalar
* the multi-issue width is 12 to 16

the desire here is to obviously fit *four* simultaneous
copies of that loop into the same clock cycle, but to
do so requires *THREE* Write-after-Write register
renames in a row (four on the next clock cycle if one
is still "in the system").

my question is: is this even possible, without having to
go to massive-wide register operations as a work-around?

my current understanding of how WaW reg-rename
works is that it would be perfectly possible (without
huge gate cascades) to do at least one reg-renaming
in the same clock cycle, but that only gets 50%
utilisation.

has anyone encountered or attempted this before, or
have any good ideas? i know VVM auto-vectorisation
would solve this because the LOOP construct marks
the LD-COMPUTE-ST loop-counter, load, and store
registers such that hazards can be dropped on those.

any other ideas?

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<Y%y_L.110710$qpNc.31718@fx03.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31733&group=comp.arch#31733

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx03.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is
it possible?
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
In-Reply-To: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 46
Message-ID: <Y%y_L.110710$qpNc.31718@fx03.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sat, 15 Apr 2023 15:02:16 UTC
Date: Sat, 15 Apr 2023 11:01:56 -0400
X-Received-Bytes: 2337

by: EricP - Sat, 15 Apr 2023 15:01 UTC

luke.l...@gmail.com wrote:
> an interesting question has come up where i don't know
> if it's even possible (in reasonable gate propagation time
> given how large Dependency Matrices can get).
>
> the scenario is:
>
> * a very small loop (LD-COMPUTE-ST-BRcond) of 4 ops
> * each instruction (for the sake of simplicity) is Scalar
> * the multi-issue width is 12 to 16
>
> the desire here is to obviously fit *four* simultaneous
> copies of that loop into the same clock cycle, but to
> do so requires *THREE* Write-after-Write register
> renames in a row (four on the next clock cycle if one
> is still "in the system").
>
> my question is: is this even possible, without having to
> go to massive-wide register operations as a work-around?
>
> my current understanding of how WaW reg-rename
> works is that it would be perfectly possible (without
> huge gate cascades) to do at least one reg-renaming
> in the same clock cycle, but that only gets 50%
> utilisation.
>
> has anyone encountered or attempted this before, or
> have any good ideas? i know VVM auto-vectorisation
> would solve this because the LOOP construct marks
> the LD-COMPUTE-ST loop-counter, load, and store
> registers such that hazards can be dropped on those.
>
> any other ideas?
>
> l.

I'm not sure what you are asking as rename is a separate
stage from execute.

Does "three write-after-write register renames" mean
three concurrent back-to-back result forwarding's,
whereby results are forwarded from uOp to uOp triggering an
immediate wake-up & launch of dest uOps with no intervening clocks?

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<2023Apr15.165917@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31734&group=comp.arch#31734

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?
Date: Sat, 15 Apr 2023 14:59:17 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 35
Message-ID: <2023Apr15.165917@mips.complang.tuwien.ac.at>
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
Injection-Info: dont-email.me; posting-host="d65c04fb64e906efe5cce5e42391f30c";
logging-data="2197667"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19H4+SqUF85uhD76qLSTyW+"
Cancel-Lock: sha1:AjlE1r/SWxg4FgzVx0aBjNpWImU=
X-newsreader: xrn 10.11

by: Anton Ertl - Sat, 15 Apr 2023 14:59 UTC

"luke.l...@gmail.com" <luke.leighton@gmail.com> writes:
>* a very small loop (LD-COMPUTE-ST-BRcond) of 4 ops
>* each instruction (for the sake of simplicity) is Scalar
>* the multi-issue width is 12 to 16
>
>the desire here is to obviously fit *four* simultaneous
>copies of that loop into the same clock cycle, but to
>do so requires *THREE* Write-after-Write register
>renames in a row (four on the next clock cycle if one
>is still "in the system").
>
>my question is: is this even possible, without having to
>go to massive-wide register operations as a work-around?

Issue width 12 has not been demonstrated in a register-renaming CPU
yet. The Zen 4 register renamer can process 6 macro-ops per cycle,
and Intel's register renamer can process 6 macro-ops IIRC. The
numbers I remember for ARM and Apple (but not sure which structure
they refer to) are in the 7-9 range.

>my current understanding of how WaW reg-rename
>works is that it would be perfectly possible (without
>huge gate cascades) to do at least one reg-renaming
>in the same clock cycle, but that only gets 50%
>utilisation.

I never heard of or experienced any limitations in that area. A
slightly different, but probably related problem: Both Intel and AMD
manage to process a chain of 6 *dependent* mov reg,reg instructions in
a cycle into nothing in the renamer.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<5d5c6894-db33-4d9c-a8be-1389c03fd719n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31739&group=comp.arch#31739

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:11a3:b0:74c:4349:8ceb with SMTP id c3-20020a05620a11a300b0074c43498cebmr735206qkk.14.1681576917974;
Sat, 15 Apr 2023 09:41:57 -0700 (PDT)
X-Received: by 2002:a9d:4d86:0:b0:6a5:cf74:d56f with SMTP id
u6-20020a9d4d86000000b006a5cf74d56fmr513962otk.4.1681576917637; Sat, 15 Apr
2023 09:41:57 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!feeder1.feed.usenet.farm!feed.usenet.farm!peer02.ams4!peer.am4.highwinds-media.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 15 Apr 2023 09:41:57 -0700 (PDT)
In-Reply-To: <Y%y_L.110710$qpNc.31718@fx03.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=92.19.80.230; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.19.80.230
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com> <Y%y_L.110710$qpNc.31718@fx03.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5d5c6894-db33-4d9c-a8be-1389c03fd719n@googlegroups.com>
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Sat, 15 Apr 2023 16:41:57 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2509

by: luke.l...@gmail.com - Sat, 15 Apr 2023 16:41 UTC

On Saturday, April 15, 2023 at 4:02:22 PM UTC+1, EricP wrote:

> I'm not sure what you are asking as rename is a separate
> stage from execute.

indeed. it's usually performed just after Decode phase
> Does "three write-after-write register renames" mean
> three concurrent back-to-back result forwarding's,
> whereby results are forwarded from uOp to uOp triggering an
> immediate wake-up & launch of dest uOps with no intervening clocks?

no.

three write-after-write register renames need (in the scenario
i am envisioning) to occur right there, right then, just after
Decode, having been detected as part of the Hazard Dependency
Matrix setup.

in one issue-batch:

op1 | op2 | op3 | op4 | op5 | op6
LD r0, r1| ADD r0,r0,1 | ST r0,r1 | LD r0, r1 | ADD r0,r0,1 | ST r0,r1

those need to *not* stall just because r0 is loaded calced
stored twice in the same clock cycle.

a *single* reg-rename (on r0 and r1) is perfectly reasonably achievable.

but *multiple* reg-renames in the same clock cycle, to avoid the
Write-after-Write Hazards? that's what i don't know is possible.

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<c6650342-d0ff-49d0-8283-ae5e197bf3c9n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31740&group=comp.arch#31740

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:19a8:b0:3df:375:5102 with SMTP id u40-20020a05622a19a800b003df03755102mr3033922qtc.2.1681577155813;
Sat, 15 Apr 2023 09:45:55 -0700 (PDT)
X-Received: by 2002:a05:6808:2119:b0:38d:e63f:d278 with SMTP id
r25-20020a056808211900b0038de63fd278mr205835oiw.2.1681577155524; Sat, 15 Apr
2023 09:45:55 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 15 Apr 2023 09:45:55 -0700 (PDT)
In-Reply-To: <2023Apr15.165917@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=92.19.80.230; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.19.80.230
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com> <2023Apr15.165917@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c6650342-d0ff-49d0-8283-ae5e197bf3c9n@googlegroups.com>
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Sat, 15 Apr 2023 16:45:55 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: luke.l...@gmail.com - Sat, 15 Apr 2023 16:45 UTC

On Saturday, April 15, 2023 at 4:13:43 PM UTC+1, Anton Ertl wrote:

> Issue width 12 has not been demonstrated in a register-renaming CPU
> yet.

the M1 has 8, and sustains 100% throughput.

> The Zen 4 register renamer can process 6 macro-ops per cycle,
> and Intel's register renamer can process 6 macro-ops IIRC. The
> numbers I remember for ARM and Apple (but not sure which structure
> they refer to) are in the 7-9 range.

good to know. ah, i know: inline loop-unrolled assembler would
hit a reg-renamer pretty hard even on 8-wide OoO issue.

> >my current understanding of how WaW reg-rename
> >works is that it would be perfectly possible (without
> >huge gate cascades) to do at least one reg-renaming
> >in the same clock cycle, but that only gets 50%
> >utilisation.
> I never heard of or experienced any limitations in that area. A
> slightly different, but probably related problem: Both Intel and AMD
> manage to process a chain of 6 *dependent* mov reg,reg instructions in
> a cycle into nothing in the renamer.

this would not surprise me, they would be forced to because the
number of regs in x86 is so small, i heard that Intel actually does
RISC micro-coding at the back-end, translating x86 into an internal
RISC ISA.

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<5ed8a491-20ff-4eb5-9301-ddcb0e1d03f4n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31743&group=comp.arch#31743

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:5b91:0:b0:3ed:e1c3:ff8b with SMTP id a17-20020ac85b91000000b003ede1c3ff8bmr311975qta.12.1681577433707;
Sat, 15 Apr 2023 09:50:33 -0700 (PDT)
X-Received: by 2002:a05:6870:a115:b0:184:502f:e79d with SMTP id
m21-20020a056870a11500b00184502fe79dmr3856195oae.9.1681577433455; Sat, 15 Apr
2023 09:50:33 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 15 Apr 2023 09:50:33 -0700 (PDT)
In-Reply-To: <Y%y_L.110710$qpNc.31718@fx03.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=92.19.80.230; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.19.80.230
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com> <Y%y_L.110710$qpNc.31718@fx03.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5ed8a491-20ff-4eb5-9301-ddcb0e1d03f4n@googlegroups.com>
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Sat, 15 Apr 2023 16:50:33 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: luke.l...@gmail.com - Sat, 15 Apr 2023 16:50 UTC

On Saturday, April 15, 2023 at 4:02:22 PM UTC+1, EricP wrote:

> Does "three write-after-write register renames" mean
> three concurrent back-to-back result forwarding's,
> whereby results are forwarded from uOp to uOp triggering an
> immediate wake-up & launch of dest uOps with no intervening clocks?

a much simpler example:

QTY 6of loop-unrolled repetitions of "ADD r0, r0, 1" on an 8-wide
multi-issue OoO system. these should all be renamed:

1st ADD reads r0, stores in r0.1 (temporary, Reservation Station)
2nd ADD reads r0.1, stores in r0.2
....
6th ADD reads r0.5, stores in r0

but although the ADDs themselves should be pipeline-chained,
actually getting them *into* the Reservation Stations at the
Decode Phase (and moving on to the next batch of 8
fetched instructions) should take *one* clock cycle, not 4 or 5.

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<jjB_L.2326801$vBI8.1177138@fx15.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31747&group=comp.arch#31747

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.chmurka.net!weretis.net!feeder8.news.weretis.net!news.uzoreto.com!peer01.ams4!peer.am4.highwinds-media.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx15.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: sco...@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?
Newsgroups: comp.arch
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com> <2023Apr15.165917@mips.complang.tuwien.ac.at> <c6650342-d0ff-49d0-8283-ae5e197bf3c9n@googlegroups.com>
Lines: 30
Message-ID: <jjB_L.2326801$vBI8.1177138@fx15.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Sat, 15 Apr 2023 17:39:27 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Sat, 15 Apr 2023 17:39:27 GMT
X-Received-Bytes: 2151

by: Scott Lurndal - Sat, 15 Apr 2023 17:39 UTC

"luke.l...@gmail.com" <luke.leighton@gmail.com> writes:
>On Saturday, April 15, 2023 at 4:13:43=E2=80=AFPM UTC+1, Anton Ertl wrote:
>
>> Issue width 12 has not been demonstrated in a register-renaming CPU=20
>> yet.=20
>
>the M1 has 8, and sustains 100% throughput.
>
>> The Zen 4 register renamer can process 6 macro-ops per cycle,=20
>> and Intel's register renamer can process 6 macro-ops IIRC. The=20
>> numbers I remember for ARM and Apple (but not sure which structure=20
>> they refer to) are in the 7-9 range.
>
>good to know. ah, i know: inline loop-unrolled assembler would
>hit a reg-renamer pretty hard even on 8-wide OoO issue.
>
>> >my current understanding of how WaW reg-rename=20
>> >works is that it would be perfectly possible (without=20
>> >huge gate cascades) to do at least one reg-renaming=20
>> >in the same clock cycle, but that only gets 50%=20
>> >utilisation.
>> I never heard of or experienced any limitations in that area. A=20
>> slightly different, but probably related problem: Both Intel and AMD=20
>> manage to process a chain of 6 *dependent* mov reg,reg instructions in=20
>> a cycle into nothing in the renamer.=20
>
>this would not surprise me, they would be forced to because the
>number of regs in x86 is so small,

There are quite a few more in x86_64.

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<0845acba-2ac4-4e9e-b237-6585a461e31an@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31748&group=comp.arch#31748

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:493:b0:3ed:330f:5d67 with SMTP id p19-20020a05622a049300b003ed330f5d67mr1426793qtx.1.1681682035454;
Sun, 16 Apr 2023 14:53:55 -0700 (PDT)
X-Received: by 2002:a54:4595:0:b0:38b:d67c:ba2a with SMTP id
z21-20020a544595000000b0038bd67cba2amr3142068oib.0.1681682035162; Sun, 16 Apr
2023 14:53:55 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 16 Apr 2023 14:53:54 -0700 (PDT)
In-Reply-To: <8EX_L.453479$Lfzc.18286@fx36.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=92.19.80.230; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.19.80.230
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
<Y%y_L.110710$qpNc.31718@fx03.iad> <5ed8a491-20ff-4eb5-9301-ddcb0e1d03f4n@googlegroups.com>
<8EX_L.453479$Lfzc.18286@fx36.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <0845acba-2ac4-4e9e-b237-6585a461e31an@googlegroups.com>
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Sun, 16 Apr 2023 21:53:55 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2880

by: luke.l...@gmail.com - Sun, 16 Apr 2023 21:53 UTC

On Sunday, April 16, 2023 at 8:03:36 PM UTC+1, EricP wrote:

> One answer is brute force: each parallel Rename lane propagates its
> changes to higher lanes. Each higher lane checks if any of its source
> Architecture Register Numbers (ARN) is a dest ARN of a lower lane,
> and if so use that lower lane's new dest rename assignment.

all the information about which are reads and which writes being easy to decode

> Lane-1 checks if its source ARN's equal lane-0 dest ARN,
> Lane-2 checks lane-0, lane-1, Lane-3 checks lane-0, lane-1, lane-2.
> The gate cost is 1+2+3... = (N2+N)/2 = O(N2) the number of lanes,
> and the max propagation delay is serial across the rename lanes.

ahh... because it's transitive, it's serial?

instruction 2 depends on instruction 1
instruction 3 depends on 2 and 1
....
....

> Assuming the instructions have two source and one dest ARN,
> and assuming the rename state table is an ARN-indexed SRAM,
> and assuming a classic design of ROB + commit Architecture Register File,
> then the rename table needs 3R1W ports for each rename lane
> plus write ports to track each result write and ARN commit.

which isn't massively expensive or unachievable.

> And bag full of logic ensures that the rename state table is updated
> with correct value for renaming multiple dest ARN's in the same clock.
> (Rename state table checkpoint and restore cost extra.)

awesome.

eric i really appreciate your insights. thank you.

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<u1hrjq$2mtdg$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31749&group=comp.arch#31749

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it
possible?
Date: Sun, 16 Apr 2023 17:07:52 -0500
Organization: A noiseless patient Spider
Lines: 147
Message-ID: <u1hrjq$2mtdg$1@dont-email.me>
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 16 Apr 2023 22:07:54 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="d4f082a1a97ce41d9dc411a1eb22fc3c";
logging-data="2848176"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19AGjKIs2eRqfEswb7bWKpp"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.10.0
Cancel-Lock: sha1:SK9ZeKFENveklJIx/Qa/AgAnJrA=
Content-Language: en-US
In-Reply-To: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>

by: BGB - Sun, 16 Apr 2023 22:07 UTC

On 4/15/2023 3:29 AM, luke.l...@gmail.com wrote:
> an interesting question has come up where i don't know
> if it's even possible (in reasonable gate propagation time
> given how large Dependency Matrices can get).
>
> the scenario is:
>
> * a very small loop (LD-COMPUTE-ST-BRcond) of 4 ops
> * each instruction (for the sake of simplicity) is Scalar
> * the multi-issue width is 12 to 16
>

Possible: Probably...
But clock speed is gonna suck...

Not sure if/how OoO machines avoid the issue where increasing width
leads to massively expanding cost.

IOW: Why my BJX2 core is 3-wide, but why I effectively shelved ideas for
trying to go wider for the time being:
1=>2 or 2=>3 are arguably modest jumps;
But, 3=>4 or beyond, stuff starts to get a bit unreasonable.

As far as I know, there is no real good way to avoid the x^2 cost-curve
of widening the register file and similar.

> the desire here is to obviously fit *four* simultaneous
> copies of that loop into the same clock cycle, but to
> do so requires *THREE* Write-after-Write register
> renames in a row (four on the next clock cycle if one
> is still "in the system").
>
> my question is: is this even possible, without having to
> go to massive-wide register operations as a work-around?
>
> my current understanding of how WaW reg-rename
> works is that it would be perfectly possible (without
> huge gate cascades) to do at least one reg-renaming
> in the same clock cycle, but that only gets 50%
> utilisation.
>
> has anyone encountered or attempted this before, or
> have any good ideas? i know VVM auto-vectorisation
> would solve this because the LOOP construct marks
> the LD-COMPUTE-ST loop-counter, load, and store
> registers such that hazards can be dropped on those.
>
> any other ideas?
>

If you have a lot of registers, and leave all the scheduling up to the
compiler, one does not need register renaming.

Making the compiler "not suck" is difficult though.

I can often get pretty large speedups by writing things in ASM in my
case; contrast with some other ISA's where ASM doesn't accomplish much.

My past (small scale) experiments with trying to generate code for ARM
yielded atrociously bad performance (so, there is still some "secret
sauce" that seems to be missing in my case).

But, it could be that these are "two sides to the same coin".

I am not aware of any big "smoking gun" issue for why my code generators
suck so bad, seemingly it is all "death by 1000 paper cuts" style issues.

But, in other news:
Had worked on hacking my FM Synth module design to do PCM mixing;
Modified my MIDI backend to effectively do wavetable music synthesis (in
a hacky way), but theoretically could also now play S3M music and
similar in hardware (assuming my Verilog tweaks for this work, still
untested).

Ended up mostly going with 11025 A-Law for the wavetable patches for
now, but seemingly this does effect the quality of the sound some
(better than 8kHz; 16kHz would have left the wavetable "a bit large").

In this case, ended up putting the wavetable in a package based on the
Doom WAD (IWAD) format; with "SND_xxxx" for the wavetable patches, and
"PATCHIDX" for a combined index of audio patches (as a single big table
covering all of the MIDI instruments). Design was a bit "quick and
dirty". The WAD lumps are then copied into "audio RAM".

Not entirely clear as of yet (from the API design sense) how it will
accommodate something like an S3M player. Same basic MIDI
command-structure would work, but would need a "good" way for the client
program to submit customized instrument patches.

Though, possibly would use a combined instrument number space, say:
0..127: Normal MIDI Instruments
128..255: MIDI Percussive Instruments
256+: User-defined.
With patches submitted as waveform data (likely WAVEFORMATEX + data),
and instruments as additional entries into the patch table. May need to
be tied to a context.

The wavetable MIDI isn't entirely a win though vs the OPL-like FM
synthesis (the FM synthesis is "more authentic" to the original sound of
the Doom music and similar).

Another possible option would be allowing the program to submit
instruments in bulk form as a WAD image, but this would be a bit tacky
and require a program to compose a WAD in-RAM for something like an S3M
player.

Otherwise, was faced with an annoyance that seemingly hardly anyone
sells ~ 505nm LEDs in a 5mm form factor (465nm, 525nm, and combined
465nm+525nm LEDs would not work for what I want them for; but some
combined-color LEDs would be useful as well).

Mostly consider wanting to try to do a "more empirical" test for whether
or not I can in-fact see 505nm and 465nm+525nm as different colors...

Conventional wisdom is that they should look like the same color (as
opposed to two separate "cyan" and an "essence-of-gray" style colors).

( Was mostly looking on Amazon, but no one on here seems to be selling
the types of LEDs I would want for this... Would seemingly need to go
through the hassle of buying LEDs from DigiKey or similar... ).

Well, either that or I am gradually slipping into insanity, one of the
two...

Can mostly ignore 590nm and 570nm, as while these colors of LED are
seemingly more common, I don't seem to notice anything in these areas
which seems worth investigating (there are no "unusual" colors between
green and red, only up near the blue end of the color range where stuff
diverges).

But, even then, even if I see it, it may not matter if it is still the
case if no one else sees the difference.

Like, a kind of seemingly "incredibly moot" type of insanity...

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<d448bd70-8620-43eb-bf53-0cde6d286f78n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31750&group=comp.arch#31750

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:1828:b0:3e6:457f:9ed1 with SMTP id t40-20020a05622a182800b003e6457f9ed1mr4008676qtc.5.1681685571900;
Sun, 16 Apr 2023 15:52:51 -0700 (PDT)
X-Received: by 2002:a9d:4f16:0:b0:6a5:d8ff:a846 with SMTP id
d22-20020a9d4f16000000b006a5d8ffa846mr1039425otl.7.1681685571633; Sun, 16 Apr
2023 15:52:51 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 16 Apr 2023 15:52:51 -0700 (PDT)
In-Reply-To: <u1hrjq$2mtdg$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=92.19.80.230; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.19.80.230
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com> <u1hrjq$2mtdg$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d448bd70-8620-43eb-bf53-0cde6d286f78n@googlegroups.com>
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Sun, 16 Apr 2023 22:52:51 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2102

by: luke.l...@gmail.com - Sun, 16 Apr 2023 22:52 UTC

On Sunday, April 16, 2023 at 11:07:58 PM UTC+1, BGB wrote:
> On 4/15/2023 3:29 AM, luke.l...@gmail.com wrote:
> > * a very small loop (LD-COMPUTE-ST-BRcond) of 4 ops
> > * each instruction (for the sake of simplicity) is Scalar
> > * the multi-issue width is 12 to 16
> >
> Possible: Probably...
> But clock speed is gonna suck...

one of the requirements is "not sucking" :)
> As far as I know, there is no real good way to avoid the x^2 cost-curve
> of widening the register file and similar.

striping. QTY 4of separate regfiles, every 4th register has
direct access to every 4th regfile, and *indirect* access
via either a cyclic shift register or multiplexer bus.

issue may also be similarly striped, i saw a paper on
tomasulo algorithm reserving every modulo-4 ROB entry
for every modulo-4 instruction issued.

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<2023Apr17.092300@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31751&group=comp.arch#31751

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?
Date: Mon, 17 Apr 2023 07:23:00 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 42
Message-ID: <2023Apr17.092300@mips.complang.tuwien.ac.at>
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com> <Y%y_L.110710$qpNc.31718@fx03.iad> <5ed8a491-20ff-4eb5-9301-ddcb0e1d03f4n@googlegroups.com> <8EX_L.453479$Lfzc.18286@fx36.iad> <0845acba-2ac4-4e9e-b237-6585a461e31an@googlegroups.com>
Injection-Info: dont-email.me; posting-host="609438adcb03bab2457a674e5a7fc0d0";
logging-data="3117256"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX185M1n1LFVeVo3g7f8VnQxu"
Cancel-Lock: sha1:SMNCo6HC8Xz0b2Ifr1QD9FI3e9Y=
X-newsreader: xrn 10.11

by: Anton Ertl - Mon, 17 Apr 2023 07:23 UTC

"luke.l...@gmail.com" <luke.leighton@gmail.com> writes:
>On Sunday, April 16, 2023 at 8:03:36=E2=80=AFPM UTC+1, EricP wrote:
>
>> One answer is brute force: each parallel Rename lane propagates its=20
>> changes to higher lanes. Each higher lane checks if any of its source=20
>> Architecture Register Numbers (ARN) is a dest ARN of a lower lane,=20
>> and if so use that lower lane's new dest rename assignment.=20
>
>all the information about which are reads and which writes being easy to de=
>code
>
>> Lane-1 checks if its source ARN's equal lane-0 dest ARN,=20
>> Lane-2 checks lane-0, lane-1, Lane-3 checks lane-0, lane-1, lane-2.=20
>> The gate cost is 1+2+3... =3D (N2+N)/2 =3D O(N2) the number of lanes,=20
>> and the max propagation delay is serial across the rename lanes.=20
>
>ahh... because it's transitive, it's serial?

Only if you implement it so. You can also implement it as a tree
(O(ln N) time complexity):

overwrite(arn, instructions) returns (flag, prn)
{ if |instructions|=1 then // base case
if instructions[0] writes to arn then
return yes, instructions[0].prn
else
return no, dontcare
else
instructions1, instructions2 = split(instructions)
flag1, prn1 = overwrite(arn,instructions1) || // parallel
flag2, prn2 = overwrite(arn,instructions2)
if flag2 then
return flag2, prn2
else
return flag1, prn1
}

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<8EX_L.453479$Lfzc.18286@fx36.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31752&group=comp.arch#31752

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx36.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is
it possible?
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com> <Y%y_L.110710$qpNc.31718@fx03.iad> <5ed8a491-20ff-4eb5-9301-ddcb0e1d03f4n@googlegroups.com>
In-Reply-To: <5ed8a491-20ff-4eb5-9301-ddcb0e1d03f4n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 46
Message-ID: <8EX_L.453479$Lfzc.18286@fx36.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sun, 16 Apr 2023 19:03:32 UTC
Date: Sun, 16 Apr 2023 15:02:59 -0400
X-Received-Bytes: 2788

by: EricP - Sun, 16 Apr 2023 19:02 UTC

luke.l...@gmail.com wrote:
> On Saturday, April 15, 2023 at 4:02:22 PM UTC+1, EricP wrote:
>
>> Does "three write-after-write register renames" mean
>> three concurrent back-to-back result forwarding's,
>> whereby results are forwarded from uOp to uOp triggering an
>> immediate wake-up & launch of dest uOps with no intervening clocks?
>
> a much simpler example:
>
> QTY 6of loop-unrolled repetitions of "ADD r0, r0, 1" on an 8-wide
> multi-issue OoO system. these should all be renamed:
>
> 1st ADD reads r0, stores in r0.1 (temporary, Reservation Station)
> 2nd ADD reads r0.1, stores in r0.2
> ....
> 6th ADD reads r0.5, stores in r0
>
> but although the ADDs themselves should be pipeline-chained,
> actually getting them *into* the Reservation Stations at the
> Decode Phase (and moving on to the next batch of 8
> fetched instructions) should take *one* clock cycle, not 4 or 5.
>
> l.

One answer is brute force: each parallel Rename lane propagates its
changes to higher lanes. Each higher lane checks if any of its source
Architecture Register Numbers (ARN) is a dest ARN of a lower lane,
and if so use that lower lane's new dest rename assignment.

Lane-1 checks if its source ARN's equal lane-0 dest ARN,
Lane-2 checks lane-0, lane-1, Lane-3 checks lane-0, lane-1, lane-2.
The gate cost is 1+2+3... = (N2+N)/2 = O(N2) the number of lanes,
and the max propagation delay is serial across the rename lanes.

Assuming the instructions have two source and one dest ARN,
and assuming the rename state table is an ARN-indexed SRAM,
and assuming a classic design of ROB + commit Architecture Register File,
then the rename table needs 3R1W ports for each rename lane
plus write ports to track each result write and ARN commit.
And bag full of logic ensures that the rename state table is updated
with correct value for renaming multiple dest ARN's in the same clock.
(Rename state table checkpoint and restore cost extra.)

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<gLX_L.1361239$gGD7.1324773@fx11.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31753&group=comp.arch#31753

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!news.uzoreto.com!peer01.ams4!peer.am4.highwinds-media.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx11.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is
it possible?
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com> <Y%y_L.110710$qpNc.31718@fx03.iad> <5ed8a491-20ff-4eb5-9301-ddcb0e1d03f4n@googlegroups.com> <8EX_L.453479$Lfzc.18286@fx36.iad>
In-Reply-To: <8EX_L.453479$Lfzc.18286@fx36.iad>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 11
Message-ID: <gLX_L.1361239$gGD7.1324773@fx11.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sun, 16 Apr 2023 19:11:08 UTC
Date: Sun, 16 Apr 2023 15:10:33 -0400
X-Received-Bytes: 1200

by: EricP - Sun, 16 Apr 2023 19:10 UTC

EricP wrote:
>
> Lane-1 checks if its source ARN's equal lane-0 dest ARN,
> Lane-2 checks lane-0, lane-1, Lane-3 checks lane-0, lane-1, lane-2.
> The gate cost is 1+2+3... = (N2+N)/2 = O(N2) the number of lanes,

= (N^2+N)/2 = O(N^2) the number of lanes

(the ^ got lost somewhere).

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<3738535a-03c7-4170-b821-9d27f5ae722bn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31754&group=comp.arch#31754

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:147b:b0:74d:f5cb:aaac with SMTP id j27-20020a05620a147b00b0074df5cbaaacmr910853qkl.1.1681793164407;
Mon, 17 Apr 2023 21:46:04 -0700 (PDT)
X-Received: by 2002:a05:6870:11cf:b0:187:859b:369e with SMTP id
15-20020a05687011cf00b00187859b369emr376160oav.7.1681793164175; Mon, 17 Apr
2023 21:46:04 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!nntp.club.cc.cmu.edu!45.76.7.193.MISMATCH!3.us.feeder.erje.net!feeder.erje.net!border-1.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 17 Apr 2023 21:46:03 -0700 (PDT)
In-Reply-To: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb71:2b00:993:ae59:16c0:8e6a;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb71:2b00:993:ae59:16c0:8e6a
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3738535a-03c7-4170-b821-9d27f5ae722bn@googlegroups.com>
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Tue, 18 Apr 2023 04:46:04 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 41

by: Quadibloc - Tue, 18 Apr 2023 04:46 UTC

On Saturday, April 15, 2023 at 2:30:00 AM UTC-6, luke.l...@gmail.com wrote:

I am not nearly as knowledgeable about such matters
as some of those who have already replied.

However, I think that there would be ways in which this
could be made to work. Register renaming belongs with
instruction decoding, and thus can take place prior to
execution itself.

So you convert to micro-ops that include renaming, and
then do the loop on the cached micro-ops, and then
no cycles are required for rename on the second and subsequent
iterations of the loop.

The problem is, though, if you want to do an iteration in
one cycle, and floating-point arithmetic takes, say, 11 cycles
or so to finish, you're _not_ going to get by with just enough
register renames to let *four* sets of arithmetic operations
take place concurrently.

In an ISA that doesn't have Mitch Alsup's VVM, though, the
machine language to micro-op translator could still note
the existence of a loop, so with short loops, in a conventional
architecture, one _could_ do anything that VVM can do. This
sort of trickery is the sort of thing Mitch Alsup often talks
about as a possibility.

So if, instead of thinking in terms of register renames, one
frames the entire loop in terms of result forwarding, with no
actual register use at all, an optimal sequence of micro-ops
that involves no delays due to WAW hazards should indeed
be at least a theoretical possibility.

John Savard

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<bd0449ce-27f0-4566-96a2-2c168f5b90c2n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31755&group=comp.arch#31755

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:104:b0:3eb:8f6a:9f3 with SMTP id u4-20020a05622a010400b003eb8f6a09f3mr4161475qtw.11.1681804188913;
Tue, 18 Apr 2023 00:49:48 -0700 (PDT)
X-Received: by 2002:a4a:bc8c:0:b0:544:f2c5:6f69 with SMTP id
m12-20020a4abc8c000000b00544f2c56f69mr3076212oop.1.1681804188616; Tue, 18 Apr
2023 00:49:48 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 18 Apr 2023 00:49:48 -0700 (PDT)
In-Reply-To: <3738535a-03c7-4170-b821-9d27f5ae722bn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=92.19.80.230; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.19.80.230
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com> <3738535a-03c7-4170-b821-9d27f5ae722bn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <bd0449ce-27f0-4566-96a2-2c168f5b90c2n@googlegroups.com>
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Tue, 18 Apr 2023 07:49:48 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3848

by: luke.l...@gmail.com - Tue, 18 Apr 2023 07:49 UTC

On Tuesday, April 18, 2023 at 5:46:06 AM UTC+1, Quadibloc wrote:
> On Saturday, April 15, 2023 at 2:30:00 AM UTC-6, luke.l...@gmail.com wrote:
>
> > the desire here is to obviously fit *four* simultaneous
> > copies of that loop into the same clock cycle, but to
> > do so requires *THREE* Write-after-Write register
> > renames in a row (four on the next clock cycle if one
> > is still "in the system").
> I am not nearly as knowledgeable about such matters
> as some of those who have already replied.
>
> However, I think that there would be ways in which this
> could be made to work. Register renaming belongs with
> instruction decoding, and thus can take place prior to
> execution itself.

yes.

> So you convert to micro-ops that include renaming, and
> then do the loop on the cached micro-ops, and then
> no cycles are required for rename on the second and subsequent
> iterations of the loop.

this still leaves the method of detection which Eric kindly pointed
out is achievable (whew).

but yes for SVP64 the loop is already expressed (like x86 REP)
so actual micro-ops are ironically the scalar operations themselves.

thus in SVP64 even on single-issue one instruction may result in
say 32 back-end SIMD operations.

but that takes up 32 registers (64 if 2 operands, 96 if not an overwrite)
and that's what multi-issue fixes, you can use less register names
have multi-issue put many more smaller blocks of processing into
RSes.

> The problem is, though, if you want to do an iteration in
> one cycle, and floating-point arithmetic takes, say, 11 cycles
> or so to finish, you're _not_ going to get by with just enough
> register renames to let *four* sets of arithmetic operations
> take place concurrently.

they need to get into Reservation Stations.
they don't have to complete immediately.
but if they are not in RSes on the same clock there
is no chance of acheving > 1 IPC.

> In an ISA that doesn't have Mitch Alsup's VVM, though, the
> machine language to micro-op translator could still note
> the existence of a loop, so with short loops, in a conventional
> architecture, one _could_ do anything that VVM can do. This
> sort of trickery is the sort of thing Mitch Alsup often talks
> about as a possibility.

SVP64 Looping expresses this possibility in reality but does
so explicitly with an actual instruction equivalent to x86 "REP"
and bringing z80 LDIR and CPIR capability.

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<05b56213-50b5-4425-b99e-633ce7e3f712n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31756&group=comp.arch#31756

copy link Newsgroups: comp.arch

X-Received: by 2002:ae9:e105:0:b0:74c:f9b2:47b7 with SMTP id g5-20020ae9e105000000b0074cf9b247b7mr431562qkm.2.1682026559737;
Thu, 20 Apr 2023 14:35:59 -0700 (PDT)
X-Received: by 2002:a4a:314f:0:b0:544:f933:dd4c with SMTP id
v15-20020a4a314f000000b00544f933dd4cmr154451oog.0.1682026559455; Thu, 20 Apr
2023 14:35:59 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 20 Apr 2023 14:35:59 -0700 (PDT)
In-Reply-To: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:e5d4:bbea:2fae:e0bf;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:e5d4:bbea:2fae:e0bf
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <05b56213-50b5-4425-b99e-633ce7e3f712n@googlegroups.com>
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 20 Apr 2023 21:35:59 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3380

by: MitchAlsup - Thu, 20 Apr 2023 21:35 UTC

On Saturday, April 15, 2023 at 3:30:00 AM UTC-5, luke.l...@gmail.com wrote:
> an interesting question has come up where i don't know
> if it's even possible (in reasonable gate propagation time
> given how large Dependency Matrices can get).
>
> the scenario is:
>
> * a very small loop (LD-COMPUTE-ST-BRcond) of 4 ops
> * each instruction (for the sake of simplicity) is Scalar
> * the multi-issue width is 12 to 16
>
> the desire here is to obviously fit *four* simultaneous
> copies of that loop into the same clock cycle, but to
> do so requires *THREE* Write-after-Write register
> renames in a row (four on the next clock cycle if one
> is still "in the system").
>
> my question is: is this even possible, without having to
> go to massive-wide register operations as a work-around?
<
Luke:: use the force..........
<
A quick examination indicates this is not even a problem.
Just give each result a name (from the rename pool) {you
have to be able to do this anyway, so it falls out for free.}
<
As for the 3 that get elided, this is a traded off in the
register rename pool. With a sufficiently big pool, do
nothing and the problem solves itself. With a more
conservative pool, you can detect this in the cycle
after decode {¿issue, insert, queue, sched, execute?}
and return the invisible registers back to the pool.
>
> my current understanding of how WaW reg-rename
> works is that it would be perfectly possible (without
> huge gate cascades) to do at least one reg-renaming
> in the same clock cycle, but that only gets 50%
> utilisation.
<
A) give each destination register a new name.
B) when you have time, return the elided writes to the
......pool.
>
> has anyone encountered or attempted this before, or
> have any good ideas? i know VVM auto-vectorisation
> would solve this because the LOOP construct marks
> the LD-COMPUTE-ST loop-counter, load, and store
> registers such that hazards can be dropped on those.
>
> any other ideas?
>
> l.

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<a2d66a8f-8f05-4aa4-8adb-8536a829d231n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31757&group=comp.arch#31757

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:184d:b0:5ef:57cc:641d with SMTP id d13-20020a056214184d00b005ef57cc641dmr473853qvy.1.1682026635680;
Thu, 20 Apr 2023 14:37:15 -0700 (PDT)
X-Received: by 2002:a9d:51c7:0:b0:6a5:f3fa:ab8d with SMTP id
d7-20020a9d51c7000000b006a5f3faab8dmr785280oth.4.1682026635562; Thu, 20 Apr
2023 14:37:15 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 20 Apr 2023 14:37:15 -0700 (PDT)
In-Reply-To: <gLX_L.1361239$gGD7.1324773@fx11.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:e5d4:bbea:2fae:e0bf;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:e5d4:bbea:2fae:e0bf
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
<Y%y_L.110710$qpNc.31718@fx03.iad> <5ed8a491-20ff-4eb5-9301-ddcb0e1d03f4n@googlegroups.com>
<8EX_L.453479$Lfzc.18286@fx36.iad> <gLX_L.1361239$gGD7.1324773@fx11.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <a2d66a8f-8f05-4aa4-8adb-8536a829d231n@googlegroups.com>
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 20 Apr 2023 21:37:15 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 1946

by: MitchAlsup - Thu, 20 Apr 2023 21:37 UTC

On Sunday, April 16, 2023 at 2:11:12 PM UTC-5, EricP wrote:
> EricP wrote:
> >
> > Lane-1 checks if its source ARN's equal lane-0 dest ARN,
> > Lane-2 checks lane-0, lane-1, Lane-3 checks lane-0, lane-1, lane-2.
> > The gate cost is 1+2+3... = (N2+N)/2 = O(N2) the number of lanes,
> = (N^2+N)/2 = O(N^2) the number of lanes
>
> (the ^ got lost somewhere).
<
Yes, if you try to do it the hard way it is BigO( n^2 ).
<
If you get clever, the check logic is quadratic, but the gate logic
is linear.

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<u1tg4b$11p8h$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31758&group=comp.arch#31758

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it
possible?
Date: Fri, 21 Apr 2023 03:05:28 -0500
Organization: A noiseless patient Spider
Lines: 270
Message-ID: <u1tg4b$11p8h$1@dont-email.me>
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
<u1hrjq$2mtdg$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 21 Apr 2023 08:05:31 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="5e237cc94fcfa6ed8d21bcd5f07e4355";
logging-data="1107217"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/zvhRdkykmrTSHhUEofoCW"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.10.0
Cancel-Lock: sha1:JTGNpcyaK2NPtc3+ALlelWVo8ps=
In-Reply-To: <u1hrjq$2mtdg$1@dont-email.me>
Content-Language: en-US

by: BGB - Fri, 21 Apr 2023 08:05 UTC

On 4/16/2023 5:07 PM, BGB wrote:
> On 4/15/2023 3:29 AM, luke.l...@gmail.com wrote:
>> an interesting question has come up where i don't know
>> if it's even possible (in reasonable gate propagation time
>> given how large Dependency Matrices can get).
>>
>> the scenario is:
>>
>> * a very small loop (LD-COMPUTE-ST-BRcond) of 4 ops
>> * each instruction (for the sake of simplicity) is Scalar
>> * the multi-issue width is 12 to 16
>>
>
> Possible: Probably...
> But clock speed is gonna suck...
>
> Not sure if/how OoO machines avoid the issue where increasing width
> leads to massively expanding cost.
>
>
> IOW: Why my BJX2 core is 3-wide, but why I effectively shelved ideas for
> trying to go wider for the time being:
> 1=>2 or 2=>3 are arguably modest jumps;
> But, 3=>4 or beyond, stuff starts to get a bit unreasonable.
>
> As far as I know, there is no real good way to avoid the x^2 cost-curve
> of widening the register file and similar.
>
>
>> the desire here is to obviously fit *four* simultaneous
>> copies of that loop into the same clock cycle, but to
>> do so requires *THREE* Write-after-Write register
>> renames in a row (four on the next clock cycle if one
>> is still "in the system").
>>
>> my question is: is this even possible, without having to
>> go to massive-wide register operations as a work-around?
>>
>> my current understanding of how WaW reg-rename
>> works is that it would be perfectly possible (without
>> huge gate cascades) to do at least one reg-renaming
>> in the same clock cycle, but that only gets 50%
>> utilisation.
>>
>> has anyone encountered or attempted this before, or
>> have any good ideas? i know VVM auto-vectorisation
>> would solve this because the LOOP construct marks
>> the LD-COMPUTE-ST loop-counter, load, and store
>> registers such that hazards can be dropped on those.
>>
>> any other ideas?
>>
>
> If you have a lot of registers, and leave all the scheduling up to the
> compiler, one does not need register renaming.
>
> Making the compiler "not suck" is difficult though.
>
>
> I can often get pretty large speedups by writing things in ASM in my
> case; contrast with some other ISA's where ASM doesn't accomplish much.
>
> My past (small scale) experiments with trying to generate code for ARM
> yielded atrociously bad performance (so, there is still some "secret
> sauce" that seems to be missing in my case).
>
> But, it could be that these are "two sides to the same coin".
>
>
> I am not aware of any big "smoking gun" issue for why my code generators
> suck so bad, seemingly it is all "death by 1000 paper cuts" style issues.
>

My compiler still sucks...

>
>
> But, in other news:
> Had worked on hacking my FM Synth module design to do PCM mixing;
> Modified my MIDI backend to effectively do wavetable music synthesis (in
> a hacky way), but theoretically could also now play S3M music and
> similar in hardware (assuming my Verilog tweaks for this work, still
> untested).
>
> Ended up mostly going with 11025 A-Law for the wavetable patches for
> now, but seemingly this does effect the quality of the sound some
> (better than 8kHz; 16kHz would have left the wavetable "a bit large").
>
> In this case, ended up putting the wavetable in a package based on the
> Doom WAD (IWAD) format; with "SND_xxxx" for the wavetable patches, and
> "PATCHIDX" for a combined index of audio patches (as a single big table
> covering all of the MIDI instruments). Design was a bit "quick and
> dirty". The WAD lumps are then copied into "audio RAM".
>
>
> Not entirely clear as of yet (from the API design sense) how it will
> accommodate something like an S3M player. Same basic MIDI
> command-structure would work, but would need a "good" way for the client
> program to submit customized instrument patches.
>
> Though, possibly would use a combined instrument number space, say:
> 0..127: Normal MIDI Instruments
> 128..255: MIDI Percussive Instruments
> 256+: User-defined.
> With patches submitted as waveform data (likely WAVEFORMATEX + data),
> and instruments as additional entries into the patch table. May need to
> be tied to a context.
>
> The wavetable MIDI isn't entirely a win though vs the OPL-like FM
> synthesis (the FM synthesis is "more authentic" to the original sound of
> the Doom music and similar).
>
> Another possible option would be allowing the program to submit
> instruments in bulk form as a WAD image, but this would be a bit tacky
> and require a program to compose a WAD in-RAM for something like an S3M
> player.
>

API issues not resolved.

Did end up remaking a "quick and dirty" set of replacement patches to
hopefully avoid copyright issues from using patches that were apparently
derived from roughly 30 year old "Gravis Ultra Sound" patches.

My improvised replacement set doesn't exactly sound "good", but is
basically usable as a "proof of concept" for testing the wavetable approach.

Have also ended up adding some helper instructions for A-Law encoding,
in the form of a 4x Binary16 to 4x A-Law converter instruction (reducing
the computational cost vs doing it using normal integer math),
effectively using the SIMD converter ops and FPU for most of the
heavy-lifting.

This was intended to reduce the CPU cost of sending PCM audio data to
the sound device (which accepts audio data in A-Law form rather than PCM).

Also went and added a few color-cell ops as well:

* RGB5MINMAX Rm, Rn
Takes 4x RGB555 colors, and selects the minimum and maximum colors
according to luma.

* RGB5CCENC Rm, Ro, Rn
Classifies each RGB555 pixel from Rm according to the Min and Max given
in Ro, adding the selector output bits to Rn (while also shifting Rn
right by 8 bits).

These could be used to hopefully make 640x400 mode more practical by
speeding up the color-cell encoding process. Though, there are limits,
as redrawing a 640x400 framebuffer, etc, will still take a significant
amount of memory bandwidth.

But, if I could do a 640x400 "GUI" with a redraw speed of more than a
few frames per-second, that would be good (still not entirely sure how
early 90s PC's made 640x480 practical; as they would have likely faced
similar performance constraints).

Ended up basically handling luma by bit-shuffling, say:
G4 G3 R4 B4 - G2 R3 B3 G1 - R2 B2 G0 R1

Mostly because bit-shuffling and comparing a slightly larger value has
less latency than trying to "actually calculate the luma", and still
gives a similar result for relative comparisons.

>
>
>
> Otherwise, was faced with an annoyance that seemingly hardly anyone
> sells ~ 505nm LEDs in a 5mm form factor (465nm, 525nm, and combined
> 465nm+525nm LEDs would not work for what I want them for; but some
> combined-color LEDs would be useful as well).
>
> Mostly consider wanting to try to do a "more empirical" test for whether
> or not I can in-fact see 505nm and 465nm+525nm as different colors...
>
> Conventional wisdom is that they should look like the same color (as
> opposed to two separate "cyan" and an "essence-of-gray" style colors).
>

After getting a few LEDs, initial results were not what was expected:
505nm and 525nm were "almost" the same color, both "obviously green".
The 505nm LEDs were a slightly brighter green than the 525nm LEDs.
But, both appeared green to me, and nearly the same color.
However, 465nm was a match for the mystery color.
The 465nm LEDs were sold as "blue", but they are not blue...
They are a color that people often call blue, but it is not blue.

I have now ordered some different 465nm LEDs (for contrast), some 400nm
LEDs, and some 440nm LEDs.

Claimed descriptions were:
465nm, "blue" but not really (it is the imposter...);
440nm, "royal blue" or "pink" (WTF? *1);
400nm, UV / invisible(?) (*2).

*1: Not sure how this works, when "pink" is not even a color on the rainbow.

More so, as apparently 440nm is what is used for the blue in monitors,
and presumably they would not have done this if it were "pink".

In any case, whatever monitors use, seems to match with the "higher"
blue, and not with the "blue" LEDs. Whatever exactly is going on (it is
possible the LEDs were not 465nm, so I ordered some different 465nm blue
LEDs in case this was what was going on).

Click here to read the complete article

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<2a3dbdc4-bab7-463a-aeb8-0d6733d60102n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31759&group=comp.arch#31759

copy link Newsgroups: comp.arch

X-Received: by 2002:ad4:58a7:0:b0:56e:a203:5d1f with SMTP id ea7-20020ad458a7000000b0056ea2035d1fmr980495qvb.5.1682098596107;
Fri, 21 Apr 2023 10:36:36 -0700 (PDT)
X-Received: by 2002:a9d:5504:0:b0:6a6:c96:bce with SMTP id l4-20020a9d5504000000b006a60c960bcemr1827289oth.1.1682098595774;
Fri, 21 Apr 2023 10:36:35 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 21 Apr 2023 10:36:35 -0700 (PDT)
In-Reply-To: <05b56213-50b5-4425-b99e-633ce7e3f712n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=92.19.80.230; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.19.80.230
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com> <05b56213-50b5-4425-b99e-633ce7e3f712n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2a3dbdc4-bab7-463a-aeb8-0d6733d60102n@googlegroups.com>
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Fri, 21 Apr 2023 17:36:36 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2717

by: luke.l...@gmail.com - Fri, 21 Apr 2023 17:36 UTC

On Thursday, April 20, 2023 at 10:36:01 PM UTC+1, MitchAlsup wrote:

> Luke:: use the force..........

you can imagine what happened at stonyhurst boarding school
when Star Wars first came out...

> As for the 3 that get elided, this is a traded off in the
> register rename pool. With a sufficiently big pool, do
> nothing and the problem solves itself. With a more
> conservative pool, you can detect this in the cycle
> after decode {¿issue, insert, queue, sched, execute?}
> and return the invisible registers back to the pool.

slight chang3 of topic for a moment..

i remember you said a couple weeks back you managed to eliminate
multi-ported CAMs by aligning ROB numbers with Scoreboard-esque
Function Unit numbers: does that basically well, the 1:1 relationship
eliminates even the need for the CAM, but in essence the number
of ROB row entries *equals* the number of Reservation Stations
in effect, is that right?

if so, i can then foresee a new spanner-in-the-works: the allocation
of ROB entries in Tomasulo is an incremental affair, rotating round
robin, but if ROBs are tied to RSes then finding "first free RS" is a bit
more complex, needing to skip over multiple entries based on
Function Unit type.

i don't think however what you and Eric have kindly outlined would
be impacted by the above, i.e. Multi-Issue Tomasulo *and* reg-renaming
should work perfectly well?

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<571deecb-b242-435f-9de7-d393c5248b69n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31760&group=comp.arch#31760

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:149c:b0:74d:f172:1a45 with SMTP id w28-20020a05620a149c00b0074df1721a45mr965433qkj.7.1682101159178;
Fri, 21 Apr 2023 11:19:19 -0700 (PDT)
X-Received: by 2002:aca:a848:0:b0:389:91b:f51e with SMTP id
r69-20020acaa848000000b00389091bf51emr1232999oie.9.1682101158827; Fri, 21 Apr
2023 11:19:18 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 21 Apr 2023 11:19:18 -0700 (PDT)
In-Reply-To: <2a3dbdc4-bab7-463a-aeb8-0d6733d60102n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:983e:e7a9:d63b:b9b0;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:983e:e7a9:d63b:b9b0
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
<05b56213-50b5-4425-b99e-633ce7e3f712n@googlegroups.com> <2a3dbdc4-bab7-463a-aeb8-0d6733d60102n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <571deecb-b242-435f-9de7-d393c5248b69n@googlegroups.com>
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 21 Apr 2023 18:19:19 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3996

by: MitchAlsup - Fri, 21 Apr 2023 18:19 UTC

On Friday, April 21, 2023 at 12:36:37 PM UTC-5, luke.l...@gmail.com wrote:
> On Thursday, April 20, 2023 at 10:36:01 PM UTC+1, MitchAlsup wrote:
>
> > Luke:: use the force..........
>
> you can imagine what happened at stonyhurst boarding school
> when Star Wars first came out...
> > As for the 3 that get elided, this is a traded off in the
> > register rename pool. With a sufficiently big pool, do
> > nothing and the problem solves itself. With a more
> > conservative pool, you can detect this in the cycle
> > after decode {¿issue, insert, queue, sched, execute?}
> > and return the invisible registers back to the pool.
<
> slight chang3 of topic for a moment..
>
> i remember you said a couple weeks back you managed to eliminate
> multi-ported CAMs by aligning ROB numbers with Scoreboard-esque
> Function Unit numbers: does that basically well, the 1:1 relationship
> eliminates even the need for the CAM, but in essence the number
> of ROB row entries *equals* the number of Reservation Stations
> in effect, is that right?
<
If the operand waiting entry in the RS knows which function unit delivers
the result, then you place a multiplexer on the tag busses, and now, this
RS-operand entry only watches 1 result tag bus.
<
So the number of tag busses remains the same, but the number of CAM
comparitors is 1 per operand.
<
You can go farther:
<
If the RS operand entry knows which FU and which number, then the
comparisons can be done with an AND gate--just like in a dependency
matrix; eliminating the CAM.
>
> if so, i can then foresee a new spanner-in-the-works: the allocation
> of ROB entries in Tomasulo is an incremental affair, rotating round
> robin, but if ROBs are tied to RSes then finding "first free RS" is a bit
> more complex, needing to skip over multiple entries based on
> Function Unit type.
<
Let me explain it in terms of a physical register file::
<
given a number of function units and a number of physical registers,
consider dividing the physical registers into the number of pools of
the function units. Now, each function unit has a small pool of rename
registers. So, once you decide the FU, you have a rename register sitting
their waiting to be consumed. Presto--complete parallelism.
>
> i don't think however what you and Eric have kindly outlined would
> be impacted by the above, i.e. Multi-Issue Tomasulo *and* reg-renaming
> should work perfectly well?
<
I had no problem with it in 1991......
>
> l.

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<u1ulg6$2rrai$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31761&group=comp.arch#31761

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it
possible?
Date: Fri, 21 Apr 2023 13:43:14 -0500
Organization: A noiseless patient Spider
Lines: 128
Message-ID: <u1ulg6$2rrai$1@dont-email.me>
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
<u1hrjq$2mtdg$1@dont-email.me>
<d448bd70-8620-43eb-bf53-0cde6d286f78n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 21 Apr 2023 18:43:18 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="5e237cc94fcfa6ed8d21bcd5f07e4355";
logging-data="3009874"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+5L7HoAfkuSkR+wkvZX5n/"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.10.0
Cancel-Lock: sha1:dbLkKXrlL8WQXxXK+cWCnxYkWx4=
In-Reply-To: <d448bd70-8620-43eb-bf53-0cde6d286f78n@googlegroups.com>
Content-Language: en-US

by: BGB - Fri, 21 Apr 2023 18:43 UTC

On 4/16/2023 5:52 PM, luke.l...@gmail.com wrote:
> On Sunday, April 16, 2023 at 11:07:58 PM UTC+1, BGB wrote:
>> On 4/15/2023 3:29 AM, luke.l...@gmail.com wrote:
>>> * a very small loop (LD-COMPUTE-ST-BRcond) of 4 ops
>>> * each instruction (for the sake of simplicity) is Scalar
>>> * the multi-issue width is 12 to 16
>>>
>> Possible: Probably...
>> But clock speed is gonna suck...
>
> one of the requirements is "not sucking" :)
>

Not sucking is hard.

In my case, I am sorta managing 50 MHz.

A few of the "big shiny RISC-V cores", ironically, have a hard time
being clocked much over 25 or 30 MHz...

Apparently similar for some of the ARM on FPGA cores as well...

Getting 100 MHz is harder though, as then one is basically on the edge
of things like the latency of adding two numbers; and the general
inability to have single-cycle access in many cases to block-RAM arrays.

reg[127:0] arrA[63:0]; //goes in LUTRAM, it is fine...
reg[127:0] arrB[511:0]; //Oh Crap...
reg[127:0] arrC[8191:0]; //Oh Crap...

So, at 50MHz, both arrA and arrB can be "easily" accessed in a single
cycle, but at 100MHz, only arrA works.

Whereas arrC will generally need 2 or 3 cycles (works OK for L2 cache,
not so good for L1).

In my early testing, the advantage of a bigger L1 cache more than offset
the "loss" in terms of needing to have a lower clock speed.

Hence why in my case, I ended up with a big 3-wide 64-bit core running
at 50MHz, rather than a 1-wide 32-bit core running at 100 MHz.

The 1-wide core might have ran Doom better, in theory, apart from the
issue that 2K L1 caches would have been killed performance...

Well, either that, or have a longer pipeline with 4 or 5 cycle
memory-access instructions...

Also ironically, with the Spartan and Artix, the clock-speed has a
slight negative correlation with FPGA size at the same speed grade (so,
say, the XC7A100T is slightly slower than the XC7S50, and the XC7A200T
is slightly slower than the XC7A100T).

But, one can fit a whole lot more into an XC7A100T or XC7A200T than in
an XC7S50, so it is mostly a worthwhile tradeoff.

....

>> As far as I know, there is no real good way to avoid the x^2 cost-curve
>> of widening the register file and similar.
>
> striping. QTY 4of separate regfiles, every 4th register has
> direct access to every 4th regfile, and *indirect* access
> via either a cyclic shift register or multiplexer bus.
>
> issue may also be similarly striped, i saw a paper on
> tomasulo algorithm reserving every modulo-4 ROB entry
> for every modulo-4 instruction issued.
>

I guess maybe this can work for register renaming...
For a more conventional ISA design, this would suck.

When I was designing WEX6W, one idea was to try to split the regfile
into 2x 6R4W, rather than have a monolithic 12R6W register file.

In this case, operations would have been organized (for 4/5/6 bundles):
OP3B | OP3A | OP2B | OP2A | OP1B | OP1A

With A having an affinity for R0..R31 and B having an affinity for
R32..R63, with a penalty for using a register outside the set (the
operations would need to trigger an interlock rather than forwarding).

Bundles would also only be able to handle a single store (per A or B) to
a register outside of the corresponding A|B set.

I ended up mostly not doing this, as cost and complexity was still
getting absurd even without the full forwarding.

Also there was a considered SMT mode that would have split the register
file into 2x 32 registers (so XGPR would not have been usable in SMT
mode). This idea was also dropped.

Even if it were implemented, unclear how it could have been used.

My compiler can't even make effective use of 2 and 3 wide bundles.
My ASM code can mostly make use of 2-wide bundles, but I mostly ended up
writing 2-wide ASM because 3-wide ASM is much more of a pain to write
and mentally manage the instruction scheduling (more so as 3-wide is
much more subject to "which instructions are allowed in which
combinations" issues).

Much wider would likely be "nearly impossible".

As noted, BGBCC mostly uses a strategy like:
Emit code like it was a "Plain Ol' RISC";
Shuffle instructions after the fact;
Bundle any instructions which have ended up in the right combinations.

Though, this strategy only really tends to get an average bundle length
of around 1.20 to 1.35 instructions, which is "kinda suck"...

It is also painfully bad at the whole "instruction scheduling" thing, so
ends up basically paying much of the interlock penalty for nearly every
multi-cycle instruction (at least short of manually organizing things in
the C code to try to avoid this; and writing C that kinda resembles ASM).

....

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<9db772b3-faf3-4154-a524-3e76b9ef42d9n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31765&group=comp.arch#31765

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:5cd1:0:b0:3e1:5755:7bbf with SMTP id s17-20020ac85cd1000000b003e157557bbfmr2709749qta.5.1682153381992;
Sat, 22 Apr 2023 01:49:41 -0700 (PDT)
X-Received: by 2002:a05:6870:9d9c:b0:17e:a5ad:ab26 with SMTP id
pv28-20020a0568709d9c00b0017ea5adab26mr2293066oab.10.1682153381751; Sat, 22
Apr 2023 01:49:41 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 22 Apr 2023 01:49:41 -0700 (PDT)
In-Reply-To: <u1ulg6$2rrai$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=99.251.79.92; posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 99.251.79.92
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
<u1hrjq$2mtdg$1@dont-email.me> <d448bd70-8620-43eb-bf53-0cde6d286f78n@googlegroups.com>
<u1ulg6$2rrai$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <9db772b3-faf3-4154-a524-3e76b9ef42d9n@googlegroups.com>
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?
From: robfi...@gmail.com (robf...@gmail.com)
Injection-Date: Sat, 22 Apr 2023 08:49:41 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 7996

by: robf...@gmail.com - Sat, 22 Apr 2023 08:49 UTC

On Friday, April 21, 2023 at 2:43:22 PM UTC-4, BGB wrote:
> On 4/16/2023 5:52 PM, luke.l...@gmail.com wrote:
> > On Sunday, April 16, 2023 at 11:07:58 PM UTC+1, BGB wrote:
> >> On 4/15/2023 3:29 AM, luke.l...@gmail.com wrote:
> >>> * a very small loop (LD-COMPUTE-ST-BRcond) of 4 ops
> >>> * each instruction (for the sake of simplicity) is Scalar
> >>> * the multi-issue width is 12 to 16
> >>>
> >> Possible: Probably...
> >> But clock speed is gonna suck...
> >
> > one of the requirements is "not sucking" :)
> >
> Not sucking is hard.
>
> In my case, I am sorta managing 50 MHz.
>
> A few of the "big shiny RISC-V cores", ironically, have a hard time
> being clocked much over 25 or 30 MHz...
>
> Apparently similar for some of the ARM on FPGA cores as well...
>
>
> Getting 100 MHz is harder though, as then one is basically on the edge
> of things like the latency of adding two numbers; and the general
> inability to have single-cycle access in many cases to block-RAM arrays.
>
> reg[127:0] arrA[63:0]; //goes in LUTRAM, it is fine...
> reg[127:0] arrB[511:0]; //Oh Crap...
> reg[127:0] arrC[8191:0]; //Oh Crap...
>
> So, at 50MHz, both arrA and arrB can be "easily" accessed in a single
> cycle, but at 100MHz, only arrA works.
>
> Whereas arrC will generally need 2 or 3 cycles (works OK for L2 cache,
> not so good for L1).
>
>
> In my early testing, the advantage of a bigger L1 cache more than offset
> the "loss" in terms of needing to have a lower clock speed.
>
>
> Hence why in my case, I ended up with a big 3-wide 64-bit core running
> at 50MHz, rather than a 1-wide 32-bit core running at 100 MHz.
>
> The 1-wide core might have ran Doom better, in theory, apart from the
> issue that 2K L1 caches would have been killed performance...
>
> Well, either that, or have a longer pipeline with 4 or 5 cycle
> memory-access instructions...
>
>
> Also ironically, with the Spartan and Artix, the clock-speed has a
> slight negative correlation with FPGA size at the same speed grade (so,
> say, the XC7A100T is slightly slower than the XC7S50, and the XC7A200T
> is slightly slower than the XC7A100T).
>
> But, one can fit a whole lot more into an XC7A100T or XC7A200T than in
> an XC7S50, so it is mostly a worthwhile tradeoff.
>
> ...
> >> As far as I know, there is no real good way to avoid the x^2 cost-curve
> >> of widening the register file and similar.
> >
> > striping. QTY 4of separate regfiles, every 4th register has
> > direct access to every 4th regfile, and *indirect* access
> > via either a cyclic shift register or multiplexer bus.
> >
> > issue may also be similarly striped, i saw a paper on
> > tomasulo algorithm reserving every modulo-4 ROB entry
> > for every modulo-4 instruction issued.
> >
> I guess maybe this can work for register renaming...
> For a more conventional ISA design, this would suck.
>
>
> When I was designing WEX6W, one idea was to try to split the regfile
> into 2x 6R4W, rather than have a monolithic 12R6W register file.
>
> In this case, operations would have been organized (for 4/5/6 bundles):
> OP3B | OP3A | OP2B | OP2A | OP1B | OP1A
>
> With A having an affinity for R0..R31 and B having an affinity for
> R32..R63, with a penalty for using a register outside the set (the
> operations would need to trigger an interlock rather than forwarding).
>
> Bundles would also only be able to handle a single store (per A or B) to
> a register outside of the corresponding A|B set.
>
> I ended up mostly not doing this, as cost and complexity was still
> getting absurd even without the full forwarding.
>
>
> Also there was a considered SMT mode that would have split the register
> file into 2x 32 registers (so XGPR would not have been usable in SMT
> mode). This idea was also dropped.
>
>
> Even if it were implemented, unclear how it could have been used.
>
> My compiler can't even make effective use of 2 and 3 wide bundles.
> My ASM code can mostly make use of 2-wide bundles, but I mostly ended up
> writing 2-wide ASM because 3-wide ASM is much more of a pain to write
> and mentally manage the instruction scheduling (more so as 3-wide is
> much more subject to "which instructions are allowed in which
> combinations" issues).
>
> Much wider would likely be "nearly impossible".
>
>
> As noted, BGBCC mostly uses a strategy like:
> Emit code like it was a "Plain Ol' RISC";
> Shuffle instructions after the fact;
> Bundle any instructions which have ended up in the right combinations.
>
> Though, this strategy only really tends to get an average bundle length
> of around 1.20 to 1.35 instructions, which is "kinda suck"...
>
> It is also painfully bad at the whole "instruction scheduling" thing, so
> ends up basically paying much of the interlock penalty for nearly every
> multi-cycle instruction (at least short of manually organizing things in
> the C code to try to avoid this; and writing C that kinda resembles ASM).
>
> ...
I have not had much luck getting a superscalar to clock over 20 MHz in
an FPGA. One issue is the size of the core slows things down. Even at
20 MHz though performance is like that of a 40 to 50 MHz scalar core.
One nice thing about a lower clock speed is that there are fewer issues
with some instructions which can be more complex because more
clock space is available. I got a in-order PowerPC clone to work a little
faster than 20 MHz.

Even though slow an FPGA is great for experimenting with things like
register renaming.
One issue with the FPGA is it cannot handle non-
clocked logic at all. So some logic that relies on a circuit settling in
a loop cannot be done very well in an FPGA.

For my most recent core I am starting with a simple scalar
sequential machine. It should be able to run at 40 MH+, but likely
taking 3 or more clocks per instruction. I have not been able to
isolate a signal that is missing and causing about half the core
to be omitted when built.

I think getting a VLIW machine working would be very difficult
to do. Getting the compiler to generate code making use of the
machine would be a significant part of it. Risc with a lot of registers
makes it easier on the compiler.

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<u21kfl$3eki2$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31770&group=comp.arch#31770

copy link Newsgroups: comp.arch

Path: i2pn2.org!rocksolid2!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it
possible?
Date: Sat, 22 Apr 2023 16:44:17 -0500
Organization: A noiseless patient Spider
Lines: 351
Message-ID: <u21kfl$3eki2$1@dont-email.me>
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
<u1hrjq$2mtdg$1@dont-email.me>
<d448bd70-8620-43eb-bf53-0cde6d286f78n@googlegroups.com>
<u1ulg6$2rrai$1@dont-email.me>
<9db772b3-faf3-4154-a524-3e76b9ef42d9n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 22 Apr 2023 21:44:21 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="b79f4e184c68b1ba1d0939953a2ad416";
logging-data="3625538"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/5pSq4RQli7bRV0YBDJNK4"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.10.0
Cancel-Lock: sha1:wzy2JsAO073TlHx/i+iLp5t3tC8=
Content-Language: en-US
In-Reply-To: <9db772b3-faf3-4154-a524-3e76b9ef42d9n@googlegroups.com>

by: BGB - Sat, 22 Apr 2023 21:44 UTC

On 4/22/2023 3:49 AM, robf...@gmail.com wrote:
> On Friday, April 21, 2023 at 2:43:22 PM UTC-4, BGB wrote:
>> On 4/16/2023 5:52 PM, luke.l...@gmail.com wrote:
>>> On Sunday, April 16, 2023 at 11:07:58 PM UTC+1, BGB wrote:
>>>> On 4/15/2023 3:29 AM, luke.l...@gmail.com wrote:
>>>>> * a very small loop (LD-COMPUTE-ST-BRcond) of 4 ops
>>>>> * each instruction (for the sake of simplicity) is Scalar
>>>>> * the multi-issue width is 12 to 16
>>>>>
>>>> Possible: Probably...
>>>> But clock speed is gonna suck...
>>>
>>> one of the requirements is "not sucking" :)
>>>
>> Not sucking is hard.
>>
>> In my case, I am sorta managing 50 MHz.
>>
>> A few of the "big shiny RISC-V cores", ironically, have a hard time
>> being clocked much over 25 or 30 MHz...
>>
>> Apparently similar for some of the ARM on FPGA cores as well...
>>
>>
>> Getting 100 MHz is harder though, as then one is basically on the edge
>> of things like the latency of adding two numbers; and the general
>> inability to have single-cycle access in many cases to block-RAM arrays.
>>
>> reg[127:0] arrA[63:0]; //goes in LUTRAM, it is fine...
>> reg[127:0] arrB[511:0]; //Oh Crap...
>> reg[127:0] arrC[8191:0]; //Oh Crap...
>>
>> So, at 50MHz, both arrA and arrB can be "easily" accessed in a single
>> cycle, but at 100MHz, only arrA works.
>>
>> Whereas arrC will generally need 2 or 3 cycles (works OK for L2 cache,
>> not so good for L1).
>>
>>
>> In my early testing, the advantage of a bigger L1 cache more than offset
>> the "loss" in terms of needing to have a lower clock speed.
>>
>>
>> Hence why in my case, I ended up with a big 3-wide 64-bit core running
>> at 50MHz, rather than a 1-wide 32-bit core running at 100 MHz.
>>
>> The 1-wide core might have ran Doom better, in theory, apart from the
>> issue that 2K L1 caches would have been killed performance...
>>
>> Well, either that, or have a longer pipeline with 4 or 5 cycle
>> memory-access instructions...
>>
>>
>> Also ironically, with the Spartan and Artix, the clock-speed has a
>> slight negative correlation with FPGA size at the same speed grade (so,
>> say, the XC7A100T is slightly slower than the XC7S50, and the XC7A200T
>> is slightly slower than the XC7A100T).
>>
>> But, one can fit a whole lot more into an XC7A100T or XC7A200T than in
>> an XC7S50, so it is mostly a worthwhile tradeoff.
>>
>> ...
>>>> As far as I know, there is no real good way to avoid the x^2 cost-curve
>>>> of widening the register file and similar.
>>>
>>> striping. QTY 4of separate regfiles, every 4th register has
>>> direct access to every 4th regfile, and *indirect* access
>>> via either a cyclic shift register or multiplexer bus.
>>>
>>> issue may also be similarly striped, i saw a paper on
>>> tomasulo algorithm reserving every modulo-4 ROB entry
>>> for every modulo-4 instruction issued.
>>>
>> I guess maybe this can work for register renaming...
>> For a more conventional ISA design, this would suck.
>>
>>
>> When I was designing WEX6W, one idea was to try to split the regfile
>> into 2x 6R4W, rather than have a monolithic 12R6W register file.
>>
>> In this case, operations would have been organized (for 4/5/6 bundles):
>> OP3B | OP3A | OP2B | OP2A | OP1B | OP1A
>>
>> With A having an affinity for R0..R31 and B having an affinity for
>> R32..R63, with a penalty for using a register outside the set (the
>> operations would need to trigger an interlock rather than forwarding).
>>
>> Bundles would also only be able to handle a single store (per A or B) to
>> a register outside of the corresponding A|B set.
>>
>> I ended up mostly not doing this, as cost and complexity was still
>> getting absurd even without the full forwarding.
>>
>>
>> Also there was a considered SMT mode that would have split the register
>> file into 2x 32 registers (so XGPR would not have been usable in SMT
>> mode). This idea was also dropped.
>>
>>
>> Even if it were implemented, unclear how it could have been used.
>>
>> My compiler can't even make effective use of 2 and 3 wide bundles.
>> My ASM code can mostly make use of 2-wide bundles, but I mostly ended up
>> writing 2-wide ASM because 3-wide ASM is much more of a pain to write
>> and mentally manage the instruction scheduling (more so as 3-wide is
>> much more subject to "which instructions are allowed in which
>> combinations" issues).
>>
>> Much wider would likely be "nearly impossible".
>>
>>
>> As noted, BGBCC mostly uses a strategy like:
>> Emit code like it was a "Plain Ol' RISC";
>> Shuffle instructions after the fact;
>> Bundle any instructions which have ended up in the right combinations.
>>
>> Though, this strategy only really tends to get an average bundle length
>> of around 1.20 to 1.35 instructions, which is "kinda suck"...
>>
>> It is also painfully bad at the whole "instruction scheduling" thing, so
>> ends up basically paying much of the interlock penalty for nearly every
>> multi-cycle instruction (at least short of manually organizing things in
>> the C code to try to avoid this; and writing C that kinda resembles ASM).
>>
>> ...
> I have not had much luck getting a superscalar to clock over 20 MHz in
> an FPGA. One issue is the size of the core slows things down. Even at
> 20 MHz though performance is like that of a 40 to 50 MHz scalar core.
> One nice thing about a lower clock speed is that there are fewer issues
> with some instructions which can be more complex because more
> clock space is available. I got a in-order PowerPC clone to work a little
> faster than 20 MHz.
>

Yeah.

There are a few reasons I went with VLIW rather than superscalar...
Both VLIW and an in-order superscalar require similar logic from the
compiler in order to be used efficiently, and the main cost of VLIW here
reduces to losing 1 bit of instruction entropy and some additional logic
in the compiler to detect if/when instructions can run in parallel.

Superscalar effectively requires a big glob of pattern recognition early
in the pipeline, which seems like a roadblock.

Had considered support for superscalar RISC-V by using a lookup to
classify instructions as "valid prefix" and "valid suffix" and also
logic to check for register clashes, and then behaving as if there were
a "virtual WEX bit" based on this.

Hadn't got around to finishing this. RISC-V support on the BJX2 core is
still mostly untested, and will still be limited to in-order operation
for the time being (and, the design of superscalar mechanism would only
be able to give 2-wide operation for 32-bit instructions only).

Running POWER ISA code on the BJX2 pipeline would be a bit more of a
stretch though (I added RISC-V as it was already pretty close to a
direct subset of BJX2 at that point).

Things like "modulo scheduling" could in theory help with a VLIW
machine, but in my experience modulo-scheduling could likely also help
with some big OoO machines as well (faking it manually in the C code
being a moderately effective optimization strategy on x86-64 machines as
well).

Apparently, clang supports this optimization, but this sort of thing is
currently a bit out of scope of what I can currently manage in BGBCC.

Though, have observed that this strategy seems to be counter-productive
on ARM machines (where it seems to often be faster to not try to
manually modulo-schedule the loops). Though, this may depend on the ARM
core (possibly an OoO ARM core might fare better; most of the ones I had
tested on had been in-order superscalar).

Though, I wouldn't expect there to be all that huge of a difference
between AArch64 and BJX2 on this front, where manual modulo-scheduling
is generally effective on BJX2.

But, as noted in my case on BJX2, latency is sort of like:
1-cycle:
Basic converter ops, like sign and zero extension;
MOV reg/reg, imm/reg, ...
...
2-cycle:
Most ALU ops (ADD/SUB/CMPxx/etc);
Some ALU ops could be made 1-cycle, but "worth the cost?".
More complex converter-class instructions ('CONV2').
Many of the FPU and SIMD format converters go here.
3-cycle:
MUL (32-bit only);
"low-precision" SIMD-FPU ops (Binary16, opt Binary32*);
Memory Loads;
The newer RGB5MINMAX and RGB5CCENC instructions;
...

Click here to read the complete article

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<9240105f-fe55-474e-8560-28bc3c298ef6n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31771&group=comp.arch#31771

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:1793:b0:3e3:8172:ff21 with SMTP id s19-20020a05622a179300b003e38172ff21mr3130007qtk.8.1682205092049;
Sat, 22 Apr 2023 16:11:32 -0700 (PDT)
X-Received: by 2002:aca:ef44:0:b0:38d:ef6d:91ee with SMTP id
n65-20020acaef44000000b0038def6d91eemr2609821oih.10.1682205091815; Sat, 22
Apr 2023 16:11:31 -0700 (PDT)
Path: i2pn2.org!rocksolid2!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 22 Apr 2023 16:11:31 -0700 (PDT)
In-Reply-To: <u21kfl$3eki2$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:f5f2:17bc:f70f:1190;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:f5f2:17bc:f70f:1190
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
<u1hrjq$2mtdg$1@dont-email.me> <d448bd70-8620-43eb-bf53-0cde6d286f78n@googlegroups.com>
<u1ulg6$2rrai$1@dont-email.me> <9db772b3-faf3-4154-a524-3e76b9ef42d9n@googlegroups.com>
<u21kfl$3eki2$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <9240105f-fe55-474e-8560-28bc3c298ef6n@googlegroups.com>
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 22 Apr 2023 23:11:32 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 8132

by: MitchAlsup - Sat, 22 Apr 2023 23:11 UTC

On Saturday, April 22, 2023 at 4:44:25 PM UTC-5, BGB wrote:
> On 4/22/2023 3:49 AM, robf...@gmail.com wrote:

> Yeah.
>
> There are a few reasons I went with VLIW rather than superscalar...
> Both VLIW and an in-order superscalar require similar logic from the
> compiler in order to be used efficiently, and the main cost of VLIW here
> reduces to losing 1 bit of instruction entropy and some additional logic
> in the compiler to detect if/when instructions can run in parallel.
>
> Superscalar effectively requires a big glob of pattern recognition early
> in the pipeline, which seems like a roadblock.
<
It is "such a massive roadblock" that my ISA has to look at <gasp> all
of 6-bits to determine where an instruction gets routed (its function unit)
in unary, one gate later you know if a conflict is present.
>
> Had considered support for superscalar RISC-V by using a lookup to
> classify instructions as "valid prefix" and "valid suffix" and also
> logic to check for register clashes, and then behaving as if there were
> a "virtual WEX bit" based on this.
>
> Hadn't got around to finishing this. RISC-V support on the BJX2 core is
> still mostly untested, and will still be limited to in-order operation
> for the time being (and, the design of superscalar mechanism would only
> be able to give 2-wide operation for 32-bit instructions only).
>
>
> Running POWER ISA code on the BJX2 pipeline would be a bit more of a
> stretch though (I added RISC-V as it was already pretty close to a
> direct subset of BJX2 at that point).
>
>
>
> Things like "modulo scheduling" could in theory help with a VLIW
> machine, but in my experience modulo-scheduling could likely also help
> with some big OoO machines as well (faking it manually in the C code
> being a moderately effective optimization strategy on x86-64 machines as
> well).
<
Modulo scheduling reduces the number of resources in flight to get
loops-with-recurrences running smoothly. It helps mono-scalar and larger
in-order pipelines, and does no harm to OoO designs whatsoever.
>
> Apparently, clang supports this optimization, but this sort of thing is
> currently a bit out of scope of what I can currently manage in BGBCC.
>
>
> Though, have observed that this strategy seems to be counter-productive
> on ARM machines (where it seems to often be faster to not try to
> manually modulo-schedule the loops). Though, this may depend on the ARM
> core (possibly an OoO ARM core might fare better; most of the ones I had
> tested on had been in-order superscalar).
>
> Though, I wouldn't expect there to be all that huge of a difference
> between AArch64 and BJX2 on this front, where manual modulo-scheduling
> is generally effective on BJX2.
>
>
>
> But, as noted in my case on BJX2, latency is sort of like:
> 1-cycle:
> Basic converter ops, like sign and zero extension;
> MOV reg/reg, imm/reg, ...
<
Are these the zero calculation-cost "calculations"
from the set {FABS, IMOV, FMOV, FNEG, FcopySign, INVERT} ?
<
> ...
> 2-cycle:
> Most ALU ops (ADD/SUB/CMPxx/etc);
> Some ALU ops could be made 1-cycle, but "worth the cost?".
<
Warning "Will Robinson":: the slope is very slippery; but candidates
are from the set {AND, OR, XOR, <<1, <<2, >>1, >>2, ROT <<1, ROT >>1}
<
> More complex converter-class instructions ('CONV2').
> Many of the FPU and SIMD format converters go here.
> 3-cycle:
> MUL (32-bit only);
> "low-precision" SIMD-FPU ops (Binary16, opt Binary32*);
> Memory Loads;
<
Given that integer ADD is 2-cycles {in your definition of the list ordinality}
I find it interesting that you get::
{ route AGEN adder to SRAMs address decoder,
SRAM access (1-full-cycle)
SRAM route to data-path; tag address compare;
LD align; Set-Selection;
Drive result bus
} in 1 more cycle.
<
> The newer RGB5MINMAX and RGB5CCENC instructions;
> ...
>
> *: The support for Binary32 is optional, and was pulled off by fiddling
> the FPU to "barely" give correct-ish Binary32 results, with the notable
> restriction that it is truncate-only rounding.
>
> RGB5MINMAX was basically:
> Cycle 1: Figure out the RGB555 Y values;
> Cycle 2: Compare and select based on Y values;
> Cycle 3: Deliver output from Cycle 2.
>
> Initial attempt had routed this through the CONV2 path, but it was bad
> for cost and timing, so I had reworked it to share the RGB5CCENC module,
> which also had similar logic (both needed to find Y based on RGB555
> values, ...).
>
> RGB5CCENC was basically:
> Cycle 1:
> Figure out the RGB555 Y values (for pixels);
> Figure out the RGB555 Y values (for Mid, Lo-Sel, Hi-Sel);
> Cycle 2: Compare and generate selector indices based on Y values;
> Cycle 3: Deliver output from Cycle 2.
>
>
> Some longer non-pipelined cases:
> 6-cycle: FADD/FMUL/etc (main FPU)
> 10-cycle: FP-SIMD via main FPU ("high precision").
> 40-cycle: Integer DIVx.L and MODx.L
> 80-cylce: Integer DIVx.Q and MODx.Q, 64-bit MULx.Q, ...
> 120-cycle: FDIV.
> 480-cycle: FSQRT.
<
If I were to take the machine I am designing AND it happened that I
had a calculation unit which could do FADD or FMUL or FMAC in
64-bits and in a 6-cycle pipeline then::
IDIV IMOD is 24-28 cycles
FDIV is 24-26 cycles
SQRT is 27-32 cycles
in a pipeline where::
LD latency is 4 cycles
IMUL is 6 cycles
<
If I stated the latencies appropriate to 4-cycle {FADD, FMUL, FMAC}
IDIV IMOD is 17 cycles
FDIV is 17 cycles
SQRT is 22-cycles
in a pipeline where::
LD latency is 3-cycles
IMUL is 4 cycles.
>
> For integer divide and modulo, it is mostly a toss-up between the ISA
> instruction and "just doing it in software".
>
> For 64-bit integer multiply, doing it in software is still faster.
>
> For floating-point divide, doing it in software is faster, but the
> hardware FDIV is able to give more accurate results (software N-R
> seemingly being unable to correctly converge the last few low-order bits)..
<
This is a consequence of not calculating all of the partial product bits
and then summing to a correct result. N-R only converges absolutely
when the integer parts of the arithmetic are correct. Any error here,
which is similar to the uncorrected error of Goldschmidt, prevents
convergence to correctness.
>

Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

<u21ub7$3g8ik$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=31772&group=comp.arch#31772

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: chained multi-issue reg-renaming in the same clock cycle: is it
possible?
Date: Sat, 22 Apr 2023 19:32:35 -0500
Organization: A noiseless patient Spider
Lines: 288
Message-ID: <u21ub7$3g8ik$1@dont-email.me>
References: <72173b50-2325-4746-8628-d9cd0f24680an@googlegroups.com>
<u1hrjq$2mtdg$1@dont-email.me>
<d448bd70-8620-43eb-bf53-0cde6d286f78n@googlegroups.com>
<u1ulg6$2rrai$1@dont-email.me>
<9db772b3-faf3-4154-a524-3e76b9ef42d9n@googlegroups.com>
<u21kfl$3eki2$1@dont-email.me>
<9240105f-fe55-474e-8560-28bc3c298ef6n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 23 Apr 2023 00:32:39 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="cbe4ce160b9ab9921eb225295f68de6d";
logging-data="3678804"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Ma+9JERFfaWHcloPXS5cP"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.10.0
Cancel-Lock: sha1:VZFiW/8ilPddDEtBLczl6Ocoo38=
Content-Language: en-US
In-Reply-To: <9240105f-fe55-474e-8560-28bc3c298ef6n@googlegroups.com>

by: BGB - Sun, 23 Apr 2023 00:32 UTC

On 4/22/2023 6:11 PM, MitchAlsup wrote:
> On Saturday, April 22, 2023 at 4:44:25 PM UTC-5, BGB wrote:
>> On 4/22/2023 3:49 AM, robf...@gmail.com wrote:
>
>
>> Yeah.
>>
>> There are a few reasons I went with VLIW rather than superscalar...
>> Both VLIW and an in-order superscalar require similar logic from the
>> compiler in order to be used efficiently, and the main cost of VLIW here
>> reduces to losing 1 bit of instruction entropy and some additional logic
>> in the compiler to detect if/when instructions can run in parallel.
>>
>> Superscalar effectively requires a big glob of pattern recognition early
>> in the pipeline, which seems like a roadblock.
> <
> It is "such a massive roadblock" that my ISA has to look at <gasp> all
> of 6-bits to determine where an instruction gets routed (its function unit)
> in unary, one gate later you know if a conflict is present.

It would require a slightly bigger lookup table for RISC-V or BJX2,
mostly because instructions are not organized by prefix/suffix class;
nor by which function-unit handles them.

Also routing to FU's gets a bit ad-hoc (or subject to vary based on
which features are enabled or disabled), ...

>>
>> Had considered support for superscalar RISC-V by using a lookup to
>> classify instructions as "valid prefix" and "valid suffix" and also
>> logic to check for register clashes, and then behaving as if there were
>> a "virtual WEX bit" based on this.
>>
>> Hadn't got around to finishing this. RISC-V support on the BJX2 core is
>> still mostly untested, and will still be limited to in-order operation
>> for the time being (and, the design of superscalar mechanism would only
>> be able to give 2-wide operation for 32-bit instructions only).
>>
>>
>> Running POWER ISA code on the BJX2 pipeline would be a bit more of a
>> stretch though (I added RISC-V as it was already pretty close to a
>> direct subset of BJX2 at that point).
>>
>>
>>
>> Things like "modulo scheduling" could in theory help with a VLIW
>> machine, but in my experience modulo-scheduling could likely also help
>> with some big OoO machines as well (faking it manually in the C code
>> being a moderately effective optimization strategy on x86-64 machines as
>> well).
> <
> Modulo scheduling reduces the number of resources in flight to get
> loops-with-recurrences running smoothly. It helps mono-scalar and larger
> in-order pipelines, and does no harm to OoO designs whatsoever.

Yes.

As noted, it seems to help with both x86-64 and BJX2 in my experience.

Not so much with ARM IME, but less sure as to why.
At least, on ARMv8, theoretically there should not be any detrimental
effect to modulo scheduling the loops.

>>
>> Apparently, clang supports this optimization, but this sort of thing is
>> currently a bit out of scope of what I can currently manage in BGBCC.
>>
>>
>> Though, have observed that this strategy seems to be counter-productive
>> on ARM machines (where it seems to often be faster to not try to
>> manually modulo-schedule the loops). Though, this may depend on the ARM
>> core (possibly an OoO ARM core might fare better; most of the ones I had
>> tested on had been in-order superscalar).
>>
>> Though, I wouldn't expect there to be all that huge of a difference
>> between AArch64 and BJX2 on this front, where manual modulo-scheduling
>> is generally effective on BJX2.
>>
>>
>>
>> But, as noted in my case on BJX2, latency is sort of like:
>> 1-cycle:
>> Basic converter ops, like sign and zero extension;
>> MOV reg/reg, imm/reg, ...
> <
> Are these the zero calculation-cost "calculations"
> from the set {FABS, IMOV, FMOV, FNEG, FcopySign, INVERT} ?
> <

Yes, FABS and FNEG and similar are also 1-cycle ops.
My list was not exactly exhaustive...

>> ...
>> 2-cycle:
>> Most ALU ops (ADD/SUB/CMPxx/etc);
>> Some ALU ops could be made 1-cycle, but "worth the cost?".
> <
> Warning "Will Robinson":: the slope is very slippery; but candidates
> are from the set {AND, OR, XOR, <<1, <<2, >>1, >>2, ROT <<1, ROT >>1}
> <

Yeah.
32-bit ADDS.L / SUBS.L
AND/OR/XOR
...
Could all be turned into 1-cycle ops.

However, most are not high enough on the ranking to show
significant/obvious benefit from doing so.

It would "maybe" reduce the average interlock penalty from ~ 7% to
around 5% or 6% of the total clock cycles, but I don't expect much more
than this (with most of the rest of the interlock penalty being due to
memory loads).

>> More complex converter-class instructions ('CONV2').
>> Many of the FPU and SIMD format converters go here.
>> 3-cycle:
>> MUL (32-bit only);
>> "low-precision" SIMD-FPU ops (Binary16, opt Binary32*);
>> Memory Loads;
> <
> Given that integer ADD is 2-cycles {in your definition of the list ordinality}
> I find it interesting that you get::
> {
> route AGEN adder to SRAMs address decoder,
> SRAM access (1-full-cycle)

Both happen in EX1, L1 SRAM is accessed on 1 clock-edge.
Cache is direct-mapped, so only the low order bits need to reach the
L1's SRAM during this clock cycle.

This can roughly handle a 16K or (maybe) 32K array at 50MHz; 64K is
basically no-go.

Contrast, the L2 cache uses multiple clock-cycles to access the SRAM
array, so can support a somewhat larger array.

> SRAM route to data-path; tag address compare;
> LD align; Set-Selection;

Mostly happens in EX2.
The "check for L1 cache miss and stall pipeline" logic is one of the
major "tight" paths in the core. Anything touching this pathway is
basically a mine-field.

> Drive result bus
> }
> in 1 more cycle.

Final result handling is in EX3, say:
Final sign/zero extension;
Single -> Double (FMOV.S);
Pixel Extraction (LDTEX).

> <

The 2-cycle ADD was originally partly an artifact.
Originally I had not discovered some tricks to make the adder faster,
and a naive 64-bit add has a fairly high latency.

Though, the trick only really helps with larger adders, so I can *also*
support 128-bit ADD in 2 clock cycles (effectively by having both the
Lane 1 and 2 ALUs being able to combine into a larger virtual ALU).

But, I can note that:
ADD, SUB, CMPxx, PADDx, ...
Are all basically routed through the same logic.

But, I can also note that getting the L1 D$ to not blow out timing
constraints is a semi-constant annoyance...

Poke at something, somewhere random, and once again logic in the L1 D$
has decided to start failing timing again...

It would be easier here if I made the pipeline longer so I could do
4-cycle memory loads.

>> The newer RGB5MINMAX and RGB5CCENC instructions;
>> ...
>>
>> *: The support for Binary32 is optional, and was pulled off by fiddling
>> the FPU to "barely" give correct-ish Binary32 results, with the notable
>> restriction that it is truncate-only rounding.
>>
>> RGB5MINMAX was basically:
>> Cycle 1: Figure out the RGB555 Y values;
>> Cycle 2: Compare and select based on Y values;
>> Cycle 3: Deliver output from Cycle 2.
>>
>> Initial attempt had routed this through the CONV2 path, but it was bad
>> for cost and timing, so I had reworked it to share the RGB5CCENC module,
>> which also had similar logic (both needed to find Y based on RGB555
>> values, ...).
>>
>> RGB5CCENC was basically:
>> Cycle 1:
>> Figure out the RGB555 Y values (for pixels);
>> Figure out the RGB555 Y values (for Mid, Lo-Sel, Hi-Sel);
>> Cycle 2: Compare and generate selector indices based on Y values;
>> Cycle 3: Deliver output from Cycle 2.
>>
>>
>> Some longer non-pipelined cases:
>> 6-cycle: FADD/FMUL/etc (main FPU)
>> 10-cycle: FP-SIMD via main FPU ("high precision").
>> 40-cycle: Integer DIVx.L and MODx.L
>> 80-cylce: Integer DIVx.Q and MODx.Q, 64-bit MULx.Q, ...
>> 120-cycle: FDIV.
>> 480-cycle: FSQRT.
> <
> If I were to take the machine I am designing AND it happened that I
> had a calculation unit which could do FADD or FMUL or FMAC in
> 64-bits and in a 6-cycle pipeline then::
> IDIV IMOD is 24-28 cycles
> FDIV is 24-26 cycles
> SQRT is 27-32 cycles
> in a pipeline where::
> LD latency is 4 cycles
> IMUL is 6 cycles

Part of the "magic" that allows 3-cycle IMUL and 6-cycle FMUL is the
DSP48 hard-logic. If not for this, these would not likely be possible.

Click here to read the complete article

devel / comp.arch / Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

Pages:12

server_pubkey.txt

rocksolid light 0.9.81
clearnet tor

19 May, 2024: Line wrapping has been changed to be more consistent with Usenet standards. If you find that it is broken please let me know here rocksolid.nodes.help

devel / comp.arch / Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

devel / comp.arch / Re: chained multi-issue reg-renaming in the same clock cycle: is it possible?

19 May, 2024: Line wrapping has been changed to be more consistent with Usenet standards.
If you find that it is broken please let me know here rocksolid.nodes.help