Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

"A mind is a terrible thing to have leaking out your ears." -- The League of Sadistic Telepaths


devel / comp.arch / Re: Pipeline Registers

SubjectAuthor
* Pipeline Registersrobf...@gmail.com
+- Re: Pipeline RegistersEricP
+* Re: Pipeline RegistersMitchAlsup
|`* Re: Pipeline RegistersBGB
| `* Re: Pipeline RegistersMitchAlsup
|  `* Re: Pipeline RegistersBGB
|   +- Re: Pipeline RegistersMitchAlsup
|   +- Re: Pipeline RegistersIvan Godard
|   `- Re: Pipeline RegistersJimBrakefield
`* Re: Pipeline RegistersEricP
 +- Re: Pipeline RegistersMitchAlsup
 `- Re: Pipeline RegistersIvan Godard

1
Pipeline Registers

<1b6cd7ba-d827-407c-a4f5-6c13266d5984n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=20239&group=comp.arch#20239

 copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:5b8d:: with SMTP id a13mr6940208qta.130.1630846806677;
Sun, 05 Sep 2021 06:00:06 -0700 (PDT)
X-Received: by 2002:a9d:3a6:: with SMTP id f35mr7144452otf.144.1630846806472;
Sun, 05 Sep 2021 06:00:06 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 5 Sep 2021 06:00:06 -0700 (PDT)
Injection-Info: google-groups.googlegroups.com; posting-host=2607:fea8:1ddc:ad00:c564:b7e3:ecf4:561c;
posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 2607:fea8:1ddc:ad00:c564:b7e3:ecf4:561c
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1b6cd7ba-d827-407c-a4f5-6c13266d5984n@googlegroups.com>
Subject: Pipeline Registers
From: robfi...@gmail.com (robf...@gmail.com)
Injection-Date: Sun, 05 Sep 2021 13:00:06 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: robf...@gmail.com - Sun, 5 Sep 2021 13:00 UTC

Not sure I can explain this adequately. It would be nice if one could say ‘Get this value from the previous pipeline result” using a register designator. The idea is that the value does not need to be retrieved or sent to the register file. It exists only as a temporary value in a pipeline register.
$P1 would be one pipeline stage back, $P2 would be two stages back.

ADD $P0,$X3,$X4
ADD $P0,$X5,$X6
ADD $X7,$P1,$P2 ; $x7 = sum of $x3 to $x6

There are many circumstances where the target of instruction #one is used as a source operand for instruction #2. But then is not otherwise needed. Depending on the lifetime of the value it may not need to be moved to the register file.

Re: Pipeline Registers

<cb4ZI.24053$nR3.17032@fx38.iad>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=20241&group=comp.arch#20241

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsreader4.netcologne.de!news.netcologne.de!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx38.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Pipeline Registers
References: <1b6cd7ba-d827-407c-a4f5-6c13266d5984n@googlegroups.com>
In-Reply-To: <1b6cd7ba-d827-407c-a4f5-6c13266d5984n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 14
Message-ID: <cb4ZI.24053$nR3.17032@fx38.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sun, 05 Sep 2021 14:08:08 UTC
Date: Sun, 05 Sep 2021 10:07:57 -0400
X-Received-Bytes: 1531
 by: EricP - Sun, 5 Sep 2021 14:07 UTC

robf...@gmail.com wrote:
> Not sure I can explain this adequately. It would be nice if one could say ‘Get this value from the previous pipeline result” using a register designator. The idea is that the value does not need to be retrieved or sent to the register file. It exists only as a temporary value in a pipeline register.
> $P1 would be one pipeline stage back, $P2 would be two stages back.
>
> ADD $P0,$X3,$X4
> ADD $P0,$X5,$X6
> ADD $X7,$P1,$P2 ; $x7 = sum of $x3 to $x6
>
> There are many circumstances where the target of instruction #one is used as a source operand for instruction #2. But then is not otherwise needed. Depending on the lifetime of the value it may not need to be moved to the register file.
>

And if you get an interrupt between the ADD's?

Re: Pipeline Registers

<d877475f-c12f-4946-a98d-0cda28e0d172n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=20243&group=comp.arch#20243

 copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:7a98:: with SMTP id x24mr7831310qtr.265.1630862008230;
Sun, 05 Sep 2021 10:13:28 -0700 (PDT)
X-Received: by 2002:a05:6808:14c5:: with SMTP id f5mr5868057oiw.84.1630862008028;
Sun, 05 Sep 2021 10:13:28 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 5 Sep 2021 10:13:27 -0700 (PDT)
In-Reply-To: <1b6cd7ba-d827-407c-a4f5-6c13266d5984n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <1b6cd7ba-d827-407c-a4f5-6c13266d5984n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d877475f-c12f-4946-a98d-0cda28e0d172n@googlegroups.com>
Subject: Re: Pipeline Registers
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 05 Sep 2021 17:13:28 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 21
 by: MitchAlsup - Sun, 5 Sep 2021 17:13 UTC

On Sunday, September 5, 2021 at 8:00:07 AM UTC-5, robf...@gmail.com wrote:
> Not sure I can explain this adequately. It would be nice if one could say ‘Get this value from the previous pipeline result” using a register designator. The idea is that the value does not need to be retrieved or sent to the register file. It exists only as a temporary value in a pipeline register.
> $P1 would be one pipeline stage back, $P2 would be two stages back.
>
> ADD $P0,$X3,$X4
> ADD $P0,$X5,$X6
> ADD $X7,$P1,$P2 ; $x7 = sum of $x3 to $x6
>
> There are many circumstances where the target of instruction #one is used as a source operand for instruction #2. But then is not otherwise needed. Depending on the lifetime of the value it may not need to be moved to the register file.
<
Realistically, It is only a few gates of delay harder to compare the register number
in the pipeline to the register number in the instructions and determine forwarding
indirectly. And this gets rid of the problem EricP mentions.

Re: Pipeline Registers

<sh35ap$351$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=20245&group=comp.arch#20245

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Pipeline Registers
Date: Sun, 5 Sep 2021 14:19:42 -0500
Organization: A noiseless patient Spider
Lines: 62
Message-ID: <sh35ap$351$1@dont-email.me>
References: <1b6cd7ba-d827-407c-a4f5-6c13266d5984n@googlegroups.com>
<d877475f-c12f-4946-a98d-0cda28e0d172n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 5 Sep 2021 19:20:58 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="268f4fcc03c57321055cefd04f0641e7";
logging-data="3233"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19FlCw8ZaunCGAuiBAzzlqE"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:q9GFBCn8fwqjL0FeFdCm7tMGVwk=
In-Reply-To: <d877475f-c12f-4946-a98d-0cda28e0d172n@googlegroups.com>
Content-Language: en-US
X-Mozilla-News-Host: news://news.albasani.net
 by: BGB - Sun, 5 Sep 2021 19:19 UTC

On 9/5/2021 12:13 PM, MitchAlsup wrote:
> On Sunday, September 5, 2021 at 8:00:07 AM UTC-5, robf...@gmail.com wrote:
>> Not sure I can explain this adequately. It would be nice if one could say ‘Get this value from the previous pipeline result” using a register designator. The idea is that the value does not need to be retrieved or sent to the register file. It exists only as a temporary value in a pipeline register.
>> $P1 would be one pipeline stage back, $P2 would be two stages back.
>>
>> ADD $P0,$X3,$X4
>> ADD $P0,$X5,$X6
>> ADD $X7,$P1,$P2 ; $x7 = sum of $x3 to $x6
>>
>> There are many circumstances where the target of instruction #one is used as a source operand for instruction #2. But then is not otherwise needed. Depending on the lifetime of the value it may not need to be moved to the register file.
> <
> Realistically, It is only a few gates of delay harder to compare the register number
> in the pipeline to the register number in the instructions and determine forwarding
> indirectly. And this gets rid of the problem EricP mentions.
>

Yeah. Conventional forwarding works pretty well, and does not introduce
semantics which are overly brittle (this would break on interrupts, if
one tries to change the width of the machine, ...).

Likewise, "having sufficient registers" means needing to store things
into registers isn't usually too much of an issue.

Granted, forwarding adds cost, but:
This doesn't usually become too big of an issue until the core gets wider.

Cost Multiplier = Nr*Nw*Ns; Where Nr=Number of Read Ports, Nw=Number of
Write Ports, Ns=Number of Execute Stages.

So:
3 x 1 x 3 = 9 forwarders (Scalar core with 3 EX stages).
6 x 3 x 3 = 54 forwarders (Current WEX-3W in BJX2).
12 x 6 x 3 = 216 forwarders (Possible WEX-6W)

Which is part of why I have been having so many thoughts about using
interlocks instead of forwarding for 6W, and implementing the 12R+6W
register-file as essentially 2x 6R+4W (144 forwarders).

There is also an Nr*Nw cost multiplier for register arrays:
Arrays are cheapest at 1W, 3 clones;
At 6R+3W, 18 clones;
At 12R+6W, 72 clones.
At 2x 6R+4W, 48 clones.

Between the 3-wide and 6-wide cases in this scenario, there is also a
transition where it seems to become cheaper to implement the registers
as state machines and flip-flop registers than to use arrays (however,
trying to compare these cases using simplistic cost metrics falls on its
face).

Still some internal debate as to whether or not WEX-6W is even viable
(or if I would be better off just trying to find ways to make dual-core
cheaper, or cost-effectively implement SMT).

....

However, explicit numbering would not help here, it would likely
actually make these sorts of problems significantly worse.

Re: Pipeline Registers

<3b1add5f-9e90-4aed-9ba7-94a092bc2397n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=20246&group=comp.arch#20246

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:1435:: with SMTP id k21mr8254534qkj.442.1630872052952;
Sun, 05 Sep 2021 13:00:52 -0700 (PDT)
X-Received: by 2002:a9d:609e:: with SMTP id m30mr2818587otj.38.1630872052728;
Sun, 05 Sep 2021 13:00:52 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 5 Sep 2021 13:00:52 -0700 (PDT)
In-Reply-To: <sh35ap$351$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <1b6cd7ba-d827-407c-a4f5-6c13266d5984n@googlegroups.com>
<d877475f-c12f-4946-a98d-0cda28e0d172n@googlegroups.com> <sh35ap$351$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3b1add5f-9e90-4aed-9ba7-94a092bc2397n@googlegroups.com>
Subject: Re: Pipeline Registers
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 05 Sep 2021 20:00:52 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 87
 by: MitchAlsup - Sun, 5 Sep 2021 20:00 UTC

On Sunday, September 5, 2021 at 2:21:00 PM UTC-5, BGB wrote:
> On 9/5/2021 12:13 PM, MitchAlsup wrote:
> > On Sunday, September 5, 2021 at 8:00:07 AM UTC-5, robf...@gmail.com wrote:
> >> Not sure I can explain this adequately. It would be nice if one could say ‘Get this value from the previous pipeline result” using a register designator. The idea is that the value does not need to be retrieved or sent to the register file. It exists only as a temporary value in a pipeline register.
> >> $P1 would be one pipeline stage back, $P2 would be two stages back.
> >>
> >> ADD $P0,$X3,$X4
> >> ADD $P0,$X5,$X6
> >> ADD $X7,$P1,$P2 ; $x7 = sum of $x3 to $x6
> >>
> >> There are many circumstances where the target of instruction #one is used as a source operand for instruction #2. But then is not otherwise needed. Depending on the lifetime of the value it may not need to be moved to the register file.
> > <
> > Realistically, It is only a few gates of delay harder to compare the register number
> > in the pipeline to the register number in the instructions and determine forwarding
> > indirectly. And this gets rid of the problem EricP mentions.
> >
> Yeah. Conventional forwarding works pretty well, and does not introduce
> semantics which are overly brittle (this would break on interrupts, if
> one tries to change the width of the machine, ...).
>
> Likewise, "having sufficient registers" means needing to store things
> into registers isn't usually too much of an issue.
>
>
> Granted, forwarding adds cost, but:
> This doesn't usually become too big of an issue until the core gets wider..
>
> Cost Multiplier = Nr*Nw*Ns; Where Nr=Number of Read Ports, Nw=Number of
> Write Ports, Ns=Number of Execute Stages.
>
> So:
> 3 x 1 x 3 = 9 forwarders (Scalar core with 3 EX stages).
> 6 x 3 x 3 = 54 forwarders (Current WEX-3W in BJX2).
> 12 x 6 x 3 = 216 forwarders (Possible WEX-6W)
>
> Which is part of why I have been having so many thoughts about using
> interlocks instead of forwarding for 6W, and implementing the 12R+6W
> register-file as essentially 2x 6R+4W (144 forwarders).
>
>
> There is also an Nr*Nw cost multiplier for register arrays:
> Arrays are cheapest at 1W, 3 clones;
> At 6R+3W, 18 clones;
> At 12R+6W, 72 clones.
> At 2x 6R+4W, 48 clones.
<
I have built 6R6W register files. This is about as big as one can build and wire,
using replication for more read ports.
>
>
> Between the 3-wide and 6-wide cases in this scenario, there is also a
> transition where it seems to become cheaper to implement the registers
> as state machines and flip-flop registers than to use arrays (however,
> trying to compare these cases using simplistic cost metrics falls on its
> face).
<
Somewhere in the 3W-6W arena is the transition where the vast majority
of operands arrive from the file versus where the vast majority of operands
arrive from the forwarding logic.
>
> Still some internal debate as to whether or not WEX-6W is even viable
> (or if I would be better off just trying to find ways to make dual-core
> cheaper, or cost-effectively implement SMT).
>
> ...
>
>
> However, explicit numbering would not help here, it would likely
> actually make these sorts of problems significantly worse.
<
The only way explicit number would work is if you do it in such a way that
the numbers one uses today hold for eternity.

Re: Pipeline Registers

<sh3kpr$48d$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=20249&group=comp.arch#20249

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Pipeline Registers
Date: Sun, 5 Sep 2021 18:43:43 -0500
Organization: A noiseless patient Spider
Lines: 166
Message-ID: <sh3kpr$48d$1@dont-email.me>
References: <1b6cd7ba-d827-407c-a4f5-6c13266d5984n@googlegroups.com>
<d877475f-c12f-4946-a98d-0cda28e0d172n@googlegroups.com>
<sh35ap$351$1@dont-email.me>
<3b1add5f-9e90-4aed-9ba7-94a092bc2397n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 5 Sep 2021 23:44:59 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="71911c025cb2d08a7ebe1920a0bb34d6";
logging-data="4365"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/qHBII23deeT5I9ESFL4G3"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:RYC6o54lvtvvCdMexWBo2QQ8EtY=
In-Reply-To: <3b1add5f-9e90-4aed-9ba7-94a092bc2397n@googlegroups.com>
Content-Language: en-US
 by: BGB - Sun, 5 Sep 2021 23:43 UTC

On 9/5/2021 3:00 PM, MitchAlsup wrote:
> On Sunday, September 5, 2021 at 2:21:00 PM UTC-5, BGB wrote:
>> On 9/5/2021 12:13 PM, MitchAlsup wrote:
>>> On Sunday, September 5, 2021 at 8:00:07 AM UTC-5, robf...@gmail.com wrote:
>>>> Not sure I can explain this adequately. It would be nice if one could say ‘Get this value from the previous pipeline result” using a register designator. The idea is that the value does not need to be retrieved or sent to the register file. It exists only as a temporary value in a pipeline register.
>>>> $P1 would be one pipeline stage back, $P2 would be two stages back.
>>>>
>>>> ADD $P0,$X3,$X4
>>>> ADD $P0,$X5,$X6
>>>> ADD $X7,$P1,$P2 ; $x7 = sum of $x3 to $x6
>>>>
>>>> There are many circumstances where the target of instruction #one is used as a source operand for instruction #2. But then is not otherwise needed. Depending on the lifetime of the value it may not need to be moved to the register file.
>>> <
>>> Realistically, It is only a few gates of delay harder to compare the register number
>>> in the pipeline to the register number in the instructions and determine forwarding
>>> indirectly. And this gets rid of the problem EricP mentions.
>>>
>> Yeah. Conventional forwarding works pretty well, and does not introduce
>> semantics which are overly brittle (this would break on interrupts, if
>> one tries to change the width of the machine, ...).
>>
>> Likewise, "having sufficient registers" means needing to store things
>> into registers isn't usually too much of an issue.
>>
>>
>> Granted, forwarding adds cost, but:
>> This doesn't usually become too big of an issue until the core gets wider.
>>
>> Cost Multiplier = Nr*Nw*Ns; Where Nr=Number of Read Ports, Nw=Number of
>> Write Ports, Ns=Number of Execute Stages.
>>
>> So:
>> 3 x 1 x 3 = 9 forwarders (Scalar core with 3 EX stages).
>> 6 x 3 x 3 = 54 forwarders (Current WEX-3W in BJX2).
>> 12 x 6 x 3 = 216 forwarders (Possible WEX-6W)
>>
>> Which is part of why I have been having so many thoughts about using
>> interlocks instead of forwarding for 6W, and implementing the 12R+6W
>> register-file as essentially 2x 6R+4W (144 forwarders).
>>
>>
>> There is also an Nr*Nw cost multiplier for register arrays:
>> Arrays are cheapest at 1W, 3 clones;
>> At 6R+3W, 18 clones;
>> At 12R+6W, 72 clones.
>> At 2x 6R+4W, 48 clones.
> <
> I have built 6R6W register files. This is about as big as one can build and wire,
> using replication for more read ports.

OK.

Something is going on in any case as far as LUT costs and similar are
concerned. The whole idea is kinda rendered pointless if it ends up too
expensive to fit on the XC7A100 which I am using.

As can be noted, the "4th write port" would differ from the other 3:
It would not be connected directly to the pipeline;
It would instead behave more like a virtual port for the two "halves" of
the register file to be able to signal values from one side to the other.

Though, the actual mechanism for getting the value across is not yet
determined. One possibility is that it triggers an interlock every time
this happens, then copies registers from one RF to the other until no
more crosses remain (possibly, the write-ports are paired across the
RF's, allowing for up to 3 registers per cycle in each direction).

It is like, one has a primitive which is like:
6-bit address, 2 bit data in;
6-bit address, 2 bit data out.

Where, 32 of these can give a 64x 64b collection of registers, but only
between a single source and a single destination.

Then, each write point needs to drive its output to all connected
arrays, and each read-port needs to pull its value from any of these
arrays, ...

Starts wondering if there is any sort of way to build a "crossbar"
between the read and write ports.

>>
>>
>> Between the 3-wide and 6-wide cases in this scenario, there is also a
>> transition where it seems to become cheaper to implement the registers
>> as state machines and flip-flop registers than to use arrays (however,
>> trying to compare these cases using simplistic cost metrics falls on its
>> face).
> <
> Somewhere in the 3W-6W arena is the transition where the vast majority
> of operands arrive from the file versus where the vast majority of operands
> arrive from the forwarding logic.

Fair enough. Though, in this case, falling back to interlocks for 6W
bundles if a register crosses an "interleave boundary or similar" isn't
too bad of a tradeoff.

Though it would mean that 6W code would need to be written in a way to
minimize the number of crosses

Meanwhile, with simple cost-heuristics, it seems like the cloned arrays
should always win over flip-flops, but in past experiments, there does
seem to be a transition point, as the cost of flip-flop registers seems
to grow at a slower rate relative to the number of ports (but is more
strongly effected by the number of registers).

Eg, in the flip-flop case case, the logic is more like (pseudocode)
regGpr2 <=
(regIdRn1A==GPR_R2) ? regValRn1A :
(regIdRn1B==GPR_R2) ? regValRn1B :
(regIdRn2A==GPR_R2) ? regValRn2A :
(regIdRn2B==GPR_R2) ? regValRn2B :
... (Repeated for every write port )
regGpr2;
... (Repeated for every register ) ...

case(regIdRsA)
GPR_R0: tRegValRsA = regGpr0;
GPR_R1: tRegValRsA = regGpr1;
GPR_R2: tRegValRsA = regGpr2;
... (Repeated for every register )
end
case(regIdRtA)
GPR_R0: tRegValRtA = regGpr0;
...
end
... (Repeat for every read port) ...

Don't necessarily want to go this way if it can be avoided.

>>
>> Still some internal debate as to whether or not WEX-6W is even viable
>> (or if I would be better off just trying to find ways to make dual-core
>> cheaper, or cost-effectively implement SMT).
>>
>> ...
>>
>>
>> However, explicit numbering would not help here, it would likely
>> actually make these sorts of problems significantly worse.
> <
> The only way explicit number would work is if you do it in such a way that
> the numbers one uses today hold for eternity.
>

It would mean that any code which uses it would be tied to the specific
width and organization of the pipeline, and thus not binary compatible
with a processor using a different pipeline width, ...

This would, kinda suck.

Also, as noted, it could not help with these sorts of issue, as the cost
isn't so much figuring out where the value is forwarded from, but rather
the cost of getting the value moved from point A to B (within a single
clock cycle).

Re: Pipeline Registers

<d60a32cf-9443-41a9-9547-de8f061c5f2en@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=20250&group=comp.arch#20250

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:9e8c:: with SMTP id h134mr8819391qke.366.1630887355977;
Sun, 05 Sep 2021 17:15:55 -0700 (PDT)
X-Received: by 2002:a05:6808:f90:: with SMTP id o16mr6553591oiw.37.1630887355737;
Sun, 05 Sep 2021 17:15:55 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 5 Sep 2021 17:15:55 -0700 (PDT)
In-Reply-To: <sh3kpr$48d$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <1b6cd7ba-d827-407c-a4f5-6c13266d5984n@googlegroups.com>
<d877475f-c12f-4946-a98d-0cda28e0d172n@googlegroups.com> <sh35ap$351$1@dont-email.me>
<3b1add5f-9e90-4aed-9ba7-94a092bc2397n@googlegroups.com> <sh3kpr$48d$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d60a32cf-9443-41a9-9547-de8f061c5f2en@googlegroups.com>
Subject: Re: Pipeline Registers
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 06 Sep 2021 00:15:55 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 213
 by: MitchAlsup - Mon, 6 Sep 2021 00:15 UTC

On Sunday, September 5, 2021 at 6:45:02 PM UTC-5, BGB wrote:
> On 9/5/2021 3:00 PM, MitchAlsup wrote:
> > On Sunday, September 5, 2021 at 2:21:00 PM UTC-5, BGB wrote:
> >> On 9/5/2021 12:13 PM, MitchAlsup wrote:
> >>> On Sunday, September 5, 2021 at 8:00:07 AM UTC-5, robf...@gmail.com wrote:
> >>>> Not sure I can explain this adequately. It would be nice if one could say ‘Get this value from the previous pipeline result” using a register designator. The idea is that the value does not need to be retrieved or sent to the register file. It exists only as a temporary value in a pipeline register.
> >>>> $P1 would be one pipeline stage back, $P2 would be two stages back.
> >>>>
> >>>> ADD $P0,$X3,$X4
> >>>> ADD $P0,$X5,$X6
> >>>> ADD $X7,$P1,$P2 ; $x7 = sum of $x3 to $x6
> >>>>
> >>>> There are many circumstances where the target of instruction #one is used as a source operand for instruction #2. But then is not otherwise needed. Depending on the lifetime of the value it may not need to be moved to the register file.
> >>> <
> >>> Realistically, It is only a few gates of delay harder to compare the register number
> >>> in the pipeline to the register number in the instructions and determine forwarding
> >>> indirectly. And this gets rid of the problem EricP mentions.
> >>>
> >> Yeah. Conventional forwarding works pretty well, and does not introduce
> >> semantics which are overly brittle (this would break on interrupts, if
> >> one tries to change the width of the machine, ...).
> >>
> >> Likewise, "having sufficient registers" means needing to store things
> >> into registers isn't usually too much of an issue.
> >>
> >>
> >> Granted, forwarding adds cost, but:
> >> This doesn't usually become too big of an issue until the core gets wider.
> >>
> >> Cost Multiplier = Nr*Nw*Ns; Where Nr=Number of Read Ports, Nw=Number of
> >> Write Ports, Ns=Number of Execute Stages.
> >>
> >> So:
> >> 3 x 1 x 3 = 9 forwarders (Scalar core with 3 EX stages).
> >> 6 x 3 x 3 = 54 forwarders (Current WEX-3W in BJX2).
> >> 12 x 6 x 3 = 216 forwarders (Possible WEX-6W)
> >>
> >> Which is part of why I have been having so many thoughts about using
> >> interlocks instead of forwarding for 6W, and implementing the 12R+6W
> >> register-file as essentially 2x 6R+4W (144 forwarders).
> >>
> >>
> >> There is also an Nr*Nw cost multiplier for register arrays:
> >> Arrays are cheapest at 1W, 3 clones;
> >> At 6R+3W, 18 clones;
> >> At 12R+6W, 72 clones.
> >> At 2x 6R+4W, 48 clones.
> > <
> > I have built 6R6W register files. This is about as big as one can build and wire,
> > using replication for more read ports.
> OK.
>
> Something is going on in any case as far as LUT costs and similar are
> concerned. The whole idea is kinda rendered pointless if it ends up too
> expensive to fit on the XC7A100 which I am using.
<
When I was building the 6R6W register file, I was using raw transistors.
The file had 12 RW wires in True/Complement form, and each cell has
16 transistors 12 RW pass gates and 4 for the cross coupled inverters.
The pass gates and the inverters had to be carefully sized and the
differential voltages on the T/C wires had to be strictly maintained.
This is not something you build with library gates.
>
>
> As can be noted, the "4th write port" would differ from the other 3:
> It would not be connected directly to the pipeline;
> It would instead behave more like a virtual port for the two "halves" of
> the register file to be able to signal values from one side to the other.
<
You could make four (4) fourth ports, one for each ¼ of the file.
>
> Though, the actual mechanism for getting the value across is not yet
> determined. One possibility is that it triggers an interlock every time
> this happens, then copies registers from one RF to the other until no
> more crosses remain (possibly, the write-ports are paired across the
> RF's, allowing for up to 3 registers per cycle in each direction).
>
>
>
> It is like, one has a primitive which is like:
> 6-bit address, 2 bit data in;
> 6-bit address, 2 bit data out.
>
> Where, 32 of these can give a 64x 64b collection of registers, but only
> between a single source and a single destination.
>
> Then, each write point needs to drive its output to all connected
> arrays, and each read-port needs to pull its value from any of these
> arrays, ...
>
>
> Starts wondering if there is any sort of way to build a "crossbar"
> between the read and write ports.
<
Read in the first phase, write in the second phase. Use the wires and decoders
twice per cycle.
> >>
> >>
> >> Between the 3-wide and 6-wide cases in this scenario, there is also a
> >> transition where it seems to become cheaper to implement the registers
> >> as state machines and flip-flop registers than to use arrays (however,
> >> trying to compare these cases using simplistic cost metrics falls on its
> >> face).
> > <
> > Somewhere in the 3W-6W arena is the transition where the vast majority
> > of operands arrive from the file versus where the vast majority of operands
> > arrive from the forwarding logic.
<
> Fair enough. Though, in this case, falling back to interlocks for 6W
> bundles if a register crosses an "interleave boundary or similar" isn't
> too bad of a tradeoff.
<
Reservation stations.
>
> Though it would mean that 6W code would need to be written in a way to
> minimize the number of crosses
>
>
>
> Meanwhile, with simple cost-heuristics, it seems like the cloned arrays
> should always win over flip-flops, but in past experiments, there does
> seem to be a transition point, as the cost of flip-flop registers seems
> to grow at a slower rate relative to the number of ports (but is more
> strongly effected by the number of registers).
>
>
> Eg, in the flip-flop case case, the logic is more like (pseudocode)
> regGpr2 <=
> (regIdRn1A==GPR_R2) ? regValRn1A :
> (regIdRn1B==GPR_R2) ? regValRn1B :
> (regIdRn2A==GPR_R2) ? regValRn2A :
> (regIdRn2B==GPR_R2) ? regValRn2B :
> ... (Repeated for every write port )
> regGpr2;
> ... (Repeated for every register ) ...
>
> case(regIdRsA)
> GPR_R0: tRegValRsA = regGpr0;
> GPR_R1: tRegValRsA = regGpr1;
> GPR_R2: tRegValRsA = regGpr2;
> ... (Repeated for every register )
> end
> case(regIdRtA)
> GPR_R0: tRegValRtA = regGpr0;
> ...
> end
> ... (Repeat for every read port) ...
>
>
> Don't necessarily want to go this way if it can be avoided.
> >>
> >> Still some internal debate as to whether or not WEX-6W is even viable
> >> (or if I would be better off just trying to find ways to make dual-core
> >> cheaper, or cost-effectively implement SMT).
> >>
> >> ...
> >>
> >>
> >> However, explicit numbering would not help here, it would likely
> >> actually make these sorts of problems significantly worse.
> > <
> > The only way explicit number would work is if you do it in such a way that
> > the numbers one uses today hold for eternity.
> >
> It would mean that any code which uses it would be tied to the specific
> width and organization of the pipeline, and thus not binary compatible
> with a processor using a different pipeline width, ...
>
> This would, kinda suck.
>
> Also, as noted, it could not help with these sorts of issue, as the cost
> isn't so much figuring out where the value is forwarded from, but rather
> the cost of getting the value moved from point A to B (within a single
> clock cycle).

Re: Pipeline Registers

<sh3mnm$due$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=20251&group=comp.arch#20251

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Pipeline Registers
Date: Sun, 5 Sep 2021 17:17:57 -0700
Organization: A noiseless patient Spider
Lines: 109
Message-ID: <sh3mnm$due$1@dont-email.me>
References: <1b6cd7ba-d827-407c-a4f5-6c13266d5984n@googlegroups.com>
<d877475f-c12f-4946-a98d-0cda28e0d172n@googlegroups.com>
<sh35ap$351$1@dont-email.me>
<3b1add5f-9e90-4aed-9ba7-94a092bc2397n@googlegroups.com>
<sh3kpr$48d$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 6 Sep 2021 00:17:58 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="d83eddccc9aacde635becf09a7e9f2e1";
logging-data="14286"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/afJMTyQivbheO57+K8Rfx"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:3G+9YHrLkDnrV5n8zkn1uhzl7mg=
In-Reply-To: <sh3kpr$48d$1@dont-email.me>
Content-Language: en-US
 by: Ivan Godard - Mon, 6 Sep 2021 00:17 UTC

On 9/5/2021 4:43 PM, BGB wrote:
> On 9/5/2021 3:00 PM, MitchAlsup wrote:
>> On Sunday, September 5, 2021 at 2:21:00 PM UTC-5, BGB wrote:
>>> On 9/5/2021 12:13 PM, MitchAlsup wrote:
>>>> On Sunday, September 5, 2021 at 8:00:07 AM UTC-5, robf...@gmail.com
>>>> wrote:
>>>>> Not sure I can explain this adequately. It would be nice if one
>>>>> could say ‘Get this value from the previous pipeline result” using
>>>>> a register designator. The idea is that the value does not need to
>>>>> be retrieved or sent to the register file. It exists only as a
>>>>> temporary value in a pipeline register.
>>>>> $P1 would be one pipeline stage back, $P2 would be two stages back.
>>>>>
>>>>> ADD $P0,$X3,$X4
>>>>> ADD $P0,$X5,$X6
>>>>> ADD $X7,$P1,$P2 ; $x7 = sum of $x3 to $x6
>>>>>
>>>>> There are many circumstances where the target of instruction #one
>>>>> is used as a source operand for instruction #2. But then is not
>>>>> otherwise needed. Depending on the lifetime of the value it may not
>>>>> need to be moved to the register file.
>>>> <
>>>> Realistically, It is only a few gates of delay harder to compare the
>>>> register number
>>>> in the pipeline to the register number in the instructions and
>>>> determine forwarding
>>>> indirectly. And this gets rid of the problem EricP mentions.
>>>>
>>> Yeah. Conventional forwarding works pretty well, and does not introduce
>>> semantics which are overly brittle (this would break on interrupts, if
>>> one tries to change the width of the machine, ...).
>>>
>>> Likewise, "having sufficient registers" means needing to store things
>>> into registers isn't usually too much of an issue.
>>>
>>>
>>> Granted, forwarding adds cost, but:
>>> This doesn't usually become too big of an issue until the core gets
>>> wider.
>>>
>>> Cost Multiplier = Nr*Nw*Ns; Where Nr=Number of Read Ports, Nw=Number of
>>> Write Ports, Ns=Number of Execute Stages.
>>>
>>> So:
>>> 3 x 1 x 3 = 9 forwarders (Scalar core with 3 EX stages).
>>> 6 x 3 x 3 = 54 forwarders (Current WEX-3W in BJX2).
>>> 12 x 6 x 3 = 216 forwarders (Possible WEX-6W)
>>>
>>> Which is part of why I have been having so many thoughts about using
>>> interlocks instead of forwarding for 6W, and implementing the 12R+6W
>>> register-file as essentially 2x 6R+4W (144 forwarders).
>>>
>>>
>>> There is also an Nr*Nw cost multiplier for register arrays:
>>> Arrays are cheapest at 1W, 3 clones;
>>> At 6R+3W, 18 clones;
>>> At 12R+6W, 72 clones.
>>> At 2x 6R+4W, 48 clones.
>> <
>> I have built 6R6W register files. This is about as big as one can
>> build and wire,
>> using replication for more read ports.
>
> OK.
>
> Something is going on in any case as far as LUT costs and similar are
> concerned. The whole idea is kinda rendered pointless if it ends up too
> expensive to fit on the XC7A100 which I am using.
>
>
> As can be noted, the "4th write port" would differ from the other 3:
> It would not be connected directly to the pipeline;
> It would instead behave more like a virtual port for the two "halves" of
> the register file to be able to signal values from one side to the other.
>
> Though, the actual mechanism for getting the value across is not yet
> determined. One possibility is that it triggers an interlock every time
> this happens, then copies registers from one RF to the other until no
> more crosses remain (possibly, the write-ports are paired across the
> RF's, allowing for up to 3 registers per cycle in each direction).
>
>
>
> It is like, one has a primitive which is like:
>   6-bit address, 2 bit data in;
>   6-bit address, 2 bit data out.
>
> Where, 32 of these can give a 64x 64b collection of registers, but only
> between a single source and a single destination.
>
> Then, each write point needs to drive its output to all connected
> arrays, and each read-port needs to pull its value from any of these
> arrays, ...
>
>
> Starts wondering if there is any sort of way to build a "crossbar"
> between the read and write ports.
>

It's tough - we have the same problem when looking at a 30X30 crossbar
on our high-end configs. Mitch's 6r6w limit is another example of the
same problem. Our solution (patented) was a cascaded crossbar
partitioned by operation latency, so that a much more tractable 8X30 is
used for one-cycle instructions, while the rest use a 24X8 and can take
longer without impacting anybody's latency but their own.

Ours works, but it's a throughput design. If you can come up with
something that does a better job on a latency design then you can smile
to the bank :-)

Re: Pipeline Registers

<bd58b642-afd4-4caa-bf53-e65a4d768b16n@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=20252&group=comp.arch#20252

 copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:7a98:: with SMTP id x24mr8837569qtr.265.1630887751574;
Sun, 05 Sep 2021 17:22:31 -0700 (PDT)
X-Received: by 2002:a05:6808:14c5:: with SMTP id f5mr6565858oiw.84.1630887751297;
Sun, 05 Sep 2021 17:22:31 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 5 Sep 2021 17:22:31 -0700 (PDT)
In-Reply-To: <sh3kpr$48d$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=136.50.182.0; posting-account=AoizIQoAAADa7kQDpB0DAj2jwddxXUgl
NNTP-Posting-Host: 136.50.182.0
References: <1b6cd7ba-d827-407c-a4f5-6c13266d5984n@googlegroups.com>
<d877475f-c12f-4946-a98d-0cda28e0d172n@googlegroups.com> <sh35ap$351$1@dont-email.me>
<3b1add5f-9e90-4aed-9ba7-94a092bc2397n@googlegroups.com> <sh3kpr$48d$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <bd58b642-afd4-4caa-bf53-e65a4d768b16n@googlegroups.com>
Subject: Re: Pipeline Registers
From: jim.brak...@ieee.org (JimBrakefield)
Injection-Date: Mon, 06 Sep 2021 00:22:31 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 214
 by: JimBrakefield - Mon, 6 Sep 2021 00:22 UTC

On Sunday, September 5, 2021 at 6:45:02 PM UTC-5, BGB wrote:
> On 9/5/2021 3:00 PM, MitchAlsup wrote:
> > On Sunday, September 5, 2021 at 2:21:00 PM UTC-5, BGB wrote:
> >> On 9/5/2021 12:13 PM, MitchAlsup wrote:
> >>> On Sunday, September 5, 2021 at 8:00:07 AM UTC-5, robf...@gmail.com wrote:
> >>>> Not sure I can explain this adequately. It would be nice if one could say ‘Get this value from the previous pipeline result” using a register designator. The idea is that the value does not need to be retrieved or sent to the register file. It exists only as a temporary value in a pipeline register.
> >>>> $P1 would be one pipeline stage back, $P2 would be two stages back.
> >>>>
> >>>> ADD $P0,$X3,$X4
> >>>> ADD $P0,$X5,$X6
> >>>> ADD $X7,$P1,$P2 ; $x7 = sum of $x3 to $x6
> >>>>
> >>>> There are many circumstances where the target of instruction #one is used as a source operand for instruction #2. But then is not otherwise needed. Depending on the lifetime of the value it may not need to be moved to the register file.
> >>> <
> >>> Realistically, It is only a few gates of delay harder to compare the register number
> >>> in the pipeline to the register number in the instructions and determine forwarding
> >>> indirectly. And this gets rid of the problem EricP mentions.
> >>>
> >> Yeah. Conventional forwarding works pretty well, and does not introduce
> >> semantics which are overly brittle (this would break on interrupts, if
> >> one tries to change the width of the machine, ...).
> >>
> >> Likewise, "having sufficient registers" means needing to store things
> >> into registers isn't usually too much of an issue.
> >>
> >>
> >> Granted, forwarding adds cost, but:
> >> This doesn't usually become too big of an issue until the core gets wider.
> >>
> >> Cost Multiplier = Nr*Nw*Ns; Where Nr=Number of Read Ports, Nw=Number of
> >> Write Ports, Ns=Number of Execute Stages.
> >>
> >> So:
> >> 3 x 1 x 3 = 9 forwarders (Scalar core with 3 EX stages).
> >> 6 x 3 x 3 = 54 forwarders (Current WEX-3W in BJX2).
> >> 12 x 6 x 3 = 216 forwarders (Possible WEX-6W)
> >>
> >> Which is part of why I have been having so many thoughts about using
> >> interlocks instead of forwarding for 6W, and implementing the 12R+6W
> >> register-file as essentially 2x 6R+4W (144 forwarders).
> >>
> >>
> >> There is also an Nr*Nw cost multiplier for register arrays:
> >> Arrays are cheapest at 1W, 3 clones;
> >> At 6R+3W, 18 clones;
> >> At 12R+6W, 72 clones.
> >> At 2x 6R+4W, 48 clones.
> > <
> > I have built 6R6W register files. This is about as big as one can build and wire,
> > using replication for more read ports.
> OK.
>
> Something is going on in any case as far as LUT costs and similar are
> concerned. The whole idea is kinda rendered pointless if it ends up too
> expensive to fit on the XC7A100 which I am using.
>
>
> As can be noted, the "4th write port" would differ from the other 3:
> It would not be connected directly to the pipeline;
> It would instead behave more like a virtual port for the two "halves" of
> the register file to be able to signal values from one side to the other.
>
> Though, the actual mechanism for getting the value across is not yet
> determined. One possibility is that it triggers an interlock every time
> this happens, then copies registers from one RF to the other until no
> more crosses remain (possibly, the write-ports are paired across the
> RF's, allowing for up to 3 registers per cycle in each direction).
>
>
>
> It is like, one has a primitive which is like:
> 6-bit address, 2 bit data in;
> 6-bit address, 2 bit data out.
>
> Where, 32 of these can give a 64x 64b collection of registers, but only
> between a single source and a single destination.
>
> Then, each write point needs to drive its output to all connected
> arrays, and each read-port needs to pull its value from any of these
> arrays, ...
>
>
> Starts wondering if there is any sort of way to build a "crossbar"
> between the read and write ports.
> >>
> >>
> >> Between the 3-wide and 6-wide cases in this scenario, there is also a
> >> transition where it seems to become cheaper to implement the registers
> >> as state machines and flip-flop registers than to use arrays (however,
> >> trying to compare these cases using simplistic cost metrics falls on its
> >> face).
> > <
> > Somewhere in the 3W-6W arena is the transition where the vast majority
> > of operands arrive from the file versus where the vast majority of operands
> > arrive from the forwarding logic.
> Fair enough. Though, in this case, falling back to interlocks for 6W
> bundles if a register crosses an "interleave boundary or similar" isn't
> too bad of a tradeoff.
>
> Though it would mean that 6W code would need to be written in a way to
> minimize the number of crosses
>
>
>
> Meanwhile, with simple cost-heuristics, it seems like the cloned arrays
> should always win over flip-flops, but in past experiments, there does
> seem to be a transition point, as the cost of flip-flop registers seems
> to grow at a slower rate relative to the number of ports (but is more
> strongly effected by the number of registers).
>
>
> Eg, in the flip-flop case case, the logic is more like (pseudocode)
> regGpr2 <=
> (regIdRn1A==GPR_R2) ? regValRn1A :
> (regIdRn1B==GPR_R2) ? regValRn1B :
> (regIdRn2A==GPR_R2) ? regValRn2A :
> (regIdRn2B==GPR_R2) ? regValRn2B :
> ... (Repeated for every write port )
> regGpr2;
> ... (Repeated for every register ) ...
>
> case(regIdRsA)
> GPR_R0: tRegValRsA = regGpr0;
> GPR_R1: tRegValRsA = regGpr1;
> GPR_R2: tRegValRsA = regGpr2;
> ... (Repeated for every register )
> end
> case(regIdRtA)
> GPR_R0: tRegValRtA = regGpr0;
> ...
> end
> ... (Repeat for every read port) ...
>
>
> Don't necessarily want to go this way if it can be avoided.
> >>
> >> Still some internal debate as to whether or not WEX-6W is even viable
> >> (or if I would be better off just trying to find ways to make dual-core
> >> cheaper, or cost-effectively implement SMT).
> >>
> >> ...
> >>
> >>
> >> However, explicit numbering would not help here, it would likely
> >> actually make these sorts of problems significantly worse.
> > <
> > The only way explicit number would work is if you do it in such a way that
> > the numbers one uses today hold for eternity.
> >
> It would mean that any code which uses it would be tied to the specific
> width and organization of the pipeline, and thus not binary compatible
> with a processor using a different pipeline width, ...
>
> This would, kinda suck.
>
> Also, as noted, it could not help with these sorts of issue, as the cost
> isn't so much figuring out where the value is forwarded from, but rather
> the cost of getting the value moved from point A to B (within a single
> clock cycle).

Ugh, should be able to infer or use LUT RAM memory generator to support multiple read ports per write port.
Instead of using bare FFs, which drive the LUT count out of sight!
Each additional read port requires an additional set of LUTS (64x1 or 32x2 RAM each).
See Xilinx UG474: 7 Series FPGAs Configurable Logic Block page 23, Altera/Intel somewhat less capable?
Double clocking (running LUT RAM at twice the clock rate of rest of system) should also reduce LUT count.
However have not done this myself, so is "theoretical".
For true multiple write ports see: Efficient Multi-Ported Memories for FPGAs, Charles LaForest 2009 & 2014.
When faced with the need for multiple write ports, tend to use "shadow" registers, eg for PC, status register etc.
Which look like they are part of the register file but are instead mux'd onto the LUT-RAM output.

Re: Pipeline Registers

<v72_I.37491$md6.21426@fx36.iad>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=20343&group=comp.arch#20343

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!4.us.feeder.erje.net!2.eu.feeder.erje.net!feeder.erje.net!feeder1.feed.usenet.farm!feed.usenet.farm!newsfeed.xs4all.nl!newsfeed7.news.xs4all.nl!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx36.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Pipeline Registers
References: <1b6cd7ba-d827-407c-a4f5-6c13266d5984n@googlegroups.com>
In-Reply-To: <1b6cd7ba-d827-407c-a4f5-6c13266d5984n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 56
Message-ID: <v72_I.37491$md6.21426@fx36.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Wed, 08 Sep 2021 12:36:43 UTC
Date: Wed, 08 Sep 2021 08:35:58 -0400
X-Received-Bytes: 3483
 by: EricP - Wed, 8 Sep 2021 12:35 UTC

robf...@gmail.com wrote:
> Not sure I can explain this adequately. It would be nice if one could say ‘Get this value from the previous pipeline result” using a register designator. The idea is that the value does not need to be retrieved or sent to the register file. It exists only as a temporary value in a pipeline register.
> $P1 would be one pipeline stage back, $P2 would be two stages back.
>
> ADD $P0,$X3,$X4
> ADD $P0,$X5,$X6
> ADD $X7,$P1,$P2 ; $x7 = sum of $x3 to $x6
>
> There are many circumstances where the target of instruction #one is used as a source operand for instruction #2. But then is not otherwise needed. Depending on the lifetime of the value it may not need to be moved to the register file.

I'll mention this as well as it was something unexpected I encountered
as a consequence of a similar design approach I thought of.

I had thought, if necessary, I could add complex instructions to my
RISC design that for macro-op instructions have the decoder emit
a sequence of uOps which used pseudo-registers (PsR) to communicate
between themselves. Pseudo-register number were physical register numbers
for which no actual register exists.
PsR like the above existed only in the forwarding network.

Interrupts are not a problem for this as they are not recognized
inside macro-op sequences anyway.

The problem which I did not foresee occurs with scheduling and issuing.

i1 ADD $P0,$X3,$X4
i2 ADD $P1,$X5,$X6
i3 ADD $X7,$P1,$P0 ; $x7 = sum of $x3 to $x6

The problem is that while i2 is dataflow dependent on i1 through $P0,
because $P0 is ephemeral and exists only for an instant then
i1 is _anti-dependent_ on i2 being queued and ready to launch
so that $P0 does not get missed and lost.

In effect there is a ReadyToReceive-> signal from i1 to i2,
and an <-Ack from i2 to i1 before i1 can launch.

What if i2 gets stuck in the front end because some resource like a
queue is full? What if i2 has other dependencies that prevent it from
launching, how do I propagate this NotReady signal backwards to i1
which can be in a different function unit. It gets worse, because
this anti-dependence is transitive across multiple uOps.

It also creates the possibility of deadlock.
In the above example, i1 and i2 are waiting in the ALU reservation
station ready to launch but can't because i3 is not issued and ready.
i3 can't issue from the front end because all the ALU reservations
stations are full.

There are probably solutions to all of these.
My point is what looked like a cheap-o solution to adding complex
instructions to a RISC design turns out to be anything but that.

The easiest solution was to dump pseudo-registers and always use real ones.

Re: Pipeline Registers

<dc35eb75-01cd-4e54-a7d8-830d5c7b432an@googlegroups.com>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=20348&group=comp.arch#20348

 copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:11ab:: with SMTP id c11mr4224496qkk.169.1631116771555;
Wed, 08 Sep 2021 08:59:31 -0700 (PDT)
X-Received: by 2002:a9d:450b:: with SMTP id w11mr4035594ote.254.1631116771331;
Wed, 08 Sep 2021 08:59:31 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 8 Sep 2021 08:59:31 -0700 (PDT)
In-Reply-To: <v72_I.37491$md6.21426@fx36.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <1b6cd7ba-d827-407c-a4f5-6c13266d5984n@googlegroups.com> <v72_I.37491$md6.21426@fx36.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <dc35eb75-01cd-4e54-a7d8-830d5c7b432an@googlegroups.com>
Subject: Re: Pipeline Registers
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 08 Sep 2021 15:59:31 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 73
 by: MitchAlsup - Wed, 8 Sep 2021 15:59 UTC

On Wednesday, September 8, 2021 at 7:36:45 AM UTC-5, EricP wrote:
> robf...@gmail.com wrote:
> > Not sure I can explain this adequately. It would be nice if one could say ‘Get this value from the previous pipeline result” using a register designator. The idea is that the value does not need to be retrieved or sent to the register file. It exists only as a temporary value in a pipeline register.
> > $P1 would be one pipeline stage back, $P2 would be two stages back.
> >
> > ADD $P0,$X3,$X4
> > ADD $P0,$X5,$X6
> > ADD $X7,$P1,$P2 ; $x7 = sum of $x3 to $x6
> >
> > There are many circumstances where the target of instruction #one is used as a source operand for instruction #2. But then is not otherwise needed.. Depending on the lifetime of the value it may not need to be moved to the register file.
> I'll mention this as well as it was something unexpected I encountered
> as a consequence of a similar design approach I thought of.
>
> I had thought, if necessary, I could add complex instructions to my
> RISC design that for macro-op instructions have the decoder emit
> a sequence of uOps which used pseudo-registers (PsR) to communicate
> between themselves. Pseudo-register number were physical register numbers
> for which no actual register exists.
> PsR like the above existed only in the forwarding network.
>
> Interrupts are not a problem for this as they are not recognized
> inside macro-op sequences anyway.
>
> The problem which I did not foresee occurs with scheduling and issuing.
>
> i1 ADD $P0,$X3,$X4
> i2 ADD $P1,$X5,$X6
> i3 ADD $X7,$P1,$P0 ; $x7 = sum of $x3 to $x6
>
> The problem is that while i2 is dataflow dependent on i1 through $P0,
> because $P0 is ephemeral and exists only for an instant then
> i1 is _anti-dependent_ on i2 being queued and ready to launch
> so that $P0 does not get missed and lost.
>
> In effect there is a ReadyToReceive-> signal from i1 to i2,
> and an <-Ack from i2 to i1 before i1 can launch.
<
OR you can capture $P0 in the S1 operand latch of i3.
Reservation stations make this problem moot.
>
> What if i2 gets stuck in the front end because some resource like a
> queue is full? What if i2 has other dependencies that prevent it from
> launching, how do I propagate this NotReady signal backwards to i1
> which can be in a different function unit. It gets worse, because
> this anti-dependence is transitive across multiple uOps.
<
You don't start a macro-op unless you have a minimum number of station
entries left. And the Decoder is probably in a position to know how many
of which are needed.
>
> It also creates the possibility of deadlock.
> In the above example, i1 and i2 are waiting in the ALU reservation
> station ready to launch but can't because i3 is not issued and ready.
> i3 can't issue from the front end because all the ALU reservations
> stations are full.
>
> There are probably solutions to all of these.
> My point is what looked like a cheap-o solution to adding complex
> instructions to a RISC design turns out to be anything but that.
<
This is one reason I built transcendental instructions INTO function
units and not as micro-ops.
>
> The easiest solution was to dump pseudo-registers and always use real ones.

Re: Pipeline Registers

<shb8in$qr5$1@dont-email.me>

 copy mid

https://www.novabbs.com/devel/article-flat.php?id=20364&group=comp.arch#20364

 copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Pipeline Registers
Date: Wed, 8 Sep 2021 14:05:27 -0700
Organization: A noiseless patient Spider
Lines: 66
Message-ID: <shb8in$qr5$1@dont-email.me>
References: <1b6cd7ba-d827-407c-a4f5-6c13266d5984n@googlegroups.com>
<v72_I.37491$md6.21426@fx36.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 8 Sep 2021 21:05:27 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="7eb767440171a710e666912332033e6e";
logging-data="27493"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/bn8537JG3laVcLVuHeyGI"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:d+zjtp7GcHm0N43hXlUQ311x9e8=
In-Reply-To: <v72_I.37491$md6.21426@fx36.iad>
Content-Language: en-US
 by: Ivan Godard - Wed, 8 Sep 2021 21:05 UTC

On 9/8/2021 5:35 AM, EricP wrote:
> robf...@gmail.com wrote:
>> Not sure I can explain this adequately. It would be nice if one could
>> say ‘Get this value from the previous pipeline result” using a
>> register designator. The idea is that the value does not need to be
>> retrieved or sent to the register file. It exists only as a temporary
>> value in a pipeline register.
>> $P1 would be one pipeline stage back, $P2 would be two stages back.
>>
>> ADD    $P0,$X3,$X4
>> ADD    $P0,$X5,$X6
>> ADD    $X7,$P1,$P2        ; $x7 = sum of $x3 to $x6
>>
>> There are many circumstances where the target of instruction #one is
>> used as a source operand for instruction #2. But then is not otherwise
>> needed. Depending on the lifetime of the value it may not need to be
>> moved to the register file.
>
> I'll mention this as well as it was something unexpected I encountered
> as a consequence of a similar design approach I thought of.
>
> I had thought, if necessary, I could add complex instructions to my
> RISC design that for macro-op instructions have the decoder emit
> a sequence of uOps which used pseudo-registers (PsR) to communicate
> between themselves. Pseudo-register number were physical register numbers
> for which no actual register exists.
> PsR like the above existed only in the forwarding network.
>
> Interrupts are not a problem for this as they are not recognized
> inside macro-op sequences anyway.
>
> The problem which I did not foresee occurs with scheduling and issuing.
>
> i1 ADD    $P0,$X3,$X4
> i2 ADD    $P1,$X5,$X6
> i3 ADD    $X7,$P1,$P0        ; $x7 = sum of $x3 to $x6
>
> The problem is that while i2 is dataflow dependent on i1 through $P0,
> because $P0 is ephemeral and exists only for an instant then
> i1 is _anti-dependent_ on i2 being queued and ready to launch
> so that $P0 does not get missed and lost.
>
> In effect there is a ReadyToReceive-> signal from i1 to i2,
> and an <-Ack from i2 to i1 before i1 can launch.
>
> What if i2 gets stuck in the front end because some resource like a
> queue is full? What if i2 has other dependencies that prevent it from
> launching, how do I propagate this NotReady signal backwards to i1
> which can be in a different function unit. It gets worse, because
> this anti-dependence is transitive across multiple uOps.
>
> It also creates the possibility of deadlock.
> In the above example, i1 and i2 are waiting in the ALU reservation
> station ready to launch but can't because i3 is not issued and ready.
> i3 can't issue from the front end because all the ALU reservations
> stations are full.
>
> There are probably solutions to all of these.
> My point is what looked like a cheap-o solution to adding complex
> instructions to a RISC design turns out to be anything but that.
>
> The easiest solution was to dump pseudo-registers and always use real ones.

Or to use static scheduling, so the value is always just present in the
bypass when the consumer goes looking. No queues, no stalls, a firehose.

1
server_pubkey.txt

rocksolid light 0.9.7
clearnet tor