Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  nodelist  faq  login

To kick or not to kick... -- Somewhere on IRC, inspired by Shakespeare


computers / comp.arch / Re: Operand forwarding: complexity and limits

SubjectAuthor
* Operand forwarding: complexity and limitsJonathan Brandmeyer
+* Re: Operand forwarding: complexity and limitsMitchAlsup
|+* Re: Operand forwarding: complexity and limitsEricP
||+- Re: Operand forwarding: complexity and limitsMitchAlsup
||`* Re: Operand forwarding: complexity and limitsIvan Godard
|| `- Re: Operand forwarding: complexity and limitsEricP
|`* Re: Operand forwarding: complexity and limitsJonathan Brandmeyer
| +- Re: Operand forwarding: complexity and limitsMitchAlsup
| `- Re: Operand forwarding: complexity and limitsMitchAlsup
+- Re: Operand forwarding: complexity and limitsIvan Godard
`* Re: Operand forwarding: complexity and limitsAnton Ertl
 +* Re: Operand forwarding: complexity and limitsJonathan Brandmeyer
 |`- Re: Operand forwarding: complexity and limitsMitchAlsup
 `- Re: Operand forwarding: complexity and limitsJonathan Brandmeyer

1
Subject: Operand forwarding: complexity and limits
From: Jonathan Brandmeyer
Newsgroups: comp.arch
Date: Mon, 7 Sep 2020 19:03 UTC
X-Received: by 2002:ac8:36e9:: with SMTP id b38mr6762869qtc.284.1599505423069;
Mon, 07 Sep 2020 12:03:43 -0700 (PDT)
X-Received: by 2002:aca:c387:: with SMTP id t129mr411063oif.99.1599505422841;
Mon, 07 Sep 2020 12:03:42 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 7 Sep 2020 12:03:42 -0700 (PDT)
Complaints-To: groups-abuse@google.com
Injection-Info: google-groups.googlegroups.com; posting-host=67.165.203.148; posting-account=Dwi7xQoAAACthC59yQ_kZSCsey4S5nWq
NNTP-Posting-Host: 67.165.203.148
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <9b9a4e1d-fa6c-4e03-bd71-54a7be1cd509n@googlegroups.com>
Subject: Operand forwarding: complexity and limits
From: jonathan...@gmail.com (Jonathan Brandmeyer)
Injection-Date: Mon, 07 Sep 2020 19:03:43 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
View all headers
How complete is a typical superscaler processor's forwarding network?  Every once in a while you can see some forwarding nonuniformity creep into the public record.  A64FX is one example (they just tell you).  Agner Fog frequently observes an extra cycle when switching operand data types from producer to consumer in carefully written benchmarks.  I'd like to understand more about the underlying tradeoffs being made in hardware that lead to different design points.

Assumptions:

- two cycles to read input operands
- one cycle to execute (example is for simple operations only: please exclude floating-point, integer multiply, etc for this discussion).
- one cycle to write back to physical register file
- two independent functional units

Different design points may give wildly different values for those numbers, but I want to anchor the discussion concretely.

Without operand forwarding, the dependent operation latency is four cycles.  What are the practical tradeoffs as forwarding gets more aggressive?

- Both functional units forward to each other, or only to themselves?
- Forwarding all three cycles behind, or just directly dependent instructions?

My mental model of forwarding is that there is a mux tacked onto the head of the functional unit that selects between the register-read operand flip-flop and any of the other forwarding points.  Just how wide of a mux can a typical pipeline tolerate in that position?

I note that there is basically a back-to-back mux here: One muxes down the N different functional circuits to produce the functional unit's result, and one muxes down again from M bypass paths to the input operand.  Are there any "standard tricks" for optimizing those two muxes together?

At what point does fan-out become a problem?  For example, a pair of FMA pipes operating together have 6 input operands between them.  At what point does the fan-out become a routing hassle with the fan-in?  Is 2x three-input functional units getting close to the limit, or can you keep going?

If the Mill's wide and deep belt is little more than a glorified forwarding network, then the answer seems to be: Forwarding has a regular circuit geometry with an upper bound high enough that Something Else becomes a limit first.  Is that true, or do wide implementation's of Mill need some kind of belt partitioning past a certain point?

Thanks,
- Jonathan Brandmeyer


Subject: Re: Operand forwarding: complexity and limits
From: MitchAlsup
Newsgroups: comp.arch
Date: Mon, 7 Sep 2020 19:51 UTC
References: 1
X-Received: by 2002:ac8:70c4:: with SMTP id g4mr2845083qtp.75.1599508270669;
Mon, 07 Sep 2020 12:51:10 -0700 (PDT)
X-Received: by 2002:aca:fd95:: with SMTP id b143mr553621oii.68.1599508270425;
Mon, 07 Sep 2020 12:51:10 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!peer01.ams4!peer.am4.highwinds-media.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 7 Sep 2020 12:51:10 -0700 (PDT)
In-Reply-To: <9b9a4e1d-fa6c-4e03-bd71-54a7be1cd509n@googlegroups.com>
Complaints-To: groups-abuse@google.com
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:f5ed:bd4:e8e4:1435;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:f5ed:bd4:e8e4:1435
References: <9b9a4e1d-fa6c-4e03-bd71-54a7be1cd509n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <e2293476-8c3e-4d82-831c-d6dfcd7c058do@googlegroups.com>
Subject: Re: Operand forwarding: complexity and limits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 07 Sep 2020 19:51:10 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 9698
X-Received-Body-CRC: 295199053
View all headers
On Monday, September 7, 2020 at 2:03:45 PM UTC-5, Jonathan Brandmeyer wrote:
How complete is a typical superscaler processor's forwarding network?

All of the RISC machines I worked on had complete forwarding. When all of
your results and all of you operands are of the same size, it is easy.

The x86 machines had "special problems" that were solved by waiting for the result to arrive in the register file.

Every once in a while you can see some forwarding nonuniformity creep into the public record.  A64FX is one example (they just tell you).  Agner Fog frequently observes an extra cycle when switching operand data types from producer to consumer in carefully written benchmarks.  I'd like to understand more about the underlying tradeoffs being made in hardware that lead to different design points.

Assumptions:

- two cycles to read input operands
- one cycle to execute (example is for simple operations only: please exclude floating-point, integer multiply, etc for this discussion).
- one cycle to write back to physical register file
- two independent functional units

Different design points may give wildly different values for those numbers, but I want to anchor the discussion concretely.

Without operand forwarding, the dependent operation latency is four cycles.  What are the practical tradeoffs as forwarding gets more aggressive?

In RISC machines where the size of the result is the size of the operand,
it is easy to detect that the register you want is arriving (2-gates
plus Fan-In) and then drive the forwading multiplexer (3-gates of buffering
inverters and a data-path height of wire delay).

In x86 it is difficult to forward AX into RAX, and vice versa. In general
the pipeline keeps track of what it is writing {AX, AL, AH, EAX, RAX}
which causes equality detection to take another gate (now 3-gates).

Should anyone try to forward AX to EAX or RAX you are going to have to
perform a data merge in the forwarding multiplexers which will add to
the logic delay, and add wire width to the select lines across the
data path as each section of the forwarding logic needs to be independent.
For example, an ADD instruction is waiting for AX and one function unit
delivers AL and another one AH. At some point the designers will say
no to the complexity.

In lesser pipelined machines, register file read and forwarding is performed
in the same cycle; with RF read taking 2/3rds of the cycle and forwarding
the last 1/3rd. In more heavily pipelined machines, RF takes an entire cycle
(or more) and forwarding is often allotted a 1/2 cycle.

- Both functional units forward to each other, or only to themselves?

In reservation station machines no function unit forwards to another function
unit, any result is captured by the station, and if this result satisfies
the operands needed, the instruction becomes a candidate for launch.

- Forwarding all three cycles behind, or just directly dependent instructions?

With a 1 cycle function unit, the only case I ever had that did not have
complete forwarding was the 8-gates per cycle K9. While the adder itself
could perform an ADD each cycle; you did not have the time to drive the
result onto a bus and through a forwarding multiplexer and make setup
time to the adder input flip-flops. You would sit around hoping some OTHER
instruction could keep the FU busy in the cycle you were using to route data
around.

My mental model of forwarding is that there is a mux tacked onto the head of the functional unit that selects between the register-read operand flip-flop and any of the other forwarding points.  Just how wide of a mux can a typical pipeline tolerate in that position?

A 4-wide multiplexer is 1-gate of delay (AOI); so it is not the problem of
choosing which input, the problem is delivering each result to where it
can be chosen as an input. That is--the wire load (capacitance) and delay
(L-R-C) of the wire carrying the result.

If you could get 15-results and 1 RF read to every FU input, the forwarding
multiplexer data delay would be only 2-gates. {It would have a massively
dense set of buses, but that part can be done.}

Another part of the analysis is "how wide is the data path" ?

A 1-wide machine needs about 14-wires per bit to be layed out reguarly
{I know this from having built a 13-wires per bit data path--but I digress}

For every unit of scalarity you need to add 4-wires to this (2.3 operand
wires and 1 result wire and 0.7 because you can't predict everything
up front.) So a 6-wide machine would need 14+5*4 = 34 wires per bit.

And then finally: you need to understand the relationship between the
frequency of operation f and the number of wire pitches that equals
1/2 clock and 1 clock. For the 64-bit K9 at 8-logic gates per cycle
the height of the data path could be traversed by a single well routed
wire in 0.9 clocks--just jumping over the data path with no loads was
a full clock cycle!

So (the big SO) driving the select lines of the forwarding multiplexer
took place the cycle before the data was placed on the result bus !
Which meant the tags to be matched were driven out 3-cycles before
result was delivered !

But let us return to more sane design space:: in current technologies
one basically has the choice of 20-gates per cycle, 16-gates per
cycle, and 12-gates per cycle. At 20-gates per cycle, you can access
the SRAMs twice per cycle, wheres at 16-gates and 12-gates, you cannot.
This interacts with your cache and register designs. Which then interacts
with your data path design and finally the forwarding ends up dependent
on all these other choices.

For most applications one can get a 20-gate machine to be as efficient
per unit wall-clock time as a 16-gate machine with only 2/3rds the
number of pipeline stages. {vector applications are the exception}
The 16-gate and 12-gate machines are trading efficiency for frequency.
In particular, the 12-gate machine is even willing to trade off some
forwarding efficiency for higher f.

I note that there is basically a back-to-back mux here: One muxes down the N different functional circuits to produce the functional unit's result, and one muxes down again from M bypass paths to the input operand.  Are there any "standard tricks" for optimizing those two muxes together?

There used to be, but they took T-gates and N-only multiplexers away
in the 250nm generation.

The std way of counting is to assume 4×2 AOI multiplexers. 4-inputs 1-gate delay, 16-inputs = 2-gates of delay (if you an route all the
data into the multiplexer--which quickly becomes the harder problem.)

At what point does fan-out become a problem?  For example, a pair of FMA pipes operating together have 6 input operands between them.  At what point does the fan-out become a routing hassle with the fan-in?  Is 2x three-input functional units getting close to the limit, or can you keep going?

Fan-Out is a measure of how many consumers of a gates output there are.

But let's say we have a typical design with a pair of FMAs (3-operands)
and a std quantum of ALUs (2-operand) and a std quantum of cache ports
(3-operand the way I build them, 2-operand is generally not too bad).
To support the 2 FMAs, you are going to need at least 4 (and lkely 6) AGEN
units, and 4 ALUs, and 2 branch units (1-operand typically, 2-operand
occasionally). So we have 2×3 + 6×3 + 4×2 + 2×2 = 30 operand buses !!
from 10 result busses !

So the forwarding multiplexer is buried under 40+(RF read) sets of wires !
{and some wonder why they have 10+ layers of metal.....}

If the Mill's wide and deep belt is little more than a glorified forwarding network, then the answer seems to be: Forwarding has a regular circuit geometry with an upper bound high enough that Something Else becomes a limit first.  Is that true, or do wide implementation's of Mill need some kind of belt partitioning past a certain point?

It already embodies the notion of a PHI operator and is directly programmed
rather than being detected by register encodings. (IIRC)

Thanks,
- Jonathan Brandmeyer



Subject: Re: Operand forwarding: complexity and limits
From: Ivan Godard
Newsgroups: comp.arch
Organization: A noiseless patient Spider
Date: Mon, 7 Sep 2020 20:01 UTC
References: 1
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder.eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Operand forwarding: complexity and limits
Date: Mon, 7 Sep 2020 13:01:40 -0700
Organization: A noiseless patient Spider
Lines: 51
Message-ID: <rj63j5$bdp$1@dont-email.me>
References: <9b9a4e1d-fa6c-4e03-bd71-54a7be1cd509n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 7 Sep 2020 20:01:41 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="f23497dba31fe706b8c2a440bb1e3870";
logging-data="11705"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18iNwK09yhNccQdXRQnKBNQ"
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101
Thunderbird/68.12.0
Cancel-Lock: sha1:8eZE9KBu8nKIVztYvR9nuBaqovk=
In-Reply-To: <9b9a4e1d-fa6c-4e03-bd71-54a7be1cd509n@googlegroups.com>
Content-Language: en-US
View all headers
On 9/7/2020 12:03 PM, Jonathan Brandmeyer wrote:
How complete is a typical superscaler processor's forwarding network?  Every once in a while you can see some forwarding nonuniformity creep into the public record.  A64FX is one example (they just tell you).  Agner Fog frequently observes an extra cycle when switching operand data types from producer to consumer in carefully written benchmarks.  I'd like to understand more about the underlying tradeoffs being made in hardware that lead to different design points.

Assumptions:

- two cycles to read input operands
- one cycle to execute (example is for simple operations only: please exclude floating-point, integer multiply, etc for this discussion).
- one cycle to write back to physical register file
- two independent functional units

Different design points may give wildly different values for those numbers, but I want to anchor the discussion concretely.

Without operand forwarding, the dependent operation latency is four cycles.  What are the practical tradeoffs as forwarding gets more aggressive?

- Both functional units forward to each other, or only to themselves?
- Forwarding all three cycles behind, or just directly dependent instructions?

My mental model of forwarding is that there is a mux tacked onto the head of the functional unit that selects between the register-read operand flip-flop and any of the other forwarding points.  Just how wide of a mux can a typical pipeline tolerate in that position?

I note that there is basically a back-to-back mux here: One muxes down the N different functional circuits to produce the functional unit's result, and one muxes down again from M bypass paths to the input operand.  Are there any "standard tricks" for optimizing those two muxes together?

At what point does fan-out become a problem?  For example, a pair of FMA pipes operating together have 6 input operands between them.  At what point does the fan-out become a routing hassle with the fan-in?  Is 2x three-input functional units getting close to the limit, or can you keep going?

If the Mill's wide and deep belt is little more than a glorified forwarding network, then the answer seems to be: Forwarding has a regular circuit geometry with an upper bound high enough that Something Else becomes a limit first.  Is that true, or do wide implementation's of Mill need some kind of belt partitioning past a certain point?

Thanks,
- Jonathan Brandmeyer


The Mill already partitions the forwarding; the 20- to 40-way fan-in would otherwise require so many mux stages as to impact critical path. The split is between one-cycle results and multi-cycle results. As there can be at most one one-cycle result per slot, the one-cycle fan-in equals the number of slots (exu-side only; there are no flow-side one-cycle ops). This fan-in (4 to 12 in our current configs) is tractable without clock trouble.

Multi-cycle results go through a mux cascade whose added latency is pushed into the execution latency of the instruction itself. If the natural execution latency is right at the cycle boundary then the added muxing can push the execution into one more cycle - from a natural 3-cycle to a with-muxing 4-cycle for example. The effect does not seem significant in our measurements, because multicycle ops are rare in open code, and in loops the added latency disappears into the software pipeline.

We have also considered having split slot banks, each with their own belt (and ops to move data between belts). This would be the belt equivalent of the TI split DSPs, where a logical 8-way VLIW is implemented as two 4-way VLIWs side-by-side. Like with the TI chip, splitting would reduce the fan-in. So far we have not seen a need to do such a split, but it's certainly an available option.


Subject: Re: Operand forwarding: complexity and limits
From: EricP
Newsgroups: comp.arch
Date: Tue, 8 Sep 2020 01:25 UTC
References: 1 2
Path: i2pn2.org!i2pn.org!aioe.org!peer01.ams4!peer.am4.highwinds-media.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx10.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Operand forwarding: complexity and limits
References: <9b9a4e1d-fa6c-4e03-bd71-54a7be1cd509n@googlegroups.com> <e2293476-8c3e-4d82-831c-d6dfcd7c058do@googlegroups.com>
In-Reply-To: <e2293476-8c3e-4d82-831c-d6dfcd7c058do@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 90
Message-ID: <_4B5H.70060$5l1.62999@fx10.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Tue, 08 Sep 2020 01:26:18 UTC
Date: Mon, 07 Sep 2020 21:25:33 -0400
X-Received-Bytes: 4236
X-Received-Body-CRC: 2789469505
View all headers
MitchAlsup wrote:
On Monday, September 7, 2020 at 2:03:45 PM UTC-5, Jonathan Brandmeyer wrote:

- Both functional units forward to each other, or only to themselves?

In reservation station machines no function unit forwards to another function unit, any result is captured by the station, and if this result satisfies
the operands needed, the instruction becomes a candidate for launch.

- Forwarding all three cycles behind, or just directly dependent instructions?

With a 1 cycle function unit, the only case I ever had that did not have complete forwarding was the 8-gates per cycle K9. While the adder itself
could perform an ADD each cycle; you did not have the time to drive the
result onto a bus and through a forwarding multiplexer and make setup
time to the adder input flip-flops. You would sit around hoping some OTHER instruction could keep the FU busy in the cycle you were using to route data around.

I had an idea in this area so I might as well toss it out.

I want back-to-back operations (where one uOp result feeds into
another uOp operand immediately with no delay) and was a bit worried
about the critical path delay on the forwarding as you described above,
and not having access to a PDK's documentation I couldn't tell
whether such concerns were justified.

In a nut shell, the idea was, if necessary, to introduce a "fast path"
bypass mux that routes the result bus data directly into
the operand input of the FU, while it also goes to the
reservation station FF, and then propagates to the tri-state
bus and also to the other mux input.
The result tag is transmitted 1/2 cycle early to allow the
wake-up match matrix to do its thing. It knows the result value
will xmit on the next cycle so it enables the "fast path" mux.
(ascii art below)

This should make the data value arrive at the FU input so the
FU has stable inputs ASAP. Later, say 1/4 clock, the R.S.
is clocked to latch the result, and its output propagate
to the other "fast path" mux input.
But the FU is already processing the operand.

Then the clock switches the mux from the fast to slow path
but since both inputs are the same value, no glitch should occur.
Then the result bus can power down, with the FU operand
safely saved in the R.S. FF.

         ----- operand_bus
         |
         | ------------
         v v          |
         mux          |
         v            |
     Res_Stn_FF       |
         v            |
   tri-state_bus  -----
         v        v   ^
        fast_path_mux |
             |        |
             v        |
         FU_operand   |
             v        |
        FU_result_FF  |
             v        |
             ---------- result_bus

Clocks:

     ----------------
  ---|
    Result bus xmits value, arrives at fast path mux.
    It is routed directly to the FU operand input.
    FU begins processing operands.

         ------------
  -------|
        Res_Stn_FF clocked and result stabilizes to fast path mux
        alternate input.

              -------
  ------------|
             Fast path mux switches to source from Res_Stn_FF value
             but since values are the same, no gltch occurs.
             Now safe to power down result bus,
             Res_Stn_FF holds input operand stable to FU.






Subject: Re: Operand forwarding: complexity and limits
From: MitchAlsup
Newsgroups: comp.arch
Date: Tue, 8 Sep 2020 02:06 UTC
References: 1 2 3
X-Received: by 2002:ac8:3902:: with SMTP id s2mr23704752qtb.258.1599530792680; Mon, 07 Sep 2020 19:06:32 -0700 (PDT)
X-Received: by 2002:a05:6830:138c:: with SMTP id d12mr3619896otq.288.1599530792459; Mon, 07 Sep 2020 19:06:32 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 7 Sep 2020 19:06:32 -0700 (PDT)
In-Reply-To: <_4B5H.70060$5l1.62999@fx10.iad>
Complaints-To: groups-abuse@google.com
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:f5ed:bd4:e8e4:1435; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:f5ed:bd4:e8e4:1435
References: <9b9a4e1d-fa6c-4e03-bd71-54a7be1cd509n@googlegroups.com> <e2293476-8c3e-4d82-831c-d6dfcd7c058do@googlegroups.com> <_4B5H.70060$5l1.62999@fx10.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3fc7d736-fc2c-47d3-b9f2-d523d3bf8183o@googlegroups.com>
Subject: Re: Operand forwarding: complexity and limits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 08 Sep 2020 02:06:32 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 113
View all headers
On Monday, September 7, 2020 at 8:26:22 PM UTC-5, EricP wrote:
MitchAlsup wrote:
On Monday, September 7, 2020 at 2:03:45 PM UTC-5, Jonathan Brandmeyer wrote:

- Both functional units forward to each other, or only to themselves?

In reservation station machines no function unit forwards to another function
unit, any result is captured by the station, and if this result satisfies
the operands needed, the instruction becomes a candidate for launch.

- Forwarding all three cycles behind, or just directly dependent instructions?

With a 1 cycle function unit, the only case I ever had that did not have
complete forwarding was the 8-gates per cycle K9. While the adder itself
could perform an ADD each cycle; you did not have the time to drive the
result onto a bus and through a forwarding multiplexer and make setup
time to the adder input flip-flops. You would sit around hoping some OTHER
instruction could keep the FU busy in the cycle you were using to route data
around.

I had an idea in this area so I might as well toss it out.

I want back-to-back operations (where one uOp result feeds into
another uOp operand immediately with no delay) and was a bit worried
about the critical path delay on the forwarding as you described above,
and not having access to a PDK's documentation I couldn't tell
whether such concerns were justified.

It is fairly easy, and I've seen it done, where one simple ALU op gets
forwarded into a second simple ALU op by having the decoder fuse the
ops together into a single wider, dual op--most often with a single
result.

The constraints are that the 2 ALUs sit in the same FU and the second is
one cycle after the first. Shift-Add, Add-Add, Add-Sub are typical. You
can't do LD-ops, however--well I guess you could, but I've never seen
that done--unless you remember how I plan on doing the cache for the
Virtual Vector Method.

In a nut shell, the idea was, if necessary, to introduce a "fast path"
bypass mux that routes the result bus data directly into
the operand input of the FU, while it also goes to the
reservation station FF, and then propagates to the tri-state
bus and also to the other mux input.

The result tag is transmitted 1/2 cycle early to allow the
wake-up match matrix to do its thing. It knows the result value
will xmit on the next cycle so it enables the "fast path" mux.
(ascii art below)

Result tags are generally already 2 cycles early, and the decoder
often sends the tag for 1-cycle ops.


This should make the data value arrive at the FU input so the
FU has stable inputs ASAP. Later, say 1/4 clock, the R.S.
is clocked to latch the result, and its output propagate
to the other "fast path" mux input.
But the FU is already processing the operand.

That is a typical plan. I, however, prefer that there be operand
flip-flops instead of result bus count flip-flops per FU.

Then the clock switches the mux from the fast to slow path
but since both inputs are the same value, no glitch should occur.

You might be surpised where glitches can crop up !!

Then the result bus can power down, with the FU operand
safely saved in the R.S. FF.

         ----- operand_bus
         |
         | ------------
         v v          |
         mux          |
         v            |
     Res_Stn_FF       |
         v            |
   tri-state_bus  -----
         v        v   ^
        fast_path_mux |
             |        |
             v        |
         FU_operand   |
             v        |
        FU_result_FF  |
             v        |
             ---------- result_bus

This is about what we did on Mc 88120, except we also had the operand
bus feed into the fast path mux--in effect, this puts the reservation
stations in the feedback loop in front of the attached FU.

Clocks:

     ----------------
  ---|
    Result bus xmits value, arrives at fast path mux.
    It is routed directly to the FU operand input.
    FU begins processing operands.

         ------------
  -------|
        Res_Stn_FF clocked and result stabilizes to fast path mux
        alternate input.

              -------
  ------------|
             Fast path mux switches to source from Res_Stn_FF value
             but since values are the same, no gltch occurs.
             Now safe to power down result bus,
             Res_Stn_FF holds input operand stable to FU.



Subject: Re: Operand forwarding: complexity and limits
From: Jonathan Brandmeyer
Newsgroups: comp.arch
Date: Tue, 8 Sep 2020 02:18 UTC
References: 1 2
X-Received: by 2002:a37:4d09:: with SMTP id a9mr21720185qkb.157.1599531519090;
Mon, 07 Sep 2020 19:18:39 -0700 (PDT)
X-Received: by 2002:a9d:53c1:: with SMTP id i1mr16827359oth.16.1599531518842;
Mon, 07 Sep 2020 19:18:38 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 7 Sep 2020 19:18:38 -0700 (PDT)
In-Reply-To: <e2293476-8c3e-4d82-831c-d6dfcd7c058do@googlegroups.com>
Complaints-To: groups-abuse@google.com
Injection-Info: google-groups.googlegroups.com; posting-host=67.165.203.148; posting-account=Dwi7xQoAAACthC59yQ_kZSCsey4S5nWq
NNTP-Posting-Host: 67.165.203.148
References: <9b9a4e1d-fa6c-4e03-bd71-54a7be1cd509n@googlegroups.com> <e2293476-8c3e-4d82-831c-d6dfcd7c058do@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f04c6c5a-88c6-4cfc-a6b3-ef51d8343c3fn@googlegroups.com>
Subject: Re: Operand forwarding: complexity and limits
From: jonathan...@gmail.com (Jonathan Brandmeyer)
Injection-Date: Tue, 08 Sep 2020 02:18:39 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
View all headers
Thanks for the detail!

On Monday, September 7, 2020 at 1:51:12 PM UTC-6, MitchAlsup wrote:

If you could get 15-results and 1 RF read to every FU input, the forwarding
multiplexer data delay would be only 2-gates. {It would have a massively
dense set of buses, but that part can be done.}

Another part of the analysis is "how wide is the data path" ?

A 1-wide machine needs about 14-wires per bit to be layed out reguarly
{I know this from having built a 13-wires per bit data path--but I digress}

For every unit of scalarity you need to add 4-wires to this (2.3 operand
wires and 1 result wire and 0.7 because you can't predict everything
up front.) So a 6-wide machine would need 14+5*4 = 34 wires per bit.

I don't follow this.  Which structure drives the initial preference to be 14 wires per bit?  And why do entire functional units only consume 4 wires per bit in their width?

Suppose a functional unit provides a carry-lookahead adder, shift-permute network, and some boolean logic functions.  Are you saying that these three together are interleaved bit-by-bit and that the whole thing is about 4 wires per bit wide?  Or do you mean something else?


Subject: Re: Operand forwarding: complexity and limits
From: Ivan Godard
Newsgroups: comp.arch
Organization: A noiseless patient Spider
Date: Tue, 8 Sep 2020 06:27 UTC
References: 1 2 3
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder.eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: Operand forwarding: complexity and limits
Date: Mon, 7 Sep 2020 23:27:04 -0700
Organization: A noiseless patient Spider
Lines: 98
Message-ID: <rj787n$k4f$1@dont-email.me>
References: <9b9a4e1d-fa6c-4e03-bd71-54a7be1cd509n@googlegroups.com>
<e2293476-8c3e-4d82-831c-d6dfcd7c058do@googlegroups.com>
<_4B5H.70060$5l1.62999@fx10.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 8 Sep 2020 06:27:03 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="af682c546b7dd59f6898055ef729adee";
logging-data="20623"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+b2U9Wcplgb9gJzGCjWHOc"
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101
Thunderbird/68.12.0
Cancel-Lock: sha1:+rgggNHeTJI1eSsAjuLIeUJmkpg=
In-Reply-To: <_4B5H.70060$5l1.62999@fx10.iad>
Content-Language: en-US
View all headers
On 9/7/2020 6:25 PM, EricP wrote:
MitchAlsup wrote:
On Monday, September 7, 2020 at 2:03:45 PM UTC-5, Jonathan Brandmeyer wrote:

- Both functional units forward to each other, or only to themselves?

In reservation station machines no function unit forwards to another function unit, any result is captured by the station, and if this result satisfies
the operands needed, the instruction becomes a candidate for launch.

- Forwarding all three cycles behind, or just directly dependent instructions?

With a 1 cycle function unit, the only case I ever had that did not have complete forwarding was the 8-gates per cycle K9. While the adder itself
could perform an ADD each cycle; you did not have the time to drive the
result onto a bus and through a forwarding multiplexer and make setup
time to the adder input flip-flops. You would sit around hoping some OTHER instruction could keep the FU busy in the cycle you were using to route data around.

I had an idea in this area so I might as well toss it out.

I want back-to-back operations (where one uOp result feeds into
another uOp operand immediately with no delay) and was a bit worried
about the critical path delay on the forwarding as you described above,
and not having access to a PDK's documentation I couldn't tell
whether such concerns were justified.

In a nut shell, the idea was, if necessary, to introduce a "fast path"
bypass mux that routes the result bus data directly into
the operand input of the FU, while it also goes to the
reservation station FF, and then propagates to the tri-state
bus and also to the other mux input.
The result tag is transmitted 1/2 cycle early to allow the
wake-up match matrix to do its thing. It knows the result value
will xmit on the next cycle so it enables the "fast path" mux.
(ascii art below)

This should make the data value arrive at the FU input so the
FU has stable inputs ASAP. Later, say 1/4 clock, the R.S.
is clocked to latch the result, and its output propagate
to the other "fast path" mux input.
But the FU is already processing the operand.

Then the clock switches the mux from the fast to slow path
but since both inputs are the same value, no glitch should occur.
Then the result bus can power down, with the FU operand
safely saved in the R.S. FF.

         ----- operand_bus
         |
         | ------------
         v v          |
         mux          |
         v            |
     Res_Stn_FF       |
         v            |
   tri-state_bus  -----
         v        v   ^
        fast_path_mux |
             |        |
             v        |
         FU_operand   |
             v        |
        FU_result_FF  |
             v        |
             ---------- result_bus

Clocks:

     ----------------
  ---|
    Result bus xmits value, arrives at fast path mux.
    It is routed directly to the FU operand input.
    FU begins processing operands.

         ------------
  -------|
        Res_Stn_FF clocked and result stabilizes to fast path mux
        alternate input.

              -------
  ------------|
             Fast path mux switches to source from Res_Stn_FF value
             but since values are the same, no gltch occurs.
             Now safe to power down result bus,
             Res_Stn_FF holds input operand stable to FU.





How is this different from an accumulator? (said by an ignorant software type)


Subject: Re: Operand forwarding: complexity and limits
From: Anton Ertl
Newsgroups: comp.arch
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Date: Tue, 8 Sep 2020 07:42 UTC
References: 1
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder.eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Operand forwarding: complexity and limits
Date: Tue, 08 Sep 2020 07:42:40 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 174
Message-ID: <2020Sep8.094240@mips.complang.tuwien.ac.at>
References: <9b9a4e1d-fa6c-4e03-bd71-54a7be1cd509n@googlegroups.com>
Injection-Info: reader02.eternal-september.org; posting-host="ef8baf22638849e6007efe48e4c0ad83";
logging-data="18299"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18JFMqg64Ifk2wpMPr7xMEi"
Cancel-Lock: sha1:9+UL46IGRuAZ6io6YrP8qXVGPwQ=
X-newsreader: xrn 10.00-beta-3
View all headers
Jonathan Brandmeyer <jonathan.brandmeyer@gmail.com> writes:
How complete is a typical superscaler processor's forwarding network?

I remember reading about 30 forward/bypass paths for the 21064 (which
is two-wide with one integer, one load/store, one FP, and one branch
unit).  I also remember a paper (maybe palacharla+97 below) that
proposed leaving away some bypasses and having the compiler reduce the
negative effect of that; possibly this was the same paper that
mentioned the 30 bypasses of the 21064.

Below you find the papers in my bibliography that mention "forward" or
"bypass" in a way that may be relevant to your question.

@InProceedings{palacharla+97,
  author = {Subbarao Palacharla and Norman P. Jouppi and J. E. Smith},
  title = {Complexity-Effective Superscalar Processors},
  crossref = {isca97},
  pages = {206--218},
  annote = {Estimates the cycle time of certain critical
                  (non-pipelinable) components of an OOO superscalar
                  processor at verious feature sizes and for various
                  degrees of superscalarity. For a 8-issue superscalar
                  at 0.18$\mu$ the critical components are the bypass
                  logic and the wakeup and select logic. They then
                  propose a microarchitecture that avoids this
                  bottleneck: they partition the functional units into
                  two 4-issue clusters (with a one-cycle delay for
                  intercluster bypassing to avoid the bypass bottleneck
                  and schedule the instructions into FIFOs of
                  (perferably) dependent instructions to avoid the wakeup and
                  select bottleneck. These changes have a small
                  negative effect on the IPC, but a large positive
                  effect on the (potential) cycle time, resulting in
                  an average improvement of 16\% in speed.}
}
@Proceedings{isca97,
  title = "$24^\textit{th}$ Annual International Symposium on Computer Architecture",
  booktitle = "$24^\textit{th}$ Annual International Symposium on Computer Architecture",
  year = "1997",
  key = "ISCA 24",
}

@InProceedings{sprangle&carmean02,
  author = {Eric Sprangle and Doug Carmean},
  title = {Increasing Processor Performance by Implementing
                  Deeper Pipelines},
  crossref = {isca02},
  pages = {25--34},
  url = {http://www.cs.cmu.edu/afs/cs/academic/class/15740-f03/public/doc/discussions/uniprocessors/technology/deep-pipelines-isca02.pdf}
  annote = {This paper starts with the Williamette (Pentium~4)
                  pipeline and discusses and evaluates changes to the
                  pipeline length. In particular, it gives numbers on
                  how lengthening various latencies would affect IPC;
                  on a per-cycle basis the ALU latency is most
                  important, then L1 cache, then L2 cache, then branch
                  misprediction; however, the total effect of
                  lengthening the pipeline to double the clock rate
                  gives the reverse order (because branch
                  misprediction gains more cycles than the other
                  latencies). The paper reports 52 pipeline stages
                  with 1.96 times the original clock rate as optimal
                  for the Pentium~4 microarchitecture, resulting in a
                  reduction of 1.45 of core time and an overall
                  speedup of about 1.29 (including waiting for
                  memory). Various other topics are discussed, such as
                  nonlinear effects when introducing bypasses, and
                  varying cache sizes.  Recommended reading.}
}
@Proceedings{isca02,
  title = "$29^\textit{th}$ Annual International Symposium on Computer Architecture",
  booktitle = "$29^\textit{th}$ Annual International Symposium on Computer Architecture",
  year = "2002",
  key = "ISCA 29",
}

Agner Fog frequ=
ently observes an extra cycle when switching operand data types from produc=
er to consumer in carefully written benchmarks.

Can you elaborate on that?  What data types?

- two cycles to read input operands
- one cycle to execute (example is for simple operations only: please exclu=
de floating-point, integer multiply, etc for this discussion).
- one cycle to write back to physical register file
- two independent functional units

Different design points may give wildly different values for those numbers,=
but I want to anchor the discussion concretely.

Without operand forwarding, the dependent operation latency is four cycles.=
 What are the practical tradeoffs as forwarding gets more aggressive?

About the practical tradeoffs, better read what the practitioners
write.

But anyway, here are my thoughts: The critical paths are for
back-to-back instruction execution.  Everything else can be worked
around (well, until the complexity becomes overwhelming).  E.g., if
you have the choice of making the path from the register file longer,
or a back-to-back bypass path, you probably choose the register file.
One cycle more in register reads just costs branch misprediction
penalty, design complexity and area, but not cycle time or a cycle on
the back-to-back execution.

The bypasses for results that are used a cycle (or several) later are
probably not time-critical.

Still given the ~10 functional units in a modern high-end core, a full
back-to-back bypass network would slow down the cycle time, so you
have to select those bypasses that have the most benefit.  I guess
that bypasses from far-away units incur an extra cycle (or several?)
anyway because of wire delay, but fan-out and mux-width also have to
be considered.

On the good side:

* Architectures nowadays tend to have a division between
  general-purpose and SIMD registers (somewhat similar to the CDC
  6600's A, B, and X registers, or the 68000s address and data
  registers), with not that much data flowing across this division
  (and when it flows, there's an instruction that makes it flow, so
  that becomes an issue of that instruction's latency).

* The dynamic scheduler can steer dependent instructions to
  well-connected functional units.  E.g., dependent ALU instructions
  to the same ALU unit, and an ALU instruction that depends on a load
  instruction to an ALU that is better connected to the load than some
  other ALU.  I don't know if the dynamic schedulers do this (or much
  of this); Intel's mainline seems to decide placement pretty early,
  without knowledge about resource availability at execution time.

* Also, for many ALU instructions you can exchange the inputs of many
  instructions, so, if, e.g., the left operand is better connected, a
  decoder or schedluer that knows that the right operand of the
  original instruction is more likely to be back-to-back with a
  preceding instruction could rewrite the instruction into one with
  reversed operands.  I don't know if this is done in OoO CPUs.
  Again, it depends on the knowledge early on in the pipeline that two
  instructions will be back-to-back during the execution; this would
  be easier for in-order CPUs.

And there are cases where all these tricks don't help, and where you
can observe an extra cycle.

- Both functional units forward to each other, or only to themselves?

To each other, but if that costs cycle length, it might be better to
have a one-cycle delay in forwarding to the other.

- Forwarding all three cycles behind, or just directly dependent instructio=
ns?

Click here to read the complete article
Subject: Re: Operand forwarding: complexity and limits
From: EricP
Newsgroups: comp.arch
Date: Tue, 8 Sep 2020 16:12 UTC
References: 1 2 3 4
Path: i2pn2.org!i2pn.org!aioe.org!peer02.ams4!peer.am4.highwinds-media.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx24.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: Operand forwarding: complexity and limits
References: <9b9a4e1d-fa6c-4e03-bd71-54a7be1cd509n@googlegroups.com> <e2293476-8c3e-4d82-831c-d6dfcd7c058do@googlegroups.com> <_4B5H.70060$5l1.62999@fx10.iad> <rj787n$k4f$1@dont-email.me>
In-Reply-To: <rj787n$k4f$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 46
Message-ID: <35O5H.63545$Ml5.22945@fx24.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Tue, 08 Sep 2020 16:13:51 UTC
Date: Tue, 08 Sep 2020 12:12:56 -0400
X-Received-Bytes: 2474
X-Received-Body-CRC: 3248685039
View all headers
Ivan Godard wrote:
On 9/7/2020 6:25 PM, EricP wrote:

         ----- operand_bus
         |
         | ------------
         v v          |
         mux          |
         v            |
     Res_Stn_FF       |
         v            |
   tri-state_bus  -----
         v        v   ^
        fast_path_mux |
             |        |
             v        |
         FU_operand   |
             v        |
        FU_result_FF  |
             v        |
             ---------- result_bus


How is this different from an accumulator? (said by an ignorant software type)

Ideally you want dependent operations performed on the same or separate
function units to not stall for 1 clock because of the delay
in propagating a prior result all the way through all that logic.

Not shown on the diagram are multiple result/bypass buses,
multiple function units each with multiple reservation stations.
So there is more logic in the paths than shown.

This was me looking for a way to have my cake and eat it too,
having the operand muxes and reservation station logic be present,
but not totally paying for it in critical path delay.

Ultimately it may not be possible to avoid since wire delay would dominate.
Or maybe its only possible for operations within a single FU like ALU.
So A+B+C can be back-to-back but A*B+C has to stall 1 extra clock while
the result moves from the MUL unit to ALU unit. But that could also
impact loads and stores so the cost is going to add up quick.





Subject: Re: Operand forwarding: complexity and limits
From: MitchAlsup
Newsgroups: comp.arch
Date: Tue, 8 Sep 2020 17:07 UTC
References: 1 2 3
X-Received: by 2002:ac8:1c82:: with SMTP id f2mr1055643qtl.305.1599584823028; Tue, 08 Sep 2020 10:07:03 -0700 (PDT)
X-Received: by 2002:a9d:5b7:: with SMTP id 52mr37809otd.134.1599584822544; Tue, 08 Sep 2020 10:07:02 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!peer01.ams4!peer.am4.highwinds-media.com!news.highwinds-media.com!tr3.eu1.usenetexpress.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 8 Sep 2020 10:07:02 -0700 (PDT)
In-Reply-To: <f04c6c5a-88c6-4cfc-a6b3-ef51d8343c3fn@googlegroups.com>
Complaints-To: groups-abuse@google.com
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:e97c:1d4c:7572:50d1; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:e97c:1d4c:7572:50d1
References: <9b9a4e1d-fa6c-4e03-bd71-54a7be1cd509n@googlegroups.com> <e2293476-8c3e-4d82-831c-d6dfcd7c058do@googlegroups.com> <f04c6c5a-88c6-4cfc-a6b3-ef51d8343c3fn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f2cf9670-1564-4eaa-964f-07b3b4128b2eo@googlegroups.com>
Subject: Re: Operand forwarding: complexity and limits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 08 Sep 2020 17:07:03 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 41
X-Received-Bytes: 3158
X-Received-Body-CRC: 361162759
View all headers
On Monday, September 7, 2020 at 9:18:41 PM UTC-5, Jonathan Brandmeyer wrote:
Thanks for the detail!

On Monday, September 7, 2020 at 1:51:12 PM UTC-6, MitchAlsup wrote:

If you could get 15-results and 1 RF read to every FU input, the forwarding
multiplexer data delay would be only 2-gates. {It would have a massively
dense set of buses, but that part can be done.}

Another part of the analysis is "how wide is the data path" ?

A 1-wide machine needs about 14-wires per bit to be layed out reguarly
{I know this from having built a 13-wires per bit data path--but I digress}

For every unit of scalarity you need to add 4-wires to this (2.3 operand
wires and 1 result wire and 0.7 because you can't predict everything
up front.) So a 6-wide machine would need 14+5*4 = 34 wires per bit.

I don't follow this.  Which structure drives the initial preference to be 14 wires per bit?  And why do entire functional units only consume 4 wires per bit in their width?

This was derived back in the days where there were only 2 metal layers, and
accounts for the local wiring up of gates out of transistors.

Suppose a functional unit provides a carry-lookahead adder, shift-permute network, and some boolean logic functions.  Are you saying that these three together are interleaved bit-by-bit and that the whole thing is about 4 wires per bit wide?  Or do you mean something else?

If you run these calculation units in parallel you need the wires to
deliver operands and receive results. 14+3+3=20.
If you run the calculation units one at a time you only need 14 wires.



Subject: Re: Operand forwarding: complexity and limits
From: MitchAlsup
Newsgroups: comp.arch
Date: Tue, 8 Sep 2020 18:35 UTC
References: 1 2 3
X-Received: by 2002:ac8:3902:: with SMTP id s2mr1500728qtb.258.1599590149389;
Tue, 08 Sep 2020 11:35:49 -0700 (PDT)
X-Received: by 2002:a9d:d35:: with SMTP id 50mr307752oti.166.1599590149099;
Tue, 08 Sep 2020 11:35:49 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!peer03.ams4!peer.am4.highwinds-media.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 8 Sep 2020 11:35:48 -0700 (PDT)
In-Reply-To: <f04c6c5a-88c6-4cfc-a6b3-ef51d8343c3fn@googlegroups.com>
Complaints-To: groups-abuse@google.com
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:e97c:1d4c:7572:50d1;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:e97c:1d4c:7572:50d1
References: <9b9a4e1d-fa6c-4e03-bd71-54a7be1cd509n@googlegroups.com>
<e2293476-8c3e-4d82-831c-d6dfcd7c058do@googlegroups.com> <f04c6c5a-88c6-4cfc-a6b3-ef51d8343c3fn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <94e9e24d-5a53-47a8-b41a-2a1c1d6f5c5eo@googlegroups.com>
Subject: Re: Operand forwarding: complexity and limits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 08 Sep 2020 18:35:49 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 4036
X-Received-Body-CRC: 226858948
View all headers
On Monday, September 7, 2020 at 9:18:41 PM UTC-5, Jonathan Brandmeyer wrote:
Thanks for the detail!

On Monday, September 7, 2020 at 1:51:12 PM UTC-6, MitchAlsup wrote:

If you could get 15-results and 1 RF read to every FU input, the forwarding
multiplexer data delay would be only 2-gates. {It would have a massively
dense set of buses, but that part can be done.}

Another part of the analysis is "how wide is the data path" ?

A 1-wide machine needs about 14-wires per bit to be layed out reguarly
{I know this from having built a 13-wires per bit data path--but I digress}

For every unit of scalarity you need to add 4-wires to this (2.3 operand
wires and 1 result wire and 0.7 because you can't predict everything
up front.) So a 6-wide machine would need 14+5*4 = 34 wires per bit.

I don't follow this.  Which structure drives the initial preference to be 14 wires per bit?  And why do entire functional units only consume 4 wires per bit in their width?

You have to realize the things you might consider as a monolithic calculation
unit (integer adder) is composed of individual gates {input multiplexer,
P-G generator, Carry initiator, Carry propagator, output selector} and
each of these subsections needs a wire to get its result to the next section
along the way. Those wires only span a couple of gates in width, but they
have to actually exist. And where they exist, they block any other wire
from simultaneously existing there.

These "other" wires (and blockages) is what contributes to the number 14
above the easy to count long range buses.

It is true in modern technology with lots of metal layers that the large
number of wires (say 30) are carried in upper layers of the metal grid.
The problem then shifts to finding vias in which those wires can reach
(down) to the gates they need to feed, and those gates feed back up to
send a result "over there". The problem changes from a horizontal wiring
problem into a vertical wiring problem.

But most designers today don't see any of this as the tools simply
"wire everything up" and when the wiring gets too dense to route,
the tools move the gates farther apart to make room for the wires.

Suppose a functional unit provides a carry-lookahead adder, shift-permute network, and some boolean logic functions.  Are you saying that these three together are interleaved bit-by-bit and that the whole thing is about 4 wires per bit wide?  Or do you mean something else?



Subject: Re: Operand forwarding: complexity and limits
From: Jonathan Brandmeyer
Newsgroups: comp.arch
Date: Wed, 9 Sep 2020 00:45 UTC
References: 1 2
X-Received: by 2002:ac8:1c82:: with SMTP id f2mr1049098qtl.305.1599612311218;
Tue, 08 Sep 2020 17:45:11 -0700 (PDT)
X-Received: by 2002:aca:c6cd:: with SMTP id w196mr1077155oif.7.1599612310915;
Tue, 08 Sep 2020 17:45:10 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 8 Sep 2020 17:45:10 -0700 (PDT)
In-Reply-To: <2020Sep8.094240@mips.complang.tuwien.ac.at>
Complaints-To: groups-abuse@google.com
Injection-Info: google-groups.googlegroups.com; posting-host=67.165.203.148; posting-account=Dwi7xQoAAACthC59yQ_kZSCsey4S5nWq
NNTP-Posting-Host: 67.165.203.148
References: <9b9a4e1d-fa6c-4e03-bd71-54a7be1cd509n@googlegroups.com> <2020Sep8.094240@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d67ce7f7-0ae0-421a-bdc5-3d8044a42773n@googlegroups.com>
Subject: Re: Operand forwarding: complexity and limits
From: jonathan...@gmail.com (Jonathan Brandmeyer)
Injection-Date: Wed, 09 Sep 2020 00:45:11 +0000
Content-Type: text/plain; charset="UTF-8"
View all headers
On Tuesday, September 8, 2020 at 2:57:46 AM UTC-6, Anton Ertl wrote:
Jonathan Brandmeyer writes:

Agner Fog frequ=
ently observes an extra cycle when switching operand data types from produc=
er to consumer in carefully written benchmarks.
Can you elaborate on that? What data types?

Nehelem had a complex set of forwarding delays between different domains.  A few generations later, Skylake appears to have no cross-domain delays (or at least, none that he could identify, which is as close to zero as makes no difference).

Zen 2 adds a cycle when switching between integer and floating-point domains.



Subject: Re: Operand forwarding: complexity and limits
From: MitchAlsup
Newsgroups: comp.arch
Date: Wed, 9 Sep 2020 01:31 UTC
References: 1 2 3
X-Received: by 2002:a05:6214:1752:: with SMTP id dc18mr1954877qvb.10.1599615115341; Tue, 08 Sep 2020 18:31:55 -0700 (PDT)
X-Received: by 2002:a4a:924b:: with SMTP id g11mr1100634ooh.9.1599615115130; Tue, 08 Sep 2020 18:31:55 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.etla.org!news.uzoreto.com!newsfeed.xs4all.nl!newsfeed9.news.xs4all.nl!tr3.eu1.usenetexpress.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 8 Sep 2020 18:31:54 -0700 (PDT)
In-Reply-To: <d67ce7f7-0ae0-421a-bdc5-3d8044a42773n@googlegroups.com>
Complaints-To: groups-abuse@google.com
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:e97c:1d4c:7572:50d1; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:e97c:1d4c:7572:50d1
References: <9b9a4e1d-fa6c-4e03-bd71-54a7be1cd509n@googlegroups.com> <2020Sep8.094240@mips.complang.tuwien.ac.at> <d67ce7f7-0ae0-421a-bdc5-3d8044a42773n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2abe8a7b-551b-4b6c-9c4b-55b66d299f58o@googlegroups.com>
Subject: Re: Operand forwarding: complexity and limits
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 09 Sep 2020 01:31:55 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 21
View all headers
On Tuesday, September 8, 2020 at 7:45:13 PM UTC-5, Jonathan Brandmeyer wrote:
On Tuesday, September 8, 2020 at 2:57:46 AM UTC-6, Anton Ertl wrote:
Jonathan Brandmeyer writes:

Agner Fog frequ=
ently observes an extra cycle when switching operand data types from produc=
er to consumer in carefully written benchmarks.
Can you elaborate on that? What data types?

Nehelem had a complex set of forwarding delays between different domains.  A few generations later, Skylake appears to have no cross-domain delays (or at least, none that he could identify, which is as close to zero as makes no difference).

My guess is that ::
a) either Intel designers figured out how to reduce the wire delays {doubtful}
or
b) Intel engineers moved the time of tag broadcast back one cycle (giving the
logic time to deal with the special cases) {likely}


Zen 2 adds a cycle when switching between integer and floating-point domains.

One of the things that happens when the FPU is in a different register file
space than the integer and memory references.


Subject: Re: Operand forwarding: complexity and limits
From: Jonathan Brandmeyer
Newsgroups: comp.arch
Date: Wed, 9 Sep 2020 06:01 UTC
References: 1 2
X-Received: by 2002:ac8:424a:: with SMTP id r10mr1733454qtm.211.1599631291460;
Tue, 08 Sep 2020 23:01:31 -0700 (PDT)
X-Received: by 2002:aca:3e8b:: with SMTP id l133mr1761731oia.110.1599631291215;
Tue, 08 Sep 2020 23:01:31 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!peer03.ams4!peer.am4.highwinds-media.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 8 Sep 2020 23:01:30 -0700 (PDT)
In-Reply-To: <2020Sep8.094240@mips.complang.tuwien.ac.at>
Complaints-To: groups-abuse@google.com
Injection-Info: google-groups.googlegroups.com; posting-host=67.165.203.148; posting-account=Dwi7xQoAAACthC59yQ_kZSCsey4S5nWq
NNTP-Posting-Host: 67.165.203.148
References: <9b9a4e1d-fa6c-4e03-bd71-54a7be1cd509n@googlegroups.com> <2020Sep8.094240@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <6d9ffcbb-0922-423d-bc16-531b5e1afd4en@googlegroups.com>
Subject: Re: Operand forwarding: complexity and limits
From: jonathan...@gmail.com (Jonathan Brandmeyer)
Injection-Date: Wed, 09 Sep 2020 06:01:31 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 1
X-Received-Bytes: 1451
X-Received-Body-CRC: 416220284
View all headers
Also, thanks for the references, Anton!  It will take some time to chew through them, but they are appreciated and did not go unnoticed.



1
rocksolid light 0.7.2
clearneti2ptor