Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Ignorance is bliss. -- Thomas Gray Fortune updates the great quotes, #42: BLISS is ignorance.


devel / comp.arch / Re: More complex instructions to reduce cycle overhead

SubjectAuthor
* Signed division by 2^nThomas Koenig
+* Re: Signed division by 2^nMarcus
|`- Re: Signed division by 2^nMitchAlsup
+- Re: Signed division by 2^nStephen Fuld
+* Re: Signed division by 2^nAnton Ertl
|+* Re: Signed division by 2^nMitchAlsup
||`* Re: Signed division by 2^nThomas Koenig
|| `* Re: saturating arithmetic, not Signed division by 2^nJohn Levine
||  +- Re: saturating arithmetic, not Signed division by 2^nMitchAlsup
||  +- Re: saturating arithmetic, not Signed division by 2^nBrian G. Lucas
||  `* Re: saturating arithmetic, not Signed division by 2^nJeremy Linton
||   +* Re: saturating arithmetic, not Signed division by 2^nStefan Monnier
||   |+* Re: saturating arithmetic, not Signed division by 2^nThomas Koenig
||   ||+- Re: saturating arithmetic, not Signed division by 2^nMitchAlsup
||   ||+- Re: saturating arithmetic, not Signed division by 2^nStefan Monnier
||   ||+- Re: saturating arithmetic, not Signed division by 2^nDavid Brown
||   ||`- Re: saturating arithmetic, not Signed division by 2^nAnton Ertl
||   |`- Re: saturating arithmetic, not Signed division by 2^nIvan Godard
||   +- Re: saturating arithmetic, not Signed division by 2^nEricP
||   `* Re: saturating arithmetic, not Signed division by 2^nAnton Ertl
||    +- Re: saturating arithmetic, not Signed division by 2^nMitchAlsup
||    `* Re: saturating arithmetic, not Signed division by 2^nGeorge Neuner
||     +* Re: saturating arithmetic, not Signed division by 2^nNiklas Holsti
||     |`- Re: saturating arithmetic, not Signed division by 2^nBill Findlay
||     +* Re: saturating arithmetic, not Signed division by 2^nBill Findlay
||     |`- Re: saturating arithmetic, not Signed division by 2^nTerje Mathisen
||     +* Re: saturating arithmetic, not Signed division by 2^nTerje Mathisen
||     |`* Re: saturating arithmetic, not Signed division by 2^nThomas Koenig
||     | `* Re: saturating arithmetic, not Signed division by 2^nTerje Mathisen
||     |  +- Re: saturating arithmetic, not Signed division by 2^nMitchAlsup
||     |  `* Re: saturating arithmetic, not Signed division by 2^nAndreas Eder
||     |   `* Re: saturating arithmetic, not Signed division by 2^nTerje Mathisen
||     |    `* Re: saturating arithmetic, not Signed division by 2^nThomas Koenig
||     |     `* Re: saturating arithmetic, not Signed division by 2^nTerje Mathisen
||     |      `* Re: saturating arithmetic, not Signed division by 2^nThomas Koenig
||     |       `- Re: saturating arithmetic, not Signed division by 2^nThomas Koenig
||     `- Re: saturating arithmetic, not Signed division by 2^nMitchAlsup
|+* Re: Signed division by 2^nBGB
||+* Re: Signed division by 2^nIvan Godard
|||+- Re: Signed division by 2^nAnton Ertl
|||+- Re: Signed division by 2^nTerje Mathisen
|||+- Re: Signed division by 2^nMitchAlsup
|||`* Re: Signed division by 2^nBGB
||| `* Re: Signed division by 2^nMitchAlsup
|||  `* Re: Signed division by 2^nBGB
|||   `* Re: Signed division by 2^nMitchAlsup
|||    +* More complex instructions to reduce cycle overheadStefan Monnier
|||    |+* Re: More complex instructions to reduce cycle overheadIvan Godard
|||    ||`* Re: More complex instructions to reduce cycle overheadMitchAlsup
|||    || `- Re: More complex instructions to reduce cycle overheadIvan Godard
|||    |+* Re: More complex instructions to reduce cycle overheadMitchAlsup
|||    ||+- Re: More complex instructions to reduce cycle overheadStefan Monnier
|||    ||`* Re: More complex instructions to reduce cycle overheadIvan Godard
|||    || `* Re: More complex instructions to reduce cycle overheadMitchAlsup
|||    ||  `* Re: More complex instructions to reduce cycle overheadIvan Godard
|||    ||   `* Re: More complex instructions to reduce cycle overheadMitchAlsup
|||    ||    `* Re: More complex instructions to reduce cycle overheadIvan Godard
|||    ||     +* Re: More complex instructions to reduce cycle overheadEricP
|||    ||     |+* Re: More complex instructions to reduce cycle overheadThomas Koenig
|||    ||     ||+* Re: More complex instructions to reduce cycle overheadEricP
|||    ||     |||+* Re: More complex instructions to reduce cycle overheadThomas Koenig
|||    ||     ||||`* Re: More complex instructions to reduce cycle overheadBGB
|||    ||     |||| `* Re: More complex instructions to reduce cycle overheadEricP
|||    ||     ||||  +* Re: More complex instructions to reduce cycle overheadMitchAlsup
|||    ||     ||||  |+* Re: More complex instructions to reduce cycle overheadBGB
|||    ||     ||||  ||`* Re: More complex instructions to reduce cycle overheadMarcus
|||    ||     ||||  || `- Re: More complex instructions to reduce cycle overheadBGB
|||    ||     ||||  |`- Re: More complex instructions to reduce cycle overheadJimBrakefield
|||    ||     ||||  `* Re: More complex instructions to reduce cycle overheadBGB
|||    ||     ||||   +* Re: More complex instructions to reduce cycle overheadMarcus
|||    ||     ||||   |`* Re: More complex instructions to reduce cycle overheadBGB
|||    ||     ||||   | `* Re: More complex instructions to reduce cycle overheadEricP
|||    ||     ||||   |  `* Re: More complex instructions to reduce cycle overheadBGB
|||    ||     ||||   |   `- Re: More complex instructions to reduce cycle overheadEricP
|||    ||     ||||   `* Re: More complex instructions to reduce cycle overheadEricP
|||    ||     ||||    `* Re: More complex instructions to reduce cycle overheadMitchAlsup
|||    ||     ||||     `* Re: More complex instructions to reduce cycle overheadBGB
|||    ||     ||||      +* Re: More complex instructions to reduce cycle overheadEricP
|||    ||     ||||      |`* Re: More complex instructions to reduce cycle overheadBGB
|||    ||     ||||      | +- Timing... (Re: More complex instructions to reduce cycle overhead)BGB
|||    ||     ||||      | `* Re: Timing... (Re: More complex instructions to reduce cycle overhead)JimBrakefield
|||    ||     ||||      |  `- Re: Timing... (Re: More complex instructions to reduce cycleBGB
|||    ||     ||||      `* Re: More complex instructions to reduce cycle overheadMarcus
|||    ||     ||||       `- Re: More complex instructions to reduce cycle overheadBGB
|||    ||     |||`* Re: More complex instructions to reduce cycle overheadpaul wallich
|||    ||     ||| `- Re: More complex instructions to reduce cycle overheadMitchAlsup
|||    ||     ||`- Re: More complex instructions to reduce cycle overheadStefan Monnier
|||    ||     |`- Re: More complex instructions to reduce cycle overheadMitchAlsup
|||    ||     +* Re: More complex instructions to reduce cycle overheadPaul A. Clayton
|||    ||     |`- Re: More complex instructions to reduce cycle overheadPaul A. Clayton
|||    ||     `- Re: More complex instructions to reduce cycle overheadMitchAlsup
|||    |`* Re: More complex instructions to reduce cycle overheadAnton Ertl
|||    | `- Re: More complex instructions to reduce cycle overheadTerje Mathisen
|||    `* Re: Signed division by 2^nBGB
|||     `* Re: Signed division by 2^nMitchAlsup
|||      `- Re: Signed division by 2^nBGB
||`- Re: Signed division by 2^nThomas Koenig
|`* Re: Signed division by 2^naph
| `- Re: Signed division by 2^nAnton Ertl
`- Re: Signed division by 2^nIvan Godard

Pages:1234
Re: More complex instructions to reduce cycle overhead

<jwvim3k28bn.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16775&group=comp.arch#16775

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
Date: Fri, 14 May 2021 19:48:01 -0400
Organization: A noiseless patient Spider
Lines: 21
Message-ID: <jwvim3k28bn.fsf-monnier+comp.arch@gnu.org>
References: <s7dn5p$78r$1@newsreader4.netcologne.de>
<2021May11.193250@mips.complang.tuwien.ac.at>
<s7l775$sq5$1@dont-email.me> <s7l7os$75r$1@dont-email.me>
<s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com>
<s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com>
<jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Injection-Info: reader02.eternal-september.org; posting-host="51ea0c12f7924b604dce53bd6c65a776";
logging-data="23140"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18qMEaD8m8zIedpmrlBUXXu"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)
Cancel-Lock: sha1:OOqPrkyIvi4lqtor2XyMPrfCNK8=
sha1:FgJrmMleK3qAwvkVUOW73oqco88=
 by: Stefan Monnier - Fri, 14 May 2021 23:48 UTC

> Finally note that the characteristic delay of the flip-flop is 5-gates when
> you include clock Jitter and clock Skew. so a 16-gate machine actually
> cycles at 21-gates of delay. These 5-gates of delay really hurt at an 8-gate
> delay pipeline.

So of those 21-gates of delay, only 12 are actual useful work (when
performing an IADD), and the rest is forwarding and flip-flop.

With instructions that include two ops without any flip-flop between the
two and either no forwarding at all or very restricted forwarding, we
could get an actual cycle of say 34-gates, 24 of which are (hopefully)
"actual work".

> And there is Mitch's second law:: When you take the logic in a pipelined
> machine and divide each stage by 2, you end up with 2.5× as many pipeline
> stages !!

Hence the desire to reduce the pipeline length ;-)

Stefan

Re: More complex instructions to reduce cycle overhead

<s7n2en$5na$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16776&group=comp.arch#16776

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
Date: Fri, 14 May 2021 16:51:52 -0700
Organization: A noiseless patient Spider
Lines: 45
Message-ID: <s7n2en$5na$1@dont-email.me>
References: <s7dn5p$78r$1@newsreader4.netcologne.de>
<2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me>
<s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com>
<s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com>
<jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org> <s7n03h$jvm$1@dont-email.me>
<e1e2de3c-657b-4c13-9648-828713ffce70n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 14 May 2021 23:51:51 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="084bb58afc123c2913520798d93e8cc0";
logging-data="5866"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19bvkQ2KaN/fp/bjdz3Rq9J"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.1
Cancel-Lock: sha1:sljzb1SZb9rYwBqTbbbQwhLCsRM=
In-Reply-To: <e1e2de3c-657b-4c13-9648-828713ffce70n@googlegroups.com>
Content-Language: en-US
 by: Ivan Godard - Fri, 14 May 2021 23:51 UTC

On 5/14/2021 4:39 PM, MitchAlsup wrote:
> On Friday, May 14, 2021 at 6:11:47 PM UTC-5, Ivan Godard wrote:
>> On 5/14/2021 2:55 PM, Stefan Monnier wrote:
>>>
>>> IIUC cycle time (for the EX stage) can be split into:
>>> A- time to perform single-cycle operation
>>> B- time to propagate the result through the forwarding network
>>> C- time for the actual latch/flipflop
>>>
>>> Arguably, B and C are overheads.
>>> Has there been ISAs that aim to maximize the proportion of time spent in
>>> A rather than B and C by having instructions that perform several
>>> sequential operations.
>>>
>>> I guess the "negate inputs" options in MY66000 (and the shifts in ARM3)
>>> could be counted as such an example, tho a limited one.
>>>
>>> I'm thinking more of an ISA where an instruction is expected to do
>>> something like `(A op1 B) op2 C` in a single cycle (for various
>>> combinations of `op1` and `op2` like additions, shifts, and whatnot).
>>>
>>> I'm far from convinced it would work out well (there's a risk you'd end
>>> up having to use a NOP for `op1` or `op2` in too many cases), but I'm
>>> curious if someone has tried out something like that,
>>>
>>>
>>> Stefan
>>>
>> Bill Wulf (CMU) did this, but I forget what they called it.
> <
> I don't remember--when I knew Bill, he was involved with PDP-11 stuff
> and the BLISS compiler stuff.
> <
>> Mitch was at
>> CMU, maybe he remembers. It did address arithmetic real well, MAC, and A
>> <comp> B <rel> C, others not so mutch. A bit tough fitting two opcodes
>> and four regs into an instruction IIRC.
> <
> You don't need 4 registers as the calculations are serially dependent.
> More like 3 operands 1 result 2 calculations.

Miss Daily taught me that 3+1 = 4 in first grade. Were you absent that day?

Talking entropy here, not RF ports. :-)

Re: More complex instructions to reduce cycle overhead

<s7n2gj$5na$2@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16777&group=comp.arch#16777

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
Date: Fri, 14 May 2021 16:52:53 -0700
Organization: A noiseless patient Spider
Lines: 18
Message-ID: <s7n2gj$5na$2@dont-email.me>
References: <s7dn5p$78r$1@newsreader4.netcologne.de>
<2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me>
<s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com>
<s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com>
<jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 14 May 2021 23:52:51 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="084bb58afc123c2913520798d93e8cc0";
logging-data="5866"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX185FH0nxqNHz+y8vVRzgFgB"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.1
Cancel-Lock: sha1:LBLPxFo/vmY73ImbjIcxd6hDstc=
In-Reply-To: <049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com>
Content-Language: en-US
 by: Ivan Godard - Fri, 14 May 2021 23:52 UTC

On 5/14/2021 4:33 PM, MitchAlsup wrote:
> On Friday, May 14, 2021 at 4:55:16 PM UTC-5, Stefan Monnier wrote:
>> IIUC cycle time (for the EX stage) can be split into:
>> A- time to perform single-cycle operation
>> B- time to propagate the result through the forwarding network
>> C- time for the actual latch/flipflop
>>
>> Arguably, B and C are overheads.
>> Has there been ISAs that aim to maximize the proportion of time spent in
>> A rather than B and C by having instructions that perform several
>> sequential operations.
> <
> For single cycle back-to-back, this is accurate. C, however, is not a delay
> one can get rid of, unless one is not building a fully pipelined machine
> (new operation starting every cycle in the same FU.)

You ever play with asynchronous logic?

Re: Signed division by 2^n

<s7n2va$8el$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16778&group=comp.arch#16778

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Signed division by 2^n
Date: Fri, 14 May 2021 19:00:40 -0500
Organization: A noiseless patient Spider
Lines: 402
Message-ID: <s7n2va$8el$1@dont-email.me>
References: <s7dn5p$78r$1@newsreader4.netcologne.de>
<2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me>
<s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com>
<s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 15 May 2021 00:00:42 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="30aa43141088ac13a2e1f0fa95e34ad5";
logging-data="8661"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19tuZ2mYO6daW7O8stv7gdK"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.1
Cancel-Lock: sha1:5sZ7EdmDzd6mzQ3YDpRg5oytzLs=
In-Reply-To: <00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com>
Content-Language: en-US
 by: BGB - Sat, 15 May 2021 00:00 UTC

On 5/14/2021 4:29 PM, MitchAlsup wrote:
> On Friday, May 14, 2021 at 2:23:49 PM UTC-5, BGB wrote:
>> On 5/14/2021 1:04 PM, MitchAlsup wrote:
>>> On Friday, May 14, 2021 at 11:00:53 AM UTC-5, BGB wrote:
>>>> On 5/14/2021 2:10 AM, Ivan Godard wrote:
>>>>> On 5/13/2021 11:59 PM, BGB wrote:
>>>>>> On 5/11/2021 12:32 PM, Anton Ertl wrote:
>>>>>>> Thomas Koenig <tko...@netcologne.de> writes:
>>>>>>>> Everybody should know that a signed division by 2^n cannot be
>>>>>>>> done with a single right shift :-)
>>>>>>>
>>>>>>> Depends on the language and sometimes on its implementation. E.g., on
>>>>>>> Gforth:
>>>>>>>
>>>>>>> -9 4 / . -3 ok
>>>>>>> -9 2 arshift . -3 ok
>>>>>>>
>>>>>>>> but having this as a single
>>>>>>>> instruction instead of four without branches, three with a branch
>>>>>>>> or two if you happen to own a POWER would make sense, especially
>>>>>>>> a conditional add of 2**n - 1 should be easier to do in hardware
>>>>>>>> than in software.
>>>>>>>>
>>>>>>>> Does ISA actually implement this?
>>>>>>>
>>>>>>> Even Aarch64, which supports pretty exotic stuff in some cases, needs
>>>>>>> 4 instructions for a signed symmetric division (or at least that's
>>>>>>> what I get with gcc).
>>>>>>>
>>>>>>> Is this frequent enough to merit a special instruction? Or do people
>>>>>>> use unsigned numbers or explicit shift if they are
>>>>>>> performance-conscious and want to divide by 2^n?
>>>>>>>
>>>>>>
>>>>>> It seems like it could be supported...
>>>>>>
>>>>>> However the ways I can think of it would likely add enough cost and
>>>>>> latency to the shift unit to make it "not likely worthwhile".
>>>>>>
>>>>>> Option 1, if input is negative:
>>>>>> Negate Input;
>>>>>> Do Shift;
>>>>>> Negate Output.
>>>>>>
>>>>>> Option 2, if input is negative:
>>>>>> Detect if ((Input&((1<<n)-1))!=0);
>>>>>> If So, Add 1 to output.
>>>>>>
>>>>>>
>>>>>> If one had an absolute-value instruction which set a status flag based
>>>>>> on whether or not the input was negative, this could be combined with
>>>>>> a conditional negate to reduce it to 3 instructions.
>>>>>>
>>>>>> ...
>>>>>
>>>>> Or just integrate the integer shifter with the FP normalization shifter
>>>>> and apply the right rounding mode.
>>>> It is possible.
>>>>
>>>> I was going to assert originally that these is a problem if the
>>>> renormalization shifter:
>>>> Only does a left-shift;
>>>> Isn't quite wide enough to be used for integer shifts;
>>> <
>>> The normalizer in and DP FMAC unit is at least 213-bits wide.
> <
>> Yeah ... 64 -> 54 bits (narrowing shift) ...
>>
>> With rounding carry propagation limited to 12-bits in this case.
>>
>> There is a possibility for long strings of 1s, but the cases that lead
>> to them mostly disappear if one does a carry-in during the main adder.
>>
>> One of the things that helped "kill" my Long-Double FPU effort was
>> trying to widen these parts to 90 bits.
>>
>>
>> The Long-Double FPU would have still been too narrow to ensure
>> correctly-rounded Double values though, but would have reduced their
>> probability somewhat.
>>
>> LUT cost was an issue, and timing was pretty tight at 50MHz.
>> I had to invoke a lot of fiddly to "make it work" here.
>>
>> More recently I have applied some of the tweaks to the DP FPU to get it
>> to pass timing at 75MHz (and make a change that ended up leading to FADD
>> and FMUL being 7-cycle operations).
>>
>>
>> I can only guess what sorts of damage a 213 bit mantissa would do...
>>
>> Occasional rounding errors seem like a better trade-off IMO.
> <
> Try having that argument with Kahan !!
>>>> ...
>>>>
>>>> But, then realized, this probably meant to use the input-side right
>>>> shifter, followed by the main adder (say, internally it uses a 64-bit
>>>> adder with a carry-in flag), and already implements some similar logic
>>>> for other reasons, ...
>>>>
>>>> It actually seems possible, as the logic is basically analogous to doing
>>>> the Int->Float and Float->Int conversion logic at the same time, and
>>>> fudging one of the exponents to cause it to implement a right shift.
>>>>
>>>> So, it should be technically possible to pull off something like this
>>>> via slight tweaks to an FADD unit...
>>> <
>>> Possibly, but the multiplier is dealing with 53+53 bit things minimum,
>>> if said multiplier also FDIV and SQRT then it is 57+57
>>> if said multiplier also Transcendentals then it is 58+58.....
> <
>> This is assuming one uses a "square" multiplier, rather than a
>> "triangular" multiplier.
> <
> The proper word is parallelogram not square

Could also call it rhombus or diamond...

>>
>> As-is:
>> Square Multiplier: 54*54 -> 108
>> Triangular Multiplier: 54*54 -> 72
> <
> I question your definition of triangular::
>> Triangular Multiplier: 54×54 -> 54 !?!
>>

It is built from DSPs, which can generate an output twice as wide as the
inputs.

They can be:
16*16 -> 32, Signed/Unsigned
17*17 -> 34, Signed/Unsigned
18*18 -> 36, Nominally Signed
Can fake Unsigned via extra LUTs.

If one builds a triangular multiplier, the bottom parts hang off the
bottom, so one gets an additional 16-18 bits of width.

However, because the low order bits are incomplete, they also tend to be
be erroneous, and are discarded.

The actual usable portion is roughly the same width of the inputs, but I
could these low-order bits as this is what the intermediate adders tend
to work with, and they are discarded afterwards.

So, 54*54->54, 72*72->72, or 80*80->80, ...
Would be more what one would see after generating the final output.

>> The LongDouble used a wider multiplier:
>> 72*72->90 (initial)
>> 85*85->90 (likely needed to avoid some issues, *)
>>
>> *: Algorithms based on iterative convergence get stuck in an infinite
>> loop if FADDX and FMULX use different mantissa lengths.
>>
>> This means that I would need to make them agree on a fixed 80-bit mantissa.
>>
>> Or: S.E15.F80.P32 (where P=Zero Padding)
>>
>>
>> I did at one point start trying to implement a combined FMAC unit, but
>> then realized it was likely to have a fairly high latency.
> <
> Your implementation medium is harming your ability to pull off your design.

Very possibly...

I just spent like the past week battling with debugging and trying to
get stuff to pass timing reliably.

Switched over to trying to get it to pass timing at 75MHz, because if I
can get it to pass timing much at all at 75MHz, it will hopefully stop
unpredictably failing timing at 50MHz.

But, things like passing/failing timing, resource usage, estimated power
usage, ... are basically kinda like a roulette wheel which jumps all
over the place.

Similarly, whether or not it works in simulation is no guarantee it will
work on the actual FPGA (simulation starts typically with everything
holding zeroes, whereas the FPGA seems to start with pretty much
everything initialized to garbage values; requiring a global "reset"
strobe signal to try to pull everything into a "known good" state).

Also, the sort of "metastability" from clock-domain crossings isn't
really modeled at in simulation, nor the effects of random internal
corruptions, or the apparent tendency of the FPGA to start experiencing
errors once it warms up (I stuck a RasPi heat-sink on it, probably also
need a case with a fan, *), ...


Click here to read the complete article
Re: More complex instructions to reduce cycle overhead

<5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16779&group=comp.arch#16779

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:4484:: with SMTP id r126mr9790376qka.18.1621039437675;
Fri, 14 May 2021 17:43:57 -0700 (PDT)
X-Received: by 2002:a4a:b389:: with SMTP id p9mr38129329ooo.71.1621039437441;
Fri, 14 May 2021 17:43:57 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 14 May 2021 17:43:57 -0700 (PDT)
In-Reply-To: <s7n2gj$5na$2@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:cd96:76eb:8d7:21eb;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:cd96:76eb:8d7:21eb
References: <s7dn5p$78r$1@newsreader4.netcologne.de> <2021May11.193250@mips.complang.tuwien.ac.at>
<s7l775$sq5$1@dont-email.me> <s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com> <s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com> <jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com> <s7n2gj$5na$2@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com>
Subject: Re: More complex instructions to reduce cycle overhead
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 15 May 2021 00:43:57 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Sat, 15 May 2021 00:43 UTC

On Friday, May 14, 2021 at 6:52:53 PM UTC-5, Ivan Godard wrote:
> On 5/14/2021 4:33 PM, MitchAlsup wrote:
> > On Friday, May 14, 2021 at 4:55:16 PM UTC-5, Stefan Monnier wrote:
> >> IIUC cycle time (for the EX stage) can be split into:
> >> A- time to perform single-cycle operation
> >> B- time to propagate the result through the forwarding network
> >> C- time for the actual latch/flipflop
> >>
> >> Arguably, B and C are overheads.
> >> Has there been ISAs that aim to maximize the proportion of time spent in
> >> A rather than B and C by having instructions that perform several
> >> sequential operations.
> > <
> > For single cycle back-to-back, this is accurate. C, however, is not a delay
> > one can get rid of, unless one is not building a fully pipelined machine
> > (new operation starting every cycle in the same FU.)
<
> You ever play with asynchronous logic?
<
Yes, ever try to talk the chip testing people into testing a chip with asynchronous
pipelines ??

Re: Signed division by 2^n

<338af7a4-6369-4e4e-ae33-7c89cc11d2f5n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16780&group=comp.arch#16780

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:684d:: with SMTP id d74mr18834422qkc.151.1621040083572; Fri, 14 May 2021 17:54:43 -0700 (PDT)
X-Received: by 2002:a05:6830:a:: with SMTP id c10mr33264956otp.114.1621040083286; Fri, 14 May 2021 17:54:43 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!news-out.netnews.com!news.alt.net!fdc3.netnews.com!feeder.usenetexpress.com!tr3.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 14 May 2021 17:54:43 -0700 (PDT)
In-Reply-To: <s7n2va$8el$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:cd96:76eb:8d7:21eb; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:cd96:76eb:8d7:21eb
References: <s7dn5p$78r$1@newsreader4.netcologne.de> <2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me> <s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me> <c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com> <s7mio3$qfs$1@dont-email.me> <00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com> <s7n2va$8el$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <338af7a4-6369-4e4e-ae33-7c89cc11d2f5n@googlegroups.com>
Subject: Re: Signed division by 2^n
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 15 May 2021 00:54:43 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 345
 by: MitchAlsup - Sat, 15 May 2021 00:54 UTC

On Friday, May 14, 2021 at 7:00:45 PM UTC-5, BGB wrote:
> On 5/14/2021 4:29 PM, MitchAlsup wrote:
> > On Friday, May 14, 2021 at 2:23:49 PM UTC-5, BGB wrote:

> >>> <
> >>> Possibly, but the multiplier is dealing with 53+53 bit things minimum,
> >>> if said multiplier also FDIV and SQRT then it is 57+57
> >>> if said multiplier also Transcendentals then it is 58+58.....
> > <
> >> This is assuming one uses a "square" multiplier, rather than a
> >> "triangular" multiplier.
> > <
> > The proper word is parallelogram not square
> Could also call it rhombus or diamond...
> >>
> >> As-is:
> >> Square Multiplier: 54*54 -> 108
> >> Triangular Multiplier: 54*54 -> 72
> > <
> > I question your definition of triangular::
> >> Triangular Multiplier: 54×54 -> 54 !?!
> >>
> It is built from DSPs, which can generate an output twice as wide as the
> inputs.
>
> They can be:
> 16*16 -> 32, Signed/Unsigned
> 17*17 -> 34, Signed/Unsigned
> 18*18 -> 36, Nominally Signed
> Can fake Unsigned via extra LUTs.
>
> If one builds a triangular multiplier, the bottom parts hang off the
> bottom, so one gets an additional 16-18 bits of width.
<
Are you talking about the more significant triangle of the multiplier
or the lesser significant triangle of the multiplier ?
<
<
This is a symptom of the library you are using not a general property of
multipliers.
>
> However, because the low order bits are incomplete, they also tend to be
> be erroneous, and are discarded.
<
upper......
>
> The actual usable portion is roughly the same width of the inputs, but I
> could these low-order bits as this is what the intermediate adders tend
> to work with, and they are discarded afterwards.
>
> So, 54*54->54, 72*72->72, or 80*80->80, ...
> Would be more what one would see after generating the final output.
> >> The LongDouble used a wider multiplier:
> >> 72*72->90 (initial)
> >> 85*85->90 (likely needed to avoid some issues, *)
> >>
> >> *: Algorithms based on iterative convergence get stuck in an infinite
> >> loop if FADDX and FMULX use different mantissa lengths.
> >>
> >> This means that I would need to make them agree on a fixed 80-bit mantissa.
> >>
> >> Or: S.E15.F80.P32 (where P=Zero Padding)
> >>
> >>
> >> I did at one point start trying to implement a combined FMAC unit, but
> >> then realized it was likely to have a fairly high latency.
> > <
> > Your implementation medium is harming your ability to pull off your design.
> Very possibly...
>
>
> I just spent like the past week battling with debugging and trying to
> get stuff to pass timing reliably.
>
> Switched over to trying to get it to pass timing at 75MHz, because if I
> can get it to pass timing much at all at 75MHz, it will hopefully stop
> unpredictably failing timing at 50MHz.
>
>
> But, things like passing/failing timing, resource usage, estimated power
> usage, ... are basically kinda like a roulette wheel which jumps all
> over the place.
>
>
> Similarly, whether or not it works in simulation is no guarantee it will
> work on the actual FPGA (simulation starts typically with everything
> holding zeroes, whereas the FPGA seems to start with pretty much
> everything initialized to garbage values; requiring a global "reset"
> strobe signal to try to pull everything into a "known good" state).
>
> Also, the sort of "metastability" from clock-domain crossings isn't
> really modeled at in simulation, nor the effects of random internal
> corruptions, or the apparent tendency of the FPGA to start experiencing
> errors once it warms up (I stuck a RasPi heat-sink on it, probably also
> need a case with a fan, *), ...
>
>
> *: Basically, once the FPGA gets much over ~45C or so, its reliability
> seems to get a lot worse (and stuff gets a lot more crash-prone).
<
That is getting hot enough the the LUTs lose their programming !
>
> It didn't come with a heat-sink though, as I guess the board designers
> figured that passive air-cooling from the bare FPGA was sufficient?...
<
More likely they wanted the user of the chip to add appropriate amounts
of cooling.
>
> Can't mount a fan directly to the heatsink though, as ~ 12mm fans aren't
> really a thing (smallest I can find are ~ 30mm). Also seemingly not a
> thing: a 30mm aluminum heatsink that narrows and sticks onto a 10mm die
> (via thermal adhesive). Also preferable if the fan could run at 3.3v and
> under 50mA (so it could be powered via a PMOD connector or similar).
<
Piece of cake to machine, starting from a 30×30 heat sink. I could knock one
out in 10 minutes if I had a 30×30 to start with--the machining of the fins and
the anodizing is the hard parts. Relieving the bottom so it is only 10×10 is
easy.
>
> ...
>
> Current strategy though is mostly turning it off when it starts getting
> too warm.
> >>
> >> An FPU with a separate FADD and FMUL unit could give lower latency for
> >> FADD and FMUL, and could fake FMAC with only slightly higher latency
> >> than the combined unit.
> > <
> > Maybe,
> > FADD: 2-cycles is darned hard, 3-cycles is pretty easy.
> > FMUL: 4-cycles is rather standard for 16-gates/cycle machines.
> > FMAC: 4-cycles is pretty hard, 5-cycles is a bit better
> > <
> > AMD Athlon and Opteron used FADD=4 and FMUL=4 to simplify the
> > pipelineing and to prevent having both units deliver result in the same
> > cycle.
<
> Not sure how gate-delay compares with FPGA logic levels; ATM I am mostly
> looking at 12 .. 14 (some parts are 10 or 11 logic levels).
>
> Looking at traces, they mostly seem to be LUT3/LUT4/LUT5 with the
> occasional CARRY4 or similar, traveling between pairs of FDRE elements.
>
>
> Internally, the FADD and FMUL units still have a 5-cycle latency, but
> gain an extra 2 cycles due to an input/output buffering mechanism (also
> needed for SIMD).
>
> Eg:
> EX1: FPU receives inputs from pipeline;
> EX2: Inputs fed into FADD or FMUL;
> ... Work Cycles ...
> EX3: Get result from FPU.
>
> Previously, the FADD and FMUL would recieve inputs directly during EX1,
> but then they needed to deal with pipeline stalls. Adding the extra
> cycle (with the outer FPU module managing input/output buffering) makes
> them independent of the stall, which helps timing, but also adds an
> extra clock-cycle of latency to the operation.
>
> The extra buffering cycle also helps with timing, allowing more time for
> the value to get from the register-forwarding logic to the FPU.
>
>
> So, FADD Stages:
> C1 Unpack Input Arguments
> Find difference of exponents
> Decide which side is 'A' and which is 'B'
> Right-Shift FracB
> C2 Optionally Invert FracB
> Add (FracA+FracB+Cin)
> C3 CLZ (Renorm 1)
> C4 Left-Shift (Renorm 2)
> Try to round
> C5 Pack Output / Done
>
> FMUL Stages:
> C1 Unpack Input Arguments
> Set up Exponents
> Multiply Input Fragments
> C2 ADD Stuff
> C3 ADD Stuff
> C4 Renorm Adjust / Round
> C5 Pack Output / Done
>
> Renorm is easier for FMUL, as it assumes that the values falls in the
> range of 1.0 .. 4.0, as opposed to FADD where it can be anything.
> > <
> > On the other hand, a single FMAC unit can do it all::
> > FADD: FMAC 1*Rs1+Rs2
> > FMUL: FMAC Rs1*Rs2+0
> > <
> > So if you find yourself in a position where you need FMAC (say to meet
> > IEEE 754-2008+) you can have the design team build the FMAC unit.
> > Later on, when building the next and wider machine, you can add an
> > FADD or FMUL or both based on statistics you have gathered from
> > generation 1. Given and FMAC, FADD is a degenerate subset which
> > a GOOD Verilog compiler can autogenerate if you feed it the above
> > fixed values {FMAC 1*Rs1+Rs2 and FMAC Rs1*Rs2+0}. THis REALLY
> > reduces the designer workloads.
<
> My position is that I don't feel full IEEE conformance is a realistic
> goal for this.
>
>
> From what I can gather, in a loose-sense it does seem to provide most
> of what IEEE-754-1985 seems to ask for, with a few exceptions:
> Denormal as Zero;
> FADD/FSUB/FMUL Only;
> Compare Ops (via ALU);
> Format Conversion (via FPU or ALU);
> ...
>
>
> Native FP Formats:
> Double / Binary64 (Scalar, 2x SIMD)
> Single / Binary32 (Conv, 2x | 4x SIMD)
> Half / Binary16 (Conv, 4x SIMD)
>
> Long Double Extension (Optional, Cost):
> Truncated Quad / Binary128
>
> RGBF Extension:
> FP8S / FP8U (Packed Conv Only)
> >>
> >>
> >> There are some operations though which could exist with an FMAC unit
> >> which would not work correctly with an FMUL+FADD glued together, but I
> >> am already pushing the limits of what seems viable on the XC7A100T.
> > <
> > Yep, your implementation medium is getting in your way. So are some of
> > your tools.
> >
> Yeah, probably...
> Verilator is seemingly pretty buggy in some areas.
>
> I am using the freeware / feature-limited version of Vivado, not sure
> what the Commercial / EDA version is like, or what all features they
> disabled.
>
> From what I can gather, Vivado is sorta like:
> Free: Spartan and Artix FPGAs, some lower-end Zynq and Kintex devices.
> Per-device vouchers: They enable certain FPGAs with the purchase of the
> associated dev boards;
> Commercial: AFAICT, $1k per seat per year?...
>
>
> Say, if I bought to get one of the Kintex dev-boards, they would
> apparently come with a voucher to allow Vivado to target them (well,
> otherwise, it is a lot of money for a board one can't use).
>
> But, the Kintex boards generally go for upwards of $1000, and I still
> don't have a job at the moment, so this is pretty steep...
>
>
> Though, synthesis on a Kintex at a -2 speed grade (for one of the FPGA's
> supported by Vivado WebPack) implies I can achieve clock speeds of ~ 150
> to 200 MHz, as opposed to the 50MHz or 75MHz I can get on an Artix.
>
> Someone else had apparently once tested it on a Kintex and got it to
> pass timing at ~ 180MHz.
>
>
>
> When I tried before using Quartus on a Cyclone V (targeting the same
> type as in the DE10), was able to get it up to ~ 110 MHz, but this
> didn't seem like enough of a speedup to justify me buying a DE10 (more
> so when the DE10 had less RAM for the FPGA part, and I could only manage
> to fit a single BJX2 core into the FPGA).
>
> These boards have an ARM SoC + FPGA part, there is like 1GB for the ARM
> SoC, but with a separate 64MB RAM module for the FPGA.
>
> In theory, the number of LUTS/ALMS in the DE10 is large enough that it
> should be more competitive with an Artix or Spartan, not sure what is
> going on there...
>
> But, as noted, I could clock it a little higher than the Spartan or
> Artix, but not enough to convince me to throw money at buying the actual
> hardware or figure out how to deal with interacting with an ARM SoC...
>
>
> Zynq is kinda similar, just I would have to figure out how to go about
> plugging the BJX2 into an AXI Bus, which would be pretty much the only
> way it could access RAM or similar.
>
> Granted, If I wanted to use Vivado's MIG (Memory Interface Generator), I
> would also need to figure out AXI.
>
>
> Though, I suspect MIG may know how to make the RAM work correctly in its
> rated speed window (vs my DDR controller which is apparently running the
> RAM in a sort of low-power standby mode).
>
> I did write a controller which could, in theory, run the RAM at 150MHz
> (within its rated speed), but couldn't figure out how to make it
> "actually work" on the actual hardware.
>
>
> But, memory bandwidth is hard...
> An still a pretty big bottleneck it seems.
>
> ...


Click here to read the complete article
Re: More complex instructions to reduce cycle overhead

<s7n6ah$t1$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16781&group=comp.arch#16781

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
Date: Fri, 14 May 2021 17:57:53 -0700
Organization: A noiseless patient Spider
Lines: 28
Message-ID: <s7n6ah$t1$1@dont-email.me>
References: <s7dn5p$78r$1@newsreader4.netcologne.de>
<2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me>
<s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com>
<s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com>
<jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com>
<s7n2gj$5na$2@dont-email.me>
<5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 15 May 2021 00:57:53 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="084bb58afc123c2913520798d93e8cc0";
logging-data="929"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19ZVa2BjSf7N6Mqe+Ii4CIc"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.1
Cancel-Lock: sha1:QdNxGMalxxgsku1F84zTNc7gD3c=
In-Reply-To: <5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com>
Content-Language: en-US
 by: Ivan Godard - Sat, 15 May 2021 00:57 UTC

On 5/14/2021 5:43 PM, MitchAlsup wrote:
> On Friday, May 14, 2021 at 6:52:53 PM UTC-5, Ivan Godard wrote:
>> On 5/14/2021 4:33 PM, MitchAlsup wrote:
>>> On Friday, May 14, 2021 at 4:55:16 PM UTC-5, Stefan Monnier wrote:
>>>> IIUC cycle time (for the EX stage) can be split into:
>>>> A- time to perform single-cycle operation
>>>> B- time to propagate the result through the forwarding network
>>>> C- time for the actual latch/flipflop
>>>>
>>>> Arguably, B and C are overheads.
>>>> Has there been ISAs that aim to maximize the proportion of time spent in
>>>> A rather than B and C by having instructions that perform several
>>>> sequential operations.
>>> <
>>> For single cycle back-to-back, this is accurate. C, however, is not a delay
>>> one can get rid of, unless one is not building a fully pipelined machine
>>> (new operation starting every cycle in the same FU.)
> <
>> You ever play with asynchronous logic?
> <
> Yes, ever try to talk the chip testing people into testing a chip with asynchronous
> pipelines ??
>

I remind you: IANAHG :-)

Synchronous has become the de facto standard, like x86 has. Do you, as a
HG, feel that it's time to re-examine asynchronous?

Re: More complex instructions to reduce cycle overhead

<590ea343-cd96-4082-800c-f02412204262n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16782&group=comp.arch#16782

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:e113:: with SMTP id c19mr47303446qkm.329.1621042277655;
Fri, 14 May 2021 18:31:17 -0700 (PDT)
X-Received: by 2002:a05:6830:40a4:: with SMTP id x36mr38579228ott.342.1621042277445;
Fri, 14 May 2021 18:31:17 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 14 May 2021 18:31:17 -0700 (PDT)
In-Reply-To: <s7n6ah$t1$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:cd96:76eb:8d7:21eb;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:cd96:76eb:8d7:21eb
References: <s7dn5p$78r$1@newsreader4.netcologne.de> <2021May11.193250@mips.complang.tuwien.ac.at>
<s7l775$sq5$1@dont-email.me> <s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com> <s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com> <jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com> <s7n2gj$5na$2@dont-email.me>
<5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com> <s7n6ah$t1$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <590ea343-cd96-4082-800c-f02412204262n@googlegroups.com>
Subject: Re: More complex instructions to reduce cycle overhead
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 15 May 2021 01:31:17 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Sat, 15 May 2021 01:31 UTC

On Friday, May 14, 2021 at 7:57:55 PM UTC-5, Ivan Godard wrote:
> On 5/14/2021 5:43 PM, MitchAlsup wrote:
> > On Friday, May 14, 2021 at 6:52:53 PM UTC-5, Ivan Godard wrote:
<
> >>> For single cycle back-to-back, this is accurate. C, however, is not a delay
> >>> one can get rid of, unless one is not building a fully pipelined machine
> >>> (new operation starting every cycle in the same FU.)
> > <
> >> You ever play with asynchronous logic?
> > <
> > Yes, ever try to talk the chip testing people into testing a chip with asynchronous
> > pipelines ??
> >
> I remind you: IANAHG :-)
>
> Synchronous has become the de facto standard, like x86 has. Do you, as a
> HG, feel that it's time to re-examine asynchronous?
<
There are no tool sets that can do GBOoO designs in asynchronous forms.
Simple in order 1-wide designs might be possible for a very brave design
team.
<
Without the tool sets, and some demonstrable way to test them, no it is
not time to go asynch.
<
Now, Ivan, consider a Mill implementation where the FP Multiplier can take 3,
4, or 5 cycles to deliver its results. I can see a reservation station-like machine
or a scoreboard machine being able to "deal with this" (with extra logic), but
a statically scheduled machine has no chance.
<
Sutherland's paper on asynchronous only touched on the difficulty of the
concordance problem and this is where it really gets out of control.
<
So, in the case of Mill, it is completely plausible that the delivery of a result
to the belt is an easily timed event. However, consuming a result from the
belt is not -- and essentially you would need for all data to pass through a
synchronizer to be consumed as an operand. These synchronizers operating
IN the same clock domain are still 3-latches 1.2 clocks of delay--for forwarding !!

Re: More complex instructions to reduce cycle overhead

<s7nchh$b58$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16783&group=comp.arch#16783

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: iva...@millcomputing.com (Ivan Godard)
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
Date: Fri, 14 May 2021 19:44:00 -0700
Organization: A noiseless patient Spider
Lines: 50
Message-ID: <s7nchh$b58$1@dont-email.me>
References: <s7dn5p$78r$1@newsreader4.netcologne.de>
<2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me>
<s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com>
<s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com>
<jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com>
<s7n2gj$5na$2@dont-email.me>
<5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com>
<s7n6ah$t1$1@dont-email.me>
<590ea343-cd96-4082-800c-f02412204262n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 15 May 2021 02:44:01 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="084bb58afc123c2913520798d93e8cc0";
logging-data="11432"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19neTWfYT3K7Ca76iryXmXa"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.1
Cancel-Lock: sha1:mG6LEikAvbbWcdOypynl4Fj78hY=
In-Reply-To: <590ea343-cd96-4082-800c-f02412204262n@googlegroups.com>
Content-Language: en-US
 by: Ivan Godard - Sat, 15 May 2021 02:44 UTC

On 5/14/2021 6:31 PM, MitchAlsup wrote:
> On Friday, May 14, 2021 at 7:57:55 PM UTC-5, Ivan Godard wrote:
>> On 5/14/2021 5:43 PM, MitchAlsup wrote:
>>> On Friday, May 14, 2021 at 6:52:53 PM UTC-5, Ivan Godard wrote:
> <
>>>>> For single cycle back-to-back, this is accurate. C, however, is not a delay
>>>>> one can get rid of, unless one is not building a fully pipelined machine
>>>>> (new operation starting every cycle in the same FU.)
>>> <
>>>> You ever play with asynchronous logic?
>>> <
>>> Yes, ever try to talk the chip testing people into testing a chip with asynchronous
>>> pipelines ??
>>>
>> I remind you: IANAHG :-)
>>
>> Synchronous has become the de facto standard, like x86 has. Do you, as a
>> HG, feel that it's time to re-examine asynchronous?
> <
> There are no tool sets that can do GBOoO designs in asynchronous forms.
> Simple in order 1-wide designs might be possible for a very brave design
> team.
> <
> Without the tool sets, and some demonstrable way to test them, no it is
> not time to go asynch.
> <
> Now, Ivan, consider a Mill implementation where the FP Multiplier can take 3,
> 4, or 5 cycles to deliver its results. I can see a reservation station-like machine
> or a scoreboard machine being able to "deal with this" (with extra logic), but
> a statically scheduled machine has no chance.
> <
> Sutherland's paper on asynchronous only touched on the difficulty of the
> concordance problem and this is where it really gets out of control.
> <
> So, in the case of Mill, it is completely plausible that the delivery of a result
> to the belt is an easily timed event. However, consuming a result from the
> belt is not -- and essentially you would need for all data to pass through a
> synchronizer to be consumed as an operand. These synchronizers operating
> IN the same clock domain are still 3-latches 1.2 clocks of delay--for forwarding !!
>

Actually I was asking in the abstract, not for Mill. Even as !HG I can
see that the pulsed nature of static scheduling doesn't lend itself to
asynch; you would have to have something that looks quite like your
scoreboard to do what we do with the crossbar.

So I was most interested in asynch in the pipelines - using it to get
rid of the stage flipflops. Would you consider using a FPU (for example)
that was asynch internally but synch at both ends, in search of fewer
stages/faster clock?

Re: More complex instructions to reduce cycle overhead

<KiHnI.151039$wd1.100928@fx41.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16784&group=comp.arch#16784

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsfeed.xs4all.nl!newsfeed9.news.xs4all.nl!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx41.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
References: <s7dn5p$78r$1@newsreader4.netcologne.de> <2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me> <s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me> <c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com> <s7mio3$qfs$1@dont-email.me> <00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com> <jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org> <049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com> <s7n2gj$5na$2@dont-email.me> <5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com> <s7n6ah$t1$1@dont-email.me> <590ea343-cd96-4082-800c-f02412204262n@googlegroups.com> <s7nchh$b58$1@dont-email.me>
In-Reply-To: <s7nchh$b58$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 18
Message-ID: <KiHnI.151039$wd1.100928@fx41.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sat, 15 May 2021 03:34:02 UTC
Date: Fri, 14 May 2021 23:33:38 -0400
X-Received-Bytes: 2144
 by: EricP - Sat, 15 May 2021 03:33 UTC

Ivan Godard wrote:
>
> Actually I was asking in the abstract, not for Mill. Even as !HG I can
> see that the pulsed nature of static scheduling doesn't lend itself to
> asynch; you would have to have something that looks quite like your
> scoreboard to do what we do with the crossbar.
>
> So I was most interested in asynch in the pipelines - using it to get
> rid of the stage flipflops. Would you consider using a FPU (for example)
> that was asynch internally but synch at both ends, in search of fewer
> stages/faster clock?

There used to be a thing people were kicking about called wave pipelining.
I gather the signals flowed through the circuits in waves with
no synchronization (except presumably at the end).

A quicky search finds multiple recent hits so its not dead.

Re: Signed division by 2^n

<s7nm10$lok$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16785&group=comp.arch#16785

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Signed division by 2^n
Date: Sat, 15 May 2021 00:25:50 -0500
Organization: A noiseless patient Spider
Lines: 364
Message-ID: <s7nm10$lok$1@dont-email.me>
References: <s7dn5p$78r$1@newsreader4.netcologne.de>
<2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me>
<s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com>
<s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com>
<s7n2va$8el$1@dont-email.me>
<338af7a4-6369-4e4e-ae33-7c89cc11d2f5n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 15 May 2021 05:25:52 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="30aa43141088ac13a2e1f0fa95e34ad5";
logging-data="22292"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18efyfa52bENGmG1Yb7xIjY"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.1
Cancel-Lock: sha1:PGByJZUyPdfnJVtZ75wfAHRxE8c=
In-Reply-To: <338af7a4-6369-4e4e-ae33-7c89cc11d2f5n@googlegroups.com>
Content-Language: en-US
 by: BGB - Sat, 15 May 2021 05:25 UTC

On 5/14/2021 7:54 PM, MitchAlsup wrote:
> On Friday, May 14, 2021 at 7:00:45 PM UTC-5, BGB wrote:
>> On 5/14/2021 4:29 PM, MitchAlsup wrote:
>>> On Friday, May 14, 2021 at 2:23:49 PM UTC-5, BGB wrote:
>
>>>>> <
>>>>> Possibly, but the multiplier is dealing with 53+53 bit things minimum,
>>>>> if said multiplier also FDIV and SQRT then it is 57+57
>>>>> if said multiplier also Transcendentals then it is 58+58.....
>>> <
>>>> This is assuming one uses a "square" multiplier, rather than a
>>>> "triangular" multiplier.
>>> <
>>> The proper word is parallelogram not square
>> Could also call it rhombus or diamond...
>>>>
>>>> As-is:
>>>> Square Multiplier: 54*54 -> 108
>>>> Triangular Multiplier: 54*54 -> 72
>>> <
>>> I question your definition of triangular::
>>>> Triangular Multiplier: 54×54 -> 54 !?!
>>>>
>> It is built from DSPs, which can generate an output twice as wide as the
>> inputs.
>>
>> They can be:
>> 16*16 -> 32, Signed/Unsigned
>> 17*17 -> 34, Signed/Unsigned
>> 18*18 -> 36, Nominally Signed
>> Can fake Unsigned via extra LUTs.
>>
>> If one builds a triangular multiplier, the bottom parts hang off the
>> bottom, so one gets an additional 16-18 bits of width.
> <
> Are you talking about the more significant triangle of the multiplier
> or the lesser significant triangle of the multiplier ?
> <
> <
> This is a symptom of the library you are using not a general property of
> multipliers.

I suspect it is more related to the DSP48's, which are hard logic blocks
in this FPGA. But, one is sorta stuck using them, whether or not they
are a good fit, because there is no other cost-effective alternative.

>>
>> However, because the low order bits are incomplete, they also tend to be
>> be erroneous, and are discarded.
> <
> upper......

In interesting property of multiplication is that the high order digits
still tend to be correct even if one throws out most of the low-order
digits...

>>
>> The actual usable portion is roughly the same width of the inputs, but I
>> could these low-order bits as this is what the intermediate adders tend
>> to work with, and they are discarded afterwards.
>>
>> So, 54*54->54, 72*72->72, or 80*80->80, ...
>> Would be more what one would see after generating the final output.
>>>> The LongDouble used a wider multiplier:
>>>> 72*72->90 (initial)
>>>> 85*85->90 (likely needed to avoid some issues, *)
>>>>
>>>> *: Algorithms based on iterative convergence get stuck in an infinite
>>>> loop if FADDX and FMULX use different mantissa lengths.
>>>>
>>>> This means that I would need to make them agree on a fixed 80-bit mantissa.
>>>>
>>>> Or: S.E15.F80.P32 (where P=Zero Padding)
>>>>
>>>>
>>>> I did at one point start trying to implement a combined FMAC unit, but
>>>> then realized it was likely to have a fairly high latency.
>>> <
>>> Your implementation medium is harming your ability to pull off your design.
>> Very possibly...
>>
>>
>> I just spent like the past week battling with debugging and trying to
>> get stuff to pass timing reliably.
>>
>> Switched over to trying to get it to pass timing at 75MHz, because if I
>> can get it to pass timing much at all at 75MHz, it will hopefully stop
>> unpredictably failing timing at 50MHz.
>>
>>
>> But, things like passing/failing timing, resource usage, estimated power
>> usage, ... are basically kinda like a roulette wheel which jumps all
>> over the place.
>>
>>
>> Similarly, whether or not it works in simulation is no guarantee it will
>> work on the actual FPGA (simulation starts typically with everything
>> holding zeroes, whereas the FPGA seems to start with pretty much
>> everything initialized to garbage values; requiring a global "reset"
>> strobe signal to try to pull everything into a "known good" state).
>>
>> Also, the sort of "metastability" from clock-domain crossings isn't
>> really modeled at in simulation, nor the effects of random internal
>> corruptions, or the apparent tendency of the FPGA to start experiencing
>> errors once it warms up (I stuck a RasPi heat-sink on it, probably also
>> need a case with a fan, *), ...
>>
>>
>> *: Basically, once the FPGA gets much over ~45C or so, its reliability
>> seems to get a lot worse (and stuff gets a lot more crash-prone).
> <
> That is getting hot enough the the LUTs lose their programming !

Yeah.

Whatever is going on, it seems this is enough to cause a 10mm die to get
pretty warm absent some sort of active cooling.

But, if the FPGA gets too warm, whatever it is running will tend to
either get crash-prone and/or deadlock. Usual solution is to turn it off
and let it cool off.

>>
>> It didn't come with a heat-sink though, as I guess the board designers
>> figured that passive air-cooling from the bare FPGA was sufficient?...
> <
> More likely they wanted the user of the chip to add appropriate amounts
> of cooling.

Dunno, Artix-7 and Spartan-7 are in the power-and-cost-optimized category.

Marketing materials says the FPGA operates in milliwatt range.

Vivado tends to give an estimate that it uses a little over 1 watt with
this project (but, this estimate varies wildly anywhere from ~ 0.7W to
1.3W).

It could be:
milliwatt range, doesn't need a heatsink.
watt range, probably needs a heatsink.

>>
>> Can't mount a fan directly to the heatsink though, as ~ 12mm fans aren't
>> really a thing (smallest I can find are ~ 30mm). Also seemingly not a
>> thing: a 30mm aluminum heatsink that narrows and sticks onto a 10mm die
>> (via thermal adhesive). Also preferable if the fan could run at 3.3v and
>> under 50mA (so it could be powered via a PMOD connector or similar).
> <
> Piece of cake to machine, starting from a 30×30 heat sink. I could knock one
> out in 10 minutes if I had a 30×30 to start with--the machining of the fins and
> the anodizing is the hard parts. Relieving the bottom so it is only 10×10 is
> easy.

Could be.

I suspect work-holding would be the hard part here:
This part would be too small to be held effectively in a vise.

It has been a while since I have machined anything, mostly as the garage
is an endless crap-storm. Though, could probably do almost do a lot of
it using a file, rather than using a milling machine and an endmill.

Or, if one has a 30mm copper heatsink, and a 10mm x 10mm x 3mm spacer or
similar, they could solder it onto the bottom of the heatsink.

The heatsink I have on there right now is basically a 12mm x 12mm x 6mm
aluminum square with fins. Roughly matches the size of the die, but is
of reduced effectiveness without some form of active airflow.

>>
>> ...
>>
>> Current strategy though is mostly turning it off when it starts getting
>> too warm.
>>>>
>>>> An FPU with a separate FADD and FMUL unit could give lower latency for
>>>> FADD and FMUL, and could fake FMAC with only slightly higher latency
>>>> than the combined unit.
>>> <
>>> Maybe,
>>> FADD: 2-cycles is darned hard, 3-cycles is pretty easy.
>>> FMUL: 4-cycles is rather standard for 16-gates/cycle machines.
>>> FMAC: 4-cycles is pretty hard, 5-cycles is a bit better
>>> <
>>> AMD Athlon and Opteron used FADD=4 and FMUL=4 to simplify the
>>> pipelineing and to prevent having both units deliver result in the same
>>> cycle.
> <
>> Not sure how gate-delay compares with FPGA logic levels; ATM I am mostly
>> looking at 12 .. 14 (some parts are 10 or 11 logic levels).
>>
>> Looking at traces, they mostly seem to be LUT3/LUT4/LUT5 with the
>> occasional CARRY4 or similar, traveling between pairs of FDRE elements.
>>
>>
>> Internally, the FADD and FMUL units still have a 5-cycle latency, but
>> gain an extra 2 cycles due to an input/output buffering mechanism (also
>> needed for SIMD).
>>
>> Eg:
>> EX1: FPU receives inputs from pipeline;
>> EX2: Inputs fed into FADD or FMUL;
>> ... Work Cycles ...
>> EX3: Get result from FPU.
>>
>> Previously, the FADD and FMUL would recieve inputs directly during EX1,
>> but then they needed to deal with pipeline stalls. Adding the extra
>> cycle (with the outer FPU module managing input/output buffering) makes
>> them independent of the stall, which helps timing, but also adds an
>> extra clock-cycle of latency to the operation.
>>
>> The extra buffering cycle also helps with timing, allowing more time for
>> the value to get from the register-forwarding logic to the FPU.
>>
>>
>> So, FADD Stages:
>> C1 Unpack Input Arguments
>> Find difference of exponents
>> Decide which side is 'A' and which is 'B'
>> Right-Shift FracB
>> C2 Optionally Invert FracB
>> Add (FracA+FracB+Cin)
>> C3 CLZ (Renorm 1)
>> C4 Left-Shift (Renorm 2)
>> Try to round
>> C5 Pack Output / Done
>>
>> FMUL Stages:
>> C1 Unpack Input Arguments
>> Set up Exponents
>> Multiply Input Fragments
>> C2 ADD Stuff
>> C3 ADD Stuff
>> C4 Renorm Adjust / Round
>> C5 Pack Output / Done
>>
>> Renorm is easier for FMUL, as it assumes that the values falls in the
>> range of 1.0 .. 4.0, as opposed to FADD where it can be anything.
>>> <
>>> On the other hand, a single FMAC unit can do it all::
>>> FADD: FMAC 1*Rs1+Rs2
>>> FMUL: FMAC Rs1*Rs2+0
>>> <
>>> So if you find yourself in a position where you need FMAC (say to meet
>>> IEEE 754-2008+) you can have the design team build the FMAC unit.
>>> Later on, when building the next and wider machine, you can add an
>>> FADD or FMUL or both based on statistics you have gathered from
>>> generation 1. Given and FMAC, FADD is a degenerate subset which
>>> a GOOD Verilog compiler can autogenerate if you feed it the above
>>> fixed values {FMAC 1*Rs1+Rs2 and FMAC Rs1*Rs2+0}. THis REALLY
>>> reduces the designer workloads.
> <
>> My position is that I don't feel full IEEE conformance is a realistic
>> goal for this.
>>
>>
>> From what I can gather, in a loose-sense it does seem to provide most
>> of what IEEE-754-1985 seems to ask for, with a few exceptions:
>> Denormal as Zero;
>> FADD/FSUB/FMUL Only;
>> Compare Ops (via ALU);
>> Format Conversion (via FPU or ALU);
>> ...
>>
>>
>> Native FP Formats:
>> Double / Binary64 (Scalar, 2x SIMD)
>> Single / Binary32 (Conv, 2x | 4x SIMD)
>> Half / Binary16 (Conv, 4x SIMD)
>>
>> Long Double Extension (Optional, Cost):
>> Truncated Quad / Binary128
>>
>> RGBF Extension:
>> FP8S / FP8U (Packed Conv Only)
>>>>
>>>>
>>>> There are some operations though which could exist with an FMAC unit
>>>> which would not work correctly with an FMUL+FADD glued together, but I
>>>> am already pushing the limits of what seems viable on the XC7A100T.
>>> <
>>> Yep, your implementation medium is getting in your way. So are some of
>>> your tools.
>>>
>> Yeah, probably...
>> Verilator is seemingly pretty buggy in some areas.
>>
>> I am using the freeware / feature-limited version of Vivado, not sure
>> what the Commercial / EDA version is like, or what all features they
>> disabled.
>>
>> From what I can gather, Vivado is sorta like:
>> Free: Spartan and Artix FPGAs, some lower-end Zynq and Kintex devices.
>> Per-device vouchers: They enable certain FPGAs with the purchase of the
>> associated dev boards;
>> Commercial: AFAICT, $1k per seat per year?...
>>
>>
>> Say, if I bought to get one of the Kintex dev-boards, they would
>> apparently come with a voucher to allow Vivado to target them (well,
>> otherwise, it is a lot of money for a board one can't use).
>>
>> But, the Kintex boards generally go for upwards of $1000, and I still
>> don't have a job at the moment, so this is pretty steep...
>>
>>
>> Though, synthesis on a Kintex at a -2 speed grade (for one of the FPGA's
>> supported by Vivado WebPack) implies I can achieve clock speeds of ~ 150
>> to 200 MHz, as opposed to the 50MHz or 75MHz I can get on an Artix.
>>
>> Someone else had apparently once tested it on a Kintex and got it to
>> pass timing at ~ 180MHz.
>>
>>
>>
>> When I tried before using Quartus on a Cyclone V (targeting the same
>> type as in the DE10), was able to get it up to ~ 110 MHz, but this
>> didn't seem like enough of a speedup to justify me buying a DE10 (more
>> so when the DE10 had less RAM for the FPGA part, and I could only manage
>> to fit a single BJX2 core into the FPGA).
>>
>> These boards have an ARM SoC + FPGA part, there is like 1GB for the ARM
>> SoC, but with a separate 64MB RAM module for the FPGA.
>>
>> In theory, the number of LUTS/ALMS in the DE10 is large enough that it
>> should be more competitive with an Artix or Spartan, not sure what is
>> going on there...
>>
>> But, as noted, I could clock it a little higher than the Spartan or
>> Artix, but not enough to convince me to throw money at buying the actual
>> hardware or figure out how to deal with interacting with an ARM SoC...
>>
>>
>> Zynq is kinda similar, just I would have to figure out how to go about
>> plugging the BJX2 into an AXI Bus, which would be pretty much the only
>> way it could access RAM or similar.
>>
>> Granted, If I wanted to use Vivado's MIG (Memory Interface Generator), I
>> would also need to figure out AXI.
>>
>>
>> Though, I suspect MIG may know how to make the RAM work correctly in its
>> rated speed window (vs my DDR controller which is apparently running the
>> RAM in a sort of low-power standby mode).
>>
>> I did write a controller which could, in theory, run the RAM at 150MHz
>> (within its rated speed), but couldn't figure out how to make it
>> "actually work" on the actual hardware.
>>
>>
>> But, memory bandwidth is hard...
>> An still a pretty big bottleneck it seems.
>>
>> ...


Click here to read the complete article
Re: More complex instructions to reduce cycle overhead

<s7nv66$u2t$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16786&group=comp.arch#16786

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2a0a-a540-2862-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
Date: Sat, 15 May 2021 08:02:14 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <s7nv66$u2t$1@newsreader4.netcologne.de>
References: <s7dn5p$78r$1@newsreader4.netcologne.de>
<2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me>
<s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com>
<s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com>
<jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com>
<s7n2gj$5na$2@dont-email.me>
<5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com>
<s7n6ah$t1$1@dont-email.me>
<590ea343-cd96-4082-800c-f02412204262n@googlegroups.com>
<s7nchh$b58$1@dont-email.me> <KiHnI.151039$wd1.100928@fx41.iad>
Injection-Date: Sat, 15 May 2021 08:02:14 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2a0a-a540-2862-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2a0a:a540:2862:0:7285:c2ff:fe6c:992d";
logging-data="30813"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Sat, 15 May 2021 08:02 UTC

EricP <ThatWouldBeTelling@thevillage.com> schrieb:
> Ivan Godard wrote:
>>
>> Actually I was asking in the abstract, not for Mill. Even as !HG I can
>> see that the pulsed nature of static scheduling doesn't lend itself to
>> asynch; you would have to have something that looks quite like your
>> scoreboard to do what we do with the crossbar.
>>
>> So I was most interested in asynch in the pipelines - using it to get
>> rid of the stage flipflops. Would you consider using a FPU (for example)
>> that was asynch internally but synch at both ends, in search of fewer
>> stages/faster clock?
>
> There used to be a thing people were kicking about called wave pipelining.
> I gather the signals flowed through the circuits in waves with
> no synchronization (except presumably at the end).

That sounds scary - in effect, the synchronization between the different
bits in, let's say, an adder would be implied by the gate timing?

You would need very narrow tolerances on your gates, then (both
too fast and to slow would be deadly).

Or is some other mechanism proposed?

Re: More complex instructions to reduce cycle overhead

<2021May15.125605@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16787&group=comp.arch#16787

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
Date: Sat, 15 May 2021 10:56:05 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 51
Message-ID: <2021May15.125605@mips.complang.tuwien.ac.at>
References: <s7dn5p$78r$1@newsreader4.netcologne.de> <2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me> <s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me> <c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com> <s7mio3$qfs$1@dont-email.me> <00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com> <jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
Injection-Info: reader02.eternal-september.org; posting-host="5c1f667a8e4eb4275cced241751778da";
logging-data="7400"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19CL1IGRl5v+4fzYxIA/L5r"
Cancel-Lock: sha1:+UYH4txHdCtF6CBwHeyw3KQgHa0=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Sat, 15 May 2021 10:56 UTC

Stefan Monnier <monnier@iro.umontreal.ca> writes:
>
>IIUC cycle time (for the EX stage) can be split into:
>A- time to perform single-cycle operation
>B- time to propagate the result through the forwarding network
>C- time for the actual latch/flipflop
>
>Arguably, B and C are overheads.
>Has there been ISAs that aim to maximize the proportion of time spent in
>A rather than B and C by having instructions that perform several
>sequential operations.
>
>I guess the "negate inputs" options in MY66000 (and the shifts in ARM3)
>could be counted as such an example, tho a limited one.
>
>I'm thinking more of an ISA where an instruction is expected to do
>something like `(A op1 B) op2 C` in a single cycle (for various
>combinations of `op1` and `op2` like additions, shifts, and whatnot).

SuperSPARC did A+B+C in a single cycle (combining two add instructions).

Willamette (and Northwood) could do two back-to-back ALU instructions
in a single cycle. They used 16-bit ALUs that were staggered, with
the high part running half a cycle after the low part.

>I'm far from convinced it would work out well (there's a risk you'd end
>up having to use a NOP for `op1` or `op2` in too many cases), but I'm
>curious if someone has tried out something like that,

SuperSPARC was not particularly competetive, mainly because others had
much higher clock rates by the time.

Willamette and Northwood had high clock rate, but, despite this
advantage in dependent ALU operations, a relatively low IPC, so it was
at best neck-to-neck with AMD CPUs of the time (with lower clock and
higher IPC). It was never really clear to me where the Pentium 4 lost
IPC; I sometimes read about replays playing a big role, but never got
a good understanding of the effects involved.

In any case, this ALU feature was dropped already with the Prescott,
without obvious adverse effects on IPC; but Prescott had 64-bit ALUs,
so the staggered 16-bit style may have been inappropriate. Would two
staggered 32-bit ALUs be possible given the slightly longer (in terms
of gates) stages of current Intel and AMD CPUs (not to mention the
M1)? But it has not been done, so the benefits are probably not that
great.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Signed division by 2^n

<rImdnfoEmcoqJQL9nZ2dnUU78aHNnZ2d@supernews.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16788&group=comp.arch#16788

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!2.eu.feeder.erje.net!feeder.erje.net!newsfeed.xs4all.nl!newsfeed7.news.xs4all.nl!border2.nntp.ams1.giganews.com!nntp.giganews.com!buffer2.nntp.ams1.giganews.com!buffer1.nntp.ams1.giganews.com!nntp.supernews.com!news.supernews.com.POSTED!not-for-mail
NNTP-Posting-Date: Sat, 15 May 2021 07:04:07 -0500
Sender: Andrew Haley <aph@zarquon.pink>
From: aph...@littlepinkcloud.invalid
Subject: Re: Signed division by 2^n
Newsgroups: comp.arch
References: <s7dn5p$78r$1@newsreader4.netcologne.de> <2021May11.193250@mips.complang.tuwien.ac.at>
Distribution: world
User-Agent: tin/1.9.2-20070201 ("Dalaruan") (UNIX) (Linux/4.18.0-240.8.1.el8_3.x86_64 (x86_64))
Message-ID: <rImdnfoEmcoqJQL9nZ2dnUU78aHNnZ2d@supernews.com>
Date: Sat, 15 May 2021 07:04:07 -0500
Lines: 14
X-Trace: sv3-Csp3A9vJRgRb46wYoUih9ErR0k2Cn22PMo/5errFQeNBym/LMsgTRfoJMZJ0zhb6gPEGRSQ+wFkVGuW!setksWIAlY93VYA5dYKcwVnhCx1+C7/6Zd4pF+1A3g0icR08kFmVT+4gDMC9Pgox3uWBcjPVAA+9!LID1lMoa
X-Complaints-To: www.supernews.com/docs/abuse.html
X-DMCA-Complaints-To: www.supernews.com/docs/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40
X-Original-Bytes: 1507
 by: aph...@littlepinkcloud.invalid - Sat, 15 May 2021 12:04 UTC

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
>
> Even Aarch64, which supports pretty exotic stuff in some cases, needs
> 4 instructions for a signed symmetric division (or at least that's
> what I get with gcc).

Three, if you use a conditional select-and-increment (RISC? Don't make
me larf :-) but on some implmentations it might not be a win.

cmp w0, #0
cinc w8, w0, lt
asr w0, w8, #1

Andrew.

Re: More complex instructions to reduce cycle overhead

<PcQnI.365058$2A5.181861@fx45.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16790&group=comp.arch#16790

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!npeer.as286.net!npeer-ng0.as286.net!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx45.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
References: <s7dn5p$78r$1@newsreader4.netcologne.de> <2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me> <s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me> <c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com> <s7mio3$qfs$1@dont-email.me> <00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com> <jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org> <049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com> <s7n2gj$5na$2@dont-email.me> <5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com> <s7n6ah$t1$1@dont-email.me> <590ea343-cd96-4082-800c-f02412204262n@googlegroups.com> <s7nchh$b58$1@dont-email.me> <KiHnI.151039$wd1.100928@fx41.iad> <s7nv66$u2t$1@newsreader4.netcologne.de>
In-Reply-To: <s7nv66$u2t$1@newsreader4.netcologne.de>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 44
Message-ID: <PcQnI.365058$2A5.181861@fx45.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sat, 15 May 2021 13:42:07 UTC
Date: Sat, 15 May 2021 09:41:49 -0400
X-Received-Bytes: 3522
 by: EricP - Sat, 15 May 2021 13:41 UTC

Thomas Koenig wrote:
> EricP <ThatWouldBeTelling@thevillage.com> schrieb:
>> Ivan Godard wrote:
>>> Actually I was asking in the abstract, not for Mill. Even as !HG I can
>>> see that the pulsed nature of static scheduling doesn't lend itself to
>>> asynch; you would have to have something that looks quite like your
>>> scoreboard to do what we do with the crossbar.
>>>
>>> So I was most interested in asynch in the pipelines - using it to get
>>> rid of the stage flipflops. Would you consider using a FPU (for example)
>>> that was asynch internally but synch at both ends, in search of fewer
>>> stages/faster clock?
>> There used to be a thing people were kicking about called wave pipelining.
>> I gather the signals flowed through the circuits in waves with
>> no synchronization (except presumably at the end).
>
> That sounds scary - in effect, the synchronization between the different
> bits in, let's say, an adder would be implied by the gate timing?
>
> You would need very narrow tolerances on your gates, then (both
> too fast and to slow would be deadly).
>
> Or is some other mechanism proposed?

They eliminate intermediate pipeline stage registers,
then tools insert buffers so that all pathways through the combo logic
have the same propagation delay ensuring all output signals arrive at
the same instant. That allows them to change inputs for a new wave
before prior results have finished.

The computation is proceeding through the combo circuit as a wavelet.
I imagine it is extremely susceptible to process variation,
maybe temperature, which would widen or narrow the wavelet and skew
its relative time position. Different paths may change differently.

The result along all paths must not be sampled too soon or too late,
so one issue would be getting the clock to arrive at just the right
time when the whole wavelet is valid, for all variations.

Then if there are multiple wave pipelines for different calculations,
there are the meta-stability issues to deal with when they interact.

Sounds like a pain.

Re: More complex instructions to reduce cycle overhead

<jwvczts159e.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16791&group=comp.arch#16791

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
Date: Sat, 15 May 2021 09:44:42 -0400
Organization: A noiseless patient Spider
Lines: 9
Message-ID: <jwvczts159e.fsf-monnier+comp.arch@gnu.org>
References: <s7dn5p$78r$1@newsreader4.netcologne.de>
<2021May11.193250@mips.complang.tuwien.ac.at>
<s7l775$sq5$1@dont-email.me> <s7l7os$75r$1@dont-email.me>
<s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com>
<s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com>
<jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com>
<s7n2gj$5na$2@dont-email.me>
<5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com>
<s7n6ah$t1$1@dont-email.me>
<590ea343-cd96-4082-800c-f02412204262n@googlegroups.com>
<s7nchh$b58$1@dont-email.me> <KiHnI.151039$wd1.100928@fx41.iad>
<s7nv66$u2t$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="51ea0c12f7924b604dce53bd6c65a776";
logging-data="13650"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18CZNUmLiVcYMwgdgskzFSS"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)
Cancel-Lock: sha1:6HDNyNxCQygsuKCsTXWT5xoVsi0=
sha1:oXT6TLQBOiNnnypZwzSc86LES3Q=
 by: Stefan Monnier - Sat, 15 May 2021 13:44 UTC

> That sounds scary - in effect, the synchronization between the different
> bits in, let's say, an adder would be implied by the gate timing?
> You would need very narrow tolerances on your gates, then (both
> too fast and to slow would be deadly).

This does sound rather unworkable with today's variable frequencies.

Stefan

Re: More complex instructions to reduce cycle overhead

<s7onqq$ape$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16792&group=comp.arch#16792

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2a0a-a540-2862-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
Date: Sat, 15 May 2021 15:02:50 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <s7onqq$ape$1@newsreader4.netcologne.de>
References: <s7dn5p$78r$1@newsreader4.netcologne.de>
<2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me>
<s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com>
<s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com>
<jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com>
<s7n2gj$5na$2@dont-email.me>
<5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com>
<s7n6ah$t1$1@dont-email.me>
<590ea343-cd96-4082-800c-f02412204262n@googlegroups.com>
<s7nchh$b58$1@dont-email.me> <KiHnI.151039$wd1.100928@fx41.iad>
<s7nv66$u2t$1@newsreader4.netcologne.de> <PcQnI.365058$2A5.181861@fx45.iad>
Injection-Date: Sat, 15 May 2021 15:02:50 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2a0a-a540-2862-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2a0a:a540:2862:0:7285:c2ff:fe6c:992d";
logging-data="11054"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Sat, 15 May 2021 15:02 UTC

EricP <ThatWouldBeTelling@thevillage.com> schrieb:
> Thomas Koenig wrote:

>> That sounds scary - in effect, the synchronization between the different
>> bits in, let's say, an adder would be implied by the gate timing?
>>
>> You would need very narrow tolerances on your gates, then (both
>> too fast and to slow would be deadly).
>>
>> Or is some other mechanism proposed?
>
> They eliminate intermediate pipeline stage registers,
> then tools insert buffers so that all pathways through the combo logic
> have the same propagation delay ensuring all output signals arrive at
> the same instant.

I'm an engineer, and I know full well that, in anything we build,
there can not be such a thing as the same _anything_ ...

Rather, they must be counting on the inevitable dispersion to
be small enough that they an still catch it after a presumably
small number of cycles.

> That allows them to change inputs for a new wave
> before prior results have finished.
>
> The computation is proceeding through the combo circuit as a wavelet.
> I imagine it is extremely susceptible to process variation,
> maybe temperature, which would widen or narrow the wavelet and skew
> its relative time position. Different paths may change differently.
>
> The result along all paths must not be sampled too soon or too late,
> so one issue would be getting the clock to arrive at just the right
> time when the whole wavelet is valid, for all variations.
>
> Then if there are multiple wave pipelines for different calculations,
> there are the meta-stability issues to deal with when they interact.

> Sounds like a pain.

A royal pain, indeed...

Re: More complex instructions to reduce cycle overhead

<04b0a21b-151d-478d-9d38-49576c580ad4n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16793&group=comp.arch#16793

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:a44d:: with SMTP id n74mr31774176qke.367.1621092689316;
Sat, 15 May 2021 08:31:29 -0700 (PDT)
X-Received: by 2002:aca:6286:: with SMTP id w128mr37867862oib.119.1621092689049;
Sat, 15 May 2021 08:31:29 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 15 May 2021 08:31:28 -0700 (PDT)
In-Reply-To: <s7nchh$b58$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=64.26.99.248; posting-account=6JNn0QoAAAD-Scrkl0ClrfutZTkrOS9S
NNTP-Posting-Host: 64.26.99.248
References: <s7dn5p$78r$1@newsreader4.netcologne.de> <2021May11.193250@mips.complang.tuwien.ac.at>
<s7l775$sq5$1@dont-email.me> <s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com> <s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com> <jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com> <s7n2gj$5na$2@dont-email.me>
<5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com> <s7n6ah$t1$1@dont-email.me>
<590ea343-cd96-4082-800c-f02412204262n@googlegroups.com> <s7nchh$b58$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <04b0a21b-151d-478d-9d38-49576c580ad4n@googlegroups.com>
Subject: Re: More complex instructions to reduce cycle overhead
From: paaroncl...@gmail.com (Paul A. Clayton)
Injection-Date: Sat, 15 May 2021 15:31:29 +0000
Content-Type: text/plain; charset="UTF-8"
 by: Paul A. Clayton - Sat, 15 May 2021 15:31 UTC

On Friday, May 14, 2021 at 10:44:04 PM UTC-4, Ivan Godard wrote:
> On 5/14/2021 6:31 PM, MitchAlsup wrote:
[snip]
>> Without the tool sets, and some demonstrable way to test them, no it is
>> not time to go asynch.
>> <
>> Now, Ivan, consider a Mill implementation where the FP Multiplier can take 3,
>> 4, or 5 cycles to deliver its results. I can see a reservation station-like machine
>> or a scoreboard machine being able to "deal with this" (with extra logic), but
>> a statically scheduled machine has no chance.
>> <
>> Sutherland's paper on asynchronous only touched on the difficulty of the
>> concordance problem and this is where it really gets out of control.

[snip]
> Actually I was asking in the abstract, not for Mill. Even as !HG I can
> see that the pulsed nature of static scheduling doesn't lend itself to
> asynch; you would have to have something that looks quite like your
> scoreboard to do what we do with the crossbar.
>
> So I was most interested in asynch in the pipelines - using it to get
> rid of the stage flipflops. Would you consider using a FPU (for example)
> that was asynch internally but synch at both ends, in search of fewer
> stages/faster clock?

Technically, one does not need to use variable-timing asynchronous
logic to avoid (some) inter-stage buffering. Wave pipelining avoids
latching intermediate results; however, that depends on one wave
never catching up to another. (Such would typically still have latching
but the amount could be reduced.) In the past, this was used (by HP, e.g.)
for L1 cache access. With dynamic voltage-frequency scaling, such
becomes even more difficult to implement.

Just as one can speculatively overclock a design and replay on
rare failures, one could speculate on asynchronous timing while
using a static schedule. (The timing difference between successful
frequently enough to overcome replay overhead [and justify design
complexity] and worst-case timing may well not justify the effort,
but such is an abstract theoretical possibility.)

Re: Signed division by 2^n

<2021May15.182817@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16795&group=comp.arch#16795

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Signed division by 2^n
Date: Sat, 15 May 2021 16:28:17 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 57
Distribution: world
Message-ID: <2021May15.182817@mips.complang.tuwien.ac.at>
References: <s7dn5p$78r$1@newsreader4.netcologne.de> <2021May11.193250@mips.complang.tuwien.ac.at> <rImdnfoEmcoqJQL9nZ2dnUU78aHNnZ2d@supernews.com>
Injection-Info: reader02.eternal-september.org; posting-host="5c1f667a8e4eb4275cced241751778da";
logging-data="20681"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19l7QtoEOn48xOJ9mwwvj9L"
Cancel-Lock: sha1:ZKSJnY0/0YjSz8xraDE+TL4oElY=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Sat, 15 May 2021 16:28 UTC

aph@littlepinkcloud.invalid writes:
>Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
>>
>> Even Aarch64, which supports pretty exotic stuff in some cases, needs
>> 4 instructions for a signed symmetric division (or at least that's
>> what I get with gcc).
>
>Three, if you use a conditional select-and-increment (RISC? Don't make
>me larf :-) but on some implmentations it might not be a win.
>
> cmp w0, #0
> cinc w8, w0, lt
> asr w0, w8, #1

It looks to me like you are dividing by 2. Here's what gcc-7.5 gives
me for that:

18: 8b40fc00 add x0, x0, x0, lsr #63
1c: 9341fc00 asr x0, x0, #1

For dividing by 4 it gives me:

0: 91000c01 add x1, x0, #0x3
4: f100001f cmp x0, #0x0
8: 9a80b020 csel x0, x1, x0, lt // lt = tstop
c: 9342fc00 asr x0, x0, #2

Concerning "RISC?", it's a variant of a generalized conditional select
instruction; it's a register-to-register instruction with two source
registers, a flag input and one target, and seems to fit the bill
nicely for a RISC with flags. And in particular, it does not need the
destination register as one source, unlike Alpha's conditional move,
which the 21264 had to split into two microinstructions. The general
instruction is

rd = cc ? rn : op(rm)

where op(x) is x, ~x, x+1, or -x (i.e. (~x)+1). So we have to do a
separate instruction for adding 3 in the divide-by-4 case.

This instruction appears in the assembly language under 9 different
names (for specialized versions), with the most general ones being

CSEL rd, rn, rm, cc if(cc) rd = rn; else rd = rm
CSINC rd, rn, rm, cc if(cc) rd = rn; else rd = rm + 1
CSINV rd, rn, rm, cc if(cc) rd = rn; else rd = ~rm
CSNEG rd, rn, rm, cc if(cc) rd = rn; else rd = -rm

Other variants have rn=rm or rn=rm=XZR/WZR (zero). Once you add a
conditional move instruction, applying simple unary ALU stuff to one
operand is a logical extension. What makes you think that on some
implementations this may not be a win?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: More complex instructions to reduce cycle overhead

<s7ovs8$26k$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16796&group=comp.arch#16796

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!+9JlleTFc3MOERf2LU/SVA.user.gioia.aioe.org.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
Date: Sat, 15 May 2021 19:20:11 +0200
Organization: Aioe.org NNTP Server
Lines: 64
Message-ID: <s7ovs8$26k$1@gioia.aioe.org>
References: <s7dn5p$78r$1@newsreader4.netcologne.de>
<2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me>
<s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com>
<s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com>
<jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<2021May15.125605@mips.complang.tuwien.ac.at>
NNTP-Posting-Host: +9JlleTFc3MOERf2LU/SVA.user.gioia.aioe.org
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Complaints-To: abuse@aioe.org
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.7
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Sat, 15 May 2021 17:20 UTC

Anton Ertl wrote:
> Stefan Monnier <monnier@iro.umontreal.ca> writes:
>>
>> IIUC cycle time (for the EX stage) can be split into:
>> A- time to perform single-cycle operation
>> B- time to propagate the result through the forwarding network
>> C- time for the actual latch/flipflop
>>
>> Arguably, B and C are overheads.
>> Has there been ISAs that aim to maximize the proportion of time spent in
>> A rather than B and C by having instructions that perform several
>> sequential operations.
>>
>> I guess the "negate inputs" options in MY66000 (and the shifts in ARM3)
>> could be counted as such an example, tho a limited one.
>>
>> I'm thinking more of an ISA where an instruction is expected to do
>> something like `(A op1 B) op2 C` in a single cycle (for various
>> combinations of `op1` and `op2` like additions, shifts, and whatnot).
>
> SuperSPARC did A+B+C in a single cycle (combining two add instructions).
>
> Willamette (and Northwood) could do two back-to-back ALU instructions
> in a single cycle. They used 16-bit ALUs that were staggered, with
> the high part running half a cycle after the low part.
>
>> I'm far from convinced it would work out well (there's a risk you'd end
>> up having to use a NOP for `op1` or `op2` in too many cases), but I'm
>> curious if someone has tried out something like that,
>
> SuperSPARC was not particularly competetive, mainly because others had
> much higher clock rates by the time.
>
> Willamette and Northwood had high clock rate, but, despite this
> advantage in dependent ALU operations, a relatively low IPC, so it was
> at best neck-to-neck with AMD CPUs of the time (with lower clock and
> higher IPC). It was never really clear to me where the Pentium 4 lost
> IPC; I sometimes read about replays playing a big role, but never got
> a good understanding of the effects involved.
>
> In any case, this ALU feature was dropped already with the Prescott,
> without obvious adverse effects on IPC; but Prescott had 64-bit ALUs,
> so the staggered 16-bit style may have been inappropriate. Would two
> staggered 32-bit ALUs be possible given the slightly longer (in terms
> of gates) stages of current Intel and AMD CPUs (not to mention the
> M1)? But it has not been done, so the benefits are probably not that
> great.

The P4 was an extreme "hurry up and wait" machine, in that it only ran
fast as long as you had a big chunk of fast-core only instructions: The
switches between fast & slow core cost a lot, and most compiled code
tended to cause this far more often than the optimist architects assumed
would be the case?

It did have one huge Achilles' Heel imho: They made both integer MUL and
shifts very slow, so there was no fast way to do address arithmetic
unless it all fit in LEAs.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: More complex instructions to reduce cycle overhead

<ee2316da-9876-4ab0-919c-651f3e98b3cbn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16797&group=comp.arch#16797

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:5baf:: with SMTP id 15mr52015264qvq.34.1621102992159; Sat, 15 May 2021 11:23:12 -0700 (PDT)
X-Received: by 2002:a9d:4e88:: with SMTP id v8mr21948362otk.110.1621102991913; Sat, 15 May 2021 11:23:11 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsfeed.xs4all.nl!newsfeed8.news.xs4all.nl!tr3.eu1.usenetexpress.com!feeder.usenetexpress.com!tr1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 15 May 2021 11:23:11 -0700 (PDT)
In-Reply-To: <s7nchh$b58$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:12a:7219:801b:fc4d; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:12a:7219:801b:fc4d
References: <s7dn5p$78r$1@newsreader4.netcologne.de> <2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me> <s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me> <c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com> <s7mio3$qfs$1@dont-email.me> <00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com> <jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org> <049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com> <s7n2gj$5na$2@dont-email.me> <5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com> <s7n6ah$t1$1@dont-email.me> <590ea343-cd96-4082-800c-f02412204262n@googlegroups.com> <s7nchh$b58$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ee2316da-9876-4ab0-919c-651f3e98b3cbn@googlegroups.com>
Subject: Re: More complex instructions to reduce cycle overhead
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 15 May 2021 18:23:12 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 55
 by: MitchAlsup - Sat, 15 May 2021 18:23 UTC

On Friday, May 14, 2021 at 9:44:04 PM UTC-5, Ivan Godard wrote:
> On 5/14/2021 6:31 PM, MitchAlsup wrote:
> > On Friday, May 14, 2021 at 7:57:55 PM UTC-5, Ivan Godard wrote:
> >> On 5/14/2021 5:43 PM, MitchAlsup wrote:
> >>> On Friday, May 14, 2021 at 6:52:53 PM UTC-5, Ivan Godard wrote:
> > <
> >>>>> For single cycle back-to-back, this is accurate. C, however, is not a delay
> >>>>> one can get rid of, unless one is not building a fully pipelined machine
> >>>>> (new operation starting every cycle in the same FU.)
> >>> <
> >>>> You ever play with asynchronous logic?
> >>> <
> >>> Yes, ever try to talk the chip testing people into testing a chip with asynchronous
> >>> pipelines ??
> >>>
> >> I remind you: IANAHG :-)
> >>
> >> Synchronous has become the de facto standard, like x86 has. Do you, as a
> >> HG, feel that it's time to re-examine asynchronous?
> > <
> > There are no tool sets that can do GBOoO designs in asynchronous forms.
> > Simple in order 1-wide designs might be possible for a very brave design
> > team.
> > <
> > Without the tool sets, and some demonstrable way to test them, no it is
> > not time to go asynch.
> > <
> > Now, Ivan, consider a Mill implementation where the FP Multiplier can take 3,
> > 4, or 5 cycles to deliver its results. I can see a reservation station-like machine
> > or a scoreboard machine being able to "deal with this" (with extra logic), but
> > a statically scheduled machine has no chance.
> > <
> > Sutherland's paper on asynchronous only touched on the difficulty of the
> > concordance problem and this is where it really gets out of control.
> > <
> > So, in the case of Mill, it is completely plausible that the delivery of a result
> > to the belt is an easily timed event. However, consuming a result from the
> > belt is not -- and essentially you would need for all data to pass through a
> > synchronizer to be consumed as an operand. These synchronizers operating
> > IN the same clock domain are still 3-latches 1.2 clocks of delay--for forwarding !!
> >
> Actually I was asking in the abstract, not for Mill. Even as !HG I can
> see that the pulsed nature of static scheduling doesn't lend itself to
> asynch; you would have to have something that looks quite like your
> scoreboard to do what we do with the crossbar.
<
If there is a fairly easy way to calculate how long an instruction calculation
takes, So you can send the "I'm done" signal 1 or 2 cycles before data is
delivered, I can see designing FUs that run asynchronously inside.
>
> So I was most interested in asynch in the pipelines - using it to get
> rid of the stage flipflops. Would you consider using a FPU (for example)
> that was asynch internally but synch at both ends, in search of fewer
> stages/faster clock?
<
This would result in fewer cycles, but not a faster clock.

Re: More complex instructions to reduce cycle overhead

<61b63c44-4cea-4ee4-884a-dbc71412be8en@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16798&group=comp.arch#16798

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:2191:: with SMTP id g17mr44836306qka.296.1621103081973;
Sat, 15 May 2021 11:24:41 -0700 (PDT)
X-Received: by 2002:a05:6830:1605:: with SMTP id g5mr43257506otr.22.1621103081782;
Sat, 15 May 2021 11:24:41 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 15 May 2021 11:24:41 -0700 (PDT)
In-Reply-To: <KiHnI.151039$wd1.100928@fx41.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:12a:7219:801b:fc4d;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:12a:7219:801b:fc4d
References: <s7dn5p$78r$1@newsreader4.netcologne.de> <2021May11.193250@mips.complang.tuwien.ac.at>
<s7l775$sq5$1@dont-email.me> <s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com> <s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com> <jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com> <s7n2gj$5na$2@dont-email.me>
<5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com> <s7n6ah$t1$1@dont-email.me>
<590ea343-cd96-4082-800c-f02412204262n@googlegroups.com> <s7nchh$b58$1@dont-email.me>
<KiHnI.151039$wd1.100928@fx41.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <61b63c44-4cea-4ee4-884a-dbc71412be8en@googlegroups.com>
Subject: Re: More complex instructions to reduce cycle overhead
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 15 May 2021 18:24:41 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Sat, 15 May 2021 18:24 UTC

On Friday, May 14, 2021 at 10:34:05 PM UTC-5, EricP wrote:
> Ivan Godard wrote:
> >
> > Actually I was asking in the abstract, not for Mill. Even as !HG I can
> > see that the pulsed nature of static scheduling doesn't lend itself to
> > asynch; you would have to have something that looks quite like your
> > scoreboard to do what we do with the crossbar.
> >
> > So I was most interested in asynch in the pipelines - using it to get
> > rid of the stage flipflops. Would you consider using a FPU (for example)
> > that was asynch internally but synch at both ends, in search of fewer
> > stages/faster clock?
<
> There used to be a thing people were kicking about called wave pipelining.
> I gather the signals flowed through the circuits in waves with
> no synchronization (except presumably at the end).
<
Wave pipelining is dependent on knowing the dispersion of the logic
block you are running through (fastest path to slowest path). Heaven
help you if the path is data dependent.
>
> A quicky search finds multiple recent hits so its not dead.

Re: More complex instructions to reduce cycle overhead

<s7pape$8bh$1@reader1.panix.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16801&group=comp.arch#16801

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!panix!not-for-mail
From: pw...@panix.com (paul wallich)
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
Date: Sat, 15 May 2021 16:26:21 -0400
Organization: PANIX Public Access Internet and UNIX, NYC
Lines: 22
Message-ID: <s7pape$8bh$1@reader1.panix.com>
References: <s7dn5p$78r$1@newsreader4.netcologne.de>
<2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me>
<s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com>
<s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com>
<jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com>
<s7n2gj$5na$2@dont-email.me>
<5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com>
<s7n6ah$t1$1@dont-email.me>
<590ea343-cd96-4082-800c-f02412204262n@googlegroups.com>
<s7nchh$b58$1@dont-email.me> <KiHnI.151039$wd1.100928@fx41.iad>
<s7nv66$u2t$1@newsreader4.netcologne.de> <PcQnI.365058$2A5.181861@fx45.iad>
NNTP-Posting-Host: localhost
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: reader1.panix.com 1621110382 8561 127.0.0.1 (15 May 2021 20:26:22 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: Sat, 15 May 2021 20:26:22 +0000 (UTC)
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0)
Gecko/20100101 Thunderbird/78.10.1
In-Reply-To: <PcQnI.365058$2A5.181861@fx45.iad>
Content-Language: en-US
 by: paul wallich - Sat, 15 May 2021 20:26 UTC

On 5/15/21 9:41 AM, EricP wrote:

> They eliminate intermediate pipeline stage registers,
> then tools insert buffers so that all pathways through the combo logic
> have the same propagation delay ensuring all output signals arrive at
> the same instant. That allows them to change inputs for a new wave
> before prior results have finished.
>
> The computation is proceeding through the combo circuit as a wavelet.
> I imagine it is extremely susceptible to process variation,
> maybe temperature, which would widen or narrow the wavelet and skew
> its relative time position. Different paths may change differently.

Didn't Mitch just mention that flip-flops add roughly 30% to a pipe
stage as things are designed now? That would seem to give you a ballpark
figure for how much margin you could potentially make available. So even
if it costs you a fair amount in buffers and other tricks you could
still come out ahead. Also seems to me that you might be able to
propagate some sort of "done" signal through the system as well to let
you know when to latch.

paul

Re: More complex instructions to reduce cycle overhead

<7be04135-ac2a-4641-b18b-41a5a214dfb8n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16802&group=comp.arch#16802

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a0c:ef42:: with SMTP id t2mr49248842qvs.48.1621111675989;
Sat, 15 May 2021 13:47:55 -0700 (PDT)
X-Received: by 2002:aca:2107:: with SMTP id 7mr10978067oiz.110.1621111675764;
Sat, 15 May 2021 13:47:55 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 15 May 2021 13:47:55 -0700 (PDT)
In-Reply-To: <s7pape$8bh$1@reader1.panix.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:12a:7219:801b:fc4d;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:12a:7219:801b:fc4d
References: <s7dn5p$78r$1@newsreader4.netcologne.de> <2021May11.193250@mips.complang.tuwien.ac.at>
<s7l775$sq5$1@dont-email.me> <s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com> <s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com> <jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com> <s7n2gj$5na$2@dont-email.me>
<5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com> <s7n6ah$t1$1@dont-email.me>
<590ea343-cd96-4082-800c-f02412204262n@googlegroups.com> <s7nchh$b58$1@dont-email.me>
<KiHnI.151039$wd1.100928@fx41.iad> <s7nv66$u2t$1@newsreader4.netcologne.de>
<PcQnI.365058$2A5.181861@fx45.iad> <s7pape$8bh$1@reader1.panix.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <7be04135-ac2a-4641-b18b-41a5a214dfb8n@googlegroups.com>
Subject: Re: More complex instructions to reduce cycle overhead
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 15 May 2021 20:47:55 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Sat, 15 May 2021 20:47 UTC

On Saturday, May 15, 2021 at 3:26:24 PM UTC-5, paul wallich wrote:
> On 5/15/21 9:41 AM, EricP wrote:
>
> > They eliminate intermediate pipeline stage registers,
> > then tools insert buffers so that all pathways through the combo logic
> > have the same propagation delay ensuring all output signals arrive at
> > the same instant. That allows them to change inputs for a new wave
> > before prior results have finished.
> >
> > The computation is proceeding through the combo circuit as a wavelet.
> > I imagine it is extremely susceptible to process variation,
> > maybe temperature, which would widen or narrow the wavelet and skew
> > its relative time position. Different paths may change differently.
<
> Didn't Mitch just mention that flip-flops add roughly 30% to a pipe
> stage as things are designed now?
<
16-logic-gates and 5-flip-flop-gates makes the flip flop 5/(16+5) = 24%
<
> That would seem to give you a ballpark
> figure for how much margin you could potentially make available. So even
> if it costs you a fair amount in buffers and other tricks you could
> still come out ahead. Also seems to me that you might be able to
> propagate some sort of "done" signal through the system as well to let
> you know when to latch.
<
Only if the done signal is wave-front-incident with the data.
<
Back when I did a few SRAMs, we used a pair of dummy bit lines a column
of SRAM cells. to produce a signal we could use to time the flip-flops out of
the sense AMPs. So, yes it can be done and done reliably, but only if the
"I'm done" signal has exactly the characteristics as a data signal through
the same block of logic, AND that logic has the same parasitics as the real
calculation path.
<
It is a shame so few of us did real circuit design down at the transistor
and layout level.
<
>
> paul

Re: More complex instructions to reduce cycle overhead

<c3d0574a-273c-4e52-9e62-a90c85869303n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16813&group=comp.arch#16813

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:20e7:: with SMTP id 7mr54986888qvk.36.1621175273763;
Sun, 16 May 2021 07:27:53 -0700 (PDT)
X-Received: by 2002:aca:30cc:: with SMTP id w195mr40194737oiw.78.1621175273498;
Sun, 16 May 2021 07:27:53 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 16 May 2021 07:27:53 -0700 (PDT)
In-Reply-To: <04b0a21b-151d-478d-9d38-49576c580ad4n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=96.241.172.78; posting-account=6JNn0QoAAAD-Scrkl0ClrfutZTkrOS9S
NNTP-Posting-Host: 96.241.172.78
References: <s7dn5p$78r$1@newsreader4.netcologne.de> <2021May11.193250@mips.complang.tuwien.ac.at>
<s7l775$sq5$1@dont-email.me> <s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com> <s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com> <jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com> <s7n2gj$5na$2@dont-email.me>
<5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com> <s7n6ah$t1$1@dont-email.me>
<590ea343-cd96-4082-800c-f02412204262n@googlegroups.com> <s7nchh$b58$1@dont-email.me>
<04b0a21b-151d-478d-9d38-49576c580ad4n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c3d0574a-273c-4e52-9e62-a90c85869303n@googlegroups.com>
Subject: Re: More complex instructions to reduce cycle overhead
From: paaroncl...@gmail.com (Paul A. Clayton)
Injection-Date: Sun, 16 May 2021 14:27:53 +0000
Content-Type: text/plain; charset="UTF-8"
 by: Paul A. Clayton - Sun, 16 May 2021 14:27 UTC

On Saturday, May 15, 2021 at 11:31:30 AM UTC-4, Paul A. Clayton wrote:
[snip]
> Technically, one does not need to use variable-timing asynchronous
> logic to avoid (some) inter-stage buffering. Wave pipelining avoids
> latching intermediate results; however, that depends on one wave
> never catching up to another. (Such would typically still have latching
> but the amount could be reduced.) In the past, this was used (by HP, e.g.)
> for L1 cache access. With dynamic voltage-frequency scaling, such
> becomes even more difficult to implement.

I should have read previous posts before posting: wave pipelining
was already mentioned! Sigh.

Pages:1234
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor