Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Disk crisis, please clean up!


devel / comp.arch / Re: More complex instructions to reduce cycle overhead

SubjectAuthor
* Signed division by 2^nThomas Koenig
+* Re: Signed division by 2^nMarcus
|`- Re: Signed division by 2^nMitchAlsup
+- Re: Signed division by 2^nStephen Fuld
+* Re: Signed division by 2^nAnton Ertl
|+* Re: Signed division by 2^nMitchAlsup
||`* Re: Signed division by 2^nThomas Koenig
|| `* Re: saturating arithmetic, not Signed division by 2^nJohn Levine
||  +- Re: saturating arithmetic, not Signed division by 2^nMitchAlsup
||  +- Re: saturating arithmetic, not Signed division by 2^nBrian G. Lucas
||  `* Re: saturating arithmetic, not Signed division by 2^nJeremy Linton
||   +* Re: saturating arithmetic, not Signed division by 2^nStefan Monnier
||   |+* Re: saturating arithmetic, not Signed division by 2^nThomas Koenig
||   ||+- Re: saturating arithmetic, not Signed division by 2^nMitchAlsup
||   ||+- Re: saturating arithmetic, not Signed division by 2^nStefan Monnier
||   ||+- Re: saturating arithmetic, not Signed division by 2^nDavid Brown
||   ||`- Re: saturating arithmetic, not Signed division by 2^nAnton Ertl
||   |`- Re: saturating arithmetic, not Signed division by 2^nIvan Godard
||   +- Re: saturating arithmetic, not Signed division by 2^nEricP
||   `* Re: saturating arithmetic, not Signed division by 2^nAnton Ertl
||    +- Re: saturating arithmetic, not Signed division by 2^nMitchAlsup
||    `* Re: saturating arithmetic, not Signed division by 2^nGeorge Neuner
||     +* Re: saturating arithmetic, not Signed division by 2^nNiklas Holsti
||     |`- Re: saturating arithmetic, not Signed division by 2^nBill Findlay
||     +* Re: saturating arithmetic, not Signed division by 2^nBill Findlay
||     |`- Re: saturating arithmetic, not Signed division by 2^nTerje Mathisen
||     +* Re: saturating arithmetic, not Signed division by 2^nTerje Mathisen
||     |`* Re: saturating arithmetic, not Signed division by 2^nThomas Koenig
||     | `* Re: saturating arithmetic, not Signed division by 2^nTerje Mathisen
||     |  +- Re: saturating arithmetic, not Signed division by 2^nMitchAlsup
||     |  `* Re: saturating arithmetic, not Signed division by 2^nAndreas Eder
||     |   `* Re: saturating arithmetic, not Signed division by 2^nTerje Mathisen
||     |    `* Re: saturating arithmetic, not Signed division by 2^nThomas Koenig
||     |     `* Re: saturating arithmetic, not Signed division by 2^nTerje Mathisen
||     |      `* Re: saturating arithmetic, not Signed division by 2^nThomas Koenig
||     |       `- Re: saturating arithmetic, not Signed division by 2^nThomas Koenig
||     `- Re: saturating arithmetic, not Signed division by 2^nMitchAlsup
|+* Re: Signed division by 2^nBGB
||+* Re: Signed division by 2^nIvan Godard
|||+- Re: Signed division by 2^nAnton Ertl
|||+- Re: Signed division by 2^nTerje Mathisen
|||+- Re: Signed division by 2^nMitchAlsup
|||`* Re: Signed division by 2^nBGB
||| `* Re: Signed division by 2^nMitchAlsup
|||  `* Re: Signed division by 2^nBGB
|||   `* Re: Signed division by 2^nMitchAlsup
|||    +* More complex instructions to reduce cycle overheadStefan Monnier
|||    |+* Re: More complex instructions to reduce cycle overheadIvan Godard
|||    ||`* Re: More complex instructions to reduce cycle overheadMitchAlsup
|||    || `- Re: More complex instructions to reduce cycle overheadIvan Godard
|||    |+* Re: More complex instructions to reduce cycle overheadMitchAlsup
|||    ||+- Re: More complex instructions to reduce cycle overheadStefan Monnier
|||    ||`* Re: More complex instructions to reduce cycle overheadIvan Godard
|||    || `* Re: More complex instructions to reduce cycle overheadMitchAlsup
|||    ||  `* Re: More complex instructions to reduce cycle overheadIvan Godard
|||    ||   `* Re: More complex instructions to reduce cycle overheadMitchAlsup
|||    ||    `* Re: More complex instructions to reduce cycle overheadIvan Godard
|||    ||     +* Re: More complex instructions to reduce cycle overheadEricP
|||    ||     |+* Re: More complex instructions to reduce cycle overheadThomas Koenig
|||    ||     ||+* Re: More complex instructions to reduce cycle overheadEricP
|||    ||     |||+* Re: More complex instructions to reduce cycle overheadThomas Koenig
|||    ||     ||||`* Re: More complex instructions to reduce cycle overheadBGB
|||    ||     |||| `* Re: More complex instructions to reduce cycle overheadEricP
|||    ||     ||||  +* Re: More complex instructions to reduce cycle overheadMitchAlsup
|||    ||     ||||  |+* Re: More complex instructions to reduce cycle overheadBGB
|||    ||     ||||  ||`* Re: More complex instructions to reduce cycle overheadMarcus
|||    ||     ||||  || `- Re: More complex instructions to reduce cycle overheadBGB
|||    ||     ||||  |`- Re: More complex instructions to reduce cycle overheadJimBrakefield
|||    ||     ||||  `* Re: More complex instructions to reduce cycle overheadBGB
|||    ||     ||||   +* Re: More complex instructions to reduce cycle overheadMarcus
|||    ||     ||||   |`* Re: More complex instructions to reduce cycle overheadBGB
|||    ||     ||||   | `* Re: More complex instructions to reduce cycle overheadEricP
|||    ||     ||||   |  `* Re: More complex instructions to reduce cycle overheadBGB
|||    ||     ||||   |   `- Re: More complex instructions to reduce cycle overheadEricP
|||    ||     ||||   `* Re: More complex instructions to reduce cycle overheadEricP
|||    ||     ||||    `* Re: More complex instructions to reduce cycle overheadMitchAlsup
|||    ||     ||||     `* Re: More complex instructions to reduce cycle overheadBGB
|||    ||     ||||      +* Re: More complex instructions to reduce cycle overheadEricP
|||    ||     ||||      |`* Re: More complex instructions to reduce cycle overheadBGB
|||    ||     ||||      | +- Timing... (Re: More complex instructions to reduce cycle overhead)BGB
|||    ||     ||||      | `* Re: Timing... (Re: More complex instructions to reduce cycle overhead)JimBrakefield
|||    ||     ||||      |  `- Re: Timing... (Re: More complex instructions to reduce cycleBGB
|||    ||     ||||      `* Re: More complex instructions to reduce cycle overheadMarcus
|||    ||     ||||       `- Re: More complex instructions to reduce cycle overheadBGB
|||    ||     |||`* Re: More complex instructions to reduce cycle overheadpaul wallich
|||    ||     ||| `- Re: More complex instructions to reduce cycle overheadMitchAlsup
|||    ||     ||`- Re: More complex instructions to reduce cycle overheadStefan Monnier
|||    ||     |`- Re: More complex instructions to reduce cycle overheadMitchAlsup
|||    ||     +* Re: More complex instructions to reduce cycle overheadPaul A. Clayton
|||    ||     |`- Re: More complex instructions to reduce cycle overheadPaul A. Clayton
|||    ||     `- Re: More complex instructions to reduce cycle overheadMitchAlsup
|||    |`* Re: More complex instructions to reduce cycle overheadAnton Ertl
|||    | `- Re: More complex instructions to reduce cycle overheadTerje Mathisen
|||    `* Re: Signed division by 2^nBGB
|||     `* Re: Signed division by 2^nMitchAlsup
|||      `- Re: Signed division by 2^nBGB
||`- Re: Signed division by 2^nThomas Koenig
|`* Re: Signed division by 2^naph
| `- Re: Signed division by 2^nAnton Ertl
`- Re: Signed division by 2^nIvan Godard

Pages:1234
Re: More complex instructions to reduce cycle overhead

<s7uclo$23r$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16834&group=comp.arch#16834

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
Date: Mon, 17 May 2021 13:29:10 -0500
Organization: A noiseless patient Spider
Lines: 143
Message-ID: <s7uclo$23r$1@dont-email.me>
References: <s7dn5p$78r$1@newsreader4.netcologne.de>
<2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me>
<s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com>
<s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com>
<jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com>
<s7n2gj$5na$2@dont-email.me>
<5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com>
<s7n6ah$t1$1@dont-email.me>
<590ea343-cd96-4082-800c-f02412204262n@googlegroups.com>
<s7nchh$b58$1@dont-email.me> <KiHnI.151039$wd1.100928@fx41.iad>
<s7nv66$u2t$1@newsreader4.netcologne.de> <PcQnI.365058$2A5.181861@fx45.iad>
<s7onqq$ape$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 17 May 2021 18:29:12 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="dd00aa2d27f1e25627eb60df3219a2cf";
logging-data="2171"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18YHvZUO53QpiRaATzi0uTc"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.1
Cancel-Lock: sha1:s3hc9uHEc55bo6/gXbwVb4kE8g4=
In-Reply-To: <s7onqq$ape$1@newsreader4.netcologne.de>
Content-Language: en-US
 by: BGB - Mon, 17 May 2021 18:29 UTC

On 5/15/2021 10:02 AM, Thomas Koenig wrote:
> EricP <ThatWouldBeTelling@thevillage.com> schrieb:
>> Thomas Koenig wrote:
>
>>> That sounds scary - in effect, the synchronization between the different
>>> bits in, let's say, an adder would be implied by the gate timing?
>>>
>>> You would need very narrow tolerances on your gates, then (both
>>> too fast and to slow would be deadly).
>>>
>>> Or is some other mechanism proposed?
>>
>> They eliminate intermediate pipeline stage registers,
>> then tools insert buffers so that all pathways through the combo logic
>> have the same propagation delay ensuring all output signals arrive at
>> the same instant.
>
> I'm an engineer, and I know full well that, in anything we build,
> there can not be such a thing as the same _anything_ ...
>
> Rather, they must be counting on the inevitable dispersion to
> be small enough that they an still catch it after a presumably
> small number of cycles.
>

I have made the observation that data forwarded between clock domains in
an FPGA is not reliable, even if "most of the time" it gets through
intact...

In my recent battling against stability issues, I was seeing crashes
with debug prints where the "expected value" and "got value" would
differ by 1 bit being flipped, ...

Ended up adding integrity checking and filtering to reject messages (and
use the prior values) if the check values didn't match after the
clock-domain crossing (generally XOR'ing everything together and
checking for equality).

Stuff online generally implied:
always @(posedge oldclock)
begin
tempValue <= sendValue;
end
always @(posedge newclock)
begin
recvValue <= tempValue;
end

But, it wasn't nearly so easy outside of simulation...

Ended up having to do something like:
always @(posedge oldclock)
begin
tempValue1 <= sendValue;
tempValue2 <= tempValue1;
end

always @(posedge newclock)
begin
tempValue3 <= tempValue2;
tempValue4 <= tempValue3;
...
recvValue <= tempValueN;
end

Where multiple forwarding stages tends to improve stability, at a
50<->75 MHz interface, generally needed 2 stages on the 'old' side, and
6 on the 'new' side. Similar for 50<->150 MHz.

Even then, reliability was still an issue (and it would break
intermittently, and even 50<->100 interfacing was still unreliable, if
not quite as bad). Similar seems to apply to 75<->150.

Note that using both 100 and 150 MHz on the FPGA seems to be a problem,
as the synthesis then goes and forces them into different clock-tiles
(limiting the usable space in the FPGA); whereas other (sub 100 MHz)
frequencies seem able to coexist within the same clock tiles.

No mention was made though that, "yeah, bits might occasionally get
randomly flipped here", or that one might need to do some form of
error-checking across the clock-domain crossing. Though, I guess it does
make sense in retrospect.

Then again, probably shouldn't be too surprised, as "forward a bunch of
time to make data pass between two different clocks" does seem a little
questionable...

May test though whether these integrity checks allow reducing the number
of forwarding stages used (since this does visibly effect RAM speed).

That, and also observing that the ability to reliably read data from the
SDcard is dependent on how well the SDcard extension cable is seated,
which is often "not great" as it doesn't really like to stay in place.

Like a lot of "stuff isn't working correctly", unhooks SDcard cable,
reinserts it, resets FPGA, things work now...

This part is made annoying as neither the SDcard interface nor FAT32
offer any kind of data-integrity checking (and a checksum error from
loading a binary could be either due to SDcard or due to memory issues).

Though, either way, I am using checksum checks for binaries, because
formerly binaries failing to load correctly were also a source of issues.

An ability to detect bad data reads would allow retrying though.

But, yeah, this can get kinda annoying.

Luckily, at least the Block-RAM seems to be reliable, but apparently the
Artix-7 and Spartan-7 use ECC'ed BRAM (and, presumably Vivado is using
said ECC... But, can't find anything anywhere to confirm this).

>> That allows them to change inputs for a new wave
>> before prior results have finished.
>>
>> The computation is proceeding through the combo circuit as a wavelet.
>> I imagine it is extremely susceptible to process variation,
>> maybe temperature, which would widen or narrow the wavelet and skew
>> its relative time position. Different paths may change differently.
>>
>> The result along all paths must not be sampled too soon or too late,
>> so one issue would be getting the clock to arrive at just the right
>> time when the whole wavelet is valid, for all variations.
>>
>> Then if there are multiple wave pipelines for different calculations,
>> there are the meta-stability issues to deal with when they interact.
>
>> Sounds like a pain.
>
> A royal pain, indeed...
>

Re: More complex instructions to reduce cycle overhead

<2aAoI.606122$%W6.592987@fx44.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16835&group=comp.arch#16835

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsfeed.xs4all.nl!newsfeed8.news.xs4all.nl!peer02.ams4!peer.am4.highwinds-media.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx44.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
References: <s7dn5p$78r$1@newsreader4.netcologne.de> <2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me> <s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me> <c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com> <s7mio3$qfs$1@dont-email.me> <00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com> <jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org> <049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com> <s7n2gj$5na$2@dont-email.me> <5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com> <s7n6ah$t1$1@dont-email.me> <590ea343-cd96-4082-800c-f02412204262n@googlegroups.com> <s7nchh$b58$1@dont-email.me> <KiHnI.151039$wd1.100928@fx41.iad> <s7nv66$u2t$1@newsreader4.netcologne.de> <PcQnI.365058$2A5.181861@fx45.iad> <s7onqq$ape$1@newsreader4.netcologne.de> <s7uclo$23r$1@dont-email.me>
In-Reply-To: <s7uclo$23r$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 131
Message-ID: <2aAoI.606122$%W6.592987@fx44.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Mon, 17 May 2021 20:15:58 UTC
Date: Mon, 17 May 2021 16:15:30 -0400
X-Received-Bytes: 6398
 by: EricP - Mon, 17 May 2021 20:15 UTC

BGB wrote:
> On 5/15/2021 10:02 AM, Thomas Koenig wrote:
>> EricP <ThatWouldBeTelling@thevillage.com> schrieb:
>>> Thomas Koenig wrote:
>>
>>>> That sounds scary - in effect, the synchronization between the
>>>> different
>>>> bits in, let's say, an adder would be implied by the gate timing?
>>>>
>>>> You would need very narrow tolerances on your gates, then (both
>>>> too fast and to slow would be deadly).
>>>>
>>>> Or is some other mechanism proposed?
>>>
>>> They eliminate intermediate pipeline stage registers,
>>> then tools insert buffers so that all pathways through the combo logic
>>> have the same propagation delay ensuring all output signals arrive at
>>> the same instant.
>>
>> I'm an engineer, and I know full well that, in anything we build,
>> there can not be such a thing as the same _anything_ ...
>>
>> Rather, they must be counting on the inevitable dispersion to
>> be small enough that they an still catch it after a presumably
>> small number of cycles.
>>
>
> I have made the observation that data forwarded between clock domains in
> an FPGA is not reliable, even if "most of the time" it gets through
> intact...
>
> In my recent battling against stability issues, I was seeing crashes
> with debug prints where the "expected value" and "got value" would
> differ by 1 bit being flipped, ...
>
>
> Ended up adding integrity checking and filtering to reject messages (and
> use the prior values) if the check values didn't match after the
> clock-domain crossing (generally XOR'ing everything together and
> checking for equality).
>
>
> Stuff online generally implied:
> always @(posedge oldclock)
> begin
> tempValue <= sendValue;
> end
> always @(posedge newclock)
> begin
> recvValue <= tempValue;
> end
>
> But, it wasn't nearly so easy outside of simulation...
>
>
> Ended up having to do something like:
> always @(posedge oldclock)
> begin
> tempValue1 <= sendValue;
> tempValue2 <= tempValue1;
> end
>
> always @(posedge newclock)
> begin
> tempValue3 <= tempValue2;
> tempValue4 <= tempValue3;
> ...
> recvValue <= tempValueN;
> end

This looks like a metastability synchronizer but with more stages.
https://en.wikipedia.org/wiki/Metastability_(electronics)

> Where multiple forwarding stages tends to improve stability, at a
> 50<->75 MHz interface, generally needed 2 stages on the 'old' side, and
> 6 on the 'new' side. Similar for 50<->150 MHz.
>
> Even then, reliability was still an issue (and it would break
> intermittently, and even 50<->100 interfacing was still unreliable, if
> not quite as bad). Similar seems to apply to 75<->150.

>
> Note that using both 100 and 150 MHz on the FPGA seems to be a problem,
> as the synthesis then goes and forces them into different clock-tiles
> (limiting the usable space in the FPGA); whereas other (sub 100 MHz)
> frequencies seem able to coexist within the same clock tiles.

But a metastability synchronizer shouldn't be necessary as
all the clocks are multiples of 25 Mhz and presumably derived
from the same source.

Maybe it is due to clock skew between the domains.

> No mention was made though that, "yeah, bits might occasionally get
> randomly flipped here", or that one might need to do some form of
> error-checking across the clock-domain crossing. Though, I guess it does
> make sense in retrospect.
>
>
> Then again, probably shouldn't be too surprised, as "forward a bunch of
> time to make data pass between two different clocks" does seem a little
> questionable...
>
> May test though whether these integrity checks allow reducing the number
> of forwarding stages used (since this does visibly effect RAM speed).
>
>
>
> That, and also observing that the ability to reliably read data from the
> SDcard is dependent on how well the SDcard extension cable is seated,
> which is often "not great" as it doesn't really like to stay in place.
>
> Like a lot of "stuff isn't working correctly", unhooks SDcard cable,
> reinserts it, resets FPGA, things work now...
>
> This part is made annoying as neither the SDcard interface nor FAT32
> offer any kind of data-integrity checking (and a checksum error from
> loading a binary could be either due to SDcard or due to memory issues).
>
> Though, either way, I am using checksum checks for binaries, because
> formerly binaries failing to load correctly were also a source of issues.
>
> An ability to detect bad data reads would allow retrying though.
>
>
> But, yeah, this can get kinda annoying.
>
>
> Luckily, at least the Block-RAM seems to be reliable, but apparently the
> Artix-7 and Spartan-7 use ECC'ed BRAM (and, presumably Vivado is using
> said ECC... But, can't find anything anywhere to confirm this).

Re: More complex instructions to reduce cycle overhead

<3ea71dea-28e2-47a2-9073-d49cfe92cde4n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16838&group=comp.arch#16838

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:5613:: with SMTP id ca19mr1919720qvb.3.1621286973653;
Mon, 17 May 2021 14:29:33 -0700 (PDT)
X-Received: by 2002:aca:1b15:: with SMTP id b21mr841021oib.155.1621286973347;
Mon, 17 May 2021 14:29:33 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 17 May 2021 14:29:33 -0700 (PDT)
In-Reply-To: <2aAoI.606122$%W6.592987@fx44.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:e8e8:91fc:a9b8:ab12;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:e8e8:91fc:a9b8:ab12
References: <s7dn5p$78r$1@newsreader4.netcologne.de> <2021May11.193250@mips.complang.tuwien.ac.at>
<s7l775$sq5$1@dont-email.me> <s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com> <s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com> <jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com> <s7n2gj$5na$2@dont-email.me>
<5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com> <s7n6ah$t1$1@dont-email.me>
<590ea343-cd96-4082-800c-f02412204262n@googlegroups.com> <s7nchh$b58$1@dont-email.me>
<KiHnI.151039$wd1.100928@fx41.iad> <s7nv66$u2t$1@newsreader4.netcologne.de>
<PcQnI.365058$2A5.181861@fx45.iad> <s7onqq$ape$1@newsreader4.netcologne.de>
<s7uclo$23r$1@dont-email.me> <2aAoI.606122$%W6.592987@fx44.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3ea71dea-28e2-47a2-9073-d49cfe92cde4n@googlegroups.com>
Subject: Re: More complex instructions to reduce cycle overhead
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 17 May 2021 21:29:33 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Mon, 17 May 2021 21:29 UTC

On Monday, May 17, 2021 at 3:16:03 PM UTC-5, EricP wrote:
> BGB wrote:
> > On 5/15/2021 10:02 AM, Thomas Koenig wrote:
> >> EricP <ThatWould...@thevillage.com> schrieb:
> >>> Thomas Koenig wrote:
> >>
> >>>> That sounds scary - in effect, the synchronization between the
> >>>> different
> >>>> bits in, let's say, an adder would be implied by the gate timing?
> >>>>
> >>>> You would need very narrow tolerances on your gates, then (both
> >>>> too fast and to slow would be deadly).
> >>>>
> >>>> Or is some other mechanism proposed?
> >>>
> >>> They eliminate intermediate pipeline stage registers,
> >>> then tools insert buffers so that all pathways through the combo logic
> >>> have the same propagation delay ensuring all output signals arrive at
> >>> the same instant.
> >>
> >> I'm an engineer, and I know full well that, in anything we build,
> >> there can not be such a thing as the same _anything_ ...
> >>
> >> Rather, they must be counting on the inevitable dispersion to
> >> be small enough that they an still catch it after a presumably
> >> small number of cycles.
> >>
> >
> > I have made the observation that data forwarded between clock domains in
> > an FPGA is not reliable, even if "most of the time" it gets through
> > intact...
> >
> > In my recent battling against stability issues, I was seeing crashes
> > with debug prints where the "expected value" and "got value" would
> > differ by 1 bit being flipped, ...
> >
> >
> > Ended up adding integrity checking and filtering to reject messages (and
> > use the prior values) if the check values didn't match after the
> > clock-domain crossing (generally XOR'ing everything together and
> > checking for equality).
> >
> >
> > Stuff online generally implied:
> > always @(posedge oldclock)
> > begin
> > tempValue <= sendValue;
> > end
> > always @(posedge newclock)
> > begin
> > recvValue <= tempValue;
> > end
> >
> > But, it wasn't nearly so easy outside of simulation...
> >
> >
> > Ended up having to do something like:
> > always @(posedge oldclock)
> > begin
> > tempValue1 <= sendValue;
> > tempValue2 <= tempValue1;
> > end
> >
> > always @(posedge newclock)
> > begin
> > tempValue3 <= tempValue2;
> > tempValue4 <= tempValue3;
> > ...
> > recvValue <= tempValueN;
> > end
> This looks like a metastability synchronizer but with more stages.
> https://en.wikipedia.org/wiki/Metastability_(electronics)
> > Where multiple forwarding stages tends to improve stability, at a
> > 50<->75 MHz interface, generally needed 2 stages on the 'old' side, and
> > 6 on the 'new' side. Similar for 50<->150 MHz.
> >
> > Even then, reliability was still an issue (and it would break
> > intermittently, and even 50<->100 interfacing was still unreliable, if
> > not quite as bad). Similar seems to apply to 75<->150.
>
> >
> > Note that using both 100 and 150 MHz on the FPGA seems to be a problem,
> > as the synthesis then goes and forces them into different clock-tiles
> > (limiting the usable space in the FPGA); whereas other (sub 100 MHz)
> > frequencies seem able to coexist within the same clock tiles.
> But a metastability synchronizer shouldn't be necessary as
> all the clocks are multiples of 25 Mhz and presumably derived
> from the same source.
>
> Maybe it is due to clock skew between the domains.
<
That much is obvious, the question is:: is that clock skew variable with
load ? That is: when your do a 3-instruction bundle does the clock skew
change from that when you run only 1 instruction ???

AND why do you have clock domains in an FPGA ?

Re: More complex instructions to reduce cycle overhead

<s7upgf$v6r$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16839&group=comp.arch#16839

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
Date: Mon, 17 May 2021 17:08:13 -0500
Organization: A noiseless patient Spider
Lines: 188
Message-ID: <s7upgf$v6r$1@dont-email.me>
References: <s7dn5p$78r$1@newsreader4.netcologne.de>
<2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me>
<s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com>
<s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com>
<jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com>
<s7n2gj$5na$2@dont-email.me>
<5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com>
<s7n6ah$t1$1@dont-email.me>
<590ea343-cd96-4082-800c-f02412204262n@googlegroups.com>
<s7nchh$b58$1@dont-email.me> <KiHnI.151039$wd1.100928@fx41.iad>
<s7nv66$u2t$1@newsreader4.netcologne.de> <PcQnI.365058$2A5.181861@fx45.iad>
<s7onqq$ape$1@newsreader4.netcologne.de> <s7uclo$23r$1@dont-email.me>
<2aAoI.606122$%W6.592987@fx44.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 17 May 2021 22:08:15 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="d2c361b299414fd5c91bf81272384368";
logging-data="31963"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+/N5C+X5QFyFCF91dR4e5G"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.1
Cancel-Lock: sha1:GPNaNO/xvjZa96177yg2rfbg3SM=
In-Reply-To: <2aAoI.606122$%W6.592987@fx44.iad>
Content-Language: en-US
 by: BGB - Mon, 17 May 2021 22:08 UTC

On 5/17/2021 3:15 PM, EricP wrote:
> BGB wrote:
>> On 5/15/2021 10:02 AM, Thomas Koenig wrote:
>>> EricP <ThatWouldBeTelling@thevillage.com> schrieb:
>>>> Thomas Koenig wrote:
>>>
>>>>> That sounds scary - in effect, the synchronization between the
>>>>> different
>>>>> bits in, let's say, an adder would be implied by the gate timing?
>>>>>
>>>>> You would need very narrow tolerances on your gates, then (both
>>>>> too fast and to slow would be deadly).
>>>>>
>>>>> Or is some other mechanism proposed?
>>>>
>>>> They eliminate intermediate pipeline stage registers,
>>>> then tools insert buffers so that all pathways through the combo logic
>>>> have the same propagation delay ensuring all output signals arrive at
>>>> the same instant.
>>>
>>> I'm an engineer, and I know full well that, in anything we build,
>>> there can not be such a thing as the same _anything_ ...
>>>
>>> Rather, they must be counting on the inevitable dispersion to
>>> be small enough that they an still catch it after a presumably
>>> small number of cycles.
>>>
>>
>> I have made the observation that data forwarded between clock domains
>> in an FPGA is not reliable, even if "most of the time" it gets through
>> intact...
>>
>> In my recent battling against stability issues, I was seeing crashes
>> with debug prints where the "expected value" and "got value" would
>> differ by 1 bit being flipped, ...
>>
>>
>> Ended up adding integrity checking and filtering to reject messages
>> (and use the prior values) if the check values didn't match after the
>> clock-domain crossing (generally XOR'ing everything together and
>> checking for equality).
>>
>>
>> Stuff online generally implied:
>> always @(posedge oldclock)
>> begin
>>   tempValue <= sendValue;
>> end
>> always @(posedge newclock)
>> begin
>>   recvValue <= tempValue;
>> end
>>
>> But, it wasn't nearly so easy outside of simulation...
>>
>>
>> Ended up having to do something like:
>> always @(posedge oldclock)
>> begin
>>   tempValue1 <= sendValue;
>>   tempValue2 <= tempValue1;
>> end
>>
>> always @(posedge newclock)
>> begin
>>   tempValue3 <= tempValue2;
>>   tempValue4 <= tempValue3;
>>   ...
>>   recvValue <= tempValueN;
>> end
>
> This looks like a metastability synchronizer but with more stages.
> https://en.wikipedia.org/wiki/Metastability_(electronics)
>

I need a certain minimum number of stages, otherwise it all turns to
garbage.

Granted, this doesn't happen in simulation, but seems to be a bigger
issue on the FPGA.

I am less sure why the random bit-flipping seems to be a thing, but
doing what is effectively a glorified parity check seems to help here
(though, multi-bit flipping could still escape detection).

As noted, the "parity check and optionally reject" strategy does seem to
allow a "50<->150MHz" interface to work with a 1+4 stage synchronizer,
which is an improvement over a 2+7 stage approach.

>> Where multiple forwarding stages tends to improve stability, at a
>> 50<->75 MHz interface, generally needed 2 stages on the 'old' side,
>> and 6 on the 'new' side. Similar for 50<->150 MHz.
>>
>> Even then, reliability was still an issue (and it would break
>> intermittently, and even 50<->100 interfacing was still unreliable, if
>> not quite as bad). Similar seems to apply to 75<->150.
>
>>
>> Note that using both 100 and 150 MHz on the FPGA seems to be a
>> problem, as the synthesis then goes and forces them into different
>> clock-tiles (limiting the usable space in the FPGA); whereas other
>> (sub 100 MHz) frequencies seem able to coexist within the same clock
>> tiles.
>
> But a metastability synchronizer shouldn't be necessary as
> all the clocks are multiples of 25 Mhz and presumably derived
> from the same source.
>
> Maybe it is due to clock skew between the domains.
>

Dunno.

The board itself has a 100MHz input clock, which I derive all of the
other clocks from using a PLL.

I followed what examples I could find online, which basically involved
using a PLL and feeding all the clock outputs through buffers.

There were actually fewer issues earlier on before I used the PLL, where
the 50MHz signal was generated by dividing the 100MHz signal in half
using something like:
always @(posedge clock_100)
clock_50 <= !clock_50;

In this case, it was more or less possible to semi-directly pass signals
between 50 and 100MHz.

As noted:
50<->100, Works reasonably well.
50<->150, Much less reliable, but works.
50<->75, Works OK.
75<->100, Doesn't work much at all.
75<->150, Works reasonably well.

Other speeds, like 66 or 133 MHz, didn't have much luck with them.
Trying to pass a signal, even directly between flipflops, with one of
the other clock-speeds, results in failed timing.

>> No mention was made though that, "yeah, bits might occasionally get
>> randomly flipped here", or that one might need to do some form of
>> error-checking across the clock-domain crossing. Though, I guess it
>> does make sense in retrospect.
>>
>>
>> Then again, probably shouldn't be too surprised, as "forward a bunch
>> of time to make data pass between two different clocks" does seem a
>> little questionable...
>>
>> May test though whether these integrity checks allow reducing the
>> number of forwarding stages used (since this does visibly effect RAM
>> speed).
>>
>>
>>
>> That, and also observing that the ability to reliably read data from
>> the SDcard is dependent on how well the SDcard extension cable is
>> seated, which is often "not great" as it doesn't really like to stay
>> in place.
>>
>> Like a lot of "stuff isn't working correctly", unhooks SDcard cable,
>> reinserts it, resets FPGA, things work now...
>>
>> This part is made annoying as neither the SDcard interface nor FAT32
>> offer any kind of data-integrity checking (and a checksum error from
>> loading a binary could be either due to SDcard or due to memory issues).
>>
>> Though, either way, I am using checksum checks for binaries, because
>> formerly binaries failing to load correctly were also a source of issues.
>>
>> An ability to detect bad data reads would allow retrying though.
>>
>>
>> But, yeah, this can get kinda annoying.
>>
>>
>> Luckily, at least the Block-RAM seems to be reliable, but apparently
>> the Artix-7 and Spartan-7 use ECC'ed BRAM (and, presumably Vivado is
>> using said ECC... But, can't find anything anywhere to confirm this).

Re: More complex instructions to reduce cycle overhead

<s7ur99$9qd$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16840&group=comp.arch#16840

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
Date: Mon, 17 May 2021 17:38:31 -0500
Organization: A noiseless patient Spider
Lines: 140
Message-ID: <s7ur99$9qd$1@dont-email.me>
References: <s7dn5p$78r$1@newsreader4.netcologne.de>
<2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me>
<s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com>
<s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com>
<jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com>
<s7n2gj$5na$2@dont-email.me>
<5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com>
<s7n6ah$t1$1@dont-email.me>
<590ea343-cd96-4082-800c-f02412204262n@googlegroups.com>
<s7nchh$b58$1@dont-email.me> <KiHnI.151039$wd1.100928@fx41.iad>
<s7nv66$u2t$1@newsreader4.netcologne.de> <PcQnI.365058$2A5.181861@fx45.iad>
<s7onqq$ape$1@newsreader4.netcologne.de> <s7uclo$23r$1@dont-email.me>
<2aAoI.606122$%W6.592987@fx44.iad>
<3ea71dea-28e2-47a2-9073-d49cfe92cde4n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 17 May 2021 22:38:33 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="d2c361b299414fd5c91bf81272384368";
logging-data="10061"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX196k9bxj+iuwz6hgLZLXgS9"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.1
Cancel-Lock: sha1:HBx3YBMMNuECIl/Nd7bo2xR9FXM=
In-Reply-To: <3ea71dea-28e2-47a2-9073-d49cfe92cde4n@googlegroups.com>
Content-Language: en-US
 by: BGB - Mon, 17 May 2021 22:38 UTC

On 5/17/2021 4:29 PM, MitchAlsup wrote:
> On Monday, May 17, 2021 at 3:16:03 PM UTC-5, EricP wrote:
>> BGB wrote:
>>> On 5/15/2021 10:02 AM, Thomas Koenig wrote:
>>>> EricP <ThatWould...@thevillage.com> schrieb:
>>>>> Thomas Koenig wrote:
>>>>
>>>>>> That sounds scary - in effect, the synchronization between the
>>>>>> different
>>>>>> bits in, let's say, an adder would be implied by the gate timing?
>>>>>>
>>>>>> You would need very narrow tolerances on your gates, then (both
>>>>>> too fast and to slow would be deadly).
>>>>>>
>>>>>> Or is some other mechanism proposed?
>>>>>
>>>>> They eliminate intermediate pipeline stage registers,
>>>>> then tools insert buffers so that all pathways through the combo logic
>>>>> have the same propagation delay ensuring all output signals arrive at
>>>>> the same instant.
>>>>
>>>> I'm an engineer, and I know full well that, in anything we build,
>>>> there can not be such a thing as the same _anything_ ...
>>>>
>>>> Rather, they must be counting on the inevitable dispersion to
>>>> be small enough that they an still catch it after a presumably
>>>> small number of cycles.
>>>>
>>>
>>> I have made the observation that data forwarded between clock domains in
>>> an FPGA is not reliable, even if "most of the time" it gets through
>>> intact...
>>>
>>> In my recent battling against stability issues, I was seeing crashes
>>> with debug prints where the "expected value" and "got value" would
>>> differ by 1 bit being flipped, ...
>>>
>>>
>>> Ended up adding integrity checking and filtering to reject messages (and
>>> use the prior values) if the check values didn't match after the
>>> clock-domain crossing (generally XOR'ing everything together and
>>> checking for equality).
>>>
>>>
>>> Stuff online generally implied:
>>> always @(posedge oldclock)
>>> begin
>>> tempValue <= sendValue;
>>> end
>>> always @(posedge newclock)
>>> begin
>>> recvValue <= tempValue;
>>> end
>>>
>>> But, it wasn't nearly so easy outside of simulation...
>>>
>>>
>>> Ended up having to do something like:
>>> always @(posedge oldclock)
>>> begin
>>> tempValue1 <= sendValue;
>>> tempValue2 <= tempValue1;
>>> end
>>>
>>> always @(posedge newclock)
>>> begin
>>> tempValue3 <= tempValue2;
>>> tempValue4 <= tempValue3;
>>> ...
>>> recvValue <= tempValueN;
>>> end
>> This looks like a metastability synchronizer but with more stages.
>> https://en.wikipedia.org/wiki/Metastability_(electronics)
>>> Where multiple forwarding stages tends to improve stability, at a
>>> 50<->75 MHz interface, generally needed 2 stages on the 'old' side, and
>>> 6 on the 'new' side. Similar for 50<->150 MHz.
>>>
>>> Even then, reliability was still an issue (and it would break
>>> intermittently, and even 50<->100 interfacing was still unreliable, if
>>> not quite as bad). Similar seems to apply to 75<->150.
>>
>>>
>>> Note that using both 100 and 150 MHz on the FPGA seems to be a problem,
>>> as the synthesis then goes and forces them into different clock-tiles
>>> (limiting the usable space in the FPGA); whereas other (sub 100 MHz)
>>> frequencies seem able to coexist within the same clock tiles.
>> But a metastability synchronizer shouldn't be necessary as
>> all the clocks are multiples of 25 Mhz and presumably derived
>> from the same source.
>>
>> Maybe it is due to clock skew between the domains.
> <
> That much is obvious, the question is:: is that clock skew variable with
> load ? That is: when your do a 3-instruction bundle does the clock skew
> change from that when you run only 1 instruction ???
>
> AND why do you have clock domains in an FPGA ?
>

In this case, pretty much all of the CPU core itself runs at a single
speed (eg: 50 or 75MHz).

MMIO can also run at a different speed, ATM I am keeping it at 50.

The DDR controller internally operates at 100 or 150 MHz, or 2x the RAM
frequency, and drives internal logic on both the rising and falling edges.

I had designed another DDR controller (DDR-B) that runs at 1:1 speeds,
but wasn't able to get it to work on actual hardware (did work in
simulation though).

Note that DDR-B did both the rising and falling edge logic parallel in
the main part, and then uses posedge+negedge logic to drive the pins at
a faster speed.

However, either way, it is necessary to step between the faster clock
speeds used internally to the DDR controller, and the slower clock speed
used by the CPU core and ringbus.

It is also (theoretically) possible to run the DDR controller at 50MHz
internally, with the DDR itself running at 25...

However, ATM, this reduces DRAM bandwidth to around 3 MB/sec.
In theory, it should probably be able to be a bit faster even at these
speeds (RAS and CAS overheads probably shouldn't be quite this high).

Note that the DDR controller itself still uses the old bus, because
(unlike the ringbus), the old bus was able to deal with both ends
operating at different clock speeds (for the ringbus, everything on the
bus needs to operate at a single clock speed).

The main MMIO bus also still uses the old bus, but more in this case
because I would effectively need a non-trivial redesign to use a ringbus
for this part.

Re: More complex instructions to reduce cycle overhead

<s81116$7p5$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16857&group=comp.arch#16857

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
Date: Tue, 18 May 2021 20:28:53 +0200
Organization: A noiseless patient Spider
Lines: 159
Message-ID: <s81116$7p5$1@dont-email.me>
References: <s7dn5p$78r$1@newsreader4.netcologne.de>
<2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me>
<s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com>
<s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com>
<jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com>
<s7n2gj$5na$2@dont-email.me>
<5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com>
<s7n6ah$t1$1@dont-email.me>
<590ea343-cd96-4082-800c-f02412204262n@googlegroups.com>
<s7nchh$b58$1@dont-email.me> <KiHnI.151039$wd1.100928@fx41.iad>
<s7nv66$u2t$1@newsreader4.netcologne.de> <PcQnI.365058$2A5.181861@fx45.iad>
<s7onqq$ape$1@newsreader4.netcologne.de> <s7uclo$23r$1@dont-email.me>
<2aAoI.606122$%W6.592987@fx44.iad>
<3ea71dea-28e2-47a2-9073-d49cfe92cde4n@googlegroups.com>
<s7ur99$9qd$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 18 May 2021 18:28:54 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="bd328e24f3b4ef74e4aa8db76e34ce50";
logging-data="7973"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+hcTpEMnfKehNl93dmTkXDZ4CF/D4dGSs="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.8.1
Cancel-Lock: sha1:YhbmUQTe+rPwwX1NZA/CyXn/wjw=
In-Reply-To: <s7ur99$9qd$1@dont-email.me>
Content-Language: en-US
 by: Marcus - Tue, 18 May 2021 18:28 UTC

On 2021-05-18, BGB wrote:
> On 5/17/2021 4:29 PM, MitchAlsup wrote:
>> On Monday, May 17, 2021 at 3:16:03 PM UTC-5, EricP wrote:
>>> BGB wrote:
>>>> On 5/15/2021 10:02 AM, Thomas Koenig wrote:
>>>>> EricP <ThatWould...@thevillage.com> schrieb:
>>>>>> Thomas Koenig wrote:
>>>>>
>>>>>>> That sounds scary - in effect, the synchronization between the
>>>>>>> different
>>>>>>> bits in, let's say, an adder would be implied by the gate timing?
>>>>>>>
>>>>>>> You would need very narrow tolerances on your gates, then (both
>>>>>>> too fast and to slow would be deadly).
>>>>>>>
>>>>>>> Or is some other mechanism proposed?
>>>>>>
>>>>>> They eliminate intermediate pipeline stage registers,
>>>>>> then tools insert buffers so that all pathways through the combo
>>>>>> logic
>>>>>> have the same propagation delay ensuring all output signals arrive at
>>>>>> the same instant.
>>>>>
>>>>> I'm an engineer, and I know full well that, in anything we build,
>>>>> there can not be such a thing as the same _anything_ ...
>>>>>
>>>>> Rather, they must be counting on the inevitable dispersion to
>>>>> be small enough that they an still catch it after a presumably
>>>>> small number of cycles.
>>>>>
>>>>
>>>> I have made the observation that data forwarded between clock
>>>> domains in
>>>> an FPGA is not reliable, even if "most of the time" it gets through
>>>> intact...
>>>>
>>>> In my recent battling against stability issues, I was seeing crashes
>>>> with debug prints where the "expected value" and "got value" would
>>>> differ by 1 bit being flipped, ...
>>>>
>>>>
>>>> Ended up adding integrity checking and filtering to reject messages
>>>> (and
>>>> use the prior values) if the check values didn't match after the
>>>> clock-domain crossing (generally XOR'ing everything together and
>>>> checking for equality).
>>>>
>>>>
>>>> Stuff online generally implied:
>>>> always @(posedge oldclock)
>>>> begin
>>>> tempValue <= sendValue;
>>>> end
>>>> always @(posedge newclock)
>>>> begin
>>>> recvValue <= tempValue;
>>>> end
>>>>
>>>> But, it wasn't nearly so easy outside of simulation...
>>>>
>>>>
>>>> Ended up having to do something like:
>>>> always @(posedge oldclock)
>>>> begin
>>>> tempValue1 <= sendValue;
>>>> tempValue2 <= tempValue1;
>>>> end
>>>>
>>>> always @(posedge newclock)
>>>> begin
>>>> tempValue3 <= tempValue2;
>>>> tempValue4 <= tempValue3;
>>>> ...
>>>> recvValue <= tempValueN;
>>>> end
>>> This looks like a metastability synchronizer but with more stages.
>>> https://en.wikipedia.org/wiki/Metastability_(electronics)
>>>> Where multiple forwarding stages tends to improve stability, at a
>>>> 50<->75 MHz interface, generally needed 2 stages on the 'old' side, and
>>>> 6 on the 'new' side. Similar for 50<->150 MHz.
>>>>
>>>> Even then, reliability was still an issue (and it would break
>>>> intermittently, and even 50<->100 interfacing was still unreliable, if
>>>> not quite as bad). Similar seems to apply to 75<->150.
>>>
>>>>
>>>> Note that using both 100 and 150 MHz on the FPGA seems to be a problem,
>>>> as the synthesis then goes and forces them into different clock-tiles
>>>> (limiting the usable space in the FPGA); whereas other (sub 100 MHz)
>>>> frequencies seem able to coexist within the same clock tiles.
>>> But a metastability synchronizer shouldn't be necessary as
>>> all the clocks are multiples of 25 Mhz and presumably derived
>>> from the same source.
>>>
>>> Maybe it is due to clock skew between the domains.
>> <
>> That much is obvious, the question is:: is that clock skew variable with
>> load ? That is: when your do a 3-instruction bundle does the clock skew
>> change from that when you run only 1 instruction ???
>>
>> AND why do you have clock domains in an FPGA ?
>>
>
> In this case, pretty much all of the CPU core itself runs at a single
> speed (eg: 50 or 75MHz).
>
> MMIO can also run at a different speed, ATM I am keeping it at 50.
>
>
>
> The DDR controller internally operates at 100 or 150 MHz, or 2x the RAM
> frequency, and drives internal logic on both the rising and falling edges.
>

I decided to make things simple so all memory interfaces that the CPU
talks to run at CPU speed (typically 60-120 MHz). This means, for
instance, that the SDRAM controller runs at whatever speed the CPU is
configured to run at.

However in my FPGA computer I have another clock domain for the video
logic, which needs to be synchronized to the pixel clock (148.5 MHz for
1920x1080@60Hz). Almost all communication between the CPU and the video
logic is one-directional and happens via the dual ported VRAM (CPU r/w,
video ro). I only have one signal going back from the video logic to the
CPU-clocked MMIO registers - via a clock domain crossing - and that is
the current video raster line counter (used for things like vsync).

> I had designed another DDR controller (DDR-B) that runs at 1:1 speeds,
> but wasn't able to get it to work on actual hardware (did work in
> simulation though).
>
> Note that DDR-B did both the rising and falling edge logic parallel in
> the main part, and then uses posedge+negedge logic to drive the pins at
> a faster speed.
>
>
>
> However, either way, it is necessary to step between the faster clock
> speeds used internally to the DDR controller, and the slower clock speed
> used by the CPU core and ringbus.
>
>
> It is also (theoretically) possible to run the DDR controller at 50MHz
> internally, with the DDR itself running at 25...
>
> However, ATM, this reduces DRAM bandwidth to around 3 MB/sec.
> In theory, it should probably be able to be a bit faster even at these
> speeds (RAS and CAS overheads probably shouldn't be quite this high).
>
> Note that the DDR controller itself still uses the old bus, because
> (unlike the ringbus), the old bus was able to deal with both ends
> operating at different clock speeds (for the ringbus, everything on the
> bus needs to operate at a single clock speed).
>
> The main MMIO bus also still uses the old bus, but more in this case
> because I would effectively need a non-trivial redesign to use a ringbus
> for this part.
>

Re: More complex instructions to reduce cycle overhead

<s811ub$ebh$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16859&group=comp.arch#16859

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
Date: Tue, 18 May 2021 20:44:26 +0200
Organization: A noiseless patient Spider
Lines: 211
Message-ID: <s811ub$ebh$1@dont-email.me>
References: <s7dn5p$78r$1@newsreader4.netcologne.de>
<2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me>
<s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com>
<s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com>
<jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com>
<s7n2gj$5na$2@dont-email.me>
<5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com>
<s7n6ah$t1$1@dont-email.me>
<590ea343-cd96-4082-800c-f02412204262n@googlegroups.com>
<s7nchh$b58$1@dont-email.me> <KiHnI.151039$wd1.100928@fx41.iad>
<s7nv66$u2t$1@newsreader4.netcologne.de> <PcQnI.365058$2A5.181861@fx45.iad>
<s7onqq$ape$1@newsreader4.netcologne.de> <s7uclo$23r$1@dont-email.me>
<2aAoI.606122$%W6.592987@fx44.iad> <s7upgf$v6r$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 18 May 2021 18:44:27 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="bd328e24f3b4ef74e4aa8db76e34ce50";
logging-data="14705"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1//zAEiqKCludyei/ihrgVQDb+ZVU0cM+8="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.8.1
Cancel-Lock: sha1:OGal/EPvpJCAfxLKJ42vWoktnkQ=
In-Reply-To: <s7upgf$v6r$1@dont-email.me>
Content-Language: en-US
 by: Marcus - Tue, 18 May 2021 18:44 UTC

On 2021-05-18, BGB wrote:
> On 5/17/2021 3:15 PM, EricP wrote:
>> BGB wrote:
>>> On 5/15/2021 10:02 AM, Thomas Koenig wrote:
>>>> EricP <ThatWouldBeTelling@thevillage.com> schrieb:
>>>>> Thomas Koenig wrote:
>>>>
>>>>>> That sounds scary - in effect, the synchronization between the
>>>>>> different
>>>>>> bits in, let's say, an adder would be implied by the gate timing?
>>>>>>
>>>>>> You would need very narrow tolerances on your gates, then (both
>>>>>> too fast and to slow would be deadly).
>>>>>>
>>>>>> Or is some other mechanism proposed?
>>>>>
>>>>> They eliminate intermediate pipeline stage registers,
>>>>> then tools insert buffers so that all pathways through the combo logic
>>>>> have the same propagation delay ensuring all output signals arrive at
>>>>> the same instant.
>>>>
>>>> I'm an engineer, and I know full well that, in anything we build,
>>>> there can not be such a thing as the same _anything_ ...
>>>>
>>>> Rather, they must be counting on the inevitable dispersion to
>>>> be small enough that they an still catch it after a presumably
>>>> small number of cycles.
>>>>
>>>
>>> I have made the observation that data forwarded between clock domains
>>> in an FPGA is not reliable, even if "most of the time" it gets
>>> through intact...
>>>
>>> In my recent battling against stability issues, I was seeing crashes
>>> with debug prints where the "expected value" and "got value" would
>>> differ by 1 bit being flipped, ...
>>>
>>>
>>> Ended up adding integrity checking and filtering to reject messages
>>> (and use the prior values) if the check values didn't match after the
>>> clock-domain crossing (generally XOR'ing everything together and
>>> checking for equality).
>>>
>>>
>>> Stuff online generally implied:
>>> always @(posedge oldclock)
>>> begin
>>>   tempValue <= sendValue;
>>> end
>>> always @(posedge newclock)
>>> begin
>>>   recvValue <= tempValue;
>>> end
>>>
>>> But, it wasn't nearly so easy outside of simulation...
>>>
>>>
>>> Ended up having to do something like:
>>> always @(posedge oldclock)
>>> begin
>>>   tempValue1 <= sendValue;
>>>   tempValue2 <= tempValue1;
>>> end
>>>
>>> always @(posedge newclock)
>>> begin
>>>   tempValue3 <= tempValue2;
>>>   tempValue4 <= tempValue3;
>>>   ...
>>>   recvValue <= tempValueN;
>>> end
>>
>> This looks like a metastability synchronizer but with more stages.
>> https://en.wikipedia.org/wiki/Metastability_(electronics)
>>
>
> I need a certain minimum number of stages, otherwise it all turns to
> garbage.
>
> Granted, this doesn't happen in simulation, but seems to be a bigger
> issue on the FPGA.
>
>
> I am less sure why the random bit-flipping seems to be a thing, but
> doing what is effectively a glorified parity check seems to help here
> (though, multi-bit flipping could still escape detection).

Isn't it a question of uncertainty in when the different bits arrive
to the synchronizer flops? Each bit can potentially travel along a
different routing, and at least for me setting up the timing constraints
for the tools to do "the right thing" has always felt like black magic.

I made a simple trick that works well for low frequency signals (i.e.
spanning several cycles in the target domain), and where a +/- one or
two cycles jitter in the target domain is acceptable, and that is
to wait for all bits of the word to be stable for at least N cycles
in the target domain.

My implementation looks like this:

https://github.com/mrisc32/mc1/blob/master/src/rtl/synchronizer.vhd

This is actually my first CDC design, so probably not optimal, but
it seems to work well for my use cases. However if you want higher
frequency signals this method will obviously not work.

>
> As noted, the "parity check and optionally reject" strategy does seem to
> allow a "50<->150MHz" interface to work with a 1+4 stage synchronizer,
> which is an improvement over a 2+7 stage approach.
>
>
>
>>> Where multiple forwarding stages tends to improve stability, at a
>>> 50<->75 MHz interface, generally needed 2 stages on the 'old' side,
>>> and 6 on the 'new' side. Similar for 50<->150 MHz.
>>>
>>> Even then, reliability was still an issue (and it would break
>>> intermittently, and even 50<->100 interfacing was still unreliable,
>>> if not quite as bad). Similar seems to apply to 75<->150.
>>
>>>
>>> Note that using both 100 and 150 MHz on the FPGA seems to be a
>>> problem, as the synthesis then goes and forces them into different
>>> clock-tiles (limiting the usable space in the FPGA); whereas other
>>> (sub 100 MHz) frequencies seem able to coexist within the same clock
>>> tiles.
>>
>> But a metastability synchronizer shouldn't be necessary as
>> all the clocks are multiples of 25 Mhz and presumably derived
>> from the same source.
>>
>> Maybe it is due to clock skew between the domains.
>>
>
> Dunno.
>
> The board itself has a 100MHz input clock, which I derive all of the
> other clocks from using a PLL.
>
> I followed what examples I could find online, which basically involved
> using a PLL and feeding all the clock outputs through buffers.
>
>
> There were actually fewer issues earlier on before I used the PLL, where
> the 50MHz signal was generated by dividing the 100MHz signal in half
> using something like:
>   always @(posedge clock_100)
>     clock_50 <= !clock_50;
>
> In this case, it was more or less possible to semi-directly pass signals
> between 50 and 100MHz.
>
> As noted:
>   50<->100, Works reasonably well.
>   50<->150, Much less reliable, but works.
>   50<->75, Works OK.
>   75<->100, Doesn't work much at all.
>   75<->150, Works reasonably well.
>
>
> Other speeds, like 66 or 133 MHz, didn't have much luck with them.
> Trying to pass a signal, even directly between flipflops, with one of
> the other clock-speeds, results in failed timing.
>
>
>
>
>>> No mention was made though that, "yeah, bits might occasionally get
>>> randomly flipped here", or that one might need to do some form of
>>> error-checking across the clock-domain crossing. Though, I guess it
>>> does make sense in retrospect.
>>>
>>>
>>> Then again, probably shouldn't be too surprised, as "forward a bunch
>>> of time to make data pass between two different clocks" does seem a
>>> little questionable...
>>>
>>> May test though whether these integrity checks allow reducing the
>>> number of forwarding stages used (since this does visibly effect RAM
>>> speed).
>>>
>>>
>>>
>>> That, and also observing that the ability to reliably read data from
>>> the SDcard is dependent on how well the SDcard extension cable is
>>> seated, which is often "not great" as it doesn't really like to stay
>>> in place.
>>>
>>> Like a lot of "stuff isn't working correctly", unhooks SDcard cable,
>>> reinserts it, resets FPGA, things work now...
>>>
>>> This part is made annoying as neither the SDcard interface nor FAT32
>>> offer any kind of data-integrity checking (and a checksum error from
>>> loading a binary could be either due to SDcard or due to memory issues).
>>>
>>> Though, either way, I am using checksum checks for binaries, because
>>> formerly binaries failing to load correctly were also a source of
>>> issues.
>>>
>>> An ability to detect bad data reads would allow retrying though.
>>>
>>>
>>> But, yeah, this can get kinda annoying.
>>>
>>>
>>> Luckily, at least the Block-RAM seems to be reliable, but apparently
>>> the Artix-7 and Spartan-7 use ECC'ed BRAM (and, presumably Vivado is
>>> using said ECC... But, can't find anything anywhere to confirm this).
>


Click here to read the complete article
Re: More complex instructions to reduce cycle overhead

<uZToI.146185$lyv9.30173@fx35.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16862&group=comp.arch#16862

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!fdcspool4.netnews.com!news-out.netnews.com!news.alt.net!fdc3.netnews.com!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx35.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
References: <s7dn5p$78r$1@newsreader4.netcologne.de> <2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me> <s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me> <c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com> <s7mio3$qfs$1@dont-email.me> <00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com> <jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org> <049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com> <s7n2gj$5na$2@dont-email.me> <5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com> <s7n6ah$t1$1@dont-email.me> <590ea343-cd96-4082-800c-f02412204262n@googlegroups.com> <s7nchh$b58$1@dont-email.me> <KiHnI.151039$wd1.100928@fx41.iad> <s7nv66$u2t$1@newsreader4.netcologne.de> <PcQnI.365058$2A5.181861@fx45.iad> <s7onqq$ape$1@newsreader4.netcologne.de> <s7uclo$23r$1@dont-email.me> <2aAoI.606122$%W6.592987@fx44.iad> <s7upgf$v6r$1@dont-email.me>
In-Reply-To: <s7upgf$v6r$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 67
Message-ID: <uZToI.146185$lyv9.30173@fx35.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Tue, 18 May 2021 18:47:54 UTC
Date: Tue, 18 May 2021 14:46:58 -0400
X-Received-Bytes: 3446
 by: EricP - Tue, 18 May 2021 18:46 UTC

BGB wrote:
> On 5/17/2021 3:15 PM, EricP wrote:
>> BGB wrote:
>>>
>>>
>>> Stuff online generally implied:
>>> always @(posedge oldclock)
>>> begin
>>> tempValue <= sendValue;
>>> end
>>> always @(posedge newclock)
>>> begin
>>> recvValue <= tempValue;
>>> end
>>>
>>> But, it wasn't nearly so easy outside of simulation...
>>>
>>>
>>> Ended up having to do something like:
>>> always @(posedge oldclock)
>>> begin
>>> tempValue1 <= sendValue;
>>> tempValue2 <= tempValue1;
>>> end
>>>
>>> always @(posedge newclock)
>>> begin
>>> tempValue3 <= tempValue2;
>>> tempValue4 <= tempValue3;
>>> ...
>>> recvValue <= tempValueN;
>>> end
>>
>> This looks like a metastability synchronizer but with more stages.
>> https://en.wikipedia.org/wiki/Metastability_(electronics)
>>
>
> I need a certain minimum number of stages, otherwise it all turns to
> garbage.
>
> Granted, this doesn't happen in simulation, but seems to be a bigger
> issue on the FPGA.

I was thinking that if as you said this all driven from a common clock,
and if it is just skew as Mitch said, then if the skew and jitter are
small enough, you might change the above logic to set the send buffer
on the rising clock, and fill the receive buffer on the falling clock.
That gives 1/2 cycle of the faster clock of leeway for the edges to
move about relative to each other, and eliminates the extra stages.

always @(posedge sendClock)
begin
tempValue1 <= sendValue;
end

always @(negedge recvClock)
begin
recvValue <= tempValue1;
end

plus some buffer full/empty handshake logic.

So the faster clock is 150 MHz, 6.66 ns, and 1/2 cycle 3.33 ns.
The send and receive rising clock edges would have to differ by
more than 3.33 ns minus setup and hold times and clk-to-Q-out time,
to cause the above to mis-sample.

Re: More complex instructions to reduce cycle overhead

<98f17a50-83b9-4eb9-bdc0-6f3f6787e7c7n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16869&group=comp.arch#16869

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:4e8c:: with SMTP id 12mr4606489qtp.340.1621369176524;
Tue, 18 May 2021 13:19:36 -0700 (PDT)
X-Received: by 2002:a9d:6244:: with SMTP id i4mr5502596otk.182.1621369176215;
Tue, 18 May 2021 13:19:36 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.snarked.org!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 18 May 2021 13:19:35 -0700 (PDT)
In-Reply-To: <uZToI.146185$lyv9.30173@fx35.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:5109:875e:894e:4394;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:5109:875e:894e:4394
References: <s7dn5p$78r$1@newsreader4.netcologne.de> <2021May11.193250@mips.complang.tuwien.ac.at>
<s7l775$sq5$1@dont-email.me> <s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com> <s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com> <jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com> <s7n2gj$5na$2@dont-email.me>
<5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com> <s7n6ah$t1$1@dont-email.me>
<590ea343-cd96-4082-800c-f02412204262n@googlegroups.com> <s7nchh$b58$1@dont-email.me>
<KiHnI.151039$wd1.100928@fx41.iad> <s7nv66$u2t$1@newsreader4.netcologne.de>
<PcQnI.365058$2A5.181861@fx45.iad> <s7onqq$ape$1@newsreader4.netcologne.de>
<s7uclo$23r$1@dont-email.me> <2aAoI.606122$%W6.592987@fx44.iad>
<s7upgf$v6r$1@dont-email.me> <uZToI.146185$lyv9.30173@fx35.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <98f17a50-83b9-4eb9-bdc0-6f3f6787e7c7n@googlegroups.com>
Subject: Re: More complex instructions to reduce cycle overhead
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 18 May 2021 20:19:36 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 78
 by: MitchAlsup - Tue, 18 May 2021 20:19 UTC

On Tuesday, May 18, 2021 at 1:47:57 PM UTC-5, EricP wrote:
> BGB wrote:
> > On 5/17/2021 3:15 PM, EricP wrote:
> >> BGB wrote:
> >>>
> >>>
> >>> Stuff online generally implied:
> >>> always @(posedge oldclock)
> >>> begin
> >>> tempValue <= sendValue;
> >>> end
> >>> always @(posedge newclock)
> >>> begin
> >>> recvValue <= tempValue;
> >>> end
> >>>
> >>> But, it wasn't nearly so easy outside of simulation...
> >>>
> >>>
> >>> Ended up having to do something like:
> >>> always @(posedge oldclock)
> >>> begin
> >>> tempValue1 <= sendValue;
> >>> tempValue2 <= tempValue1;
> >>> end
> >>>
> >>> always @(posedge newclock)
> >>> begin
> >>> tempValue3 <= tempValue2;
> >>> tempValue4 <= tempValue3;
> >>> ...
> >>> recvValue <= tempValueN;
> >>> end
> >>
> >> This looks like a metastability synchronizer but with more stages.
> >> https://en.wikipedia.org/wiki/Metastability_(electronics)
> >>
> >
> > I need a certain minimum number of stages, otherwise it all turns to
> > garbage.
> >
> > Granted, this doesn't happen in simulation, but seems to be a bigger
> > issue on the FPGA.
<
> I was thinking that if as you said this all driven from a common clock,
> and if it is just skew as Mitch said, then if the skew and jitter are
> small enough, you might change the above logic to set the send buffer
> on the rising clock, and fill the receive buffer on the falling clock.
<
When chips started to get "fast" (say above 2 GHz) we started to layout
the whole clock buffer tree with a different Vdd/Gnd routing scheme
as one way to better control clock skew. Vdd/Gnd were routed away
from the logic Vdd/Gnd that the logic gate "saw". The length of the
clock tree in K9 was a bit longer than a clock ! and the buffers produced
a rather constant amount of noise on their Vdd/GNd route while logic
had a big moment right after the rising clock edge which then tended
downward as the cycle progressed.
<
> That gives 1/2 cycle of the faster clock of leeway for the edges to
> move about relative to each other, and eliminates the extra stages.
<
The FPGA will not have access to this kind of low skew engineering.
>
> always @(posedge sendClock)
> begin
> tempValue1 <= sendValue;
> end
>
> always @(negedge recvClock)
> begin
> recvValue <= tempValue1;
> end
>
> plus some buffer full/empty handshake logic.
>
> So the faster clock is 150 MHz, 6.66 ns, and 1/2 cycle 3.33 ns.
> The send and receive rising clock edges would have to differ by
> more than 3.33 ns minus setup and hold times and clk-to-Q-out time,
> to cause the above to mis-sample.

Re: More complex instructions to reduce cycle overhead

<s81bkf$l4s$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16872&group=comp.arch#16872

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
Date: Tue, 18 May 2021 16:29:48 -0500
Organization: A noiseless patient Spider
Lines: 287
Message-ID: <s81bkf$l4s$1@dont-email.me>
References: <s7dn5p$78r$1@newsreader4.netcologne.de>
<2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me>
<s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com>
<s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com>
<jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com>
<s7n2gj$5na$2@dont-email.me>
<5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com>
<s7n6ah$t1$1@dont-email.me>
<590ea343-cd96-4082-800c-f02412204262n@googlegroups.com>
<s7nchh$b58$1@dont-email.me> <KiHnI.151039$wd1.100928@fx41.iad>
<s7nv66$u2t$1@newsreader4.netcologne.de> <PcQnI.365058$2A5.181861@fx45.iad>
<s7onqq$ape$1@newsreader4.netcologne.de> <s7uclo$23r$1@dont-email.me>
<2aAoI.606122$%W6.592987@fx44.iad>
<3ea71dea-28e2-47a2-9073-d49cfe92cde4n@googlegroups.com>
<s7ur99$9qd$1@dont-email.me> <s81116$7p5$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 18 May 2021 21:29:51 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="d2c361b299414fd5c91bf81272384368";
logging-data="21660"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18W8qPURUzps1gFFpcx8K+f"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.1
Cancel-Lock: sha1:pnsSZv6HV53gcxtk8Cj/DIu7r00=
In-Reply-To: <s81116$7p5$1@dont-email.me>
Content-Language: en-US
 by: BGB - Tue, 18 May 2021 21:29 UTC

On 5/18/2021 1:28 PM, Marcus wrote:
> On 2021-05-18, BGB wrote:
>> On 5/17/2021 4:29 PM, MitchAlsup wrote:
>>> On Monday, May 17, 2021 at 3:16:03 PM UTC-5, EricP wrote:
>>>> BGB wrote:
>>>>> On 5/15/2021 10:02 AM, Thomas Koenig wrote:
>>>>>> EricP <ThatWould...@thevillage.com> schrieb:
>>>>>>> Thomas Koenig wrote:
>>>>>>
>>>>>>>> That sounds scary - in effect, the synchronization between the
>>>>>>>> different
>>>>>>>> bits in, let's say, an adder would be implied by the gate timing?
>>>>>>>>
>>>>>>>> You would need very narrow tolerances on your gates, then (both
>>>>>>>> too fast and to slow would be deadly).
>>>>>>>>
>>>>>>>> Or is some other mechanism proposed?
>>>>>>>
>>>>>>> They eliminate intermediate pipeline stage registers,
>>>>>>> then tools insert buffers so that all pathways through the combo
>>>>>>> logic
>>>>>>> have the same propagation delay ensuring all output signals
>>>>>>> arrive at
>>>>>>> the same instant.
>>>>>>
>>>>>> I'm an engineer, and I know full well that, in anything we build,
>>>>>> there can not be such a thing as the same _anything_ ...
>>>>>>
>>>>>> Rather, they must be counting on the inevitable dispersion to
>>>>>> be small enough that they an still catch it after a presumably
>>>>>> small number of cycles.
>>>>>>
>>>>>
>>>>> I have made the observation that data forwarded between clock
>>>>> domains in
>>>>> an FPGA is not reliable, even if "most of the time" it gets through
>>>>> intact...
>>>>>
>>>>> In my recent battling against stability issues, I was seeing crashes
>>>>> with debug prints where the "expected value" and "got value" would
>>>>> differ by 1 bit being flipped, ...
>>>>>
>>>>>
>>>>> Ended up adding integrity checking and filtering to reject messages
>>>>> (and
>>>>> use the prior values) if the check values didn't match after the
>>>>> clock-domain crossing (generally XOR'ing everything together and
>>>>> checking for equality).
>>>>>
>>>>>
>>>>> Stuff online generally implied:
>>>>> always @(posedge oldclock)
>>>>> begin
>>>>> tempValue <= sendValue;
>>>>> end
>>>>> always @(posedge newclock)
>>>>> begin
>>>>> recvValue <= tempValue;
>>>>> end
>>>>>
>>>>> But, it wasn't nearly so easy outside of simulation...
>>>>>
>>>>>
>>>>> Ended up having to do something like:
>>>>> always @(posedge oldclock)
>>>>> begin
>>>>> tempValue1 <= sendValue;
>>>>> tempValue2 <= tempValue1;
>>>>> end
>>>>>
>>>>> always @(posedge newclock)
>>>>> begin
>>>>> tempValue3 <= tempValue2;
>>>>> tempValue4 <= tempValue3;
>>>>> ...
>>>>> recvValue <= tempValueN;
>>>>> end
>>>> This looks like a metastability synchronizer but with more stages.
>>>> https://en.wikipedia.org/wiki/Metastability_(electronics)
>>>>> Where multiple forwarding stages tends to improve stability, at a
>>>>> 50<->75 MHz interface, generally needed 2 stages on the 'old' side,
>>>>> and
>>>>> 6 on the 'new' side. Similar for 50<->150 MHz.
>>>>>
>>>>> Even then, reliability was still an issue (and it would break
>>>>> intermittently, and even 50<->100 interfacing was still unreliable, if
>>>>> not quite as bad). Similar seems to apply to 75<->150.
>>>>
>>>>>
>>>>> Note that using both 100 and 150 MHz on the FPGA seems to be a
>>>>> problem,
>>>>> as the synthesis then goes and forces them into different clock-tiles
>>>>> (limiting the usable space in the FPGA); whereas other (sub 100 MHz)
>>>>> frequencies seem able to coexist within the same clock tiles.
>>>> But a metastability synchronizer shouldn't be necessary as
>>>> all the clocks are multiples of 25 Mhz and presumably derived
>>>> from the same source.
>>>>
>>>> Maybe it is due to clock skew between the domains.
>>> <
>>> That much is obvious, the question is:: is that clock skew variable with
>>> load ? That is: when your do a 3-instruction bundle does the clock skew
>>> change from that when you run only 1 instruction ???
>>>
>>> AND why do you have clock domains in an FPGA ?
>>>
>>
>> In this case, pretty much all of the CPU core itself runs at a single
>> speed (eg: 50 or 75MHz).
>>
>> MMIO can also run at a different speed, ATM I am keeping it at 50.
>>
>>
>>
>> The DDR controller internally operates at 100 or 150 MHz, or 2x the
>> RAM frequency, and drives internal logic on both the rising and
>> falling edges.
>>
>
> I decided to make things simple so all memory interfaces that the CPU
> talks to run at CPU speed (typically 60-120 MHz). This means, for
> instance, that the SDRAM controller runs at whatever speed the CPU is
> configured to run at.
>

With DDR there is a need for logic on both the rising and falling edge
of the clock pulse. As a result, absent extra trickery, it is needed to
drive the internal logic at 2x the frequency the RAM is running at.

The alternative (1:1 operation) requires running logic on both the
rising and falling edges of clock pulses, which adds its own issues.
I did experiment with a design like this, but wasn't able to make this
RAM controller work on the actual FPGA.

SDR SDRAM would have been a little easier here.

If I could drive the RAM at 150MHz, this is fast enough that I should be
able to run it with DLL enabled, and thus within-spec for the chip.

As-is, I am running it with DLL disabled, and below its minimum official
operating frequency. From what I can tell though, this is mostly
intended for devices in a low-power standby mode.

I didn't mess as much with it in the past, as previously L1<->L2 was the
main bottleneck with the old bus.

After getting the newer/faster bus working, now it appears that the
speeds I am seeing from the DRAM via the L2 cache are ~ 80% those coming
through the raw DRAM interface.

Some testing was showing (DDR = 75MHz):
direct forwarding: was getting ~ 34 cycles per DDR access (*1);
1+4 synchronizer: was getting ~ 50 cycles per DDR access;
2+7 synchronizer: was getting ~ 66 cycles per DDR access.

Where, also (DDR = 50MHz):
direct forwarding: was getting ~ 48 cycles per access;
1+4 synchronizer: was getting ~ 64 cycles per access;
2+7 synchronizer: was getting ~ 80 cycles per access.

And, if I were to run the DDR at 25MHz:
direct forwarding: 68 cycles per access.

*1: This only works when the RAM controller and L2 run at the same
clock-speed, eg, originally both ran at 100MHz.
The L2 was later dropped to 50MHz as it was faster to have a
synchronizer at L2<->DDR than at L1<->L2 (in this case, the L2 speed was
almost as slow as the DDR speed).

Was able to add some "speculative fast teardown" logic to the 1+4
synchronizer and get the latency down to ~ 40 cycles per DDR access
(with the error-checking now also allowing me to use 1+4 sychronization
with 75MHz RAM, as opposed to 2+7).

This now means 40-cycle RAM rather than 64 or 66 cycles.

This logic partly gains some speed by behaving as-is the RAM request had
torn down, and makes the assumption that the last request will finish
the teardown process before the next request arrives at the internal
state-machine (a sane assumption, since teardown is a simple
state-machine transition, and this state machine is running at ~ 2x-3x
the clock-speed as the front-end bus interface).

It will hold the 'OK' status in the 'READY' state until a 'READY' is
seen (from the DDR controller), since the time for the L2 cache to
respond to seeing a ready signal and send the next request is less than
the time it takes for the 'READY' signal from the last request to cross
the synchronizer.


Click here to read the complete article
Re: More complex instructions to reduce cycle overhead

<s81j26$gb$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16874&group=comp.arch#16874

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
Date: Tue, 18 May 2021 18:36:36 -0500
Organization: A noiseless patient Spider
Lines: 148
Message-ID: <s81j26$gb$1@dont-email.me>
References: <s7dn5p$78r$1@newsreader4.netcologne.de>
<2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me>
<s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com>
<s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com>
<jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com>
<s7n2gj$5na$2@dont-email.me>
<5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com>
<s7n6ah$t1$1@dont-email.me>
<590ea343-cd96-4082-800c-f02412204262n@googlegroups.com>
<s7nchh$b58$1@dont-email.me> <KiHnI.151039$wd1.100928@fx41.iad>
<s7nv66$u2t$1@newsreader4.netcologne.de> <PcQnI.365058$2A5.181861@fx45.iad>
<s7onqq$ape$1@newsreader4.netcologne.de> <s7uclo$23r$1@dont-email.me>
<2aAoI.606122$%W6.592987@fx44.iad> <s7upgf$v6r$1@dont-email.me>
<uZToI.146185$lyv9.30173@fx35.iad>
<98f17a50-83b9-4eb9-bdc0-6f3f6787e7c7n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 18 May 2021 23:36:39 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="8733e9da9e10b141cde1cd05e74615d1";
logging-data="523"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+ciBQApwQ/L2kSl0boofLR"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.1
Cancel-Lock: sha1:chFd+fXKAH9V7Yn3zXN5AUnEaVA=
In-Reply-To: <98f17a50-83b9-4eb9-bdc0-6f3f6787e7c7n@googlegroups.com>
Content-Language: en-US
 by: BGB - Tue, 18 May 2021 23:36 UTC

On 5/18/2021 3:19 PM, MitchAlsup wrote:
> On Tuesday, May 18, 2021 at 1:47:57 PM UTC-5, EricP wrote:
>> BGB wrote:
>>> On 5/17/2021 3:15 PM, EricP wrote:
>>>> BGB wrote:
>>>>>
>>>>>
>>>>> Stuff online generally implied:
>>>>> always @(posedge oldclock)
>>>>> begin
>>>>> tempValue <= sendValue;
>>>>> end
>>>>> always @(posedge newclock)
>>>>> begin
>>>>> recvValue <= tempValue;
>>>>> end
>>>>>
>>>>> But, it wasn't nearly so easy outside of simulation...
>>>>>
>>>>>
>>>>> Ended up having to do something like:
>>>>> always @(posedge oldclock)
>>>>> begin
>>>>> tempValue1 <= sendValue;
>>>>> tempValue2 <= tempValue1;
>>>>> end
>>>>>
>>>>> always @(posedge newclock)
>>>>> begin
>>>>> tempValue3 <= tempValue2;
>>>>> tempValue4 <= tempValue3;
>>>>> ...
>>>>> recvValue <= tempValueN;
>>>>> end
>>>>
>>>> This looks like a metastability synchronizer but with more stages.
>>>> https://en.wikipedia.org/wiki/Metastability_(electronics)
>>>>
>>>
>>> I need a certain minimum number of stages, otherwise it all turns to
>>> garbage.
>>>
>>> Granted, this doesn't happen in simulation, but seems to be a bigger
>>> issue on the FPGA.
> <
>> I was thinking that if as you said this all driven from a common clock,
>> and if it is just skew as Mitch said, then if the skew and jitter are
>> small enough, you might change the above logic to set the send buffer
>> on the rising clock, and fill the receive buffer on the falling clock.
> <
> When chips started to get "fast" (say above 2 GHz) we started to layout
> the whole clock buffer tree with a different Vdd/Gnd routing scheme
> as one way to better control clock skew. Vdd/Gnd were routed away
> from the logic Vdd/Gnd that the logic gate "saw". The length of the
> clock tree in K9 was a bit longer than a clock ! and the buffers produced
> a rather constant amount of noise on their Vdd/GNd route while logic
> had a big moment right after the rising clock edge which then tended
> downward as the cycle progressed.
> <

In theory, the fastest signals which can exist with this FPGA are 450MHz
/ 2.2ns.

Though, IME, about all it can really do at these speeds is:
reg2 <= reg1;

The Kintex and Vertex can apparently go a lot faster than the Artix and
Spartan in this regard though.

Logic that runs at 150 or 200 MHz is fairly painful in terms of timing
and what it can do.

And, 75 or 100 MHz is a little easier, but still not super easy.

At 50MHz one can do a lot more stuff...
This seems like it is probably the "sweet spot" for the FPGAs I am using.

At 25 or 33 MHz, it is possible that one could mostly ignore timing
constraints (as one writes code that casually does arithmetic on 200 bit
numbers or similar, ...).

>> That gives 1/2 cycle of the faster clock of leeway for the edges to
>> move about relative to each other, and eliminates the extra stages.
> <
> The FPGA will not have access to this kind of low skew engineering.

It is possible, though, from what I can gather:

Below ~ 100MHz, the FPGA seems able to use generic buffers for clock
signaling. So, clocks below 100MHz can be used more freely.

Above 100MHz, it seems to use clock tiles and global clock buffers, and
seem to partition logic between clock tiles (say, putting 100MHz logic
in one tile, and 150MHz logic in another). In this case, there are 8
such clock tiles in the XC7A100.

Apparently the FPGA doesn't actually support negedge clocking per-se, so
when one tries to use negedge, it either uses an inverted version of the
posedge signal, or otherwise fakes it by behaving "as-if" it had used a
negedge (while still actually using posedge signaling internally).

Not entirely sure how this interacts with the clock tiles, ...

Apparently for external IO pins, it is able to drive them for IO on both
the rising and falling edges, but what I can gather it works by running
the two cases in parallel (off the same clock), with the external IO pin
having separate sets of input and output signals internally for both the
rising and falling edges (and joining the outputs together for SDR
signaling).

Apparently, this also can be used to drive external IO pins at roughly
2x the internal clock-frequency (so, say, 150MHz logic driving a 300MHz
IO signal by using the DDR capabilities as 2x SDR).

Some amount of other stuff generally recommends not using negedge at all
if it can be avoided.

>>
>> always @(posedge sendClock)
>> begin
>> tempValue1 <= sendValue;
>> end
>>
>> always @(negedge recvClock)
>> begin
>> recvValue <= tempValue1;
>> end
>>
>> plus some buffer full/empty handshake logic.
>>
>> So the faster clock is 150 MHz, 6.66 ns, and 1/2 cycle 3.33 ns.
>> The send and receive rising clock edges would have to differ by
>> more than 3.33 ns minus setup and hold times and clk-to-Q-out time,
>> to cause the above to mis-sample.

It is possible, though I have uncertainty how well it would work in this
case.

Re: More complex instructions to reduce cycle overhead

<s81l2o$agu$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16876&group=comp.arch#16876

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
Date: Tue, 18 May 2021 19:11:02 -0500
Organization: A noiseless patient Spider
Lines: 259
Message-ID: <s81l2o$agu$1@dont-email.me>
References: <s7dn5p$78r$1@newsreader4.netcologne.de>
<2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me>
<s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com>
<s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com>
<jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com>
<s7n2gj$5na$2@dont-email.me>
<5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com>
<s7n6ah$t1$1@dont-email.me>
<590ea343-cd96-4082-800c-f02412204262n@googlegroups.com>
<s7nchh$b58$1@dont-email.me> <KiHnI.151039$wd1.100928@fx41.iad>
<s7nv66$u2t$1@newsreader4.netcologne.de> <PcQnI.365058$2A5.181861@fx45.iad>
<s7onqq$ape$1@newsreader4.netcologne.de> <s7uclo$23r$1@dont-email.me>
<2aAoI.606122$%W6.592987@fx44.iad> <s7upgf$v6r$1@dont-email.me>
<s811ub$ebh$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 19 May 2021 00:11:05 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="8733e9da9e10b141cde1cd05e74615d1";
logging-data="10782"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+OuAMnKVhq0BU8rVOn7ljR"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.1
Cancel-Lock: sha1:YQ0cYFUBCyTO5HtLuoVjmmumMao=
In-Reply-To: <s811ub$ebh$1@dont-email.me>
Content-Language: en-US
 by: BGB - Wed, 19 May 2021 00:11 UTC

On 5/18/2021 1:44 PM, Marcus wrote:
> On 2021-05-18, BGB wrote:
>> On 5/17/2021 3:15 PM, EricP wrote:
>>> BGB wrote:
>>>> On 5/15/2021 10:02 AM, Thomas Koenig wrote:
>>>>> EricP <ThatWouldBeTelling@thevillage.com> schrieb:
>>>>>> Thomas Koenig wrote:
>>>>>
>>>>>>> That sounds scary - in effect, the synchronization between the
>>>>>>> different
>>>>>>> bits in, let's say, an adder would be implied by the gate timing?
>>>>>>>
>>>>>>> You would need very narrow tolerances on your gates, then (both
>>>>>>> too fast and to slow would be deadly).
>>>>>>>
>>>>>>> Or is some other mechanism proposed?
>>>>>>
>>>>>> They eliminate intermediate pipeline stage registers,
>>>>>> then tools insert buffers so that all pathways through the combo
>>>>>> logic
>>>>>> have the same propagation delay ensuring all output signals arrive at
>>>>>> the same instant.
>>>>>
>>>>> I'm an engineer, and I know full well that, in anything we build,
>>>>> there can not be such a thing as the same _anything_ ...
>>>>>
>>>>> Rather, they must be counting on the inevitable dispersion to
>>>>> be small enough that they an still catch it after a presumably
>>>>> small number of cycles.
>>>>>
>>>>
>>>> I have made the observation that data forwarded between clock
>>>> domains in an FPGA is not reliable, even if "most of the time" it
>>>> gets through intact...
>>>>
>>>> In my recent battling against stability issues, I was seeing crashes
>>>> with debug prints where the "expected value" and "got value" would
>>>> differ by 1 bit being flipped, ...
>>>>
>>>>
>>>> Ended up adding integrity checking and filtering to reject messages
>>>> (and use the prior values) if the check values didn't match after
>>>> the clock-domain crossing (generally XOR'ing everything together and
>>>> checking for equality).
>>>>
>>>>
>>>> Stuff online generally implied:
>>>> always @(posedge oldclock)
>>>> begin
>>>>   tempValue <= sendValue;
>>>> end
>>>> always @(posedge newclock)
>>>> begin
>>>>   recvValue <= tempValue;
>>>> end
>>>>
>>>> But, it wasn't nearly so easy outside of simulation...
>>>>
>>>>
>>>> Ended up having to do something like:
>>>> always @(posedge oldclock)
>>>> begin
>>>>   tempValue1 <= sendValue;
>>>>   tempValue2 <= tempValue1;
>>>> end
>>>>
>>>> always @(posedge newclock)
>>>> begin
>>>>   tempValue3 <= tempValue2;
>>>>   tempValue4 <= tempValue3;
>>>>   ...
>>>>   recvValue <= tempValueN;
>>>> end
>>>
>>> This looks like a metastability synchronizer but with more stages.
>>> https://en.wikipedia.org/wiki/Metastability_(electronics)
>>>
>>
>> I need a certain minimum number of stages, otherwise it all turns to
>> garbage.
>>
>> Granted, this doesn't happen in simulation, but seems to be a bigger
>> issue on the FPGA.
>>
>>
>> I am less sure why the random bit-flipping seems to be a thing, but
>> doing what is effectively a glorified parity check seems to help here
>> (though, multi-bit flipping could still escape detection).
>
> Isn't it a question of uncertainty in when the different bits arrive
> to the synchronizer flops? Each bit can potentially travel along a
> different routing, and at least for me setting up the timing constraints
> for the tools to do "the right thing" has always felt like black magic.
>

One would think that (and it would make sense), but some of my efforts
implied that there is some other factor at play (not sure what exactly).

For example, I had already tried delaying certain signals relative to
the others, say:
Data and Address signals arrive first;
Commend signals arrive afterwards.

So, the assumption would be that once the command arrives with an intact
bit-pattern, the data and address would already be stable (and one would
only need to integrity-check the command and response codes).

Except it didn't work out this way...
It seems more like there is some sort of "noise" between clock domains,
that might flip bits even when they should otherwise be stable.

This noise level seems to be inversely correlated with the number of
synchronizer stages, but as can be noted, adding lots of stages is
pretty bad for latency...

Too few stages, and the data seems to turn into garbage; and the
integrity checking can produce paradoxical results (such as letting
stuff through that should have failed).

Similarly, trying to interact with data (in any way) during the
transition between clock domains, seemingly tends to cause it to turn
into random garbage, ...

> I made a simple trick that works well for low frequency signals (i.e.
> spanning several cycles in the target domain), and where a +/- one or
> two cycles jitter in the target domain is acceptable, and that is
> to wait for all bits of the word to be stable for at least N cycles
> in the target domain.
>
> My implementation looks like this:
>
> https://github.com/mrisc32/mc1/blob/master/src/rtl/synchronizer.vhd
>
> This is actually my first CDC design, so probably not optimal, but
> it seems to work well for my use cases. However if you want higher
> frequency signals this method will obviously not work.
>

OK.

I ended up using XOR-based parity checking for the larger signals, and
"bit inverted duplicate" checking for smaller signals.

This at least doesn't add any additional latency beyond that needed for
the synchronizer itself.

>>
>> As noted, the "parity check and optionally reject" strategy does seem
>> to allow a "50<->150MHz" interface to work with a 1+4 stage
>> synchronizer, which is an improvement over a 2+7 stage approach.
>>
>>
>>
>>>> Where multiple forwarding stages tends to improve stability, at a
>>>> 50<->75 MHz interface, generally needed 2 stages on the 'old' side,
>>>> and 6 on the 'new' side. Similar for 50<->150 MHz.
>>>>
>>>> Even then, reliability was still an issue (and it would break
>>>> intermittently, and even 50<->100 interfacing was still unreliable,
>>>> if not quite as bad). Similar seems to apply to 75<->150.
>>>
>>>>
>>>> Note that using both 100 and 150 MHz on the FPGA seems to be a
>>>> problem, as the synthesis then goes and forces them into different
>>>> clock-tiles (limiting the usable space in the FPGA); whereas other
>>>> (sub 100 MHz) frequencies seem able to coexist within the same clock
>>>> tiles.
>>>
>>> But a metastability synchronizer shouldn't be necessary as
>>> all the clocks are multiples of 25 Mhz and presumably derived
>>> from the same source.
>>>
>>> Maybe it is due to clock skew between the domains.
>>>
>>
>> Dunno.
>>
>> The board itself has a 100MHz input clock, which I derive all of the
>> other clocks from using a PLL.
>>
>> I followed what examples I could find online, which basically involved
>> using a PLL and feeding all the clock outputs through buffers.
>>
>>
>> There were actually fewer issues earlier on before I used the PLL,
>> where the 50MHz signal was generated by dividing the 100MHz signal in
>> half using something like:
>>    always @(posedge clock_100)
>>      clock_50 <= !clock_50;
>>
>> In this case, it was more or less possible to semi-directly pass
>> signals between 50 and 100MHz.
>>
>> As noted:
>>    50<->100, Works reasonably well.
>>    50<->150, Much less reliable, but works.
>>    50<->75, Works OK.
>>    75<->100, Doesn't work much at all.
>>    75<->150, Works reasonably well.
>>
>>
>> Other speeds, like 66 or 133 MHz, didn't have much luck with them.
>> Trying to pass a signal, even directly between flipflops, with one of
>> the other clock-speeds, results in failed timing.
>>
>>
>>
>>
>>>> No mention was made though that, "yeah, bits might occasionally get
>>>> randomly flipped here", or that one might need to do some form of
>>>> error-checking across the clock-domain crossing. Though, I guess it
>>>> does make sense in retrospect.
>>>>
>>>>
>>>> Then again, probably shouldn't be too surprised, as "forward a bunch
>>>> of time to make data pass between two different clocks" does seem a
>>>> little questionable...
>>>>
>>>> May test though whether these integrity checks allow reducing the
>>>> number of forwarding stages used (since this does visibly effect RAM
>>>> speed).
>>>>
>>>>
>>>>
>>>> That, and also observing that the ability to reliably read data from
>>>> the SDcard is dependent on how well the SDcard extension cable is
>>>> seated, which is often "not great" as it doesn't really like to stay
>>>> in place.
>>>>
>>>> Like a lot of "stuff isn't working correctly", unhooks SDcard cable,
>>>> reinserts it, resets FPGA, things work now...
>>>>
>>>> This part is made annoying as neither the SDcard interface nor FAT32
>>>> offer any kind of data-integrity checking (and a checksum error from
>>>> loading a binary could be either due to SDcard or due to memory
>>>> issues).
>>>>
>>>> Though, either way, I am using checksum checks for binaries, because
>>>> formerly binaries failing to load correctly were also a source of
>>>> issues.
>>>>
>>>> An ability to detect bad data reads would allow retrying though.
>>>>
>>>>
>>>> But, yeah, this can get kinda annoying.
>>>>
>>>>
>>>> Luckily, at least the Block-RAM seems to be reliable, but apparently
>>>> the Artix-7 and Spartan-7 use ECC'ed BRAM (and, presumably Vivado is
>>>> using said ECC... But, can't find anything anywhere to confirm this).
>>
>


Click here to read the complete article
Re: More complex instructions to reduce cycle overhead

<zB9pI.338700$Skn4.250226@fx17.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16893&group=comp.arch#16893

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsfeed.xs4all.nl!newsfeed9.news.xs4all.nl!news-out.netnews.com!news.alt.net!fdc3.netnews.com!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx17.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
References: <s7dn5p$78r$1@newsreader4.netcologne.de> <2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me> <s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me> <c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com> <s7mio3$qfs$1@dont-email.me> <00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com> <jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org> <049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com> <s7n2gj$5na$2@dont-email.me> <5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com> <s7n6ah$t1$1@dont-email.me> <590ea343-cd96-4082-800c-f02412204262n@googlegroups.com> <s7nchh$b58$1@dont-email.me> <KiHnI.151039$wd1.100928@fx41.iad> <s7nv66$u2t$1@newsreader4.netcologne.de> <PcQnI.365058$2A5.181861@fx45.iad> <s7onqq$ape$1@newsreader4.netcologne.de> <s7uclo$23r$1@dont-email.me> <2aAoI.606122$%W6.592987@fx44.iad> <s7upgf$v6r$1@dont-email.me> <uZToI.146185$lyv9.30173@fx35.iad> <98f17a50-83b9-4eb9-bdc0-6f3f6787e7c7n@googlegroups.com> <s81j26$gb$1@dont-email.me>
In-Reply-To: <s81j26$gb$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 16
Message-ID: <zB9pI.338700$Skn4.250226@fx17.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Wed, 19 May 2021 14:51:11 UTC
Date: Wed, 19 May 2021 10:51:05 -0400
X-Received-Bytes: 2088
 by: EricP - Wed, 19 May 2021 14:51 UTC

BGB wrote:
>
> In theory, the fastest signals which can exist with this FPGA are 450MHz
> / 2.2ns.
>
> Though, IME, about all it can really do at these speeds is:
> reg2 <= reg1;
>
>
> The Kintex and Vertex can apparently go a lot faster than the Artix and
> Spartan in this regard though.

Which Xilinx chips do you use?
It is Artix-7 (28 nm) and which model?

Re: More complex instructions to reduce cycle overhead

<N%9pI.385882$2A5.241067@fx45.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16900&group=comp.arch#16900

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!newsfeed.xs4all.nl!newsfeed7.news.xs4all.nl!fdcspool6.netnews.com!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx45.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
References: <s7dn5p$78r$1@newsreader4.netcologne.de> <2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me> <s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me> <c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com> <s7mio3$qfs$1@dont-email.me> <00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com> <jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org> <049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com> <s7n2gj$5na$2@dont-email.me> <5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com> <s7n6ah$t1$1@dont-email.me> <590ea343-cd96-4082-800c-f02412204262n@googlegroups.com> <s7nchh$b58$1@dont-email.me> <KiHnI.151039$wd1.100928@fx41.iad> <s7nv66$u2t$1@newsreader4.netcologne.de> <PcQnI.365058$2A5.181861@fx45.iad> <s7onqq$ape$1@newsreader4.netcologne.de> <s7uclo$23r$1@dont-email.me> <2aAoI.606122$%W6.592987@fx44.iad> <s7upgf$v6r$1@dont-email.me> <s811ub$ebh$1@dont-email.me> <s81l2o$agu$1@dont-email.me>
In-Reply-To: <s81l2o$agu$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 46
Message-ID: <N%9pI.385882$2A5.241067@fx45.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Wed, 19 May 2021 15:19:09 UTC
Date: Wed, 19 May 2021 11:19:02 -0400
X-Received-Bytes: 3586
 by: EricP - Wed, 19 May 2021 15:19 UTC

BGB wrote:
> On 5/18/2021 1:44 PM, Marcus wrote:
>>
>> Isn't it a question of uncertainty in when the different bits arrive
>> to the synchronizer flops? Each bit can potentially travel along a
>> different routing, and at least for me setting up the timing constraints
>> for the tools to do "the right thing" has always felt like black magic.
>>
>
> One would think that (and it would make sense), but some of my efforts
> implied that there is some other factor at play (not sure what exactly).
>
> For example, I had already tried delaying certain signals relative to
> the others, say:
> Data and Address signals arrive first;
> Commend signals arrive afterwards.
>
> So, the assumption would be that once the command arrives with an intact
> bit-pattern, the data and address would already be stable (and one would
> only need to integrity-check the command and response codes).
>
>
> Except it didn't work out this way...
> It seems more like there is some sort of "noise" between clock domains,
> that might flip bits even when they should otherwise be stable.
>
>
> This noise level seems to be inversely correlated with the number of
> synchronizer stages, but as can be noted, adding lots of stages is
> pretty bad for latency...
>
> Too few stages, and the data seems to turn into garbage; and the
> integrity checking can produce paradoxical results (such as letting
> stuff through that should have failed).
>
> Similarly, trying to interact with data (in any way) during the
> transition between clock domains, seemingly tends to cause it to turn
> into random garbage, ...

Do the tools have a display to show the actual route chosen for signals?
I'm wondering if as you switch data sources and destinations,
the individual bit lines are taking radically different paths
through the interconnect, giving the appearance of randomly
changing skew for each bit.

Re: More complex instructions to reduce cycle overhead

<s83ds5$ccq$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16905&group=comp.arch#16905

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
Date: Wed, 19 May 2021 11:20:19 -0500
Organization: A noiseless patient Spider
Lines: 102
Message-ID: <s83ds5$ccq$1@dont-email.me>
References: <s7dn5p$78r$1@newsreader4.netcologne.de>
<s7l775$sq5$1@dont-email.me> <s7l7os$75r$1@dont-email.me>
<s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com>
<s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com>
<jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com>
<s7n2gj$5na$2@dont-email.me>
<5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com>
<s7n6ah$t1$1@dont-email.me>
<590ea343-cd96-4082-800c-f02412204262n@googlegroups.com>
<s7nchh$b58$1@dont-email.me> <KiHnI.151039$wd1.100928@fx41.iad>
<s7nv66$u2t$1@newsreader4.netcologne.de> <PcQnI.365058$2A5.181861@fx45.iad>
<s7onqq$ape$1@newsreader4.netcologne.de> <s7uclo$23r$1@dont-email.me>
<2aAoI.606122$%W6.592987@fx44.iad> <s7upgf$v6r$1@dont-email.me>
<uZToI.146185$lyv9.30173@fx35.iad>
<98f17a50-83b9-4eb9-bdc0-6f3f6787e7c7n@googlegroups.com>
<s81j26$gb$1@dont-email.me> <zB9pI.338700$Skn4.250226@fx17.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 19 May 2021 16:20:21 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="8733e9da9e10b141cde1cd05e74615d1";
logging-data="12698"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18px1e6h/ATKskIWEIWut1w"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.1
Cancel-Lock: sha1:o5P4pELOzzz0aHTP/vcybYtwBPc=
In-Reply-To: <zB9pI.338700$Skn4.250226@fx17.iad>
Content-Language: en-US
 by: BGB - Wed, 19 May 2021 16:20 UTC

On 5/19/2021 9:51 AM, EricP wrote:
> BGB wrote:
>>
>> In theory, the fastest signals which can exist with this FPGA are
>> 450MHz / 2.2ns.
>>
>> Though, IME, about all it can really do at these speeds is:
>>   reg2 <= reg1;
>>
>>
>> The Kintex and Vertex can apparently go a lot faster than the Artix
>> and Spartan in this regard though.
>
> Which Xilinx chips do you use?
> It is Artix-7 (28 nm) and which model?
>

Primary Board:
XC7A100TCSG324-1 (Nexys A7-100T)

Others:
XC7S50TCSG324-1 (Arty S7-50)
XC7S25TCSG225-1 (CMod S7-25)

Generally using Vivado 2018.3.1, ...

Typically using strategies:
Flow_PerfOptimized_high (synthesis, *1)
Vivado Implementation Defaults (latter, *2)

*1: This makes timing more likely to pass, but at the cost of higher
resource usage and similar.

*2: Changing this option tends to break the ability to see
post-implementation resource usage stats.

Of these, the Artix seems to have a slightly harder time with passing
timing than either of the Spartan devices, for whatever reason.
However, the Spartan's are smaller.

In theory, they should be about the same speed, or if anything the Artix
should be faster due to having more space and thus better able to route
stuff.

As can be noted, whether or not things pass timing is mostly determined
by Vivado. Vivado allows synthesizing for any FPGA it supports, but as
can be noted, unless the FPGA matches the one on the board, the
bitstream doesn't work.

The Nexys A7 board has VGA and an SDcard holder, which is fairly useful.
It was ~ $269 when I bought it,

The CMod-S7 is limited to a microcontroller like configuration with the
BJX2 core, as this board lacks any peripherals or external RAM.

Timings I can manage on the BJX2 Core:
50MHz, works pretty well
75MHz, can work, prone to fail timing, needs L1 caches reduced to ~ 2K

I have the RAM working at either 50 or 75 MHz (internally, RAM
controller logic runs at 100 or 150 MHz, driving the chip at 1/2 the
internal speed).

I had another DDR module (DdrB) that runs the DDR chip at 1:1 speeds,
but as noted, it worked in simulation but doesn't seem to work on actual
hardware.

One difference between the modules is that the 1/2 clocked module can
adjust Data/DQS alignment by 1/4 cycle, whereas the 1:1 module is
limited to 1/2 cycle.

With DLL enabled (and above the 120MHz minimum), the data should be
correctly aligned with the clock to make this fine-adjustment
unnecessary, but best I could tell, the RAM chip was effectively
becoming non-responsive.

The L2 in the Ringbus design operates at the same speed as the CPU core
(as does the ringbus), and the L2 interfaces with the RAM Module (using
the old/original bus design).

The RAM module then contains the glue-logic to step between the external
"master clock" and its internal clock speed. The spot where it steps
clock speeds seems to be prone to random bit-flips.

I now have logic which basically XORs everything together and currently
generates an 18-bit check value. The OPM and OK signals are also checked
against a bitwise-inverted duplicate (also keyed to the main check
value). The way things are XOR'ed effectively functions like a
horizontal parity over all the bits (probably sufficient for now, may
miss multi-bit errors though).

....

Re: More complex instructions to reduce cycle overhead

<s83gne$1ju$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16909&group=comp.arch#16909

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
Date: Wed, 19 May 2021 12:09:00 -0500
Organization: A noiseless patient Spider
Lines: 82
Message-ID: <s83gne$1ju$1@dont-email.me>
References: <s7dn5p$78r$1@newsreader4.netcologne.de>
<2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me>
<s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com>
<s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com>
<jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com>
<s7n2gj$5na$2@dont-email.me>
<5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com>
<s7n6ah$t1$1@dont-email.me>
<590ea343-cd96-4082-800c-f02412204262n@googlegroups.com>
<s7nchh$b58$1@dont-email.me> <KiHnI.151039$wd1.100928@fx41.iad>
<s7nv66$u2t$1@newsreader4.netcologne.de> <PcQnI.365058$2A5.181861@fx45.iad>
<s7onqq$ape$1@newsreader4.netcologne.de> <s7uclo$23r$1@dont-email.me>
<2aAoI.606122$%W6.592987@fx44.iad> <s7upgf$v6r$1@dont-email.me>
<s811ub$ebh$1@dont-email.me> <s81l2o$agu$1@dont-email.me>
<N%9pI.385882$2A5.241067@fx45.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 19 May 2021 17:09:02 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="8733e9da9e10b141cde1cd05e74615d1";
logging-data="1662"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19gmy/pF+CTPbN720PKcD4U"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.1
Cancel-Lock: sha1:JLDfwV8kOPKwbjYmHNgz+zJWtf4=
In-Reply-To: <N%9pI.385882$2A5.241067@fx45.iad>
Content-Language: en-US
 by: BGB - Wed, 19 May 2021 17:09 UTC

On 5/19/2021 10:19 AM, EricP wrote:
> BGB wrote:
>> On 5/18/2021 1:44 PM, Marcus wrote:
>>>
>>> Isn't it a question of uncertainty in when the different bits arrive
>>> to the synchronizer flops? Each bit can potentially travel along a
>>> different routing, and at least for me setting up the timing constraints
>>> for the tools to do "the right thing" has always felt like black magic.
>>>
>>
>> One would think that (and it would make sense), but some of my efforts
>> implied that there is some other factor at play (not sure what exactly).
>>
>> For example, I had already tried delaying certain signals relative to
>> the others, say:
>>   Data and Address signals arrive first;
>>   Commend signals arrive afterwards.
>>
>> So, the assumption would be that once the command arrives with an
>> intact bit-pattern, the data and address would already be stable (and
>> one would only need to integrity-check the command and response codes).
>>
>>
>> Except it didn't work out this way...
>> It seems more like there is some sort of "noise" between clock
>> domains, that might flip bits even when they should otherwise be stable.
>>
>>
>> This noise level seems to be inversely correlated with the number of
>> synchronizer stages, but as can be noted, adding lots of stages is
>> pretty bad for latency...
>>
>> Too few stages, and the data seems to turn into garbage; and the
>> integrity checking can produce paradoxical results (such as letting
>> stuff through that should have failed).
>>
>> Similarly, trying to interact with data (in any way) during the
>> transition between clock domains, seemingly tends to cause it to turn
>> into random garbage, ...
>
> Do the tools have a display to show the actual route chosen for signals?
> I'm wondering if as you switch data sources and destinations,
> the individual bit lines are taking radically different paths
> through the interconnect, giving the appearance of randomly
> changing skew for each bit.
>
>

I am able to look at the top-10 "worst" routes (in terms of WNS), and
going to inter-clock paths (between 50<->150 MHz):
Yeah, they are spread all over the place;
Seem to be placed semi-randomly, at various distances apart, within an
area roughly spanning roughly two clock tiles.

Some routes have the source and destination flip-flop right next to each
other, others cross most of the way across the clock tile, ...

They are a 0-level paths, with delays of around 2.2 to 2.4 ns (for a
6.7ns requirement).

Note that, in general, I tend to be see a lot of logic paths which tend
to semi-randomly cross nearly the whole FPGA (particularly when timing
fails), it seems to often be paths wandering from one side of the device
to the other, sometimes making a few zigzags across some big "void" area
near the middle of the device (presumably hard-logic of some sort), ...

Note that how stuff ends up organized in the FPGA tends to be pretty
much random, and the overall topology tends to vary from one run to
another (sort of like how terrain is different each time when making new
worlds in Minecraft).

I once did wonder if it was fully random, and tried as an experiment
redoing synthesis a few times without changing anything. In this case,
layout stayed the same in the tests. But, as soon as one changes
*anything* in the code, the whole topology reorganizes itself...

....

Re: More complex instructions to reduce cycle overhead

<s83jm1$muk$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16919&group=comp.arch#16919

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
Date: Wed, 19 May 2021 19:59:29 +0200
Organization: A noiseless patient Spider
Lines: 162
Message-ID: <s83jm1$muk$1@dont-email.me>
References: <s7dn5p$78r$1@newsreader4.netcologne.de>
<2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me>
<s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com>
<s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com>
<jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com>
<s7n2gj$5na$2@dont-email.me>
<5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com>
<s7n6ah$t1$1@dont-email.me>
<590ea343-cd96-4082-800c-f02412204262n@googlegroups.com>
<s7nchh$b58$1@dont-email.me> <KiHnI.151039$wd1.100928@fx41.iad>
<s7nv66$u2t$1@newsreader4.netcologne.de> <PcQnI.365058$2A5.181861@fx45.iad>
<s7onqq$ape$1@newsreader4.netcologne.de> <s7uclo$23r$1@dont-email.me>
<2aAoI.606122$%W6.592987@fx44.iad> <s7upgf$v6r$1@dont-email.me>
<uZToI.146185$lyv9.30173@fx35.iad>
<98f17a50-83b9-4eb9-bdc0-6f3f6787e7c7n@googlegroups.com>
<s81j26$gb$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 19 May 2021 17:59:29 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="c2bc2388f8ab3b493b0311e7eda3587a";
logging-data="23508"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18HEd1HdRjZAFfi1rLQelRBUk/ByXLYAYo="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
Thunderbird/78.8.1
Cancel-Lock: sha1:fVYn4ikmtlI/S8L+r4O3U1wBcxo=
In-Reply-To: <s81j26$gb$1@dont-email.me>
Content-Language: en-US
 by: Marcus - Wed, 19 May 2021 17:59 UTC

On 2021-05-19, BGB wrote:
> On 5/18/2021 3:19 PM, MitchAlsup wrote:
>> On Tuesday, May 18, 2021 at 1:47:57 PM UTC-5, EricP wrote:
>>> BGB wrote:
>>>> On 5/17/2021 3:15 PM, EricP wrote:
>>>>> BGB wrote:
>>>>>>
>>>>>>
>>>>>> Stuff online generally implied:
>>>>>> always @(posedge oldclock)
>>>>>> begin
>>>>>> tempValue <= sendValue;
>>>>>> end
>>>>>> always @(posedge newclock)
>>>>>> begin
>>>>>> recvValue <= tempValue;
>>>>>> end
>>>>>>
>>>>>> But, it wasn't nearly so easy outside of simulation...
>>>>>>
>>>>>>
>>>>>> Ended up having to do something like:
>>>>>> always @(posedge oldclock)
>>>>>> begin
>>>>>> tempValue1 <= sendValue;
>>>>>> tempValue2 <= tempValue1;
>>>>>> end
>>>>>>
>>>>>> always @(posedge newclock)
>>>>>> begin
>>>>>> tempValue3 <= tempValue2;
>>>>>> tempValue4 <= tempValue3;
>>>>>> ...
>>>>>> recvValue <= tempValueN;
>>>>>> end
>>>>>
>>>>> This looks like a metastability synchronizer but with more stages.
>>>>> https://en.wikipedia.org/wiki/Metastability_(electronics)
>>>>>
>>>>
>>>> I need a certain minimum number of stages, otherwise it all turns to
>>>> garbage.
>>>>
>>>> Granted, this doesn't happen in simulation, but seems to be a bigger
>>>> issue on the FPGA.
>> <
>>> I was thinking that if as you said this all driven from a common clock,
>>> and if it is just skew as Mitch said, then if the skew and jitter are
>>> small enough, you might change the above logic to set the send buffer
>>> on the rising clock, and fill the receive buffer on the falling clock.
>> <
>> When chips started to get "fast" (say above 2 GHz) we started to layout
>> the whole clock buffer tree with a different Vdd/Gnd routing scheme
>> as one way to better control clock skew. Vdd/Gnd were routed away
>> from the logic Vdd/Gnd that the logic gate "saw". The length of the
>> clock tree in K9 was a bit longer than a clock ! and the buffers produced
>> a rather constant amount of noise on their Vdd/GNd route while logic
>> had a big moment right after the rising clock edge which then tended
>> downward as the cycle progressed.
>> <
>
> In theory, the fastest signals which can exist with this FPGA are 450MHz
> / 2.2ns.
>
> Though, IME, about all it can really do at these speeds is:
>   reg2 <= reg1;
>
>
> The Kintex and Vertex can apparently go a lot faster than the Artix and
> Spartan in this regard though.
>
>
> Logic that runs at 150 or 200 MHz is fairly painful in terms of timing
> and what it can do.
>
> And, 75 or 100 MHz is a little easier, but still not super easy.
>
> At 50MHz one can do a lot more stuff...
> This seems like it is probably the "sweet spot" for the FPGAs I am using.
>
>
> At 25 or 33 MHz, it is possible that one could mostly ignore timing
> constraints (as one writes code that casually does arithmetic on 200 bit
> numbers or similar, ...).

I have, for a long time, "overclocked" my design. Timing usually passes
at about 70MHz, but if I set the clock (from the PLL) to 100MHz or even
120MHz timing will fail (usually the timing report will say that Fmax is
68-72MHz or so) - but the FPGA will happily run the design at those
speeds.

It's obviously not a good strategy, but I wanted to see how fast I could
take the design.

BTW, I'm using a Cyclone V FPGA.

>
>
>>> That gives 1/2 cycle of the faster clock of leeway for the edges to
>>> move about relative to each other, and eliminates the extra stages.
>> <
>> The FPGA will not have access to this kind of low skew engineering.
>
>
> It is possible, though, from what I can gather:
>
> Below ~ 100MHz, the FPGA seems able to use generic buffers for clock
> signaling. So, clocks below 100MHz can be used more freely.
>
>
> Above 100MHz, it seems to use clock tiles and global clock buffers, and
> seem to partition logic between clock tiles (say, putting 100MHz logic
> in one tile, and 150MHz logic in another). In this case, there are 8
> such clock tiles in the XC7A100.
>
> Apparently the FPGA doesn't actually support negedge clocking per-se, so
> when one tries to use negedge, it either uses an inverted version of the
> posedge signal, or otherwise fakes it by behaving "as-if" it had used a
> negedge (while still actually using posedge signaling internally).
>
> Not entirely sure how this interacts with the clock tiles, ...
>
>
> Apparently for external IO pins, it is able to drive them for IO on both
> the rising and falling edges, but what I can gather it works by running
> the two cases in parallel (off the same clock), with the external IO pin
> having separate sets of input and output signals internally for both the
> rising and falling edges (and joining the outputs together for SDR
> signaling).
>
>
> Apparently, this also can be used to drive external IO pins at roughly
> 2x the internal clock-frequency (so, say, 150MHz logic driving a 300MHz
> IO signal by using the DDR capabilities as 2x SDR).
>
>
> Some amount of other stuff generally recommends not using negedge at all
> if it can be avoided.
>
>
>>>
>>> always @(posedge sendClock)
>>> begin
>>> tempValue1 <= sendValue;
>>> end
>>>
>>> always @(negedge recvClock)
>>> begin
>>> recvValue <= tempValue1;
>>> end
>>>
>>> plus some buffer full/empty handshake logic.
>>>
>>> So the faster clock is 150 MHz, 6.66 ns, and 1/2 cycle 3.33 ns.
>>> The send and receive rising clock edges would have to differ by
>>> more than 3.33 ns minus setup and hold times and clk-to-Q-out time,
>>> to cause the above to mis-sample.
>
>
> It is possible, though I have uncertainty how well it would work in this
> case.

Re: More complex instructions to reduce cycle overhead

<s83p24$uaj$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16930&group=comp.arch#16930

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
Date: Wed, 19 May 2021 14:31:14 -0500
Organization: A noiseless patient Spider
Lines: 197
Message-ID: <s83p24$uaj$1@dont-email.me>
References: <s7dn5p$78r$1@newsreader4.netcologne.de>
<s7l775$sq5$1@dont-email.me> <s7l7os$75r$1@dont-email.me>
<s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com>
<s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com>
<jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com>
<s7n2gj$5na$2@dont-email.me>
<5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com>
<s7n6ah$t1$1@dont-email.me>
<590ea343-cd96-4082-800c-f02412204262n@googlegroups.com>
<s7nchh$b58$1@dont-email.me> <KiHnI.151039$wd1.100928@fx41.iad>
<s7nv66$u2t$1@newsreader4.netcologne.de> <PcQnI.365058$2A5.181861@fx45.iad>
<s7onqq$ape$1@newsreader4.netcologne.de> <s7uclo$23r$1@dont-email.me>
<2aAoI.606122$%W6.592987@fx44.iad> <s7upgf$v6r$1@dont-email.me>
<uZToI.146185$lyv9.30173@fx35.iad>
<98f17a50-83b9-4eb9-bdc0-6f3f6787e7c7n@googlegroups.com>
<s81j26$gb$1@dont-email.me> <s83jm1$muk$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 19 May 2021 19:31:16 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="8733e9da9e10b141cde1cd05e74615d1";
logging-data="31059"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1++2DkP+ERE4Mc0trDneO+4"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.1
Cancel-Lock: sha1:4gKD6AufZzq6V5WZulnnBQAOhNI=
In-Reply-To: <s83jm1$muk$1@dont-email.me>
Content-Language: en-US
 by: BGB - Wed, 19 May 2021 19:31 UTC

On 5/19/2021 12:59 PM, Marcus wrote:
> On 2021-05-19, BGB wrote:
>> On 5/18/2021 3:19 PM, MitchAlsup wrote:
>>> On Tuesday, May 18, 2021 at 1:47:57 PM UTC-5, EricP wrote:
>>>> BGB wrote:
>>>>> On 5/17/2021 3:15 PM, EricP wrote:
>>>>>> BGB wrote:
>>>>>>>
>>>>>>>
>>>>>>> Stuff online generally implied:
>>>>>>> always @(posedge oldclock)
>>>>>>> begin
>>>>>>> tempValue <= sendValue;
>>>>>>> end
>>>>>>> always @(posedge newclock)
>>>>>>> begin
>>>>>>> recvValue <= tempValue;
>>>>>>> end
>>>>>>>
>>>>>>> But, it wasn't nearly so easy outside of simulation...
>>>>>>>
>>>>>>>
>>>>>>> Ended up having to do something like:
>>>>>>> always @(posedge oldclock)
>>>>>>> begin
>>>>>>> tempValue1 <= sendValue;
>>>>>>> tempValue2 <= tempValue1;
>>>>>>> end
>>>>>>>
>>>>>>> always @(posedge newclock)
>>>>>>> begin
>>>>>>> tempValue3 <= tempValue2;
>>>>>>> tempValue4 <= tempValue3;
>>>>>>> ...
>>>>>>> recvValue <= tempValueN;
>>>>>>> end
>>>>>>
>>>>>> This looks like a metastability synchronizer but with more stages.
>>>>>> https://en.wikipedia.org/wiki/Metastability_(electronics)
>>>>>>
>>>>>
>>>>> I need a certain minimum number of stages, otherwise it all turns to
>>>>> garbage.
>>>>>
>>>>> Granted, this doesn't happen in simulation, but seems to be a bigger
>>>>> issue on the FPGA.
>>> <
>>>> I was thinking that if as you said this all driven from a common clock,
>>>> and if it is just skew as Mitch said, then if the skew and jitter are
>>>> small enough, you might change the above logic to set the send buffer
>>>> on the rising clock, and fill the receive buffer on the falling clock.
>>> <
>>> When chips started to get "fast" (say above 2 GHz) we started to layout
>>> the whole clock buffer tree with a different Vdd/Gnd routing scheme
>>> as one way to better control clock skew. Vdd/Gnd were routed away
>>> from the logic Vdd/Gnd that the logic gate "saw". The length of the
>>> clock tree in K9 was a bit longer than a clock ! and the buffers
>>> produced
>>> a rather constant amount of noise on their Vdd/GNd route while logic
>>> had a big moment right after the rising clock edge which then tended
>>> downward as the cycle progressed.
>>> <
>>
>> In theory, the fastest signals which can exist with this FPGA are
>> 450MHz / 2.2ns.
>>
>> Though, IME, about all it can really do at these speeds is:
>>    reg2 <= reg1;
>>
>>
>> The Kintex and Vertex can apparently go a lot faster than the Artix
>> and Spartan in this regard though.
>>
>>
>> Logic that runs at 150 or 200 MHz is fairly painful in terms of timing
>> and what it can do.
>>
>> And, 75 or 100 MHz is a little easier, but still not super easy.
>>
>> At 50MHz one can do a lot more stuff...
>> This seems like it is probably the "sweet spot" for the FPGAs I am using.
>>
>>
>> At 25 or 33 MHz, it is possible that one could mostly ignore timing
>> constraints (as one writes code that casually does arithmetic on 200
>> bit numbers or similar, ...).
>
> I have, for a long time, "overclocked" my design. Timing usually passes
> at about 70MHz, but if I set the clock (from the PLL) to 100MHz or even
> 120MHz timing will fail (usually the timing report will say that Fmax is
> 68-72MHz or so) - but the FPGA will happily run the design at those
> speeds.
>
> It's obviously not a good strategy, but I wanted to see how fast I could
> take the design.
>
> BTW, I'm using a Cyclone V FPGA.
>

I had previously tried running a core at 100MHz even despite it failing
timing...
It sorta worked, but there was a lot of obvious glitching (3D objects in
Quake were a mess of vertices exploding off in random directions, ...).

After a short time, it crashed in such a way that it fouled up the
SDcard bad enough that I had to reformat it and start back from a clean
filesystem.

In other news, I have came up with an idea that if I add
request-sequence-numbers to the L2<->DDR interface, then it should be
possible to (in theory) to eliminate the Teardown/READY steps, which
could potentially result in a potentially significant reduction in
latency-related overheads.

In effect, with the existing scheme, it is necessary for several
round-trips per request, whereas with the addition of a sequence number,
only a single round-trip is needed, and the DDR controller can note
that, if the sequence number changes, it means that the prior request
has terminated.

On the other side, when the sequence number for the current request
arrives, it can be known that the current request has finished (sequence
numbers serve a vaguely similar purpose in the ringbus).

>>
>>
>>>> That gives 1/2 cycle of the faster clock of leeway for the edges to
>>>> move about relative to each other, and eliminates the extra stages.
>>> <
>>> The FPGA will not have access to this kind of low skew engineering.
>>
>>
>> It is possible, though, from what I can gather:
>>
>> Below ~ 100MHz, the FPGA seems able to use generic buffers for clock
>> signaling. So, clocks below 100MHz can be used more freely.
>>
>>
>> Above 100MHz, it seems to use clock tiles and global clock buffers,
>> and seem to partition logic between clock tiles (say, putting 100MHz
>> logic in one tile, and 150MHz logic in another). In this case, there
>> are 8 such clock tiles in the XC7A100.
>>
>> Apparently the FPGA doesn't actually support negedge clocking per-se,
>> so when one tries to use negedge, it either uses an inverted version
>> of the posedge signal, or otherwise fakes it by behaving "as-if" it
>> had used a negedge (while still actually using posedge signaling
>> internally).
>>
>> Not entirely sure how this interacts with the clock tiles, ...
>>
>>
>> Apparently for external IO pins, it is able to drive them for IO on
>> both the rising and falling edges, but what I can gather it works by
>> running the two cases in parallel (off the same clock), with the
>> external IO pin having separate sets of input and output signals
>> internally for both the rising and falling edges (and joining the
>> outputs together for SDR signaling).
>>
>>
>> Apparently, this also can be used to drive external IO pins at roughly
>> 2x the internal clock-frequency (so, say, 150MHz logic driving a
>> 300MHz IO signal by using the DDR capabilities as 2x SDR).
>>
>>
>> Some amount of other stuff generally recommends not using negedge at
>> all if it can be avoided.
>>
>>
>>>>
>>>> always @(posedge sendClock)
>>>> begin
>>>> tempValue1 <= sendValue;
>>>> end
>>>>
>>>> always @(negedge recvClock)
>>>> begin
>>>> recvValue <= tempValue1;
>>>> end
>>>>
>>>> plus some buffer full/empty handshake logic.
>>>>
>>>> So the faster clock is 150 MHz, 6.66 ns, and 1/2 cycle 3.33 ns.
>>>> The send and receive rising clock edges would have to differ by
>>>> more than 3.33 ns minus setup and hold times and clk-to-Q-out time,
>>>> to cause the above to mis-sample.
>>
>>
>> It is possible, though I have uncertainty how well it would work in
>> this case.
>

Timing... (Re: More complex instructions to reduce cycle overhead)

<s88rmh$77e$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17013&group=comp.arch#17013

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Timing... (Re: More complex instructions to reduce cycle overhead)
Date: Fri, 21 May 2021 12:46:54 -0500
Organization: A noiseless patient Spider
Lines: 163
Message-ID: <s88rmh$77e$1@dont-email.me>
References: <s7dn5p$78r$1@newsreader4.netcologne.de>
<s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me>
<c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com>
<s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com>
<jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com>
<s7n2gj$5na$2@dont-email.me>
<5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com>
<s7n6ah$t1$1@dont-email.me>
<590ea343-cd96-4082-800c-f02412204262n@googlegroups.com>
<s7nchh$b58$1@dont-email.me> <KiHnI.151039$wd1.100928@fx41.iad>
<s7nv66$u2t$1@newsreader4.netcologne.de> <PcQnI.365058$2A5.181861@fx45.iad>
<s7onqq$ape$1@newsreader4.netcologne.de> <s7uclo$23r$1@dont-email.me>
<2aAoI.606122$%W6.592987@fx44.iad> <s7upgf$v6r$1@dont-email.me>
<uZToI.146185$lyv9.30173@fx35.iad>
<98f17a50-83b9-4eb9-bdc0-6f3f6787e7c7n@googlegroups.com>
<s81j26$gb$1@dont-email.me> <zB9pI.338700$Skn4.250226@fx17.iad>
<s83ds5$ccq$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 21 May 2021 17:46:57 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="31e7d7b0f7b5e66557a059550dd9aa34";
logging-data="7406"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19BJs84mIPU/8A051E6lsj/"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.1
Cancel-Lock: sha1:1ufruXFFwlG/LUEmbmrQwAbllGE=
In-Reply-To: <s83ds5$ccq$1@dont-email.me>
Content-Language: en-US
 by: BGB - Fri, 21 May 2021 17:46 UTC

On 5/19/2021 11:20 AM, BGB wrote:
> On 5/19/2021 9:51 AM, EricP wrote:
>> BGB wrote:
>>>
>>> In theory, the fastest signals which can exist with this FPGA are
>>> 450MHz / 2.2ns.
>>>
>>> Though, IME, about all it can really do at these speeds is:
>>>   reg2 <= reg1;
>>>
>>>
>>> The Kintex and Vertex can apparently go a lot faster than the Artix
>>> and Spartan in this regard though.
>>
>> Which Xilinx chips do you use?
>> It is Artix-7 (28 nm) and which model?
>>
>
>
> Primary Board:
>   XC7A100TCSG324-1  (Nexys A7-100T)
>
> Others:
>   XC7S50TCSG324-1 (Arty S7-50)
>   XC7S25TCSG225-1 (CMod S7-25)
>
> Generally using Vivado 2018.3.1, ...
>
> Typically using strategies:
>   Flow_PerfOptimized_high (synthesis, *1)
>   Vivado Implementation Defaults (latter, *2)
>
> *1: This makes timing more likely to pass, but at the cost of higher
> resource usage and similar.
>
> *2: Changing this option tends to break the ability to see
> post-implementation resource usage stats.
>
>
> Of these, the Artix seems to have a slightly harder time with passing
> timing than either of the Spartan devices, for whatever reason.
> However, the Spartan's are smaller.
>
> In theory, they should be about the same speed, or if anything the Artix
> should be faster due to having more space and thus better able to route
> stuff.
>
>
> As can be noted, whether or not things pass timing is mostly determined
> by Vivado. Vivado allows synthesizing for any FPGA it supports, but as
> can be noted, unless the FPGA matches the one on the board, the
> bitstream doesn't work.
>
>
> The Nexys A7 board has VGA and an SDcard holder, which is fairly useful.
> It was ~ $269 when I bought it,
>
> The CMod-S7 is limited to a microcontroller like configuration with the
> BJX2 core, as this board lacks any peripherals or external RAM.
>
>
> Timings I can manage on the BJX2 Core:
>   50MHz, works pretty well
>   75MHz, can work, prone to fail timing, needs L1 caches reduced to ~ 2K
>
>
> I have the RAM working at either 50 or 75 MHz (internally, RAM
> controller logic runs at 100 or 150 MHz, driving the chip at 1/2 the
> internal speed).
>
>
> I had another DDR module (DdrB) that runs the DDR chip at 1:1 speeds,
> but as noted, it worked in simulation but doesn't seem to work on actual
> hardware.
>
> One difference between the modules is that the 1/2 clocked module can
> adjust Data/DQS alignment by 1/4 cycle, whereas the 1:1 module is
> limited to 1/2 cycle.
>
> With DLL enabled (and above the 120MHz minimum), the data should be
> correctly aligned with the clock to make this fine-adjustment
> unnecessary, but best I could tell, the RAM chip was effectively
> becoming non-responsive.
>

Goes and adds timing constraints to DDR related IO pins, after noting
that they were absent, and this apparently results in them being treated
as unconstrained...

However, this resulted in basically a crap-storm of failed timing
constraints, as apparently, the DDR controller module wasn't actually
maintaining timing on its IO pins...

>
> The L2 in the Ringbus design operates at the same speed as the CPU core
> (as does the ringbus), and the L2 interfaces with the RAM Module (using
> the old/original bus design).
>
>
> The RAM module then contains the glue-logic to step between the external
> "master clock" and its internal clock speed. The spot where it steps
> clock speeds seems to be prone to random bit-flips.
>
> I now have logic which basically XORs everything together and currently
> generates an 18-bit check value. The OPM and OK signals are also checked
> against a bitwise-inverted duplicate (also keyed to the main check
> value). The way things are XOR'ed effectively functions like a
> horizontal parity over all the bits (probably sufficient for now, may
> miss multi-bit errors though).
>

And, all this just deciding to start falling on its face started to
indicate that this may not have actually been the source of the random
bit flipping to begin with...

Rather, poking around with logic in one place was, rather, effecting the
timing at the IO pins...

I have also determined:
The 50MHz DDR is actually seemingly the fastest I can drive it and still
pass timing.

The 50MHz is actually driving some of the IO pins on a 100MHz clock, but
given the way DDR works, it looks like I either need to drive the pins
internally at 2x the external speed, or provide multiple clocks which
are slightly out of phase to get the required timings.

I looked at the datasheets some more, and it turns out ~ 100 MHz is the
fastest these IO pins can actually go (in their base form), or a little
faster if both posedge and negedge are used.

But, it appears that MIG was able to get the pins to go faster via
another "secret weapon" trick: SERDES.

So, I guess with SERDES, it is possible to feed parallel data at 50MHz
in one side, and get 400MHz out the other side. The reverse is also
possible with these pins, receiving parallel data. The SERDES pins
apparently operate at ~ 400MHz to 800MHz.

And, as it would so happen, the RAM is connected up to the pins which
have this capability, leaning in the direction of this probably being
what is going on.

Yeah...

Though I guess in this case, it would be a tradeoff of:
Leave it as-is, and live with RAM at 50MHz (safely), or 75MHz (and
technically failing timing);
Look into a DDR controller based around using SERDES as well;
Being like "screw it" and, implementing support for an AXI Bus
interface, and then using MIG.

Though, I am also left to suspect that the gains I would get from moving
to faster speeds would be limited, given I would (then) have to deal
with significantly larger CAS and RAS latencies and similar, likely
eating most of the gains...

Re: More complex instructions to reduce cycle overhead

<692535ba-3b6e-4cf3-aac2-702e324bd212n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17021&group=comp.arch#17021

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:4d42:: with SMTP id x2mr4434419qtv.178.1621644151839; Fri, 21 May 2021 17:42:31 -0700 (PDT)
X-Received: by 2002:a9d:19ed:: with SMTP id k100mr8012252otk.329.1621644151566; Fri, 21 May 2021 17:42:31 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!tr3.eu1.usenetexpress.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 21 May 2021 17:42:31 -0700 (PDT)
In-Reply-To: <3ea71dea-28e2-47a2-9073-d49cfe92cde4n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:a0d0:9f90:65ed:dbcf:9ab5:d836; posting-account=AoizIQoAAADa7kQDpB0DAj2jwddxXUgl
NNTP-Posting-Host: 2600:1700:a0d0:9f90:65ed:dbcf:9ab5:d836
References: <s7dn5p$78r$1@newsreader4.netcologne.de> <2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me> <s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me> <c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com> <s7mio3$qfs$1@dont-email.me> <00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com> <jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org> <049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com> <s7n2gj$5na$2@dont-email.me> <5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com> <s7n6ah$t1$1@dont-email.me> <590ea343-cd96-4082-800c-f02412204262n@googlegroups.com> <s7nchh$b58$1@dont-email.me> <KiHnI.151039$wd1.100928@fx41.iad> <s7nv66$u2t$1@newsreader4.netcologne.de> <PcQnI.365058$2A5.181861@fx45.iad> <s7onqq$ape$1@newsreader4.netcologne.de> <s7uclo$23r$1@dont-email.me> <2aAoI.606122$%W6.592987@fx44.iad> <3ea71dea-28e2-47a2-9073-d49cfe92cde4n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <692535ba-3b6e-4cf3-aac2-702e324bd212n@googlegroups.com>
Subject: Re: More complex instructions to reduce cycle overhead
From: jim.brak...@ieee.org (JimBrakefield)
Injection-Date: Sat, 22 May 2021 00:42:31 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 101
 by: JimBrakefield - Sat, 22 May 2021 00:42 UTC

On Monday, May 17, 2021 at 4:29:35 PM UTC-5, MitchAlsup wrote:
> On Monday, May 17, 2021 at 3:16:03 PM UTC-5, EricP wrote:
> > BGB wrote:
> > > On 5/15/2021 10:02 AM, Thomas Koenig wrote:
> > >> EricP <ThatWould...@thevillage.com> schrieb:
> > >>> Thomas Koenig wrote:
> > >>
> > >>>> That sounds scary - in effect, the synchronization between the
> > >>>> different
> > >>>> bits in, let's say, an adder would be implied by the gate timing?
> > >>>>
> > >>>> You would need very narrow tolerances on your gates, then (both
> > >>>> too fast and to slow would be deadly).
> > >>>>
> > >>>> Or is some other mechanism proposed?
> > >>>
> > >>> They eliminate intermediate pipeline stage registers,
> > >>> then tools insert buffers so that all pathways through the combo logic
> > >>> have the same propagation delay ensuring all output signals arrive at
> > >>> the same instant.
> > >>
> > >> I'm an engineer, and I know full well that, in anything we build,
> > >> there can not be such a thing as the same _anything_ ...
> > >>
> > >> Rather, they must be counting on the inevitable dispersion to
> > >> be small enough that they an still catch it after a presumably
> > >> small number of cycles.
> > >>
> > >
> > > I have made the observation that data forwarded between clock domains in
> > > an FPGA is not reliable, even if "most of the time" it gets through
> > > intact...
> > >
> > > In my recent battling against stability issues, I was seeing crashes
> > > with debug prints where the "expected value" and "got value" would
> > > differ by 1 bit being flipped, ...
> > >
> > >
> > > Ended up adding integrity checking and filtering to reject messages (and
> > > use the prior values) if the check values didn't match after the
> > > clock-domain crossing (generally XOR'ing everything together and
> > > checking for equality).
> > >
> > >
> > > Stuff online generally implied:
> > > always @(posedge oldclock)
> > > begin
> > > tempValue <= sendValue;
> > > end
> > > always @(posedge newclock)
> > > begin
> > > recvValue <= tempValue;
> > > end
> > >
> > > But, it wasn't nearly so easy outside of simulation...
> > >
> > >
> > > Ended up having to do something like:
> > > always @(posedge oldclock)
> > > begin
> > > tempValue1 <= sendValue;
> > > tempValue2 <= tempValue1;
> > > end
> > >
> > > always @(posedge newclock)
> > > begin
> > > tempValue3 <= tempValue2;
> > > tempValue4 <= tempValue3;
> > > ...
> > > recvValue <= tempValueN;
> > > end
> > This looks like a metastability synchronizer but with more stages.
> > https://en.wikipedia.org/wiki/Metastability_(electronics)
> > > Where multiple forwarding stages tends to improve stability, at a
> > > 50<->75 MHz interface, generally needed 2 stages on the 'old' side, and
> > > 6 on the 'new' side. Similar for 50<->150 MHz.
> > >
> > > Even then, reliability was still an issue (and it would break
> > > intermittently, and even 50<->100 interfacing was still unreliable, if
> > > not quite as bad). Similar seems to apply to 75<->150.
> >
> > >
> > > Note that using both 100 and 150 MHz on the FPGA seems to be a problem,
> > > as the synthesis then goes and forces them into different clock-tiles
> > > (limiting the usable space in the FPGA); whereas other (sub 100 MHz)
> > > frequencies seem able to coexist within the same clock tiles.
> > But a metastability synchronizer shouldn't be necessary as
> > all the clocks are multiples of 25 Mhz and presumably derived
> > from the same source.
> >
> > Maybe it is due to clock skew between the domains.
> <
> That much is obvious, the question is:: is that clock skew variable with
> load ? That is: when your do a 3-instruction bundle does the clock skew
> change from that when you run only 1 instruction ???
>
> AND why do you have clock domains in an FPGA ?

Communications designs are notorious for several independent unrelated clocks.
Larger FPGAs support multiple clock domains. Presumably they work as intended.

Fortunately metastability issues are taught to undergraduates these days.

Re: Timing... (Re: More complex instructions to reduce cycle overhead)

<2f79fea6-3fb5-40e5-b53d-7f41e99d5b6dn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17022&group=comp.arch#17022

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:8245:: with SMTP id e66mr15058317qkd.439.1621645093184;
Fri, 21 May 2021 17:58:13 -0700 (PDT)
X-Received: by 2002:a05:6830:1015:: with SMTP id a21mr400070otp.240.1621645092867;
Fri, 21 May 2021 17:58:12 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.snarked.org!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 21 May 2021 17:58:12 -0700 (PDT)
In-Reply-To: <s88rmh$77e$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:a0d0:9f90:65ed:dbcf:9ab5:d836;
posting-account=AoizIQoAAADa7kQDpB0DAj2jwddxXUgl
NNTP-Posting-Host: 2600:1700:a0d0:9f90:65ed:dbcf:9ab5:d836
References: <s7dn5p$78r$1@newsreader4.netcologne.de> <s7l7os$75r$1@dont-email.me>
<s7m6ri$vta$1@dont-email.me> <c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com>
<s7mio3$qfs$1@dont-email.me> <00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com>
<jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org> <049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com>
<s7n2gj$5na$2@dont-email.me> <5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com>
<s7n6ah$t1$1@dont-email.me> <590ea343-cd96-4082-800c-f02412204262n@googlegroups.com>
<s7nchh$b58$1@dont-email.me> <KiHnI.151039$wd1.100928@fx41.iad>
<s7nv66$u2t$1@newsreader4.netcologne.de> <PcQnI.365058$2A5.181861@fx45.iad>
<s7onqq$ape$1@newsreader4.netcologne.de> <s7uclo$23r$1@dont-email.me>
<2aAoI.606122$%W6.592987@fx44.iad> <s7upgf$v6r$1@dont-email.me>
<uZToI.146185$lyv9.30173@fx35.iad> <98f17a50-83b9-4eb9-bdc0-6f3f6787e7c7n@googlegroups.com>
<s81j26$gb$1@dont-email.me> <zB9pI.338700$Skn4.250226@fx17.iad>
<s83ds5$ccq$1@dont-email.me> <s88rmh$77e$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2f79fea6-3fb5-40e5-b53d-7f41e99d5b6dn@googlegroups.com>
Subject: Re: Timing... (Re: More complex instructions to reduce cycle overhead)
From: jim.brak...@ieee.org (JimBrakefield)
Injection-Date: Sat, 22 May 2021 00:58:13 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 169
 by: JimBrakefield - Sat, 22 May 2021 00:58 UTC

On Friday, May 21, 2021 at 12:47:00 PM UTC-5, BGB wrote:
> On 5/19/2021 11:20 AM, BGB wrote:
> > On 5/19/2021 9:51 AM, EricP wrote:
> >> BGB wrote:
> >>>
> >>> In theory, the fastest signals which can exist with this FPGA are
> >>> 450MHz / 2.2ns.
> >>>
> >>> Though, IME, about all it can really do at these speeds is:
> >>> reg2 <= reg1;
> >>>
> >>>
> >>> The Kintex and Vertex can apparently go a lot faster than the Artix
> >>> and Spartan in this regard though.
> >>
> >> Which Xilinx chips do you use?
> >> It is Artix-7 (28 nm) and which model?
> >>
> >
> >
> > Primary Board:
> > XC7A100TCSG324-1 (Nexys A7-100T)
> >
> > Others:
> > XC7S50TCSG324-1 (Arty S7-50)
> > XC7S25TCSG225-1 (CMod S7-25)
> >
> > Generally using Vivado 2018.3.1, ...
> >
> > Typically using strategies:
> > Flow_PerfOptimized_high (synthesis, *1)
> > Vivado Implementation Defaults (latter, *2)
> >
> > *1: This makes timing more likely to pass, but at the cost of higher
> > resource usage and similar.
> >
> > *2: Changing this option tends to break the ability to see
> > post-implementation resource usage stats.
> >
> >
> > Of these, the Artix seems to have a slightly harder time with passing
> > timing than either of the Spartan devices, for whatever reason.
> > However, the Spartan's are smaller.
> >
> > In theory, they should be about the same speed, or if anything the Artix
> > should be faster due to having more space and thus better able to route
> > stuff.
> >
> >
> > As can be noted, whether or not things pass timing is mostly determined
> > by Vivado. Vivado allows synthesizing for any FPGA it supports, but as
> > can be noted, unless the FPGA matches the one on the board, the
> > bitstream doesn't work.
> >
> >
> > The Nexys A7 board has VGA and an SDcard holder, which is fairly useful.
> > It was ~ $269 when I bought it,
> >
> > The CMod-S7 is limited to a microcontroller like configuration with the
> > BJX2 core, as this board lacks any peripherals or external RAM.
> >
> >
> > Timings I can manage on the BJX2 Core:
> > 50MHz, works pretty well
> > 75MHz, can work, prone to fail timing, needs L1 caches reduced to ~ 2K
> >
> >
> > I have the RAM working at either 50 or 75 MHz (internally, RAM
> > controller logic runs at 100 or 150 MHz, driving the chip at 1/2 the
> > internal speed).
> >
> >
> > I had another DDR module (DdrB) that runs the DDR chip at 1:1 speeds,
> > but as noted, it worked in simulation but doesn't seem to work on actual
> > hardware.
> >
> > One difference between the modules is that the 1/2 clocked module can
> > adjust Data/DQS alignment by 1/4 cycle, whereas the 1:1 module is
> > limited to 1/2 cycle.
> >
> > With DLL enabled (and above the 120MHz minimum), the data should be
> > correctly aligned with the clock to make this fine-adjustment
> > unnecessary, but best I could tell, the RAM chip was effectively
> > becoming non-responsive.
> >
> Goes and adds timing constraints to DDR related IO pins, after noting
> that they were absent, and this apparently results in them being treated
> as unconstrained...
>
> However, this resulted in basically a crap-storm of failed timing
> constraints, as apparently, the DDR controller module wasn't actually
> maintaining timing on its IO pins...
> >
> > The L2 in the Ringbus design operates at the same speed as the CPU core
> > (as does the ringbus), and the L2 interfaces with the RAM Module (using
> > the old/original bus design).
> >
> >
> > The RAM module then contains the glue-logic to step between the external
> > "master clock" and its internal clock speed. The spot where it steps
> > clock speeds seems to be prone to random bit-flips.
> >
> > I now have logic which basically XORs everything together and currently
> > generates an 18-bit check value. The OPM and OK signals are also checked
> > against a bitwise-inverted duplicate (also keyed to the main check
> > value). The way things are XOR'ed effectively functions like a
> > horizontal parity over all the bits (probably sufficient for now, may
> > miss multi-bit errors though).
> >
> And, all this just deciding to start falling on its face started to
> indicate that this may not have actually been the source of the random
> bit flipping to begin with...
>
> Rather, poking around with logic in one place was, rather, effecting the
> timing at the IO pins...
>
>
> I have also determined:
> The 50MHz DDR is actually seemingly the fastest I can drive it and still
> pass timing.
>
> The 50MHz is actually driving some of the IO pins on a 100MHz clock, but
> given the way DDR works, it looks like I either need to drive the pins
> internally at 2x the external speed, or provide multiple clocks which
> are slightly out of phase to get the required timings.
>
>
> I looked at the datasheets some more, and it turns out ~ 100 MHz is the
> fastest these IO pins can actually go (in their base form), or a little
> faster if both posedge and negedge are used.
>
>
> But, it appears that MIG was able to get the pins to go faster via
> another "secret weapon" trick: SERDES.
>
> So, I guess with SERDES, it is possible to feed parallel data at 50MHz
> in one side, and get 400MHz out the other side. The reverse is also
> possible with these pins, receiving parallel data. The SERDES pins
> apparently operate at ~ 400MHz to 800MHz.
>
> And, as it would so happen, the RAM is connected up to the pins which
> have this capability, leaning in the direction of this probably being
> what is going on.
>
> Yeah...
>
>
> Though I guess in this case, it would be a tradeoff of:
> Leave it as-is, and live with RAM at 50MHz (safely), or 75MHz (and
> technically failing timing);
> Look into a DDR controller based around using SERDES as well;
> Being like "screw it" and, implementing support for an AXI Bus
> interface, and then using MIG.
>
>
> Though, I am also left to suspect that the gains I would get from moving
> to faster speeds would be limited, given I would (then) have to deal
> with significantly larger CAS and RAS latencies and similar, likely
> eating most of the gains...

|> Goes and adds timing constraints to DDR related IO pins, after noting
|> that they were absent, and this apparently results in them being treated
|> as unconstrained...

Timing driven place and route on an FPGA is much different from ASIC:
The FPGA router puts in effort to route the paths with least slack and
can be very lazy with paths with much slack.
It can also duplicate flip-flops or entire paths if it helps.
My knowledge is this area is limited, have seen delays moved to unconstrained
IO signals to the point that IO delays were enormous.

Re: More complex instructions to reduce cycle overhead

<zaZpI.435433$2A5.79008@fx45.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17023&group=comp.arch#17023

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!news-out.netnews.com!news.alt.net!fdc3.netnews.com!peer02.ams1!peer.ams1.xlned.com!news.xlned.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx45.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: More complex instructions to reduce cycle overhead
References: <s7dn5p$78r$1@newsreader4.netcologne.de> <2021May11.193250@mips.complang.tuwien.ac.at> <s7l775$sq5$1@dont-email.me> <s7l7os$75r$1@dont-email.me> <s7m6ri$vta$1@dont-email.me> <c4fe5be0-030f-4ad1-8ff0-f89f08d1250en@googlegroups.com> <s7mio3$qfs$1@dont-email.me> <00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com> <jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org> <049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com> <s7n2gj$5na$2@dont-email.me> <5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com> <s7n6ah$t1$1@dont-email.me> <590ea343-cd96-4082-800c-f02412204262n@googlegroups.com> <s7nchh$b58$1@dont-email.me> <KiHnI.151039$wd1.100928@fx41.iad> <s7nv66$u2t$1@newsreader4.netcologne.de> <PcQnI.365058$2A5.181861@fx45.iad> <s7onqq$ape$1@newsreader4.netcologne.de> <s7uclo$23r$1@dont-email.me> <2aAoI.606122$%W6.592987@fx44.iad> <s7upgf$v6r$1@dont-email.me> <s811ub$ebh$1@dont-email.me> <s81l2o$agu$1@dont-email.me> <N%9pI.385882$2A5.241067@fx45.iad> <s83gne$1ju$1@dont-email.me>
In-Reply-To: <s83gne$1ju$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 132
Message-ID: <zaZpI.435433$2A5.79008@fx45.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sat, 22 May 2021 01:32:15 UTC
Date: Fri, 21 May 2021 21:31:58 -0400
X-Received-Bytes: 6777
 by: EricP - Sat, 22 May 2021 01:31 UTC

BGB wrote:
> On 5/19/2021 10:19 AM, EricP wrote:
>> BGB wrote:
>>> On 5/18/2021 1:44 PM, Marcus wrote:
>>>>
>>>> Isn't it a question of uncertainty in when the different bits arrive
>>>> to the synchronizer flops? Each bit can potentially travel along a
>>>> different routing, and at least for me setting up the timing
>>>> constraints
>>>> for the tools to do "the right thing" has always felt like black magic.
>>>>
>>>
>>> One would think that (and it would make sense), but some of my
>>> efforts implied that there is some other factor at play (not sure
>>> what exactly).
>>>
>>> For example, I had already tried delaying certain signals relative to
>>> the others, say:
>>> Data and Address signals arrive first;
>>> Commend signals arrive afterwards.
>>>
>>> So, the assumption would be that once the command arrives with an
>>> intact bit-pattern, the data and address would already be stable (and
>>> one would only need to integrity-check the command and response codes).
>>>
>>>
>>> Except it didn't work out this way...
>>> It seems more like there is some sort of "noise" between clock
>>> domains, that might flip bits even when they should otherwise be stable.
>>>
>>>
>>> This noise level seems to be inversely correlated with the number of
>>> synchronizer stages, but as can be noted, adding lots of stages is
>>> pretty bad for latency...
>>>
>>> Too few stages, and the data seems to turn into garbage; and the
>>> integrity checking can produce paradoxical results (such as letting
>>> stuff through that should have failed).
>>>
>>> Similarly, trying to interact with data (in any way) during the
>>> transition between clock domains, seemingly tends to cause it to turn
>>> into random garbage, ...
>>
>> Do the tools have a display to show the actual route chosen for signals?
>> I'm wondering if as you switch data sources and destinations,
>> the individual bit lines are taking radically different paths
>> through the interconnect, giving the appearance of randomly
>> changing skew for each bit.
>>
>>
>
>
> I am able to look at the top-10 "worst" routes (in terms of WNS), and
> going to inter-clock paths (between 50<->150 MHz):
> Yeah, they are spread all over the place;
> Seem to be placed semi-randomly, at various distances apart, within an
> area roughly spanning roughly two clock tiles.
>
> Some routes have the source and destination flip-flop right next to each
> other, others cross most of the way across the clock tile, ...
>
> They are a 0-level paths, with delays of around 2.2 to 2.4 ns (for a
> 6.7ns requirement).
>
>
> Note that, in general, I tend to be see a lot of logic paths which tend
> to semi-randomly cross nearly the whole FPGA (particularly when timing
> fails), it seems to often be paths wandering from one side of the device
> to the other, sometimes making a few zigzags across some big "void" area
> near the middle of the device (presumably hard-logic of some sort), ...
>
>
>
> Note that how stuff ends up organized in the FPGA tends to be pretty
> much random, and the overall topology tends to vary from one run to
> another (sort of like how terrain is different each time when making new
> worlds in Minecraft).
>
> I once did wonder if it was fully random, and tried as an experiment
> redoing synthesis a few times without changing anything. In this case,
> layout stayed the same in the tests. But, as soon as one changes
> *anything* in the code, the whole topology reorganizes itself...
>
> ....
>

Ok, I think I see what's going on (maybe).

The routing between each register has widely different latencies
for each register bit. And it changes for each compile.

If you transmit the Valid and Data bits at the same time,
the Valid bit can arrive first. Since the Recv clock is pretty much
asynchronous to the Send clock, Valid flag can be received before
the data has arrived and stabilize. You clock in garbage.

You need to transmit the data bits, and delay the Valid bit by 1 sender
clock to allow all the data bits a minimum on 6.66 ns to arrive at Recv
before Valid goes 1.

When Recv sees Valid == 1 it enables the Recv synchronizer clock.
There are two stages to Recv for metastability synchronizer,
so when the second stage Valid bit goes high, you have received
the data and can Ack the sender to return to an empty state,

Something like...

always @(posedge sendclock)
begin
tempValue1 <= sendValue;

tempValid1 <= 1; // Valid takes 1 sendclock extra to
tempValid2 <= tempValid1; // propagate after sendValue arrives.
end

always @(posedge recvclock)
begin
if (recvClock & tempValid2) begin
tempValue2 <= tempValue1;
tempValid3 <= tempValid2;
end

if (recvClock & tempValid3) begin
recvValue <= tempValue2;
recvAck <= tempValid3;
end
end

recvAck can return and mark the sender as empty.
Note that recvAck is also asynchronous so may need a synchronizer.

Re: Timing... (Re: More complex instructions to reduce cycle overhead)

<s8a3no$jc8$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17024&group=comp.arch#17024

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Timing... (Re: More complex instructions to reduce cycle
overhead)
Date: Sat, 22 May 2021 00:10:13 -0500
Organization: A noiseless patient Spider
Lines: 198
Message-ID: <s8a3no$jc8$1@dont-email.me>
References: <s7dn5p$78r$1@newsreader4.netcologne.de>
<s7mio3$qfs$1@dont-email.me>
<00a4b04a-ef97-44fd-a3a9-aa777fcc71bbn@googlegroups.com>
<jwv1ra92e0t.fsf-monnier+comp.arch@gnu.org>
<049b46dd-4544-4fe7-861b-85f97b3269c3n@googlegroups.com>
<s7n2gj$5na$2@dont-email.me>
<5eb5bb76-37e9-4363-8d56-b1139e2d384bn@googlegroups.com>
<s7n6ah$t1$1@dont-email.me>
<590ea343-cd96-4082-800c-f02412204262n@googlegroups.com>
<s7nchh$b58$1@dont-email.me> <KiHnI.151039$wd1.100928@fx41.iad>
<s7nv66$u2t$1@newsreader4.netcologne.de> <PcQnI.365058$2A5.181861@fx45.iad>
<s7onqq$ape$1@newsreader4.netcologne.de> <s7uclo$23r$1@dont-email.me>
<2aAoI.606122$%W6.592987@fx44.iad> <s7upgf$v6r$1@dont-email.me>
<uZToI.146185$lyv9.30173@fx35.iad>
<98f17a50-83b9-4eb9-bdc0-6f3f6787e7c7n@googlegroups.com>
<s81j26$gb$1@dont-email.me> <zB9pI.338700$Skn4.250226@fx17.iad>
<s83ds5$ccq$1@dont-email.me> <s88rmh$77e$1@dont-email.me>
<2f79fea6-3fb5-40e5-b53d-7f41e99d5b6dn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 22 May 2021 05:10:16 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="7eba654dd3364d58c666703fb24f3432";
logging-data="19848"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18oXwlwhDT4SMS1gWVBttDx"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.10.1
Cancel-Lock: sha1:Pp7tgMUVi2FHJelEXR2wYpWkb00=
In-Reply-To: <2f79fea6-3fb5-40e5-b53d-7f41e99d5b6dn@googlegroups.com>
Content-Language: en-US
 by: BGB - Sat, 22 May 2021 05:10 UTC

On 5/21/2021 7:58 PM, JimBrakefield wrote:
> On Friday, May 21, 2021 at 12:47:00 PM UTC-5, BGB wrote:
>> On 5/19/2021 11:20 AM, BGB wrote:
>>> On 5/19/2021 9:51 AM, EricP wrote:
>>>> BGB wrote:
>>>>>
>>>>> In theory, the fastest signals which can exist with this FPGA are
>>>>> 450MHz / 2.2ns.
>>>>>
>>>>> Though, IME, about all it can really do at these speeds is:
>>>>> reg2 <= reg1;
>>>>>
>>>>>
>>>>> The Kintex and Vertex can apparently go a lot faster than the Artix
>>>>> and Spartan in this regard though.
>>>>
>>>> Which Xilinx chips do you use?
>>>> It is Artix-7 (28 nm) and which model?
>>>>
>>>
>>>
>>> Primary Board:
>>> XC7A100TCSG324-1 (Nexys A7-100T)
>>>
>>> Others:
>>> XC7S50TCSG324-1 (Arty S7-50)
>>> XC7S25TCSG225-1 (CMod S7-25)
>>>
>>> Generally using Vivado 2018.3.1, ...
>>>
>>> Typically using strategies:
>>> Flow_PerfOptimized_high (synthesis, *1)
>>> Vivado Implementation Defaults (latter, *2)
>>>
>>> *1: This makes timing more likely to pass, but at the cost of higher
>>> resource usage and similar.
>>>
>>> *2: Changing this option tends to break the ability to see
>>> post-implementation resource usage stats.
>>>
>>>
>>> Of these, the Artix seems to have a slightly harder time with passing
>>> timing than either of the Spartan devices, for whatever reason.
>>> However, the Spartan's are smaller.
>>>
>>> In theory, they should be about the same speed, or if anything the Artix
>>> should be faster due to having more space and thus better able to route
>>> stuff.
>>>
>>>
>>> As can be noted, whether or not things pass timing is mostly determined
>>> by Vivado. Vivado allows synthesizing for any FPGA it supports, but as
>>> can be noted, unless the FPGA matches the one on the board, the
>>> bitstream doesn't work.
>>>
>>>
>>> The Nexys A7 board has VGA and an SDcard holder, which is fairly useful.
>>> It was ~ $269 when I bought it,
>>>
>>> The CMod-S7 is limited to a microcontroller like configuration with the
>>> BJX2 core, as this board lacks any peripherals or external RAM.
>>>
>>>
>>> Timings I can manage on the BJX2 Core:
>>> 50MHz, works pretty well
>>> 75MHz, can work, prone to fail timing, needs L1 caches reduced to ~ 2K
>>>
>>>
>>> I have the RAM working at either 50 or 75 MHz (internally, RAM
>>> controller logic runs at 100 or 150 MHz, driving the chip at 1/2 the
>>> internal speed).
>>>
>>>
>>> I had another DDR module (DdrB) that runs the DDR chip at 1:1 speeds,
>>> but as noted, it worked in simulation but doesn't seem to work on actual
>>> hardware.
>>>
>>> One difference between the modules is that the 1/2 clocked module can
>>> adjust Data/DQS alignment by 1/4 cycle, whereas the 1:1 module is
>>> limited to 1/2 cycle.
>>>
>>> With DLL enabled (and above the 120MHz minimum), the data should be
>>> correctly aligned with the clock to make this fine-adjustment
>>> unnecessary, but best I could tell, the RAM chip was effectively
>>> becoming non-responsive.
>>>
>> Goes and adds timing constraints to DDR related IO pins, after noting
>> that they were absent, and this apparently results in them being treated
>> as unconstrained...
>>
>> However, this resulted in basically a crap-storm of failed timing
>> constraints, as apparently, the DDR controller module wasn't actually
>> maintaining timing on its IO pins...
>>>
>>> The L2 in the Ringbus design operates at the same speed as the CPU core
>>> (as does the ringbus), and the L2 interfaces with the RAM Module (using
>>> the old/original bus design).
>>>
>>>
>>> The RAM module then contains the glue-logic to step between the external
>>> "master clock" and its internal clock speed. The spot where it steps
>>> clock speeds seems to be prone to random bit-flips.
>>>
>>> I now have logic which basically XORs everything together and currently
>>> generates an 18-bit check value. The OPM and OK signals are also checked
>>> against a bitwise-inverted duplicate (also keyed to the main check
>>> value). The way things are XOR'ed effectively functions like a
>>> horizontal parity over all the bits (probably sufficient for now, may
>>> miss multi-bit errors though).
>>>
>> And, all this just deciding to start falling on its face started to
>> indicate that this may not have actually been the source of the random
>> bit flipping to begin with...
>>
>> Rather, poking around with logic in one place was, rather, effecting the
>> timing at the IO pins...
>>
>>
>> I have also determined:
>> The 50MHz DDR is actually seemingly the fastest I can drive it and still
>> pass timing.
>>
>> The 50MHz is actually driving some of the IO pins on a 100MHz clock, but
>> given the way DDR works, it looks like I either need to drive the pins
>> internally at 2x the external speed, or provide multiple clocks which
>> are slightly out of phase to get the required timings.
>>
>>
>> I looked at the datasheets some more, and it turns out ~ 100 MHz is the
>> fastest these IO pins can actually go (in their base form), or a little
>> faster if both posedge and negedge are used.
>>
>>
>> But, it appears that MIG was able to get the pins to go faster via
>> another "secret weapon" trick: SERDES.
>>
>> So, I guess with SERDES, it is possible to feed parallel data at 50MHz
>> in one side, and get 400MHz out the other side. The reverse is also
>> possible with these pins, receiving parallel data. The SERDES pins
>> apparently operate at ~ 400MHz to 800MHz.
>>
>> And, as it would so happen, the RAM is connected up to the pins which
>> have this capability, leaning in the direction of this probably being
>> what is going on.
>>
>> Yeah...
>>
>>
>> Though I guess in this case, it would be a tradeoff of:
>> Leave it as-is, and live with RAM at 50MHz (safely), or 75MHz (and
>> technically failing timing);
>> Look into a DDR controller based around using SERDES as well;
>> Being like "screw it" and, implementing support for an AXI Bus
>> interface, and then using MIG.
>>
>>
>> Though, I am also left to suspect that the gains I would get from moving
>> to faster speeds would be limited, given I would (then) have to deal
>> with significantly larger CAS and RAS latencies and similar, likely
>> eating most of the gains...
>
> |> Goes and adds timing constraints to DDR related IO pins, after noting
> |> that they were absent, and this apparently results in them being treated
> |> as unconstrained...
>
> Timing driven place and route on an FPGA is much different from ASIC:
> The FPGA router puts in effort to route the paths with least slack and
> can be very lazy with paths with much slack.
> It can also duplicate flip-flops or entire paths if it helps.
> My knowledge is this area is limited, have seen delays moved to unconstrained
> IO signals to the point that IO delays were enormous.
>

Yeah. Stuff was being very inconsistent, and then added some constraints
and everything blew up, and in a few places (such as the Data and DQS
pins) timing was failing.

Fixed this, and things are at least a little more consistent.

Stuff is no longer blowing up at random when things are changed, and now
behavior seems to be mostly back into "deterministic" territory.

Operation at 75MHz is still unreliable / prone to break, some test data
showed that apparently 16-bit words were getting duplicated in a way
that implies a Data/DQS timing issue, ... Granted, this involves driving
these pins at 150 MHz, which is apparently out-of-spec if used this way.

As for why direct-driving the pins would be limited to ~ 100MHz, but
SERDES can apparently drive them at 800MHz, on the same physical pins, I
don't know... (Unless maybe SERDES uses a separate driver? Say, it can
drive the pin a lot faster but at a lower output power or similar).


Click here to read the complete article
Re: saturating arithmetic, not Signed division by 2^n

<0296c592-a553-66a5-4984-2147c5101cd1@nospam.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17122&group=comp.arch#17122

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!MDj0sRZbUF4XYOKsuayNxg.user.gioia.aioe.org.POSTED!not-for-mail
From: reply-to...@nospam.org (Jeremy Linton)
Newsgroups: comp.arch
Subject: Re: saturating arithmetic, not Signed division by 2^n
Date: Mon, 24 May 2021 13:52:03 -0500
Organization: Aioe.org NNTP Server
Lines: 23
Message-ID: <0296c592-a553-66a5-4984-2147c5101cd1@nospam.org>
References: <s7dn5p$78r$1@newsreader4.netcologne.de>
<2021May11.193250@mips.complang.tuwien.ac.at>
<888390de-f439-4653-a57c-c0febaa51c8fn@googlegroups.com>
<s7en70$n5v$1@newsreader4.netcologne.de> <s7er6f$2qlr$1@gal.iecc.com>
NNTP-Posting-Host: MDj0sRZbUF4XYOKsuayNxg.user.gioia.aioe.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Complaints-To: abuse@aioe.org
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.9.0
Content-Language: en-US
X-Notice: Filtered by postfilter v. 0.9.2
 by: Jeremy Linton - Mon, 24 May 2021 18:52 UTC

On 5/11/2021 3:58 PM, John Levine wrote:
> It appears that Thomas Koenig <tkoenig@netcologne.de> said:
>> MitchAlsup <MitchAlsup@aol.com> schrieb:
>>
>>> What percentage of integer {signed and unsigned} arithmetic
>>> uses saturated semantics ?
>>
>> What programming language supports or demands it? I am
>> only aware of the "unsigned wraps around, integer overflow
>> is undefined" variety.
>
> Given that many popular chips have dedicated hardware that does saturated
> arithmetic, there must be something that uses it.

Its also seems to be the cases that few C/etc programmers want overflow,
and likely would be just as happy with saturating operations by default.
The advantage beyond it just making more sense, is that it closes a
whole class of security holes caused by incorrect bounds checking.

This like the alias/restrict mess is something that probably should be
corrected with something similar to "use strict" in a compilation unit
that reverses the standard behavior unless the programmer explicitly
asks for overflow or aliasing.

Re: saturating arithmetic, not Signed division by 2^n

<jwvbl8zx5la.fsf-monnier+comp.arch@gnu.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17126&group=comp.arch#17126

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Re: saturating arithmetic, not Signed division by 2^n
Date: Mon, 24 May 2021 16:14:03 -0400
Organization: A noiseless patient Spider
Lines: 23
Message-ID: <jwvbl8zx5la.fsf-monnier+comp.arch@gnu.org>
References: <s7dn5p$78r$1@newsreader4.netcologne.de>
<2021May11.193250@mips.complang.tuwien.ac.at>
<888390de-f439-4653-a57c-c0febaa51c8fn@googlegroups.com>
<s7en70$n5v$1@newsreader4.netcologne.de> <s7er6f$2qlr$1@gal.iecc.com>
<0296c592-a553-66a5-4984-2147c5101cd1@nospam.org>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="b059f047b32debe1dcc77851a1741db4";
logging-data="610"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+W78GYMpIMVlTJijaeZZvV"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)
Cancel-Lock: sha1:G9GOqgBXjtHA5E4Ml2nR+rotVJ0=
sha1:HTxwah7zc3v6Fy64Y/5YBSEKxss=
 by: Stefan Monnier - Mon, 24 May 2021 20:14 UTC

> Its also seems to be the cases that few C/etc programmers want overflow,
> and likely would be just as happy with saturating operations by default.

In my case, the vast majority of uses of something like `int` is for
numbers which should be treated as belonging to mathematical set of
integers (aka Z).

I just happen to (think I) know that I use them in a way which will
never overflow (e.g. because it's bounded by something like the number
of heap allocated objects or the number of elements in an array).

So Ideally I want the compiler to feel free to perform optimizations
that assume things like associativity (even though those optimizations
may cause code which previously overflowed to not overflow any more, and
vice versa). But I'd also like to know if my assumptions were wrong
(aka when `int` failed to model the set Z), so I'd like the code to
check for overflows and coredump when it happens. It doesn't have to
coredump exactly when the operation takes place in the code, but it
should coredump some time between the moment the offending operation
took place and the next externally-visible effect.

Stefan

Pages:1234
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor