Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

snafu = Situation Normal All F%$*ed up


devel / comp.arch / Re: 64 bit 68080 CPU

SubjectAuthor
* 64 bit 68080 CPUBrett
+- Re: 64 bit 68080 CPUJosh Vanderhoof
+* Re: 64 bit 68080 CPUQuadibloc
|`* Re: 64 bit 68080 CPUBGB
| +* Re: 64 bit 68080 CPUAnton Ertl
| |`* Re: 64 bit 68080 CPUBGB
| | +* Re: 64 bit 68080 CPUMitchAlsup
| | |`* Re: 64 bit 68080 CPUBGB
| | | `* Re: 64 bit 68080 CPUMitchAlsup
| | |  `- Re: 64 bit 68080 CPUBGB
| | `* Re: 64 bit 68080 CPUAnton Ertl
| |  +- Re: 64 bit 68080 CPUBGB
| |  `- Re: 64 bit 68080 CPUThomas Koenig
| +* Re: 64 bit 68080 CPUTheo
| |+* Re: 64 bit 68080 CPUBGB
| ||`* Re: 64 bit 68080 CPUTorbjorn Lindgren
| || `* Re: 64 bit 68080 CPUBGB
| ||  `* Re: 64 bit 68080 CPUrobf...@gmail.com
| ||   `- Re: 64 bit 68080 CPUBGB
| |`- Re: 64 bit 68080 CPUMichael S
| +* Re: 64 bit 68080 CPUEricP
| |`- Re: 64 bit 68080 CPUBGB
| `* Re: 64 bit 68080 CPUMitchAlsup
|  +* Re: 64 bit 68080 CPUMarcus
|  |+- Re: 64 bit 68080 CPUBGB
|  |`* Re: 64 bit 68080 CPUMitchAlsup
|  | `* Re: 64 bit 68080 CPUMarcus
|  |  +- Re: 64 bit 68080 CPUMitchAlsup
|  |  `* Re: 64 bit 68080 CPUBGB
|  |   `* Re: 64 bit 68080 CPUMitchAlsup
|  |    `- Re: 64 bit 68080 CPUBGB
|  `* Re: 64 bit 68080 CPUEricP
|   +- Re: 64 bit 68080 CPUMitchAlsup
|   `- Re: 64 bit 68080 CPUBGB
`- Re: 64 bit 68080 CPUJohn Dallman

Pages:12
Re: 64 bit 68080 CPU

<624d1103-327b-44ce-a94c-cb8ce65298ddn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27027&group=comp.arch#27027

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:d84:b0:473:3106:a97d with SMTP id e4-20020a0562140d8400b004733106a97dmr22544783qve.112.1659550442738;
Wed, 03 Aug 2022 11:14:02 -0700 (PDT)
X-Received: by 2002:ac8:5b4d:0:b0:33b:858a:5c40 with SMTP id
n13-20020ac85b4d000000b0033b858a5c40mr6199810qtw.563.1659550442609; Wed, 03
Aug 2022 11:14:02 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 3 Aug 2022 11:14:02 -0700 (PDT)
In-Reply-To: <tcd8rj$23efp$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:989:2db8:b143:5f6b;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:989:2db8:b143:5f6b
References: <tc6mng$fm5h$1@dont-email.me> <0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com>
<tc7rm1$o7cl$1@dont-email.me> <fdd07bcf-4f05-48dd-b809-47c67e0c7eacn@googlegroups.com>
<tcc2ub$1ngk6$1@dont-email.me> <191c8144-23a0-4e34-9246-dffa8c6e6371n@googlegroups.com>
<tcd8rj$23efp$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <624d1103-327b-44ce-a94c-cb8ce65298ddn@googlegroups.com>
Subject: Re: 64 bit 68080 CPU
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 03 Aug 2022 18:14:02 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Wed, 3 Aug 2022 18:14 UTC

On Wednesday, August 3, 2022 at 2:41:42 AM UTC-5, Marcus wrote:
> On 2022-08-02, MitchAlsup wrote:
> > On Tuesday, August 2, 2022 at 3:54:38 PM UTC-5, Marcus wrote:
> >> On 2022-08-01, MitchAlsup wrote:
> >>> On Monday, August 1, 2022 at 1:26:12 AM UTC-5, BGB wrote:
> >>>>
> >>>> Well, and further reaching issues, say, whether a Reg/Mem ISA could
> >>>> compare well with a Load/Store ISA?... Can use fewer instructions, but
> >>>> how to do Reg/Mem ops without introducing significant complexities or
> >>>> penalty cases?...
> >>>>
> >>> You build a pipeline which has a pre-execute stage (which calculates
> >>> AGEN) and then a stage or 2 of cache access, and then you get to the
> >>> normal execute---writeback part of the pipeline. I have called this the
> >>> 360 pipeline (or IBM pipeline) several times in the past, here. The S.E.L
> >>> machines used such a pipeline.
> > <
> >> Don't you need two data access points along such a pipeline?
> > <
> > Sorry cannot parse your question.
> > <
> > But the AGEN unit at the front of the pipeline is speculative, and all
> > actual calculations are done after inbound memory references (if any)
> > have shown up.
> Sorry, I meant: In the pre-execute stages you need to read memory
> operands, no? And later in the pipeline you need to write to (or read
> from?) memory? Thus you would need (at least) two concurrent ports to
> the L1D$?
<
No !! There is a trick to all of this:
<
Consider the cache as {tag, TLB, data}
LDs read {tag, TLB, data} in the early stage
STs read {tag, TLB} in the early stage and use {data} in the later stage.
Anytime there is not a AGEN or a ST in the early stage, the later ST
stage can use {data}
<
This, BTW, is the HP store pipeline patented circa 1988, just applied
to LD-Op pipeline design.
>
> /Marcus

Re: 64 bit 68080 CPU

<20cd5f9a-9992-4994-b5f9-dc05ff44d893n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27028&group=comp.arch#27028

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ad4:5ba1:0:b0:46e:2f1f:9836 with SMTP id 1-20020ad45ba1000000b0046e2f1f9836mr22186854qvq.87.1659550800665;
Wed, 03 Aug 2022 11:20:00 -0700 (PDT)
X-Received: by 2002:a05:620a:28c8:b0:6b5:e327:3358 with SMTP id
l8-20020a05620a28c800b006b5e3273358mr18753387qkp.365.1659550800502; Wed, 03
Aug 2022 11:20:00 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 3 Aug 2022 11:20:00 -0700 (PDT)
In-Reply-To: <mrxGK.823150$X_i.231361@fx18.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:989:2db8:b143:5f6b;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:989:2db8:b143:5f6b
References: <tc6mng$fm5h$1@dont-email.me> <0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com>
<tc7rm1$o7cl$1@dont-email.me> <fdd07bcf-4f05-48dd-b809-47c67e0c7eacn@googlegroups.com>
<mrxGK.823150$X_i.231361@fx18.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <20cd5f9a-9992-4994-b5f9-dc05ff44d893n@googlegroups.com>
Subject: Re: 64 bit 68080 CPU
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 03 Aug 2022 18:20:00 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 3622
 by: MitchAlsup - Wed, 3 Aug 2022 18:20 UTC

On Wednesday, August 3, 2022 at 11:33:58 AM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Monday, August 1, 2022 at 1:26:12 AM UTC-5, BGB wrote:
> >> Well, and further reaching issues, say, whether a Reg/Mem ISA could
> >> compare well with a Load/Store ISA?... Can use fewer instructions, but
> >> how to do Reg/Mem ops without introducing significant complexities or
> >> penalty cases?...
> >>
> > You build a pipeline which has a pre-execute stage (which calculates
> > AGEN) and then a stage or 2 of cache access, and then you get to the
> > normal execute---writeback part of the pipeline. I have called this the
> > 360 pipeline (or IBM pipeline) several times in the past, here. The S.E.L
> > machines used such a pipeline.
> The thing is that a fixed layout pipeline can only accommodate
> the specific situations that it is designed for.
> Forwarding allows limited topological rearrangement.
> Putting an extra AGEN at an early pipeline stage makes
<
It is not "extra" there is only 1 AGEN and it is in the early part of the pipe.
Only through AGEN can you access {tag, TLB}. There is not another
AGEN in EXECUTE you already "missed" the cache access port.
<
> all uOps perform an extra stage, which add extra latency
> that makes it costlier to fill in bubbles.
<
One of the obvious drawbacks. TANSTAaFL
>
> A pipeline might dynamically rearrange while maintaining In-Order (InO)
> simplicity such that can fill in bubbles as best as possible.
> (I have a mental picture of a dynamic Pert chart.
> It is not OoO but it does allow concurrency to fill in bubbles.)
<
a GBOoO machine would not want such a pipeline.
>
> For example, a ST with immediate data and immediate address doesn't
> need either a RR Register Read stage or AGEN, and can go straight
> from Decode to LSU.
<
What if there is another memory reference in the cycle preceding
ST #7,[global_a], even though there is no arithmetic that needs to occur
there are pipeline "events" to monitor, and in the end it is easier to let
the pipeline "flow" naturally.
<
> That ST can launch concurrent with an earlier
> RR-ALU uOp, or a following RR-ALU op can launch concurrent with ST.
> ST doesn't need the WB stage so a subsequent uOp can use that stage.

Re: 64 bit 68080 CPU

<tceem6$2d57u$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27029&group=comp.arch#27029

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: 64 bit 68080 CPU
Date: Wed, 3 Aug 2022 13:27:13 -0500
Organization: A noiseless patient Spider
Lines: 202
Message-ID: <tceem6$2d57u$1@dont-email.me>
References: <tc6mng$fm5h$1@dont-email.me>
<0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com>
<tc7rm1$o7cl$1@dont-email.me>
<fdd07bcf-4f05-48dd-b809-47c67e0c7eacn@googlegroups.com>
<tcc2ub$1ngk6$1@dont-email.me>
<191c8144-23a0-4e34-9246-dffa8c6e6371n@googlegroups.com>
<tcd8rj$23efp$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 3 Aug 2022 18:27:18 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="8ecc289b2e5bf7eb2b3d5f5d40605fd6";
logging-data="2528510"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/azPD0vfdhOPkXc5IqNfui"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.11.0
Cancel-Lock: sha1:g9f8XpdVHPZ/L8FKOzDJiwgD080=
In-Reply-To: <tcd8rj$23efp$1@dont-email.me>
Content-Language: en-US
 by: BGB - Wed, 3 Aug 2022 18:27 UTC

On 8/3/2022 2:41 AM, Marcus wrote:
> On 2022-08-02, MitchAlsup wrote:
>> On Tuesday, August 2, 2022 at 3:54:38 PM UTC-5, Marcus wrote:
>>> On 2022-08-01, MitchAlsup wrote:
>>>> On Monday, August 1, 2022 at 1:26:12 AM UTC-5, BGB wrote:
>>>>>
>>>>> Well, and further reaching issues, say, whether a Reg/Mem ISA could
>>>>> compare well with a Load/Store ISA?... Can use fewer instructions, but
>>>>> how to do Reg/Mem ops without introducing significant complexities or
>>>>> penalty cases?...
>>>>>
>>>> You build a pipeline which has a pre-execute stage (which calculates
>>>> AGEN) and then a stage or 2 of cache access, and then you get to the
>>>> normal execute---writeback part of the pipeline. I have called this the
>>>> 360 pipeline (or IBM pipeline) several times in the past, here. The
>>>> S.E.L
>>>> machines used such a pipeline.
>> <
>>> Don't you need two data access points along such a pipeline?
>> <
>> Sorry cannot parse your question.
>> <
>> But the AGEN unit at the front of the pipeline is speculative, and all
>> actual calculations are done after inbound memory references (if any)
>> have shown up.
>
> Sorry, I meant: In the pre-execute stages you need to read memory
> operands, no? And later in the pipeline you need to write to (or read
> from?) memory? Thus you would need (at least) two concurrent ports to
> the L1D$?
>

Not necessarily.

While a Store also needs the data from a Load, in premise one could
delay the store part of the mechanism by N clock cycles and allow the
value for the Store part to arrive from the main pipeline N cycles after
the Load part.

Main issue is that this creates some potential memory consistency issues:
What happens if a following instruction would lead to an L1 miss?
What if one tries to access the line again before it has stored?
...

Scenario, two consecutive stores to the same index,
Say, 01100100 and 01120100:
First store, Fetches index 008;
Second store, Misses on 008, Loads a different cache line;
First store writes its result to index 008;
Second store writes it result to index 008;
The result of the first store is lost.

Scenario, two consecutive stores to the same line,
Say, 01100100 and 01100108:
First store, Fetches index 008;
Second store, Fetches index 008 (No miss this time);
First store writes its result to index 008;
Second store writes it result to index 008;
However, this line is stale, lacking the prior store;
The result of the first store is lost.

Scenario, store followed by load of same address,
Say, both 01100100:
Store, Fetches index 008;
Load, Fetches index 008 (No miss this time);
First store writes its result to index 008;
But, Load has a stale value.

Simplest option being to generate an interlock stall if the new
instruction would fall into the same spot in the L1 cache as an
in-flight store, but this is "not ideal" for performance (things like
stack spill and fill are likely perform poorly in this case; these often
involve mixed loads and stores to adjacent memory addresses).

One could instead have an "Early Store" and a "Late Store" (would only
require interlocking on a "Late Store"), but this creates a new problem:
What if a prior "Late Store" and a following "Early Store" happen to
land on the same clock cycle? This is also not good.

These cases could be handled by a bunch of special case interlock
checks, but this is not ideal.

....

If designing an ISA, one possible option is to have only LoadOp as a
generic case, but then StoreOp only for simple ALU instructions. In this
case, these ALU ops could be put into the L1 cache directly, and the
Late Store mechanism and scenario could be eliminated (or, at least,
reduced to a 1 cycle delay).

This could be done while adding roughly 1 cycle of pipeline latency
on-average vs a pure Load/Store design.

Seemingly (at least from superficial checks), x86 seems to mostly fit
this latter pattern.

However, x86 has the drawback that pretty much every ALU instruction
updates status bits in EFLAGS/rFLAGS, which seems like a bit of a
hassle, since one would have to route in the flags updates from several
different areas.

Though, a likely option would be that, rather than routing the bits
linearly through all the ALUs, they could be expanded along each path
to, say:
00: Bit is clear (No Change)
01: Bit is set (No Change)
10: Clear the bit (Changed)
11: Set the bit (Changed)

In which case, the different paths can "update" the flags bits
independent of each other.

As can be noted, in BJX2 I have partly ended up with something a little
more limited:
A few LoadOp and StoreOp cases being shoved into the L1 cache;
A few LoadOp style instructions which shim in the "Op" part onto the end
of the "Load" mechanism.

This avoids adding extra latency, but makes these instructions only able
to exist as special cases, and generally limited to 1-cycle operations.

Previous examples:
FMOV.S (Rm, Disp), Rn //Load Binary32 -> Binary64
FMOV.H (Rm, Disp), Rn //Load Binary16 -> Binary64
LDTEX (Rm, Ri), Rn //Load from Texture

Newer cases (with ALU in L1):
ADDS.L (...), Rn //LoadOp, Rn=Rn+Mem
ADDU.L (...), Rn //LoadOp
SUBS.L (...), Rn //LoadOp, Rn=Mem-Rn
SUBU.L (...), Rn //LoadOp
RSUBS.L (...), Rn //LoadOp, Rn=Rn-Mem
RSUBU.L (...), Rn //LoadOp

ADDS.L Rn, (...) //StoreOp, Mem=Mem+Rn
ADDU.L Rn, (...) //StoreOp
SUBS.L Rn, (...) //StoreOp, Mem=Mem-Rn
SUBU.L Rn, (...) //StoreOp

ADDS.L Imm6u, (...) //StoreOp, Mem=Mem+Imm6u
ADDU.L Imm6u, (...) //StoreOp
SUBS.L Imm6u, (...) //StoreOp, Mem=Mem-Imm6u
SUBU.L Imm6u, (...) //StoreOp

AND.L (...), Rn //LoadOp, Rn=Mem&Rn
AND.L Rn, (...) //StoreOp, Mem=Mem&Rn
AND.L Imm6u, (...) //StoreOp, Mem=Mem&Imm6u

XCHG.L (...), Rn //Ld+St, Rn'=Mem | Mem'=Rn
XCHG.Q (...), Rn //Ld+St, Rn'=Mem | Mem'=Rn

...

Also adds new encodings for:
MOV.L Imm6u, (...) //Store Imm6 encodings.
MOV.L Imm6n, (...) //XCHG Imm6 encodings.
...

The encodings cover Byte, Word, DWord, and QWord operations (with signed
and unsigned variants for LoadOp).

Will assume Byte and Word will also work (this is more the hassle of
adding all of these cases to my compiler and emulator, than it is an
issue for the Verilog). So, a fairly minor change to the Verilog, but
expanding it out would add a big chunk of new instructions and encodings
to the listing (annoyingly, listing would be a lot longer if I fully
expanded out all of the encodings which exist due to internal
combinations of features); along with a whole bunch of new instruction
mnemonics to deal with it.

Will not cover immediate values bigger than 6 bits, but OTOH, if one
needs an immediate bigger than this, loading a constant into a register
isn't all that expensive.

I am still on the fence about whether QWORD ops should be supported here
(mostly due to the higher cost and latency of 64-bit ADD/SUB vs 32-bit),
but it makes sense for things like pointer increment/decrement (though,
these are more likely to be in registers, as one tends to be less likely
to increment a pointer without otherwise interacting with it in some
other way, such as a pointer dereference or similar).

Implicitly, this extension will (presumably) have the prior RiMOV
extension as a prerequisite, so if one gets this, they also get the (Rm,
Ri*Sc, Disp) addressing mode and similar.

For now, I will consider all this to be an "experimental" extension.

> /Marcus

Re: 64 bit 68080 CPU

<88ddeeb3-5632-419c-8d88-a2097165b3ban@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27030&group=comp.arch#27030

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:5fc1:0:b0:31e:f847:6c6f with SMTP id k1-20020ac85fc1000000b0031ef8476c6fmr24037987qta.616.1659555969522;
Wed, 03 Aug 2022 12:46:09 -0700 (PDT)
X-Received: by 2002:a05:6214:2389:b0:473:17a8:102 with SMTP id
fw9-20020a056214238900b0047317a80102mr23210864qvb.40.1659555969369; Wed, 03
Aug 2022 12:46:09 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 3 Aug 2022 12:46:09 -0700 (PDT)
In-Reply-To: <tceem6$2d57u$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:989:2db8:b143:5f6b;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:989:2db8:b143:5f6b
References: <tc6mng$fm5h$1@dont-email.me> <0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com>
<tc7rm1$o7cl$1@dont-email.me> <fdd07bcf-4f05-48dd-b809-47c67e0c7eacn@googlegroups.com>
<tcc2ub$1ngk6$1@dont-email.me> <191c8144-23a0-4e34-9246-dffa8c6e6371n@googlegroups.com>
<tcd8rj$23efp$1@dont-email.me> <tceem6$2d57u$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <88ddeeb3-5632-419c-8d88-a2097165b3ban@googlegroups.com>
Subject: Re: 64 bit 68080 CPU
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 03 Aug 2022 19:46:09 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Wed, 3 Aug 2022 19:46 UTC

On Wednesday, August 3, 2022 at 1:27:21 PM UTC-5, BGB wrote:
> On 8/3/2022 2:41 AM, Marcus wrote:
> > On 2022-08-02, MitchAlsup wrote:
> >> On Tuesday, August 2, 2022 at 3:54:38 PM UTC-5, Marcus wrote:
> >>> On 2022-08-01, MitchAlsup wrote:
> >>>> On Monday, August 1, 2022 at 1:26:12 AM UTC-5, BGB wrote:
> >>>>>
> >>>>> Well, and further reaching issues, say, whether a Reg/Mem ISA could
> >>>>> compare well with a Load/Store ISA?... Can use fewer instructions, but
> >>>>> how to do Reg/Mem ops without introducing significant complexities or
> >>>>> penalty cases?...
> >>>>>
> >>>> You build a pipeline which has a pre-execute stage (which calculates
> >>>> AGEN) and then a stage or 2 of cache access, and then you get to the
> >>>> normal execute---writeback part of the pipeline. I have called this the
> >>>> 360 pipeline (or IBM pipeline) several times in the past, here. The
> >>>> S.E.L
> >>>> machines used such a pipeline.
> >> <
> >>> Don't you need two data access points along such a pipeline?
> >> <
> >> Sorry cannot parse your question.
> >> <
> >> But the AGEN unit at the front of the pipeline is speculative, and all
> >> actual calculations are done after inbound memory references (if any)
> >> have shown up.
> >
> > Sorry, I meant: In the pre-execute stages you need to read memory
> > operands, no? And later in the pipeline you need to write to (or read
> > from?) memory? Thus you would need (at least) two concurrent ports to
> > the L1D$?
> >
> Not necessarily.
>
> While a Store also needs the data from a Load, in premise one could
> delay the store part of the mechanism by N clock cycles and allow the
> value for the Store part to arrive from the main pipeline N cycles after
> the Load part.
>
>
> Main issue is that this creates some potential memory consistency issues:
> What happens if a following instruction would lead to an L1 miss?
> What if one tries to access the line again before it has stored?
> ...
>
> Scenario, two consecutive stores to the same index,
> Say, 01100100 and 01120100:
> First store, Fetches index 008;
> Second store, Misses on 008, Loads a different cache line;
> First store writes its result to index 008;
> Second store writes it result to index 008;
> The result of the first store is lost.
>
> Scenario, two consecutive stores to the same line,
> Say, 01100100 and 01100108:
> First store, Fetches index 008;
> Second store, Fetches index 008 (No miss this time);
> First store writes its result to index 008;
> Second store writes it result to index 008;
> However, this line is stale, lacking the prior store;
> The result of the first store is lost.
>
> Scenario, store followed by load of same address,
> Say, both 01100100:
> Store, Fetches index 008;
> Load, Fetches index 008 (No miss this time);
> First store writes its result to index 008;
> But, Load has a stale value.
>
>
> Simplest option being to generate an interlock stall if the new
> instruction would fall into the same spot in the L1 cache as an
> in-flight store, but this is "not ideal" for performance (things like
> stack spill and fill are likely perform poorly in this case; these often
> involve mixed loads and stores to adjacent memory addresses).
>
> One could instead have an "Early Store" and a "Late Store" (would only
> require interlocking on a "Late Store"), but this creates a new problem:
> What if a prior "Late Store" and a following "Early Store" happen to
> land on the same clock cycle? This is also not good.
>
> These cases could be handled by a bunch of special case interlock
> checks, but this is not ideal.
>
> ...
>
>
>
> If designing an ISA, one possible option is to have only LoadOp as a
> generic case, but then StoreOp only for simple ALU instructions. In this
> case, these ALU ops could be put into the L1 cache directly, and the
> Late Store mechanism and scenario could be eliminated (or, at least,
> reduced to a 1 cycle delay).
>
> This could be done while adding roughly 1 cycle of pipeline latency
> on-average vs a pure Load/Store design.
>
> Seemingly (at least from superficial checks), x86 seems to mostly fit
> this latter pattern.
>
> However, x86 has the drawback that pretty much every ALU instruction
> updates status bits in EFLAGS/rFLAGS, which seems like a bit of a
> hassle, since one would have to route in the flags updates from several
> different areas.
>
> Though, a likely option would be that, rather than routing the bits
> linearly through all the ALUs, they could be expanded along each path
> to, say:
> 00: Bit is clear (No Change)
> 01: Bit is set (No Change)
> 10: Clear the bit (Changed)
> 11: Set the bit (Changed)
>
> In which case, the different paths can "update" the flags bits
> independent of each other.
>
>
>
> As can be noted, in BJX2 I have partly ended up with something a little
> more limited:
> A few LoadOp and StoreOp cases being shoved into the L1 cache;
> A few LoadOp style instructions which shim in the "Op" part onto the end
> of the "Load" mechanism.
>
> This avoids adding extra latency, but makes these instructions only able
> to exist as special cases, and generally limited to 1-cycle operations.
>
> Previous examples:
> FMOV.S (Rm, Disp), Rn //Load Binary32 -> Binary64
> FMOV.H (Rm, Disp), Rn //Load Binary16 -> Binary64
> LDTEX (Rm, Ri), Rn //Load from Texture
>
>
> Newer cases (with ALU in L1):
> ADDS.L (...), Rn //LoadOp, Rn=Rn+Mem
> ADDU.L (...), Rn //LoadOp
> SUBS.L (...), Rn //LoadOp, Rn=Mem-Rn
> SUBU.L (...), Rn //LoadOp
> RSUBS.L (...), Rn //LoadOp, Rn=Rn-Mem
> RSUBU.L (...), Rn //LoadOp
>
> ADDS.L Rn, (...) //StoreOp, Mem=Mem+Rn
> ADDU.L Rn, (...) //StoreOp
> SUBS.L Rn, (...) //StoreOp, Mem=Mem-Rn
> SUBU.L Rn, (...) //StoreOp
<
Why not:
<
RSUBS.L Rn,(...) // StoreOp Mem=Rn-Mem
RSUBU.L Rn,(...) // StoreOp
<
??
>
> ADDS.L Imm6u, (...) //StoreOp, Mem=Mem+Imm6u
> ADDU.L Imm6u, (...) //StoreOp
> SUBS.L Imm6u, (...) //StoreOp, Mem=Mem-Imm6u
> SUBU.L Imm6u, (...) //StoreOp
<
Why not:
<
RSUBS.L Imm6u,(...) // StoreOp Mem=Imm6u-Mem
RSUBU.L Imm6u,(...) // StoreOp
<
??
>
> AND.L (...), Rn //LoadOp, Rn=Mem&Rn
> AND.L Rn, (...) //StoreOp, Mem=Mem&Rn
> AND.L Imm6u, (...) //StoreOp, Mem=Mem&Imm6u
>
> XCHG.L (...), Rn //Ld+St, Rn'=Mem | Mem'=Rn
> XCHG.Q (...), Rn //Ld+St, Rn'=Mem | Mem'=Rn
>
> ...
>
> Also adds new encodings for:
> MOV.L Imm6u, (...) //Store Imm6 encodings.
> MOV.L Imm6n, (...) //XCHG Imm6 encodings.
> ...
>
> The encodings cover Byte, Word, DWord, and QWord operations (with signed
> and unsigned variants for LoadOp).
<
It is this great waste of entropy which caused these kinds of ISAs to
drop out of favor.
>
>
> Will assume Byte and Word will also work (this is more the hassle of
> adding all of these cases to my compiler and emulator, than it is an
> issue for the Verilog). So, a fairly minor change to the Verilog, but
> expanding it out would add a big chunk of new instructions and encodings
> to the listing (annoyingly, listing would be a lot longer if I fully
> expanded out all of the encodings which exist due to internal
> combinations of features); along with a whole bunch of new instruction
> mnemonics to deal with it.
>
>
> Will not cover immediate values bigger than 6 bits, but OTOH, if one
> needs an immediate bigger than this, loading a constant into a register
> isn't all that expensive.
>
>
>
> I am still on the fence about whether QWORD ops should be supported here
> (mostly due to the higher cost and latency of 64-bit ADD/SUB vs 32-bit),
> but it makes sense for things like pointer increment/decrement (though,
> these are more likely to be in registers, as one tends to be less likely
> to increment a pointer without otherwise interacting with it in some
> other way, such as a pointer dereference or similar).
>
>
> Implicitly, this extension will (presumably) have the prior RiMOV
> extension as a prerequisite, so if one gets this, they also get the (Rm,
> Ri*Sc, Disp) addressing mode and similar.
>
> For now, I will consider all this to be an "experimental" extension.
>
>
> > /Marcus


Click here to read the complete article
Re: 64 bit 68080 CPU

<tceplk$2g08q$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27033&group=comp.arch#27033

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: 64 bit 68080 CPU
Date: Wed, 3 Aug 2022 16:34:39 -0500
Organization: A noiseless patient Spider
Lines: 210
Message-ID: <tceplk$2g08q$1@dont-email.me>
References: <tc6mng$fm5h$1@dont-email.me>
<0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com>
<tc7rm1$o7cl$1@dont-email.me>
<fdd07bcf-4f05-48dd-b809-47c67e0c7eacn@googlegroups.com>
<mrxGK.823150$X_i.231361@fx18.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 3 Aug 2022 21:34:44 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="8ecc289b2e5bf7eb2b3d5f5d40605fd6";
logging-data="2621722"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+x6nt0pqoURWtkqL/SjKzx"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.11.0
Cancel-Lock: sha1:R6+m47gRT6lwGChnfGMs2B+lJaM=
In-Reply-To: <mrxGK.823150$X_i.231361@fx18.iad>
Content-Language: en-US
 by: BGB - Wed, 3 Aug 2022 21:34 UTC

On 8/3/2022 11:32 AM, EricP wrote:
> MitchAlsup wrote:
>> On Monday, August 1, 2022 at 1:26:12 AM UTC-5, BGB wrote:
>>> Well, and further reaching issues, say, whether a Reg/Mem ISA could
>>> compare well with a Load/Store ISA?... Can use fewer instructions,
>>> but how to do Reg/Mem ops without introducing significant
>>> complexities or penalty cases?...
>> You build a pipeline which has a pre-execute stage (which calculates
>> AGEN) and then a stage or 2 of cache access, and then you get to the
>> normal execute---writeback part of the pipeline. I have called this the
>> 360 pipeline (or IBM pipeline) several times in the past, here. The S.E.L
>> machines used such a pipeline.
>
> The thing is that a fixed layout pipeline can only accommodate
> the specific situations that it is designed for.
> Forwarding allows limited topological rearrangement.
> Putting an extra AGEN at an early pipeline stage makes
> all uOps perform an extra stage, which add extra latency
> that makes it costlier to fill in bubbles.
>

This is a big drawback:
Best cases for generic LoadOp add latency.

At least found a way to support a few of these cases (in a non-generic
way) in my case without adding a latency penalty (or changing the
pipeline). Maybe not ideal for timing, but sorta works.

Would make a lot more sense for an ISA like x86, where presumably one is
already prepared to pay these penalties for an in-order core as an
artifact of the ISA design.

> A pipeline might dynamically rearrange while maintaining In-Order (InO)
> simplicity such that can fill in bubbles as best as possible.
> (I have a mental picture of a dynamic Pert chart.
> It is not OoO but it does allow concurrency to fill in bubbles.)
>
> For example, a ST with immediate data and immediate address doesn't
> need either a RR Register Read stage or AGEN, and can go straight
> from Decode to LSU. That ST can launch concurrent with an earlier
> RR-ALU uOp, or a following RR-ALU op can launch concurrent with ST.
> ST doesn't need the WB stage so a subsequent uOp can use that stage.
>

Dunno there.

Ironically, I am seemingly gradually getting closer to how one would
design a "reasonable cost" x86 core...

Main remaining ugly part is mostly the instruction decoder, which is
likely mostly a manner of having a bunch of duplicated logic for "What
if a Mod/Rm byte happens right here?".

I am almost to the point where I could try to implement such a thing if
I wanted to (and, ironically, an x86-64 core may well be cheaper than an
IA-64 core, on account of IA-64's stupidly large register file).

Much less confident about performance though.
It is like a bit of an enigma:
Sometimes, x86's performance is unexpectedly fast;
Sometimes, it is meh, or downright terrible (*).

*: Like the early versions of the Atom (such as in an ASUS Eee), where a
RasPi can seemingly run circles around it in terms of general performance.

Meanwhile, the original MS-DOS builds of Doom is seemingly unexpectedly
fast, whereas trying to run x86 versions based back-ports of the
Linuxdoom source release, are seemingly dragging around a boat anchor
(in comparison). Well, then there is Hexen which, despite being based on
the Doom engine, manages to somehow be almost as slow as Quake.

I could still consider trying to write an x86 emulator on top of BJX2; I
wrote a basic x86 emulator once before (and in a way was the origin of a
lot of the designs used in my later emulators and interpreters).

Basically, previous to this, many of my interpreters had used a fairly
naive strategy:
Decode an instruction;
Execute an instruction;
Decode next instruction;
Execute next instruction;
...
Generally spinning in a loop (which then bottle-necked on the "decode
and dispatch" part of the process).

When at first I tried writing an x86 emulator, this was no longer a
workable strategy, as the logic for pattern matching the instructions
was way too slow for handling this part inline.

So, solution:
Decode a trace of instructions in advance, and have each instruction as
a struct with "do your thing" function pointer (mostly eliminating the
decoding cost from the running execution time).

Then, there was another trick of detecting cases where future
instructions in a trace would mask out the previous instructions' EFLAGS
updates, allowing them to be replaced with faster "non EFLAGS updating"
variants (worked OK, as most of the EFLAGS updates were being masked).

However, I have doubts this would be sufficient to give usable
performance on an otherwise already slow CPU core.

But, I suspect I could probably do better this time around, if I got
around to it (in later emulators, I had realized a few more potentially
relevant tricks).

At this point, even on my PC, my BJX2 emulator seemingly can't maintain
real-time emulation much over ~ 150MHz; seemingly mostly bogged down
with handling Load/Store operations.

Apart from stuff related to Load/Store handling, no other "major
hotspots" in the profiler.

There are special lookup hint cases to speed up Load/Store cases (eg:
caching previously accessed page addresses and pointers, as a sort of
small emulator-side TLB cache, vs always needing to go through the main
TLB and looking up a memory span), but these aren't really sufficient to
fully defeat this issue.

The main trace loop:
while(tr && (cyc>0))
{ cyc-=tr->n_cyc; tr=tr->Run(tr); }

Is still making a showing in the profile, implying the emulator is still
running at close to full speed (and a lot of the longer trace dispatch
functions are still making a showing, so the problem isn't mostly of
being overrun with overly short instruction traces).

....

However, granted, for normal use emulator just needs to be faster than
what the FPGA version can do.

Kinda annoying on a RasPi though, as at present the RasPi can't emulate
the BJX2 core at much faster than about 16 MHz, and my previous attempts
to use a JIT compiler (and NEON instructions) to improve performance on
ARM were decidedly unsuccessful.

Granted, my attempts at running TKRA-GL on a RasPi "weren't so hot"
either, it seemingly kinda really sucks at it (much as doing so on an
early 2000s laptop). Neither really fast enough to give a particularly
usable GLQuake experience with software GL despite having a fairly
significant clock-speed advantage.

Extrapolating backwards, this would imply that trying to run software
rasterized OpenGL or similar on a 486 or similar would have been
borderline glacial.

But, it seems like my current ISA is putting up more of a fight than
previous ISA's at "emulate ISA at a higher clock speed".

However, most of the previous ISAs had:
Didn't have to account for bundling:
Superscalar would have a similar effect here;
More ops have a "free ride" as per the clock-cycle accounting.
Lower density of Load/Store ops:
Limited address modes, and needing ALU ops for address calcs, ...
This would makes "cheap to emulate" ops more common.

In this case, the emulator is seemingly running into a limit of not
being able to get much under around 20 (real) clock-cycles per emulated
instruction (at least, with a high density of Load/Store instructions).

Though, as long as it stays below an average of around 60 clock-cycles
per emulated instruction, this is fast enough to keep up with real-time
emulation of a 50MHz CPU core.

Though, if emulating x86 on BJX2, there would at least be the advantage
that I could map the x86 virtual address space into the BJX2 virtual
address space, and then make use of the hardware MMU for most of the
address translation (even if the reverse isn't really true).

But, still have serious doubts as to whether it could be fast enough for
something like Doom would be playable with it. Then again, seemingly the
RasPi also fails at this task as well.

Neither QEMU nor DOSBox on a RasPi giving usable performance for playing
Doom (it was basically a slide show when I tested it).

Running Doom in my BJX2 emulator on a RasPi still somehow manages to
give more playable performance than the RasPi port of DOSBox.

Well, and the relative oddity that Software GL in the emulator isn't all
that much slower than trying to do it using a natively compiled version.

This is very much unlike Doom; which is apparently able to run crazy
fast in a native ARM build.

Performance is weird sometimes...

....

Re: 64 bit 68080 CPU

<tceq5r$2g4h9$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27034&group=comp.arch#27034

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: 64 bit 68080 CPU
Date: Wed, 3 Aug 2022 16:43:18 -0500
Organization: A noiseless patient Spider
Lines: 244
Message-ID: <tceq5r$2g4h9$1@dont-email.me>
References: <tc6mng$fm5h$1@dont-email.me>
<0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com>
<tc7rm1$o7cl$1@dont-email.me>
<fdd07bcf-4f05-48dd-b809-47c67e0c7eacn@googlegroups.com>
<tcc2ub$1ngk6$1@dont-email.me>
<191c8144-23a0-4e34-9246-dffa8c6e6371n@googlegroups.com>
<tcd8rj$23efp$1@dont-email.me> <tceem6$2d57u$1@dont-email.me>
<88ddeeb3-5632-419c-8d88-a2097165b3ban@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 3 Aug 2022 21:43:23 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="8ecc289b2e5bf7eb2b3d5f5d40605fd6";
logging-data="2626089"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1//HWFs7PAImSz3tctMA9/X"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.11.0
Cancel-Lock: sha1:mSNYhdLjBalyzmDAq8UQHHTNfhs=
In-Reply-To: <88ddeeb3-5632-419c-8d88-a2097165b3ban@googlegroups.com>
Content-Language: en-US
 by: BGB - Wed, 3 Aug 2022 21:43 UTC

On 8/3/2022 2:46 PM, MitchAlsup wrote:
> On Wednesday, August 3, 2022 at 1:27:21 PM UTC-5, BGB wrote:
>> On 8/3/2022 2:41 AM, Marcus wrote:
>>> On 2022-08-02, MitchAlsup wrote:
>>>> On Tuesday, August 2, 2022 at 3:54:38 PM UTC-5, Marcus wrote:
>>>>> On 2022-08-01, MitchAlsup wrote:
>>>>>> On Monday, August 1, 2022 at 1:26:12 AM UTC-5, BGB wrote:
>>>>>>>
>>>>>>> Well, and further reaching issues, say, whether a Reg/Mem ISA could
>>>>>>> compare well with a Load/Store ISA?... Can use fewer instructions, but
>>>>>>> how to do Reg/Mem ops without introducing significant complexities or
>>>>>>> penalty cases?...
>>>>>>>
>>>>>> You build a pipeline which has a pre-execute stage (which calculates
>>>>>> AGEN) and then a stage or 2 of cache access, and then you get to the
>>>>>> normal execute---writeback part of the pipeline. I have called this the
>>>>>> 360 pipeline (or IBM pipeline) several times in the past, here. The
>>>>>> S.E.L
>>>>>> machines used such a pipeline.
>>>> <
>>>>> Don't you need two data access points along such a pipeline?
>>>> <
>>>> Sorry cannot parse your question.
>>>> <
>>>> But the AGEN unit at the front of the pipeline is speculative, and all
>>>> actual calculations are done after inbound memory references (if any)
>>>> have shown up.
>>>
>>> Sorry, I meant: In the pre-execute stages you need to read memory
>>> operands, no? And later in the pipeline you need to write to (or read
>>> from?) memory? Thus you would need (at least) two concurrent ports to
>>> the L1D$?
>>>
>> Not necessarily.
>>
>> While a Store also needs the data from a Load, in premise one could
>> delay the store part of the mechanism by N clock cycles and allow the
>> value for the Store part to arrive from the main pipeline N cycles after
>> the Load part.
>>
>>
>> Main issue is that this creates some potential memory consistency issues:
>> What happens if a following instruction would lead to an L1 miss?
>> What if one tries to access the line again before it has stored?
>> ...
>>
>> Scenario, two consecutive stores to the same index,
>> Say, 01100100 and 01120100:
>> First store, Fetches index 008;
>> Second store, Misses on 008, Loads a different cache line;
>> First store writes its result to index 008;
>> Second store writes it result to index 008;
>> The result of the first store is lost.
>>
>> Scenario, two consecutive stores to the same line,
>> Say, 01100100 and 01100108:
>> First store, Fetches index 008;
>> Second store, Fetches index 008 (No miss this time);
>> First store writes its result to index 008;
>> Second store writes it result to index 008;
>> However, this line is stale, lacking the prior store;
>> The result of the first store is lost.
>>
>> Scenario, store followed by load of same address,
>> Say, both 01100100:
>> Store, Fetches index 008;
>> Load, Fetches index 008 (No miss this time);
>> First store writes its result to index 008;
>> But, Load has a stale value.
>>
>>
>> Simplest option being to generate an interlock stall if the new
>> instruction would fall into the same spot in the L1 cache as an
>> in-flight store, but this is "not ideal" for performance (things like
>> stack spill and fill are likely perform poorly in this case; these often
>> involve mixed loads and stores to adjacent memory addresses).
>>
>> One could instead have an "Early Store" and a "Late Store" (would only
>> require interlocking on a "Late Store"), but this creates a new problem:
>> What if a prior "Late Store" and a following "Early Store" happen to
>> land on the same clock cycle? This is also not good.
>>
>> These cases could be handled by a bunch of special case interlock
>> checks, but this is not ideal.
>>
>> ...
>>
>>
>>
>> If designing an ISA, one possible option is to have only LoadOp as a
>> generic case, but then StoreOp only for simple ALU instructions. In this
>> case, these ALU ops could be put into the L1 cache directly, and the
>> Late Store mechanism and scenario could be eliminated (or, at least,
>> reduced to a 1 cycle delay).
>>
>> This could be done while adding roughly 1 cycle of pipeline latency
>> on-average vs a pure Load/Store design.
>>
>> Seemingly (at least from superficial checks), x86 seems to mostly fit
>> this latter pattern.
>>
>> However, x86 has the drawback that pretty much every ALU instruction
>> updates status bits in EFLAGS/rFLAGS, which seems like a bit of a
>> hassle, since one would have to route in the flags updates from several
>> different areas.
>>
>> Though, a likely option would be that, rather than routing the bits
>> linearly through all the ALUs, they could be expanded along each path
>> to, say:
>> 00: Bit is clear (No Change)
>> 01: Bit is set (No Change)
>> 10: Clear the bit (Changed)
>> 11: Set the bit (Changed)
>>
>> In which case, the different paths can "update" the flags bits
>> independent of each other.
>>
>>
>>
>> As can be noted, in BJX2 I have partly ended up with something a little
>> more limited:
>> A few LoadOp and StoreOp cases being shoved into the L1 cache;
>> A few LoadOp style instructions which shim in the "Op" part onto the end
>> of the "Load" mechanism.
>>
>> This avoids adding extra latency, but makes these instructions only able
>> to exist as special cases, and generally limited to 1-cycle operations.
>>
>> Previous examples:
>> FMOV.S (Rm, Disp), Rn //Load Binary32 -> Binary64
>> FMOV.H (Rm, Disp), Rn //Load Binary16 -> Binary64
>> LDTEX (Rm, Ri), Rn //Load from Texture
>>
>>
>> Newer cases (with ALU in L1):
>> ADDS.L (...), Rn //LoadOp, Rn=Rn+Mem
>> ADDU.L (...), Rn //LoadOp
>> SUBS.L (...), Rn //LoadOp, Rn=Mem-Rn
>> SUBU.L (...), Rn //LoadOp
>> RSUBS.L (...), Rn //LoadOp, Rn=Rn-Mem
>> RSUBU.L (...), Rn //LoadOp
>>
>> ADDS.L Rn, (...) //StoreOp, Mem=Mem+Rn
>> ADDU.L Rn, (...) //StoreOp
>> SUBS.L Rn, (...) //StoreOp, Mem=Mem-Rn
>> SUBU.L Rn, (...) //StoreOp
> <
> Why not:
> <
> RSUBS.L Rn,(...) // StoreOp Mem=Rn-Mem
> RSUBU.L Rn,(...) // StoreOp
> <
> ??

Actually, these cases should exist as well, just didn't think to mention it.

>>
>> ADDS.L Imm6u, (...) //StoreOp, Mem=Mem+Imm6u
>> ADDU.L Imm6u, (...) //StoreOp
>> SUBS.L Imm6u, (...) //StoreOp, Mem=Mem-Imm6u
>> SUBU.L Imm6u, (...) //StoreOp
> <
> Why not:
> <
> RSUBS.L Imm6u,(...) // StoreOp Mem=Imm6u-Mem
> RSUBU.L Imm6u,(...) // StoreOp
> <
> ??

Likewise, this wasn't exactly an exhaustive list of every new
"instruction" which pops into existence as a side effect of this feature...

>>
>> AND.L (...), Rn //LoadOp, Rn=Mem&Rn
>> AND.L Rn, (...) //StoreOp, Mem=Mem&Rn
>> AND.L Imm6u, (...) //StoreOp, Mem=Mem&Imm6u
>>
>> XCHG.L (...), Rn //Ld+St, Rn'=Mem | Mem'=Rn
>> XCHG.Q (...), Rn //Ld+St, Rn'=Mem | Mem'=Rn
>>
>> ...
>>
>> Also adds new encodings for:
>> MOV.L Imm6u, (...) //Store Imm6 encodings.
>> MOV.L Imm6n, (...) //XCHG Imm6 encodings.
>> ...
>>
>> The encodings cover Byte, Word, DWord, and QWord operations (with signed
>> and unsigned variants for LoadOp).
> <
> It is this great waste of entropy which caused these kinds of ISAs to
> drop out of favor.

Given these were a hack onto the RiMOV encodings (already using a 64-bit
instruction format), the added entropy cost was at least in a part of
the space where it didn't eat into the 32-bit encoding space.

There is not currently any plan to migrate these encodings into the
32-bit part of the encoding space.

But, yeah, something like:
AND.B 63, (R4, 0)
Would take 8 bytes to encode, unlike x86 where its' equivalent could be
encoded in 3 bytes.

>>
>>
>> Will assume Byte and Word will also work (this is more the hassle of
>> adding all of these cases to my compiler and emulator, than it is an
>> issue for the Verilog). So, a fairly minor change to the Verilog, but
>> expanding it out would add a big chunk of new instructions and encodings
>> to the listing (annoyingly, listing would be a lot longer if I fully
>> expanded out all of the encodings which exist due to internal
>> combinations of features); along with a whole bunch of new instruction
>> mnemonics to deal with it.
>>
>>
>> Will not cover immediate values bigger than 6 bits, but OTOH, if one
>> needs an immediate bigger than this, loading a constant into a register
>> isn't all that expensive.
>>
>>
>>
>> I am still on the fence about whether QWORD ops should be supported here
>> (mostly due to the higher cost and latency of 64-bit ADD/SUB vs 32-bit),
>> but it makes sense for things like pointer increment/decrement (though,
>> these are more likely to be in registers, as one tends to be less likely
>> to increment a pointer without otherwise interacting with it in some
>> other way, such as a pointer dereference or similar).
>>
>>
>> Implicitly, this extension will (presumably) have the prior RiMOV
>> extension as a prerequisite, so if one gets this, they also get the (Rm,
>> Ri*Sc, Disp) addressing mode and similar.
>>
>> For now, I will consider all this to be an "experimental" extension.
>>
>>
>>> /Marcus


Click here to read the complete article
Re: 64 bit 68080 CPU

<tcf184$2htmd$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27035&group=comp.arch#27035

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: tl...@none.invalid (Torbjorn Lindgren)
Newsgroups: comp.arch
Subject: Re: 64 bit 68080 CPU
Date: Wed, 3 Aug 2022 23:44:04 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 112
Message-ID: <tcf184$2htmd$1@dont-email.me>
References: <tc6mng$fm5h$1@dont-email.me> <tc7rm1$o7cl$1@dont-email.me> <Vhx*IxEUy@news.chiark.greenend.org.uk> <tc97ol$12u23$2@dont-email.me>
Injection-Date: Wed, 3 Aug 2022 23:44:04 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="e63935cb27b13446ecebce19c8f1c6f5";
logging-data="2684621"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/kzY2VRbtpdYPapBaxUM6eRLpE4FsETKw="
Cancel-Lock: sha1:tVHnCb90vAngfjjk7w/DdToYM7I=
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
 by: Torbjorn Lindgren - Wed, 3 Aug 2022 23:44 UTC

BGB <cr88192@gmail.com> wrote:
>On 8/1/2022 4:06 AM, Theo wrote:
>> BGB <cr88192@gmail.com> wrote:
>>> But, what sort of FPGA, exactly?...
>>
>> Cyclone V 5CEFA5F23C, 77k LE:
>> http://www.apollo-computer.com/icedrakev4.php
>
> From what I can gather, vs an XC7A100T:
> Fewer LUTs (kinda hard to compare directly here);
> Less Block RAM;
> More IO pins;
> Faster maximum internal clock speed.
>
[...]
>> Cyclone V go up to 300K LE, and can have an Arm Cortex A9 on them (yes,
>> pretty antique as far as Arm cores go). Those are pretty comparable with
>> the Zynq in Xilinx-land. This one is a Cyclone V E version, which means
>> there's no transceivers and no Arm, hence it's at the cheap end of the line
>> (the A5 meaning 77k LE is the middle of the range).
>>
>> https://www.intel.com/content/www/us/en/products/details/fpga/cyclone/v/e.html
>>
>> This one has 4.8Mbit of BRAM (think the 'Gb' on that table is a typo).

Yeah, the A2 to A9 summary[1] lists memory as Mb/Gb/Gb/Mb/Mb, this is
clearly wrong (and I confimed it's all Mb via other sources).

The part they use is the SCEA5, if we follow that link we get to the
Intel ARK page which includes lots of different order codes, three of
them are "5CEFA5F23C" models (different speed grades in the same
package & pin-out, C6 is the fastest).

>> The I/Os are typically good to drive DDR3, which is what the Arm uses for
>> DRAM.
>
>On Artix, one is typically using FPGA logic and IO to drive the DDR.
>In my case, I am driving the DDR2 module at 50MHz on the board I have.

According to the Intel ARK page[2] the 5CSEA5 model has "hard memory
controllers" which support DDR2, DDR3 and LPDDR2. The Product Table
confirms this and also reveals the A5 (and up) actually has two "hard
memory controller (FPGA)".

>Can sort-of drive RAM at 75MHz via IO pins, but reliability is lacking
>in this case, and this is pushing the limits of the IO pin speeds.

IF I'm reading the Intel's External Memory Interface Spec Estimator
[4] correctly it reveals that Cyclone V's memory channels are "up to
40-bit" (IE like mobile, not like PC) and that these models can do up
to 80-bit (confirming the dual-channel).

It looks like the faster C6 & C7 grade can run DDR3-800 (400MHz) if
you restrict it to a single chip select though it needs DDR3-1066
rated memory chips for that due to an errata. The slower C8 only
support up to DDD3-666 using DDR3-800 rated memory chips. Speed with
two chip select is a bit lower (DDR3-666 for C6/C7, DDR3-606 for C8).

It's possible Intel wrote MHz and meant MT/s, if so the numbers would
half this but 800MT is AFAIK the slowest JEDEC DDR3 standard speed
which hints these are the correct numbers.

The figures for DDR2 are the same (800MT/s) and LPDDR2 is a bit slower
(666MT/s, can't do 2 chip selects), these speeds are also completely
reasonable for DDR2 and LPDDR2. I suspect the reason for supporting
three different memory interfaces is to give the designers more
choices in memory sizes.

800MT/s and a 64-bit/8-byte wide memory interface gives us a best case
total "interface speed" of 6.4 GB/s. Obviously it will never HIT that
but if the implementation is competent there's no reason it couldn't
be capable of 3-5 GB/s.

Given that all the "68080" boards I found have 512MB of (soldered)
memory that seems like it should be sufficient bandwidth assuming they
have L1 and L2 caches on the CPU which seems very likely.

AFAIK your core has order(s?) of magnitude less memory bandwith than
this (the 80MHz above also hints at that).

Hmmm! Looks like the Cyclone V is the ONLY low-end Intel/Altera FPGA
with hard memory controllers, even some of the mid-range models
doesn't have it. So those have to rely on much slower soft memory
controllers.

So perhaps that's the secret sauce!

>> Mouser will sell me one for $127 (in MOQ 60), and the price they get from
>> their distributor is almost certainly less.

It say "non-stocked" so I suspect all this would do is to put in an
order with their supplier and then Mouser comes back to you with some
kind of estimate (or "when we get it"!) - I'd definitely check with
them before ordering any non-stocked parts.

The fact that there's suppliers in Asia that want $450 and $650 for it
respectively and provides NO volume discount at all kind of suggest
availability via "proper" channels may be very limited.

Certainly the Cyclone V SE A6 (SE is the SOC variant with lots of
extra hard stuff) that the Terasic DE10-Nano uses is in very short
supply - the MISTer game FPGA system uses this so I know the lead
times are very long.

1. https://ark.intel.com/content/www/us/en/ark/products/series/141579/cyclone-v-e-fpga.html
2. https://ark.intel.com/content/www/us/en/ark/products/210443/cyclone-v-5cea5-fpga.html
3. https://www.intel.com/content/dam/support/us/en/programmable/support-resources/bulk-container/pdfs/literature/pt/cyclone-v-product-table.pdf
4. https://www.intel.com/content/www/us/en/support/programmable/support-resources/support-centers/emif-spec-estimator.html

Re: 64 bit 68080 CPU

<tcfd6h$2ngkv$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27036&group=comp.arch#27036

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: 64 bit 68080 CPU
Date: Wed, 3 Aug 2022 22:07:56 -0500
Organization: A noiseless patient Spider
Lines: 241
Message-ID: <tcfd6h$2ngkv$1@dont-email.me>
References: <tc6mng$fm5h$1@dont-email.me> <tc7rm1$o7cl$1@dont-email.me>
<Vhx*IxEUy@news.chiark.greenend.org.uk> <tc97ol$12u23$2@dont-email.me>
<tcf184$2htmd$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 4 Aug 2022 03:08:02 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="7f01768f5cad69927cf7b418682c41ef";
logging-data="2867871"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/9fwGbxcbLcjvTHI9lmBP+"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.11.0
Cancel-Lock: sha1:4+/YwV0ZMZSYoE708eQfKjsVp7c=
In-Reply-To: <tcf184$2htmd$1@dont-email.me>
Content-Language: en-US
 by: BGB - Thu, 4 Aug 2022 03:07 UTC

On 8/3/2022 6:44 PM, Torbjorn Lindgren wrote:
> BGB <cr88192@gmail.com> wrote:
>> On 8/1/2022 4:06 AM, Theo wrote:
>>> BGB <cr88192@gmail.com> wrote:
>>>> But, what sort of FPGA, exactly?...
>>>
>>> Cyclone V 5CEFA5F23C, 77k LE:
>>> http://www.apollo-computer.com/icedrakev4.php
>>
>> From what I can gather, vs an XC7A100T:
>> Fewer LUTs (kinda hard to compare directly here);
>> Less Block RAM;
>> More IO pins;
>> Faster maximum internal clock speed.
>>
> [...]
>>> Cyclone V go up to 300K LE, and can have an Arm Cortex A9 on them (yes,
>>> pretty antique as far as Arm cores go). Those are pretty comparable with
>>> the Zynq in Xilinx-land. This one is a Cyclone V E version, which means
>>> there's no transceivers and no Arm, hence it's at the cheap end of the line
>>> (the A5 meaning 77k LE is the middle of the range).
>>>
>>> https://www.intel.com/content/www/us/en/products/details/fpga/cyclone/v/e.html
>>>
>>> This one has 4.8Mbit of BRAM (think the 'Gb' on that table is a typo).
>
> Yeah, the A2 to A9 summary[1] lists memory as Mb/Gb/Gb/Mb/Mb, this is
> clearly wrong (and I confimed it's all Mb via other sources).
>

Yeah, probably nothing in this category is going to have Gb of Block RAM...

> The part they use is the SCEA5, if we follow that link we get to the
> Intel ARK page which includes lots of different order codes, three of
> them are "5CEFA5F23C" models (different speed grades in the same
> package & pin-out, C6 is the fastest).
>

The FPGA I am using is a "-1" speed grade, which in Artix-7 terms, is
basically the slowest.

>
>>> The I/Os are typically good to drive DDR3, which is what the Arm uses for
>>> DRAM.
>>
>> On Artix, one is typically using FPGA logic and IO to drive the DDR.
>> In my case, I am driving the DDR2 module at 50MHz on the board I have.
>
> According to the Intel ARK page[2] the 5CSEA5 model has "hard memory
> controllers" which support DDR2, DDR3 and LPDDR2. The Product Table
> confirms this and also reveals the A5 (and up) actually has two "hard
> memory controller (FPGA)".
>

Yes, but no hard controllers on Artix-7, where one only has soft
controllers.

Usual idea is that Xilinx wants people to use Vivado's MIG tool, but
then one would need to deal with AXI.

In theory, one could use SERDES (the RAM being connected up to SERDES
capable pins), but the specifics of how to use it are a bit sparse, and
most of the "official" stuff here is mostly "Instantiate these random IP
Cores from the IP Catalog...".

Me: "How about NO."
Don't really want "IP Cores", nor do I necessarily want to deal with AXI.

I would more prefer if they more bothered actually documenting their
FPGA primitives, vs just endlessly being like "Invoke X from the IP
Catalog Wizard"...

In my case, I am mostly testing stuff in simulations using Verilator,
with the testbenches using a mock-up of the RAM chip based on
descriptions from various RAM module datasheets and similar (which was,
luckily, apparently accurate enough to allow interfacing with the actual
RAM chips; however the actual standards here are located behind JEDEC
paywalls, so don't have these...).

>
>> Can sort-of drive RAM at 75MHz via IO pins, but reliability is lacking
>> in this case, and this is pushing the limits of the IO pin speeds.
>
> IF I'm reading the Intel's External Memory Interface Spec Estimator
> [4] correctly it reveals that Cyclone V's memory channels are "up to
> 40-bit" (IE like mobile, not like PC) and that these models can do up
> to 80-bit (confirming the dual-channel).
>

Didn't mention it, but the board I am using has a 16-bit RAM interface
(128MB DDR2, 16-bit, 15ns CAS latency).

Some other boards are using 8-bit DDR (32MB or 64MB modules), and some
lower-end boards are using 4-bit QSPI (512K or 1024K). This sort of
thing seems more common with XC7S25 and XC7A35T based boards.

But, running a 16-bit RAM module at 50 MHz isn't that high of RAM
bandwidth in the best case...

I am effectively running the RAM module in a low-power standby mode (DLL
disabled, with 3-1-0 timings, ...).

This isn't really a proper way to use the chip, but seems to work in
this case.

> It looks like the faster C6 & C7 grade can run DDR3-800 (400MHz) if
> you restrict it to a single chip select though it needs DDR3-1066
> rated memory chips for that due to an errata. The slower C8 only
> support up to DDD3-666 using DDR3-800 rated memory chips. Speed with
> two chip select is a bit lower (DDR3-666 for C6/C7, DDR3-606 for C8).
>
> It's possible Intel wrote MHz and meant MT/s, if so the numbers would
> half this but 800MT is AFAIK the slowest JEDEC DDR3 standard speed
> which hints these are the correct numbers.
>
> The figures for DDR2 are the same (800MT/s) and LPDDR2 is a bit slower
> (666MT/s, can't do 2 chip selects), these speeds are also completely
> reasonable for DDR2 and LPDDR2. I suspect the reason for supporting
> three different memory interfaces is to give the designers more
> choices in memory sizes.
>
> 800MT/s and a 64-bit/8-byte wide memory interface gives us a best case
> total "interface speed" of 6.4 GB/s. Obviously it will never HIT that
> but if the implementation is competent there's no reason it couldn't
> be capable of 3-5 GB/s.
>
> Given that all the "68080" boards I found have 512MB of (soldered)
> memory that seems like it should be sufficient bandwidth assuming they
> have L1 and L2 caches on the CPU which seems very likely.
>

Yeah.

> AFAIK your core has order(s?) of magnitude less memory bandwith than
> this (the 80MHz above also hints at that).
>

Peak unidirectional DDR bandwidth is ~ 90 MB/s in my case at 50 MHz
(Unidirectional Load or Store), with a bidirectional speed of ~ 54 MB/s
(SWAP).

Theoretical extrapolated speed from the DDR tables, for running the RAM
at 50 MHz with a 16-bit RAM interface: 100 MB/s.
Seems it is pretty close...

For memcpy(), average-case is generally closer to around 26-30 MB/s.

For accesses within the L1 or L2, things are generally a bit:
L1:
Memcpy ~ 270 MB/s (hard limit = 400)
Memset ~ 407 MB/s (hard limit = 800)
Memload ~ 483 MB/s (hard limit = 800)
L2:
Memcpy ~ 77 MB/s
Memset ~ 142 MB/s
Memload ~ 223 MB/s
DDR:
Memcpy ~ 27 MB/s
Memset ~ 56 MB/s
Memload ~ 78 MB/s

Given L1 speeds are greater than 50% of the hard-limit, this means that
the majority of the L1 local accesses are 1-cycle. The hard limit here
is due to the clock speed (50 MHz), access width (128 bit), and single
memory port.

Vs theoretical limits:
Memcpy is 50% of theoretical limit (Swap Only);
It is 79% of adjusted limit (Load+Swap);
Memset is 62% of theoretical limit;
Memload is 87% of theoretical limit.

Where, the limit here would be if the caches and ring bus did not add
any additional latency.

Some of the overhead here is due to a multiplier effect, where, say, 1
access in the L1 may result in 2 accesses to the L2, and 4 to DRAM
(though, the latter is reduced due to the DDR controller having SWAP
operation, turning this scenario into 1,2,2).

The overhead of memcpy can partly be explained that is case effectively
tends to result in Load+Swap access pattern for DDR, which lowers the
limit down to 34MB/s.

> Hmmm! Looks like the Cyclone V is the ONLY low-end Intel/Altera FPGA
> with hard memory controllers, even some of the mid-range models
> doesn't have it. So those have to rely on much slower soft memory
> controllers.
>
> So perhaps that's the secret sauce!
>

Quite possibly...

>
>>> Mouser will sell me one for $127 (in MOQ 60), and the price they get from
>>> their distributor is almost certainly less.
>
> It say "non-stocked" so I suspect all this would do is to put in an
> order with their supplier and then Mouser comes back to you with some
> kind of estimate (or "when we get it"!) - I'd definitely check with
> them before ordering any non-stocked parts.
>
> The fact that there's suppliers in Asia that want $450 and $650 for it
> respectively and provides NO volume discount at all kind of suggest
> availability via "proper" channels may be very limited.
>
> Certainly the Cyclone V SE A6 (SE is the SOC variant with lots of
> extra hard stuff) that the Terasic DE10-Nano uses is in very short
> supply - the MISTer game FPGA system uses this so I know the lead
> times are very long.
>
>
> 1. https://ark.intel.com/content/www/us/en/ark/products/series/141579/cyclone-v-e-fpga.html
> 2. https://ark.intel.com/content/www/us/en/ark/products/210443/cyclone-v-5cea5-fpga.html
> 3. https://www.intel.com/content/dam/support/us/en/programmable/support-resources/bulk-container/pdfs/literature/pt/cyclone-v-product-table.pdf
> 4. https://www.intel.com/content/www/us/en/support/programmable/support-resources/support-centers/emif-spec-estimator.html


Click here to read the complete article
Re: 64 bit 68080 CPU

<8d8a13de-cd53-4a44-9f18-d81296a2a501n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27104&group=comp.arch#27104

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:43d4:0:b0:6b8:e3ba:ddfc with SMTP id q203-20020a3743d4000000b006b8e3baddfcmr7134003qka.192.1659750763648;
Fri, 05 Aug 2022 18:52:43 -0700 (PDT)
X-Received: by 2002:a05:622a:38e:b0:342:e878:5575 with SMTP id
j14-20020a05622a038e00b00342e8785575mr3416912qtx.291.1659750763509; Fri, 05
Aug 2022 18:52:43 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 5 Aug 2022 18:52:43 -0700 (PDT)
In-Reply-To: <tcfd6h$2ngkv$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2607:fea8:1dde:6a00:d48a:6090:c4a0:ed5f;
posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 2607:fea8:1dde:6a00:d48a:6090:c4a0:ed5f
References: <tc6mng$fm5h$1@dont-email.me> <tc7rm1$o7cl$1@dont-email.me>
<Vhx*IxEUy@news.chiark.greenend.org.uk> <tc97ol$12u23$2@dont-email.me>
<tcf184$2htmd$1@dont-email.me> <tcfd6h$2ngkv$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <8d8a13de-cd53-4a44-9f18-d81296a2a501n@googlegroups.com>
Subject: Re: 64 bit 68080 CPU
From: robfi...@gmail.com (robf...@gmail.com)
Injection-Date: Sat, 06 Aug 2022 01:52:43 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2869
 by: robf...@gmail.com - Sat, 6 Aug 2022 01:52 UTC

> Yes, but no hard controllers on Artix-7, where one only has soft
> controllers.
>
> Usual idea is that Xilinx wants people to use Vivado's MIG tool, but
> then one would need to deal with AXI.

One does not have to use the AXI interface. It is an option in the MIG tool.

> In theory, one could use SERDES (the RAM being connected up to SERDES

I believe this is what the Xilinx core does. The softcore can interface to the
DDR RAM at full speed. Probably why there is not a hard core. The SERDES and
other components are more general in nature and can be applied for other
interfacing.

> I would more prefer if they more bothered actually documenting their
> FPGA primitives, vs just endlessly being like "Invoke X from the IP
> Catalog Wizard"...

I have found Xilinx to generally have good documentation. There are many
user guides available, describing the IP cores and operation.

> >> Can sort-of drive RAM at 75MHz via IO pins, but reliability is lacking
> >> in this case, and this is pushing the limits of the IO pin speeds.

For my system 800MHz (400 MHz clock) DDR3 is being used, driven by the Xilinx softcore.
This is using the Artix7 -1 (slow part). The DDR RAM is 16 bits wide, so that 1.6GB/s.

I have built my own system read cache core and multi-port memory controller to
try an make use of the bandwidth. The pipeline for the Xilinx core is pretty deep, it
is something like 25 clock cycles. But then it can transfer every clock. The core
breaks data into 16 byte chunks so that a lower clock frequency can be used.

Re: 64 bit 68080 CPU

<tcknij$3jv0t$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27108&group=comp.arch#27108

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: 64 bit 68080 CPU
Date: Fri, 5 Aug 2022 22:35:41 -0500
Organization: A noiseless patient Spider
Lines: 72
Message-ID: <tcknij$3jv0t$1@dont-email.me>
References: <tc6mng$fm5h$1@dont-email.me> <tc7rm1$o7cl$1@dont-email.me>
<Vhx*IxEUy@news.chiark.greenend.org.uk> <tc97ol$12u23$2@dont-email.me>
<tcf184$2htmd$1@dont-email.me> <tcfd6h$2ngkv$1@dont-email.me>
<8d8a13de-cd53-4a44-9f18-d81296a2a501n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 6 Aug 2022 03:35:48 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="59bf300d5d6d30495bdf3091f15d7b77";
logging-data="3800093"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/pBaLtsbKM9nMB10fbquRR"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.11.0
Cancel-Lock: sha1:Xqtzmu4xn0WWVMz7SXvOv8Euuv4=
In-Reply-To: <8d8a13de-cd53-4a44-9f18-d81296a2a501n@googlegroups.com>
Content-Language: en-US
 by: BGB - Sat, 6 Aug 2022 03:35 UTC

On 8/5/2022 8:52 PM, robf...@gmail.com wrote:
>
>> Yes, but no hard controllers on Artix-7, where one only has soft
>> controllers.
>>
>> Usual idea is that Xilinx wants people to use Vivado's MIG tool, but
>> then one would need to deal with AXI.
>
> One does not have to use the AXI interface. It is an option in the MIG tool.
>

Might have to look into it, all the stuff I had read had said it used AXI.

>> In theory, one could use SERDES (the RAM being connected up to SERDES
>
> I believe this is what the Xilinx core does. The softcore can interface to the
> DDR RAM at full speed. Probably why there is not a hard core. The SERDES and
> other components are more general in nature and can be applied for other
> interfacing.
>

Probably true enough.

As noted, my DDR controller doesn't use SERDES, but this was partly
because I did not know about SERDES when I wrote it. I had just sorta
figured people were writing tight FIFOs and running them at high clock
speeds, partly as some of the early RAM controller code I had looked at
was working this way.

>> I would more prefer if they more bothered actually documenting their
>> FPGA primitives, vs just endlessly being like "Invoke X from the IP
>> Catalog Wizard"...
>
> I have found Xilinx to generally have good documentation. There are many
> user guides available, describing the IP cores and operation.
>

But the assumption seems to be that people use the IP Cores, and not try
to use the SERDES directly. Decent documentation for the actual FPGA
primitives is a bit more lacking here.

If possible, I also want to write Verilog that does not depend on the
specifics of Xilinx tooling (eg: relatively generic Verilog, that I can
also fill in the gaps and simulate in Verilator, or maybe synthesize in
Quartus if I decide I want to run it on a Cyclone V or something,
ideally while keeping all of the toolchain specific stuff to a minimum,
....).

The IP Cores seem to go against this, seeming almost like a trap to try
to create vendor lock-in.

>>>> Can sort-of drive RAM at 75MHz via IO pins, but reliability is lacking
>>>> in this case, and this is pushing the limits of the IO pin speeds.
>
> For my system 800MHz (400 MHz clock) DDR3 is being used, driven by the Xilinx softcore.
> This is using the Artix7 -1 (slow part). The DDR RAM is 16 bits wide, so that 1.6GB/s.
>

I am using a board with 16-bit DDR2, its rated RAM speed being 667 MHz.

> I have built my own system read cache core and multi-port memory controller to
> try an make use of the bandwidth. The pipeline for the Xilinx core is pretty deep, it
> is something like 25 clock cycles. But then it can transfer every clock. The core
> breaks data into 16 byte chunks so that a lower clock frequency can be used.
>

OK.

Pages:12
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor