Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

A conclusion is simply the place where someone got tired of thinking.


devel / comp.arch / 64 bit 68080 CPU

SubjectAuthor
* 64 bit 68080 CPUBrett
+- Re: 64 bit 68080 CPUJosh Vanderhoof
+* Re: 64 bit 68080 CPUQuadibloc
|`* Re: 64 bit 68080 CPUBGB
| +* Re: 64 bit 68080 CPUAnton Ertl
| |`* Re: 64 bit 68080 CPUBGB
| | +* Re: 64 bit 68080 CPUMitchAlsup
| | |`* Re: 64 bit 68080 CPUBGB
| | | `* Re: 64 bit 68080 CPUMitchAlsup
| | |  `- Re: 64 bit 68080 CPUBGB
| | `* Re: 64 bit 68080 CPUAnton Ertl
| |  +- Re: 64 bit 68080 CPUBGB
| |  `- Re: 64 bit 68080 CPUThomas Koenig
| +* Re: 64 bit 68080 CPUTheo
| |+* Re: 64 bit 68080 CPUBGB
| ||`* Re: 64 bit 68080 CPUTorbjorn Lindgren
| || `* Re: 64 bit 68080 CPUBGB
| ||  `* Re: 64 bit 68080 CPUrobf...@gmail.com
| ||   `- Re: 64 bit 68080 CPUBGB
| |`- Re: 64 bit 68080 CPUMichael S
| +* Re: 64 bit 68080 CPUEricP
| |`- Re: 64 bit 68080 CPUBGB
| `* Re: 64 bit 68080 CPUMitchAlsup
|  +* Re: 64 bit 68080 CPUMarcus
|  |+- Re: 64 bit 68080 CPUBGB
|  |`* Re: 64 bit 68080 CPUMitchAlsup
|  | `* Re: 64 bit 68080 CPUMarcus
|  |  +- Re: 64 bit 68080 CPUMitchAlsup
|  |  `* Re: 64 bit 68080 CPUBGB
|  |   `* Re: 64 bit 68080 CPUMitchAlsup
|  |    `- Re: 64 bit 68080 CPUBGB
|  `* Re: 64 bit 68080 CPUEricP
|   +- Re: 64 bit 68080 CPUMitchAlsup
|   `- Re: 64 bit 68080 CPUBGB
`- Re: 64 bit 68080 CPUJohn Dallman

Pages:12
64 bit 68080 CPU

<tc6mng$fm5h$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26971&group=comp.arch#26971

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: ggt...@yahoo.com (Brett)
Newsgroups: comp.arch
Subject: 64 bit 68080 CPU
Date: Sun, 31 Jul 2022 19:55:28 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 10
Message-ID: <tc6mng$fm5h$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 31 Jul 2022 19:55:28 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="0618e20ea6b0bcdf676dc6d6c858414d";
logging-data="514225"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX180GlEv1rnFhLXY7Lv5rcYt"
User-Agent: NewsTap/5.5 (iPad)
Cancel-Lock: sha1:hqrxfV5H75QxCUvuDOmse6o9tiY=
sha1:hM34/7C/AuDv58X3VoYh+mn4DO0=
 by: Brett - Sun, 31 Jul 2022 19:55 UTC

Here is the 64 but 68080 CPU, interesting.
Surprising that the economics to build such a thing exists.

http://www.apollo-core.com/index.htm?page=coding&tl=1

Mostly Amiga upgrades with antique Apollo workstations also mentioned, and
probably lots of embedded systems for machinery?

Re: 64 bit 68080 CPU

<ygno7x5f0ss.fsf@y.z>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26975&group=comp.arch#26975

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx40.iad.POSTED!not-for-mail
From: x...@y.z (Josh Vanderhoof)
Newsgroups: comp.arch
Subject: Re: 64 bit 68080 CPU
References: <tc6mng$fm5h$1@dont-email.me>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.2 (gnu/linux)
Reply-To: Josh Vanderhoof <jlv@mxsimulator.com>
Message-ID: <ygno7x5f0ss.fsf@y.z>
Cancel-Lock: sha1:/53GLlbhpGBGeT/Ai/DGRCKN+KU=
MIME-Version: 1.0
Content-Type: text/plain
Lines: 13
X-Complaints-To: https://www.astraweb.com/aup
NNTP-Posting-Date: Sun, 31 Jul 2022 21:08:04 UTC
Date: Sun, 31 Jul 2022 17:08:03 -0400
X-Received-Bytes: 1091
 by: Josh Vanderhoof - Sun, 31 Jul 2022 21:08 UTC

Brett <ggtgp@yahoo.com> writes:

> Here is the 64 but 68080 CPU, interesting.
> Surprising that the economics to build such a thing exists.
>
> http://www.apollo-core.com/index.htm?page=coding&tl=1
>
> Mostly Amiga upgrades with antique Apollo workstations also mentioned, and
> probably lots of embedded systems for machinery?

That's really cool! The SAGA Amiga chipset looks interesting as well.
Chunky pixels and bilinear z buffered texture mapping on top of AGA. I
had no idea such a thing existed. Neat!

Re: 64 bit 68080 CPU

<0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26976&group=comp.arch#26976

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:248:b0:31e:ec07:7c28 with SMTP id c8-20020a05622a024800b0031eec077c28mr12948207qtx.595.1659328802543;
Sun, 31 Jul 2022 21:40:02 -0700 (PDT)
X-Received: by 2002:a05:620a:2892:b0:6b6:50d0:88fa with SMTP id
j18-20020a05620a289200b006b650d088famr10394396qkp.89.1659328802392; Sun, 31
Jul 2022 21:40:02 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 31 Jul 2022 21:40:02 -0700 (PDT)
In-Reply-To: <tc6mng$fm5h$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:6947:3c86:73e1:a64e;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:6947:3c86:73e1:a64e
References: <tc6mng$fm5h$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com>
Subject: Re: 64 bit 68080 CPU
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Mon, 01 Aug 2022 04:40:02 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 1397
 by: Quadibloc - Mon, 1 Aug 2022 04:40 UTC

On Sunday, July 31, 2022 at 1:55:32 PM UTC-6, gg...@yahoo.com wrote:

> Surprising that the economics to build such a thing exists.

It's a soft core, in Verilog or VHDL, that gets used to program
an FPGA, so the economics aren't so daunting as to be all that
surprising.

John Savard

Re: 64 bit 68080 CPU

<tc7rm1$o7cl$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26977&group=comp.arch#26977

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: 64 bit 68080 CPU
Date: Mon, 1 Aug 2022 01:26:06 -0500
Organization: A noiseless patient Spider
Lines: 70
Message-ID: <tc7rm1$o7cl$1@dont-email.me>
References: <tc6mng$fm5h$1@dont-email.me>
<0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 1 Aug 2022 06:26:09 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="9d32d29e45373b9b98f3c5e5d5523fca";
logging-data="794005"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19b+2MtfQ/FH87qH4c6D16e"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.11.0
Cancel-Lock: sha1:TxWBsrIgGBEbFagVLdc8OP5FMhQ=
Content-Language: en-US
In-Reply-To: <0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com>
 by: BGB - Mon, 1 Aug 2022 06:26 UTC

On 7/31/2022 11:40 PM, Quadibloc wrote:
> On Sunday, July 31, 2022 at 1:55:32 PM UTC-6, gg...@yahoo.com wrote:
>
>> Surprising that the economics to build such a thing exists.
>
> It's a soft core, in Verilog or VHDL, that gets used to program
> an FPGA, so the economics aren't so daunting as to be all that
> surprising.
>

But, what sort of FPGA, exactly?...

Their claimed stats seem a little higher than what I could expect from
something in a similar class as a Spartan or Artix based on my
experience thus far.

In one of the images, albeit low res, I can sort of make out an Altera
logo; FPGA appears to be a Cyclone 3, but can't make out anything beyond
this (eg: what size of Cyclone III ?...).

Looking stuff up, it would appear that Cyclone 3 stats are in a similar
range to the Artix-7 family, albeit it would appear to be balanced
towards more logic elements but less block RAM.

Not sure much beyond this, relative comparisons between Xilinx and
Altera FPGAs is a bit sparse, particularly for the lower-end families.

Then again, I guess retro-computers like the Amiga are not cheap, so it
seems plausible they could justify the costs of (potentially) using a
"slightly expensive" FPGA on their Amiga upgrade boards (so, in any
case, probably not using the the low-end entries of the product line).

Their boards also seem to come with significantly more RAM than a lot of
the "low cost" FPGA dev boards, ...

....

Then again, I guess it is a question if there is a good way to get a
significant increase how much performance I can get out of an FPGA
without a significant resource cost increase?...

I guess main areas to look at would be:
Find a way to reduce both the cost and latency of the L1 caches;
Find a way to reduce the cost of dealing with pipeline stall signals;
Try to find a way to make the interrupt dispatch mechanism cheaper;
...

Well, and would also help for performance:
Figure out some way to make L2 misses and DRAM access both cheaper and
lower latency.

....

Well, and further reaching issues, say, whether a Reg/Mem ISA could
compare well with a Load/Store ISA?... Can use fewer instructions, but
how to do Reg/Mem ops without introducing significant complexities or
penalty cases?...

....

> John Savard

Re: 64 bit 68080 CPU

<2022Aug1.105418@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26978&group=comp.arch#26978

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: 64 bit 68080 CPU
Date: Mon, 01 Aug 2022 08:54:18 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 20
Message-ID: <2022Aug1.105418@mips.complang.tuwien.ac.at>
References: <tc6mng$fm5h$1@dont-email.me> <0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com> <tc7rm1$o7cl$1@dont-email.me>
Injection-Info: reader01.eternal-september.org; posting-host="b07bfaa2017a214780693d0ffe67b172";
logging-data="862952"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+MTVYC69yD11lOHcRSe4MM"
Cancel-Lock: sha1:HYr6zVHFYjmQfuj4+KNaF8Q+cKQ=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Mon, 1 Aug 2022 08:54 UTC

BGB <cr88192@gmail.com> writes:
> Find a way to reduce the cost of dealing with pipeline stall signals;

It seems to me that OoO helps with that, and the 68080 is OoO.

>Well, and further reaching issues, say, whether a Reg/Mem ISA could
>compare well with a Load/Store ISA?... Can use fewer instructions, but
>how to do Reg/Mem ops without introducing significant complexities or
>penalty cases?...

Intel and AMD are doing it just fine with OoO implementations. The
68020 (and consequently the 68080) has the additional complexity of
memory-indirect addressing, but the main problem I see here is one of
verification (can you guarantee forward progress in the presence of
TLB misses and page faults), not performance problems.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: 64 bit 68080 CPU

<Vhx*IxEUy@news.chiark.greenend.org.uk>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26979&group=comp.arch#26979

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!nntp.terraraq.uk!nntp-feed.chiark.greenend.org.uk!ewrotcd!.POSTED.chiark.greenend.org.uk!not-for-mail
From: theom+n...@chiark.greenend.org.uk (Theo)
Newsgroups: comp.arch
Subject: Re: 64 bit 68080 CPU
Date: 01 Aug 2022 10:06:19 +0100 (BST)
Organization: University of Cambridge, England
Message-ID: <Vhx*IxEUy@news.chiark.greenend.org.uk>
References: <tc6mng$fm5h$1@dont-email.me> <0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com> <tc7rm1$o7cl$1@dont-email.me>
Injection-Info: chiark.greenend.org.uk; posting-host="chiark.greenend.org.uk:212.13.197.229";
logging-data="16577"; mail-complaints-to="abuse@chiark.greenend.org.uk"
User-Agent: tin/1.8.3-20070201 ("Scotasay") (UNIX) (Linux/5.10.0-15-amd64 (x86_64))
Originator: theom@chiark.greenend.org.uk ([212.13.197.229])
 by: Theo - Mon, 1 Aug 2022 09:06 UTC

BGB <cr88192@gmail.com> wrote:
> But, what sort of FPGA, exactly?...

Cyclone V 5CEFA5F23C, 77k LE:
http://www.apollo-computer.com/icedrakev4.php

> Looking stuff up, it would appear that Cyclone 3 stats are in a similar
> range to the Artix-7 family, albeit it would appear to be balanced
> towards more logic elements but less block RAM.
>
> Not sure much beyond this, relative comparisons between Xilinx and
> Altera FPGAs is a bit sparse, particularly for the lower-end families.

Cyclone is Altera's 'cheap' FPGA line, fitting between the MAX CPLDs and the
bigger Arria and Stratix parts. 'Cheap' means ~$100 list price, so not
cheap for the rest of us. Cyclone V is the old mainstream part, Cyclone
10LP is I think a rebrand of the Cyclone IV, and Cyclone 10GX is higher end
with transceivers.

Cyclone V go up to 300K LE, and can have an Arm Cortex A9 on them (yes,
pretty antique as far as Arm cores go). Those are pretty comparable with
the Zynq in Xilinx-land. This one is a Cyclone V E version, which means
there's no transceivers and no Arm, hence it's at the cheap end of the line
(the A5 meaning 77k LE is the middle of the range).

https://www.intel.com/content/www/us/en/products/details/fpga/cyclone/v/e.html

This one has 4.8Mbit of BRAM (think the 'Gb' on that table is a typo).
The I/Os are typically good to drive DDR3, which is what the Arm uses for
DRAM.

Mouser will sell me one for $127 (in MOQ 60), and the price they get from
their distributor is almost certainly less.

I'm not quite sure how it matches up with Xilinx, but I'd expect an Artix is
probably comparable.

With 800Kbyte of BRAM I think you could make some decent caches - after all
the Amiga 500 only had 512Kbyte DRAM to begin with.

Theo

Re: 64 bit 68080 CPU

<DxSFK.86262$Eh2.9405@fx41.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26981&group=comp.arch#26981

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx41.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: 64 bit 68080 CPU
References: <tc6mng$fm5h$1@dont-email.me> <0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com> <tc7rm1$o7cl$1@dont-email.me>
In-Reply-To: <tc7rm1$o7cl$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 39
Message-ID: <DxSFK.86262$Eh2.9405@fx41.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Mon, 01 Aug 2022 15:44:35 UTC
Date: Mon, 01 Aug 2022 11:44:17 -0400
X-Received-Bytes: 2089
 by: EricP - Mon, 1 Aug 2022 15:44 UTC

BGB wrote:
>
> Then again, I guess it is a question if there is a good way to get a
> significant increase how much performance I can get out of an FPGA
> without a significant resource cost increase?...
>
> I guess main areas to look at would be:
> Find a way to reduce both the cost and latency of the L1 caches;

IIRC you were having cross clock-domain issues at one point where
it took many clocks to synchronize, and still had reliability issues.
Is this still the case?

Also are you still using a token ring to talk to L1?

> Find a way to reduce the cost of dealing with pipeline stall signals;

Which costs are you referring to, FPGA logic elements or pipeline latency?

If it is the second, are you using a global pipeline stall signal where
if any stage stalls they all stall (I think you were at one point)?
I've mentioned 'elastic' pipeline stages previously that use local
stalls to allow bubbles to compact out.
But they are more than twice the LE cost.
Instead of one set of FF for each stage, it is 2 sets of FF, plus a MUX
to select between FF, plus control logic.

> Try to find a way to make the interrupt dispatch mechanism cheaper;

What are you currently doing?

> Well, and would also help for performance:
> Figure out some way to make L2 misses and DRAM access both cheaper and
> lower latency.

Is this also across a clock domain?

Re: 64 bit 68080 CPU

<fdd07bcf-4f05-48dd-b809-47c67e0c7eacn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26983&group=comp.arch#26983

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:d82:b0:473:b41:aabf with SMTP id e2-20020a0562140d8200b004730b41aabfmr14928811qve.115.1659372987211;
Mon, 01 Aug 2022 09:56:27 -0700 (PDT)
X-Received: by 2002:a37:58c6:0:b0:6b5:d169:7b99 with SMTP id
m189-20020a3758c6000000b006b5d1697b99mr12453644qkb.709.1659372987096; Mon, 01
Aug 2022 09:56:27 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 1 Aug 2022 09:56:26 -0700 (PDT)
In-Reply-To: <tc7rm1$o7cl$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:bc24:120a:ec70:7071;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:bc24:120a:ec70:7071
References: <tc6mng$fm5h$1@dont-email.me> <0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com>
<tc7rm1$o7cl$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <fdd07bcf-4f05-48dd-b809-47c67e0c7eacn@googlegroups.com>
Subject: Re: 64 bit 68080 CPU
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 01 Aug 2022 16:56:27 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 1820
 by: MitchAlsup - Mon, 1 Aug 2022 16:56 UTC

On Monday, August 1, 2022 at 1:26:12 AM UTC-5, BGB wrote:
>
> Well, and further reaching issues, say, whether a Reg/Mem ISA could
> compare well with a Load/Store ISA?... Can use fewer instructions, but
> how to do Reg/Mem ops without introducing significant complexities or
> penalty cases?...
>
You build a pipeline which has a pre-execute stage (which calculates
AGEN) and then a stage or 2 of cache access, and then you get to the
normal execute---writeback part of the pipeline. I have called this the
360 pipeline (or IBM pipeline) several times in the past, here. The S.E.L
machines used such a pipeline.

Re: 64 bit 68080 CPU

<tc97np$12u23$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26986&group=comp.arch#26986

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!rocksolid2!i2pn.org!weretis.net!feeder8.news.weretis.net!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: 64 bit 68080 CPU
Date: Mon, 1 Aug 2022 13:57:57 -0500
Organization: A noiseless patient Spider
Lines: 192
Message-ID: <tc97np$12u23$1@dont-email.me>
References: <tc6mng$fm5h$1@dont-email.me>
<0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com>
<tc7rm1$o7cl$1@dont-email.me> <DxSFK.86262$Eh2.9405@fx41.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 1 Aug 2022 18:58:01 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="9d32d29e45373b9b98f3c5e5d5523fca";
logging-data="1144899"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/ZrekqisL4Cx7ylo4GU2GO"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.11.0
Cancel-Lock: sha1:mqAoZ6MMzswLNDVKxawfQqp9N5M=
Content-Language: en-US
In-Reply-To: <DxSFK.86262$Eh2.9405@fx41.iad>
 by: BGB - Mon, 1 Aug 2022 18:57 UTC

On 8/1/2022 10:44 AM, EricP wrote:
> BGB wrote:
>>
>> Then again, I guess it is a question if there is a good way to get a
>> significant increase how much performance I can get out of an FPGA
>> without a significant resource cost increase?...
>>
>> I guess main areas to look at would be:
>>   Find a way to reduce both the cost and latency of the L1 caches;
>
> IIRC you were having cross clock-domain issues at one point where
> it took many clocks to synchronize, and still had reliability issues.
> Is this still the case?
>

This is for L2<->DRAM, pretty much everything else is running on a
global 50MHz clock.

The issue is that for 16/32K L1 caches, the fastest I can currently run
them is around 50MHz. Otherwise, it seems the logic for fetching the
cache line data and tag bits from the arrays, and checking for
match/mismatch, takes too many nanoseconds (mostly with routes that
zigzag across a big part of the FPGA).

This part can pass timing easier with smaller arrays (LUTRAM), but then
the L1 is only around 2K, and the hit rate "kinda sucks".

The L2 has it a little easier, because it can stick in an extra delay
cycle between fetching the block from the array, and checking whether or
not it matches (otherwise, the 256K cache would be difficult even at 50MHz).

> Also are you still using a token ring to talk to L1?
>

The L1<->L2 interface uses a big token-ring style bus.

The L1 caches plug directly into the main pipeline (effectively, the L1s
and main pipeline operate in lock-step with each other).

While it seems like the ring-bus would have a pretty high latency, on
average the latency is compensated for by the ability to send several
requests over the bus at the same time (overall performance being
significantly higher than my original "one request at a time" bus).

From what I can gather, the OpenCores Wishbone bus operates in a
similar way to my original bus, not sure if/how they avoid it suffering
from poor performance as a result.

Something like AXI seems a fair bit more complicated, and like it would
likely result in a significantly higher resource cost than the ring-bus.

>>   Find a way to reduce the cost of dealing with pipeline stall signals;
>
> Which costs are you referring to, FPGA logic elements or pipeline latency?
>
> If it is the second, are you using a global pipeline stall signal where
> if any stage stalls they all stall (I think you were at one point)?

Yeah. If the L1 D$ stalls, it asserts a global stall. The entire
pipeline stalls.

This includes:
All the forwarding between the main pipeline stages;
All the forwarding within the various 'EX' stage units
This includes the FPUs and SIMD FPUs;
Forwarding within the internal stages within the L1 caches;
...

There are actually two stalls, A and B:
A: Stalls *everything*;
B: Stalls only the Fetch and Decode stages
Used for interlocks (injects NOPs into the EX stages).

In this case, the stall makes sure that the memory access always
finishes on the EX3 stage regardless of how long the memory access took
in reality (the FPU and similar may also assert this stall, since FPU
operations take a lot longer than what the 3 EX stages can accommodate).

I had noted that some other cores had used a FIFO queue for accessing
memory, but this leaves the issue of getting Load results written back
to the register file (and needing to keep track somehow of when Load
results have arrived).

This seems more complicated though than "whole pipeline stalls if L1 the
misses" approach.

I had experimented before with making the L1 I$ generate a stream of
0-length NOP bundles on miss, rather than stall the pipeline, but was
faced with "technical issues" with making this work reliably, so mostly
stuck with the "stall the whole pipeline" mechanism (though, the
0-length NOP bundles are still needed to be able to handle I$ TLB misses
effectively).

> I've mentioned 'elastic' pipeline stages previously that use local
> stalls to allow bubbles to compact out.
> But they are more than twice the LE cost.
> Instead of one set of FF for each stage, it is 2 sets of FF, plus a MUX
> to select between FF, plus control logic.
>

LUT cost is already a big issue.
LUTs and BRAM's are the main resources I have mostly used up.

Still have plenty of DSP48s left though...
My CPU core only needs so much low-precision multiply though.

>>   Try to find a way to make the interrupt dispatch mechanism cheaper;
>
> What are you currently doing?
>

Current mechanism (ISA level):
Save SR state into EXSR;
Twiddle some bits in SR;
MD, RB, and BL are set (Sets to Supervisor+ISR mode);
WXE and WX2 copied from VBR (51:50)
Swap SP and SSP;
Save PC to SPC;
Jump to a computed address relative to VBR.

Internally, the CPU also does:
Figure out which pipeline stage we can validly revert to;
Revert DLR, DHR, LR, SP, etc, to their values at that pipeline stage;
Normal GPR writes are handled via an "invalidate" flag:
This blocks the register's value at the WB stage.
...

The logic for forwarding and then reverting the state of some of the
various special registers is expensive. But, special handling is needed
for any register which may be updated via side-channel mechanisms
(rather than via normal GPR style access patterns).

This is not needed for SRs/CRs where the side-channel is effectively
read-only (TTB, MMCR, KRR, GBR, etc), since these are only modified via
the WB stage (and thus can use the same mechanism as the GPRs).

A likely change would be:
* Eliminate the writable side-channels for DLR, DHR, SP, and LR.
** DLR/DHR: Would become read-only side-channels;
** SP could become mostly GPR-like
*** Given PUSH/POP no longer exist in the ISA.
*** SP <-> SSP swap could be handled in the instruction decoder.
** LR update would be handled via a GPR style update
*** This is already used for RISC-V mode.

Some of this would require "awkward" rewrites to parts of the Verilog
code, and I hadn't poked at it yet.

>> Well, and would also help for performance:
>> Figure out some way to make L2 misses and DRAM access both cheaper and
>> lower latency.
>
> Is this also across a clock domain?
>

Yes.

The DDR controller logic operates at a higher clock speed than
everything else, so it needs a clock-domain crossing to access.

Still uses the original "one request at a time" bus, with several
request types:
Load (fetch cache line from DDR)
Store (write cache line to DDR)
Swap (do a combined Store + Load)

The majority of accesses here are Load and Swap.

Given this uses 64B cache lines, this is able to mostly cover a lot of
the overhead of the clock-domain crossing and state transisiton overheads.

But, access latency is still "not great", and the LUT costs of having
this part work with 64B (512-bit) cache lines, is pretty steep.

....

Re: 64 bit 68080 CPU

<tc97ol$12u23$2@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26987&group=comp.arch#26987

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: 64 bit 68080 CPU
Date: Mon, 1 Aug 2022 13:58:27 -0500
Organization: A noiseless patient Spider
Lines: 164
Message-ID: <tc97ol$12u23$2@dont-email.me>
References: <tc6mng$fm5h$1@dont-email.me>
<0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com>
<tc7rm1$o7cl$1@dont-email.me> <Vhx*IxEUy@news.chiark.greenend.org.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 1 Aug 2022 18:58:29 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="9d32d29e45373b9b98f3c5e5d5523fca";
logging-data="1144899"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/9E1u01dlrfKJerko+X3Qj"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.11.0
Cancel-Lock: sha1:ujeAm64hdMc7hTBzTCEnFd+UA7s=
In-Reply-To: <Vhx*IxEUy@news.chiark.greenend.org.uk>
Content-Language: en-US
 by: BGB - Mon, 1 Aug 2022 18:58 UTC

On 8/1/2022 4:06 AM, Theo wrote:
> BGB <cr88192@gmail.com> wrote:
>> But, what sort of FPGA, exactly?...
>
> Cyclone V 5CEFA5F23C, 77k LE:
> http://www.apollo-computer.com/icedrakev4.php
>

From what I can gather, vs an XC7A100T:
Fewer LUTs (kinda hard to compare directly here);
Less Block RAM;
More IO pins;
Faster maximum internal clock speed.

>> Looking stuff up, it would appear that Cyclone 3 stats are in a similar
>> range to the Artix-7 family, albeit it would appear to be balanced
>> towards more logic elements but less block RAM.
>>
>> Not sure much beyond this, relative comparisons between Xilinx and
>> Altera FPGAs is a bit sparse, particularly for the lower-end families.
>
> Cyclone is Altera's 'cheap' FPGA line, fitting between the MAX CPLDs and the
> bigger Arria and Stratix parts. 'Cheap' means ~$100 list price, so not
> cheap for the rest of us. Cyclone V is the old mainstream part, Cyclone
> 10LP is I think a rebrand of the Cyclone IV, and Cyclone 10GX is higher end
> with transceivers.
>
> Cyclone V go up to 300K LE, and can have an Arm Cortex A9 on them (yes,
> pretty antique as far as Arm cores go). Those are pretty comparable with
> the Zynq in Xilinx-land. This one is a Cyclone V E version, which means
> there's no transceivers and no Arm, hence it's at the cheap end of the line
> (the A5 meaning 77k LE is the middle of the range).
>
> https://www.intel.com/content/www/us/en/products/details/fpga/cyclone/v/e.html
>
> This one has 4.8Mbit of BRAM (think the 'Gb' on that table is a typo).
> The I/Os are typically good to drive DDR3, which is what the Arm uses for
> DRAM.
>

On Artix, one is typically using FPGA logic and IO to drive the DDR.

In my case, I am driving the DDR2 module at 50MHz on the board I have.

Can sort-of drive RAM at 75MHz via IO pins, but reliability is lacking
in this case, and this is pushing the limits of the IO pin speeds.

One could drive it faster (via SERDES) but still not climbed that
ladder. Theoretical estimates were typically showing only modest
improvement from faster RAM IO speeds here.

Also still not climbed the AXI ladder either (would be needed to use
Vivado's MIG tool).

> Mouser will sell me one for $127 (in MOQ 60), and the price they get from
> their distributor is almost certainly less.
>
> I'm not quite sure how it matches up with Xilinx, but I'd expect an Artix is
> probably comparable.
>

Yeah.

> With 800Kbyte of BRAM I think you could make some decent caches - after all
> the Amiga 500 only had 512Kbyte DRAM to begin with.
>

Probably.

In my case, with an XC7A100T, and a single CPU core, the maxed out
settings are basically:
256K L2 + 64K L1 I$ + 64K L1 D$

But, a fair chunk of block-RAM is eaten by internal overheads, like
tagging arrays (particularly for the L1s, which are roughly 50% tagging
overhead in this case).

Overhead is somewhat less for L2, given the L2 is using 64B cache lines
rather than 16B cache lines.

Where:
64B lines in the L2 allow OK bandwidth to external RAM;
16B lines in the L1 keep the LUT cost more modest.

A case could possibly be made here for making both be 32B though.

I am mostly using 32K L1s at the moment, given:
Timing doesn't really like 64K L1s;
The bigger L1 caches would only slightly improve hit rate.

I got the bigger L2 at the cost of dropping the use of Block-RAM for the
framebuffer, and instead mapping it to RAM (via the L2 cache).

Practically, still mostly limited to around 128K VRAM:
My original MMIO interface design doesn't deal very well with more than
128K;
Going much bigger than this, and the L2 cache DRAM can't keep up with
"keeping the VGA refresh fed".

So, the actual "usable part" of the L2 is reduced somewhat, partly as it
gets repeatedly hit by the screen refresh (as it sweeps across the
frame-buffer at around 60 times per second).

Ironically though, because of the way it works, the bandwidth is
actually "slightly less" in the 800x600 mode, because it is still using
128K (at around 2bpp), but the effective screen refresh speed drops from
60Hz down to 36HZ (still not yet confirmed to work on a real monitor).

Performance in 640x400 and 800x600 mode is partly limited also by things
like the need to use a color-cell encoder.

640x400 using a 4-bpp color cell format:
4x4 cells with two RGB555 endpoints, 2b per pixel (A, B, 3/8 + 5/8)
Packed into 256b 8x8 blocks though.
800x600 using a 2-bpp color cell format:
8x8 cells with RGB555 endpoints, 1b per pixel (A, B)
Looks "kinda awful" for Doom and similar.
Arguably "less awful" than a 4-color mode would look:
"Ah yeah, Black/White/Cyan/Magenta"...
This cell format works pretty good for text and similar at least.

If one uses the "draw into 16bpp framebuffer and then color-cell encode
this to VRAM" approach, hard to get much over 10Hz in 640x400 mode or
6Hz in 800x600 (not really all that usable for games or similar; would
probably work mostly OK for GUI or similar).

An optimization though is to flag which parts of the screen have been
updated, and then skipping color-cell encoding unchanged parts of the
screen.

Say, for a GUI-like scenario, running Doom or similar:
Doom draws to its internal framebuffer;
This is copied to window's backbuffer via as a bitmap draw operation;
Relevant parts of windows' bitmap are marked dirty.
If a window has been marked dirty:
Color or pattern fill the screen buffer
This is itself kinda expensive at 640x400 (roughly a 512K memset)
Window stack is redrawn to display's screen buffer;
Color-cell encode screen buffer, copy to VRAM.

Not entirely sure how early GUIs worked, so maybe they had more
efficient approaches.

For 320x200 (what I am mostly using for 3D stuff), the 128K of VRAM can
give a bitmapped RGB555 mode, which is a little more usable for this
(mostly need to redraw frames and then copy them to VRAM mostly unchanged).

> Theo

Re: 64 bit 68080 CPU

<tc9c81$13df2$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26990&group=comp.arch#26990

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: 64 bit 68080 CPU
Date: Mon, 1 Aug 2022 15:14:53 -0500
Organization: A noiseless patient Spider
Lines: 63
Message-ID: <tc9c81$13df2$1@dont-email.me>
References: <tc6mng$fm5h$1@dont-email.me>
<0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com>
<tc7rm1$o7cl$1@dont-email.me> <2022Aug1.105418@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 1 Aug 2022 20:14:57 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="9d32d29e45373b9b98f3c5e5d5523fca";
logging-data="1160674"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/YcB1ghMUKQ/dQ1E+Hwj+z"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.11.0
Cancel-Lock: sha1:b1Fjb16EbOKx6Ts1MpsnIhZHGo0=
In-Reply-To: <2022Aug1.105418@mips.complang.tuwien.ac.at>
Content-Language: en-US
 by: BGB - Mon, 1 Aug 2022 20:14 UTC

On 8/1/2022 3:54 AM, Anton Ertl wrote:
> BGB <cr88192@gmail.com> writes:
>> Find a way to reduce the cost of dealing with pipeline stall signals;
>
> It seems to me that OoO helps with that, and the 68080 is OoO.
>

Possible, I would have figured OoO would have been a bit steep for an
FPGA, but they seem to be managing with what looks like "not too
unreasonable" FPGAs, so dunno...

>> Well, and further reaching issues, say, whether a Reg/Mem ISA could
>> compare well with a Load/Store ISA?... Can use fewer instructions, but
>> how to do Reg/Mem ops without introducing significant complexities or
>> penalty cases?...
>
> Intel and AMD are doing it just fine with OoO implementations. The
> 68020 (and consequently the 68080) has the additional complexity of
> memory-indirect addressing, but the main problem I see here is one of
> verification (can you guarantee forward progress in the presence of
> TLB misses and page faults), not performance problems.
>

To support Reg/Mem "in general", seems like it would need mechanisms for:
Perform a Load, then perform the operation (Load-Op);
Ability to perform certain ops directly in the L1 cache.

Latter is assuming that only a subset of operations support memory as a
destination, which appears to be the case in x86 at least
(interestingly, both RISC-V's 'A' extension and SH-2A ended up with some
operations to operate directly on memory in roughly the same scenarios).

In my case, a limited form of LoadOp already exist for the "FMOV.S" and
the "LDTEX" instruction (these perform work on the loaded value in EX3).

Something like Load+ADDS.L or similar is not entirely implausible (could
probably fit within ~ 1 cycle).

However, something like extending EX to 5 stages, or needing "split this
operation into micro-ops" logic, would be a little steep. The former
would adversely effect branch latency (and increase LUT cost), the
latter would cause these instructions to rather perform poorly
(defeating the purpose of adding them).

Other options would likely require rethinking the pipeline (such as
trying to find some way to shove a Load into the ID stages; and all of
the consequences this would entail).

Say, for example, if ID1/ID2 had some special read ports, and special
cases for "Assume op is Load or Load-Op", which could then allow Load-Op
within similar pipeline latency, but would require some additional
interlock-stage handling.

Adding an EX4 and EX5 stage could be argued for on the basis that, while
it would increase branch latency and LUT cost, it could potentially also
be used to allow for fully pipelined FPU instructions (eg: could allow
turning FADD and FMUL from 6C operations into 5L/1T operations).

....

Re: 64 bit 68080 CPU

<memo.20220801212733.11788c@jgd.cix.co.uk>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26991&group=comp.arch#26991

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: jgd...@cix.co.uk (John Dallman)
Newsgroups: comp.arch
Subject: Re: 64 bit 68080 CPU
Date: Mon, 1 Aug 2022 21:27 +0100 (BST)
Organization: A noiseless patient Spider
Lines: 10
Message-ID: <memo.20220801212733.11788c@jgd.cix.co.uk>
References: <tc6mng$fm5h$1@dont-email.me>
Reply-To: jgd@cix.co.uk
Injection-Info: reader01.eternal-september.org; posting-host="fe8f0c65f8382bceba0ffc412e6dd3af";
logging-data="1163232"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18lPlRBCN/ZPSQzOSqxKxvfk+7dhgt30Ck="
Cancel-Lock: sha1:cdvJI053E/EnY02tl+Gg12ceH0Q=
 by: John Dallman - Mon, 1 Aug 2022 20:27 UTC

In article <tc6mng$fm5h$1@dont-email.me>, ggtgp@yahoo.com (Brett) wrote:

> Mostly Amiga upgrades with antique Apollo workstations also
> mentioned, and probably lots of embedded systems for machinery?

It all looks to be Amiga to me; ApolloOS is a fork of
<https://en.wikipedia.org/wiki/AROS_Research_Operating_System>, which is
a compatible re-creation of AmigaOS 3.1.

John

Re: 64 bit 68080 CPU

<779a18dc-9159-4527-9132-fe02c30fb6acn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26994&group=comp.arch#26994

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a0c:9a43:0:b0:474:9845:a110 with SMTP id q3-20020a0c9a43000000b004749845a110mr13426780qvd.111.1659389916636;
Mon, 01 Aug 2022 14:38:36 -0700 (PDT)
X-Received: by 2002:ac8:7d8f:0:b0:31f:cea:9bfd with SMTP id
c15-20020ac87d8f000000b0031f0cea9bfdmr15331931qtd.513.1659389916484; Mon, 01
Aug 2022 14:38:36 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 1 Aug 2022 14:38:36 -0700 (PDT)
In-Reply-To: <tc9c81$13df2$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:bc24:120a:ec70:7071;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:bc24:120a:ec70:7071
References: <tc6mng$fm5h$1@dont-email.me> <0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com>
<tc7rm1$o7cl$1@dont-email.me> <2022Aug1.105418@mips.complang.tuwien.ac.at> <tc9c81$13df2$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <779a18dc-9159-4527-9132-fe02c30fb6acn@googlegroups.com>
Subject: Re: 64 bit 68080 CPU
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 01 Aug 2022 21:38:36 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 4674
 by: MitchAlsup - Mon, 1 Aug 2022 21:38 UTC

On Monday, August 1, 2022 at 3:15:00 PM UTC-5, BGB wrote:
> On 8/1/2022 3:54 AM, Anton Ertl wrote:
> > BGB <cr8...@gmail.com> writes:
> >> Find a way to reduce the cost of dealing with pipeline stall signals;
> >
> > It seems to me that OoO helps with that, and the 68080 is OoO.
> >
> Possible, I would have figured OoO would have been a bit steep for an
> FPGA, but they seem to be managing with what looks like "not too
> unreasonable" FPGAs, so dunno...
> >> Well, and further reaching issues, say, whether a Reg/Mem ISA could
> >> compare well with a Load/Store ISA?... Can use fewer instructions, but
> >> how to do Reg/Mem ops without introducing significant complexities or
> >> penalty cases?...
> >
> > Intel and AMD are doing it just fine with OoO implementations. The
> > 68020 (and consequently the 68080) has the additional complexity of
> > memory-indirect addressing, but the main problem I see here is one of
> > verification (can you guarantee forward progress in the presence of
> > TLB misses and page faults), not performance problems.
> >
> To support Reg/Mem "in general", seems like it would need mechanisms for:
> Perform a Load, then perform the operation (Load-Op);
<
For these ISAs, you build the pipeline as::
<
|FETCH |DECODE| AGEN |CACHE|EXECUTE|WRITEB|
<
As I stated above. No OoO is needed, but OoO does not hurt, either.
<
> Ability to perform certain ops directly in the L1 cache.
>
A 16KB L1 cache (small) is already bigger than the register file, forwarding
logic, and all integer execution stuff, and other interfaces this section talks
to. Adding RMW operations to the cache does not add "that much" logic or
"that much" to verification.
>
> Latter is assuming that only a subset of operations support memory as a
> destination, which appears to be the case in x86 at least
> (interestingly, both RISC-V's 'A' extension and SH-2A ended up with some
> operations to operate directly on memory in roughly the same scenarios).
>
> In my case, a limited form of LoadOp already exist for the "FMOV.S" and
> the "LDTEX" instruction (these perform work on the loaded value in EX3).
>
> Something like Load+ADDS.L or similar is not entirely implausible (could
> probably fit within ~ 1 cycle).
>
>
> However, something like extending EX to 5 stages, or needing "split this
> operation into micro-ops" logic, would be a little steep. The former
> would adversely effect branch latency (and increase LUT cost), the
> latter would cause these instructions to rather perform poorly
> (defeating the purpose of adding them).
>
> Other options would likely require rethinking the pipeline (such as
> trying to find some way to shove a Load into the ID stages; and all of
> the consequences this would entail).
>
> Say, for example, if ID1/ID2 had some special read ports, and special
> cases for "Assume op is Load or Load-Op", which could then allow Load-Op
> within similar pipeline latency, but would require some additional
> interlock-stage handling.
>
>
> Adding an EX4 and EX5 stage could be argued for on the basis that, while
> it would increase branch latency and LUT cost, it could potentially also
> be used to allow for fully pipelined FPU instructions (eg: could allow
> turning FADD and FMUL from 6C operations into 5L/1T operations).
>
> ...

Re: 64 bit 68080 CPU

<f048e927-e5cd-4558-9950-e031a2131719n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26995&group=comp.arch#26995

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:5a48:0:b0:31e:f288:3d68 with SMTP id o8-20020ac85a48000000b0031ef2883d68mr16141955qta.111.1659392320534;
Mon, 01 Aug 2022 15:18:40 -0700 (PDT)
X-Received: by 2002:a05:622a:198f:b0:31f:cc6:b082 with SMTP id
u15-20020a05622a198f00b0031f0cc6b082mr16388484qtc.220.1659392320220; Mon, 01
Aug 2022 15:18:40 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 1 Aug 2022 15:18:40 -0700 (PDT)
In-Reply-To: <Vhx*IxEUy@news.chiark.greenend.org.uk>
Injection-Info: google-groups.googlegroups.com; posting-host=2a0d:6fc2:55b0:ca00:65dd:bb7:48b5:b678;
posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 2a0d:6fc2:55b0:ca00:65dd:bb7:48b5:b678
References: <tc6mng$fm5h$1@dont-email.me> <0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com>
<tc7rm1$o7cl$1@dont-email.me> <Vhx*IxEUy@news.chiark.greenend.org.uk>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f048e927-e5cd-4558-9950-e031a2131719n@googlegroups.com>
Subject: Re: 64 bit 68080 CPU
From: already5...@yahoo.com (Michael S)
Injection-Date: Mon, 01 Aug 2022 22:18:40 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 3716
 by: Michael S - Mon, 1 Aug 2022 22:18 UTC

On Monday, August 1, 2022 at 12:06:23 PM UTC+3, Theo wrote:
> BGB <cr8...@gmail.com> wrote:
> > But, what sort of FPGA, exactly?...
> Cyclone V 5CEFA5F23C, 77k LE:
> http://www.apollo-computer.com/icedrakev4.php
> > Looking stuff up, it would appear that Cyclone 3 stats are in a similar
> > range to the Artix-7 family, albeit it would appear to be balanced
> > towards more logic elements but less block RAM.
> >
> > Not sure much beyond this, relative comparisons between Xilinx and
> > Altera FPGAs is a bit sparse, particularly for the lower-end families.
> Cyclone is Altera's 'cheap' FPGA line, fitting between the MAX CPLDs and the
> bigger Arria and Stratix parts. 'Cheap' means ~$100 list price, so not
> cheap for the rest of us. Cyclone V is the old mainstream part, Cyclone
> 10LP is I think a rebrand of the Cyclone IV, and Cyclone 10GX is higher end
> with transceivers.
>
> Cyclone V go up to 300K LE, and can have an Arm Cortex A9 on them (yes,
> pretty antique as far as Arm cores go). Those are pretty comparable with
> the Zynq in Xilinx-land. This one is a Cyclone V E version, which means
> there's no transceivers and no Arm, hence it's at the cheap end of the line
> (the A5 meaning 77k LE is the middle of the range).
>
> https://www.intel.com/content/www/us/en/products/details/fpga/cyclone/v/e.html
>
> This one has 4.8Mbit of BRAM (think the 'Gb' on that table is a typo).
> The I/Os are typically good to drive DDR3, which is what the Arm uses for
> DRAM.
>
> Mouser will sell me one for $127 (in MOQ 60), and the price they get from
> their distributor is almost certainly less.
>

May be, things changed to the better in recent months, but, say, a year ago
it was practically impossible to buy Cyclone-4E/10LP or Cyclone-5 from
official distributors. I.e. formally they would accept orders, but with lead
time of 50-60 weeks and even that with no guarantee of delivery.

> I'm not quite sure how it matches up with Xilinx, but I'd expect an Artix is
> probably comparable.

Except that Xilinx 7 family has more troubles than Cyclone-5 dealing with
"traditional I/O" i.e. anything non-differential and above 1.8V.
In that regard 28nm Artix-7 is more similar to 20nm Arria-10/Cyclone-10GX
then to 28nm Cyclone-5.

>
> With 800Kbyte of BRAM I think you could make some decent caches - after all
> the Amiga 500 only had 512Kbyte DRAM to begin with.
>
> Theo

Re: 64 bit 68080 CPU

<tc9njg$14gll$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26996&group=comp.arch#26996

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: 64 bit 68080 CPU
Date: Mon, 1 Aug 2022 18:28:44 -0500
Organization: A noiseless patient Spider
Lines: 188
Message-ID: <tc9njg$14gll$1@dont-email.me>
References: <tc6mng$fm5h$1@dont-email.me>
<0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com>
<tc7rm1$o7cl$1@dont-email.me> <2022Aug1.105418@mips.complang.tuwien.ac.at>
<tc9c81$13df2$1@dont-email.me>
<779a18dc-9159-4527-9132-fe02c30fb6acn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 1 Aug 2022 23:28:49 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="90a2979674897695e370541a0bfe4d09";
logging-data="1196725"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18EPGRU1QRTYpcCResqv451"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.11.0
Cancel-Lock: sha1:W2k5IRJtVpZVjaGTApV8+hyUm1g=
In-Reply-To: <779a18dc-9159-4527-9132-fe02c30fb6acn@googlegroups.com>
Content-Language: en-US
 by: BGB - Mon, 1 Aug 2022 23:28 UTC

On 8/1/2022 4:38 PM, MitchAlsup wrote:
> On Monday, August 1, 2022 at 3:15:00 PM UTC-5, BGB wrote:
>> On 8/1/2022 3:54 AM, Anton Ertl wrote:
>>> BGB <cr8...@gmail.com> writes:
>>>> Find a way to reduce the cost of dealing with pipeline stall signals;
>>>
>>> It seems to me that OoO helps with that, and the 68080 is OoO.
>>>
>> Possible, I would have figured OoO would have been a bit steep for an
>> FPGA, but they seem to be managing with what looks like "not too
>> unreasonable" FPGAs, so dunno...
>>>> Well, and further reaching issues, say, whether a Reg/Mem ISA could
>>>> compare well with a Load/Store ISA?... Can use fewer instructions, but
>>>> how to do Reg/Mem ops without introducing significant complexities or
>>>> penalty cases?...
>>>
>>> Intel and AMD are doing it just fine with OoO implementations. The
>>> 68020 (and consequently the 68080) has the additional complexity of
>>> memory-indirect addressing, but the main problem I see here is one of
>>> verification (can you guarantee forward progress in the presence of
>>> TLB misses and page faults), not performance problems.
>>>
>> To support Reg/Mem "in general", seems like it would need mechanisms for:
>> Perform a Load, then perform the operation (Load-Op);
> <
> For these ISAs, you build the pipeline as::
> <
> |FETCH |DECODE| AGEN |CACHE|EXECUTE|WRITEB|
> <
> As I stated above. No OoO is needed, but OoO does not hurt, either.
> <

I was thinking of in-order...

In my case, pipelie is:
~ PF (overlaps with ID1)
IF (Instruction Fetch)
ID1 (Decode)
ID2 (Register Fetch)
EX1
EX2
EX3
~ WB (Pseudo stage)

Could, in theory, move AGU from EX1 to ID2 to buy an extra cycle here,
but probably could not move it to ID1 without causing problems.

This could, in premise, allow Load-Op in EX2 and EX3 without lengthening
the pipeline.

As-is, Binary32 load and LDTEX have their logic shoved into the EX3
stage. Could in theory allow for a few 32-bit ALU ops or similar.

Though, while Load+ADD occurs semi-frequently, it looks like Load+CMP is
a somewhat more common case:
CMPxx Imm, (Rm, Disp)
Would likely be able to "hit" pretty often, if it existed.

So, a few common sequences here:
MOV.L (Rm, Disp), Rs; CMPxx Imm, Rs
MOV.L (Rm1, Disp), Rs; MOV.L (Rm2, Disp), Rt; CMPxx Rt, Rs
MOV.L (Rm, Disp), Rs; ADDS.L Rs, Imm, Rn
MOV.L (Rm, Disp), Rs; ADDS.L Rs, Rt, Rn

I suspect a lot of the Load+CMP sequences are for cases where a loop
counter is used with a "for()" loop or similar, but then ends up being
evicted.

Another common case here seems to be:
MOV.L (Rm, Disp), Rt; ADD Imm, Rt; MOV.L Rt, (Rm, Disp)

This seems to be another common case of the loop counter being evicted.

Then again, I suspect my ranking logic may not be counting the loop
counters as being "inside" the loop (except when referenced withing the
loop body).

>> Ability to perform certain ops directly in the L1 cache.
>>
> A 16KB L1 cache (small) is already bigger than the register file, forwarding
> logic, and all integer execution stuff, and other interfaces this section talks
> to. Adding RMW operations to the cache does not add "that much" logic or
> "that much" to verification.

The way L1 works in my case:
AGU happens (externally);
Calculate index and similar to fetch from (low bits of address);
-- (Clock Edge, Fetch goes here)
Check whether or not request missed;
Extract value from cache lines;
-- (Clock Edge)
Generate modified cache line with store value inserted back;
Initiate store of modified cache lines;
-- (Clock Edge, Store happens here)

In theory, could shove a few 32-bit ALU ops in there (between the Load
and Store parts of the logic), and would need this if I wanted to
support the RISC-V 'A' extension, but, "Is it worth it?"...

But, then again, if it could deal effectively with:
"i++;" (where 'i' is currently located on the stack)
It could potentially be worthwhile from a performance POV.

If I were to add these instructions, they would probably be as Op64
encodings or similar though.

But, even as such, "i++;" as an Op64 is arguably still more compact than
"i++;" as "Load; ADD; Store" (with 32-bit instruction encodings), and
could be made ~ 1 cycle, vs ~ 6 cycles for the 3-op sequence.

Would still need to think a bit about how these instructions would be
encoded though.
Possibly:
FFw0_0Vpp_F1nm_Zedd ?

Well, or hack it onto the existing RiMOV encoding:
FFw0_Pvdd_F0nm_0eoZ
P: 0..7: Ld/St, XCHG.x, ADD.x, SUB.x, -, AND.x, OR.x, XOR.x
8..F: Same, but, Rn is an Imm6u? (Store Only)
Loads understood as Load-Op, and Stores as RMW.

Such an encoding would theoretically allow for stuff like, say:
SUB.W R39, (R4, R6, 122)
ADD.B 55, (R4, R6, 69)
ADDS.L (R4, R6, 44), R45

While CMPxx is technically a more common than ADD/SUB by a quick skim,
this would be harder to add without stepping on some ugly edge cases.

It would also likely be limited to Byte/Word/DWord "for reasons".

Might make sense to model some of this first, to try to see if it would
gain enough to make it worthwhile (well, it is either that, or add it to
the Verilog first to see if it can be added without effectively dropping
a nuke on resource cost or timing...).

Then again, I don't expect costs to be quite as bad as my failed "add a
second Load port" experiment, and this does have "could allow supporting
the RV64 'A' extension and similar" as an upshot (well, and possibly
also help if I wanted to add an x86 JIT compiler, but this would still
likely also need helper logic for EFLAGS emulation and similar).

>>
>> Latter is assuming that only a subset of operations support memory as a
>> destination, which appears to be the case in x86 at least
>> (interestingly, both RISC-V's 'A' extension and SH-2A ended up with some
>> operations to operate directly on memory in roughly the same scenarios).
>>
>> In my case, a limited form of LoadOp already exist for the "FMOV.S" and
>> the "LDTEX" instruction (these perform work on the loaded value in EX3).
>>
>> Something like Load+ADDS.L or similar is not entirely implausible (could
>> probably fit within ~ 1 cycle).
>>
>>
>> However, something like extending EX to 5 stages, or needing "split this
>> operation into micro-ops" logic, would be a little steep. The former
>> would adversely effect branch latency (and increase LUT cost), the
>> latter would cause these instructions to rather perform poorly
>> (defeating the purpose of adding them).
>>
>> Other options would likely require rethinking the pipeline (such as
>> trying to find some way to shove a Load into the ID stages; and all of
>> the consequences this would entail).
>>
>> Say, for example, if ID1/ID2 had some special read ports, and special
>> cases for "Assume op is Load or Load-Op", which could then allow Load-Op
>> within similar pipeline latency, but would require some additional
>> interlock-stage handling.
>>
>>
>> Adding an EX4 and EX5 stage could be argued for on the basis that, while
>> it would increase branch latency and LUT cost, it could potentially also
>> be used to allow for fully pipelined FPU instructions (eg: could allow
>> turning FADD and FMUL from 6C operations into 5L/1T operations).
>>
>> ...

Re: 64 bit 68080 CPU

<4a137c84-b06e-46c3-adff-577f1128dcc6n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26997&group=comp.arch#26997

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:13cf:b0:6b5:ed16:fc69 with SMTP id g15-20020a05620a13cf00b006b5ed16fc69mr13853281qkl.416.1659401004338;
Mon, 01 Aug 2022 17:43:24 -0700 (PDT)
X-Received: by 2002:ad4:5ca9:0:b0:474:9143:6ffc with SMTP id
q9-20020ad45ca9000000b0047491436ffcmr15639763qvh.19.1659401004202; Mon, 01
Aug 2022 17:43:24 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 1 Aug 2022 17:43:24 -0700 (PDT)
In-Reply-To: <tc9njg$14gll$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:bc24:120a:ec70:7071;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:bc24:120a:ec70:7071
References: <tc6mng$fm5h$1@dont-email.me> <0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com>
<tc7rm1$o7cl$1@dont-email.me> <2022Aug1.105418@mips.complang.tuwien.ac.at>
<tc9c81$13df2$1@dont-email.me> <779a18dc-9159-4527-9132-fe02c30fb6acn@googlegroups.com>
<tc9njg$14gll$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4a137c84-b06e-46c3-adff-577f1128dcc6n@googlegroups.com>
Subject: Re: 64 bit 68080 CPU
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 02 Aug 2022 00:43:24 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 9437
 by: MitchAlsup - Tue, 2 Aug 2022 00:43 UTC

On Monday, August 1, 2022 at 6:28:52 PM UTC-5, BGB wrote:
> On 8/1/2022 4:38 PM, MitchAlsup wrote:
> > On Monday, August 1, 2022 at 3:15:00 PM UTC-5, BGB wrote:
> >> On 8/1/2022 3:54 AM, Anton Ertl wrote:
> >>> BGB <cr8...@gmail.com> writes:
> >>>> Find a way to reduce the cost of dealing with pipeline stall signals;
> >>>
> >>> It seems to me that OoO helps with that, and the 68080 is OoO.
> >>>
> >> Possible, I would have figured OoO would have been a bit steep for an
> >> FPGA, but they seem to be managing with what looks like "not too
> >> unreasonable" FPGAs, so dunno...
> >>>> Well, and further reaching issues, say, whether a Reg/Mem ISA could
> >>>> compare well with a Load/Store ISA?... Can use fewer instructions, but
> >>>> how to do Reg/Mem ops without introducing significant complexities or
> >>>> penalty cases?...
> >>>
> >>> Intel and AMD are doing it just fine with OoO implementations. The
> >>> 68020 (and consequently the 68080) has the additional complexity of
> >>> memory-indirect addressing, but the main problem I see here is one of
> >>> verification (can you guarantee forward progress in the presence of
> >>> TLB misses and page faults), not performance problems.
> >>>
> >> To support Reg/Mem "in general", seems like it would need mechanisms for:
> >> Perform a Load, then perform the operation (Load-Op);
> > <
> > For these ISAs, you build the pipeline as::
> > <
> > |FETCH |DECODE| AGEN |CACHE|EXECUTE|WRITEB|
> > <
> > As I stated above. No OoO is needed, but OoO does not hurt, either.
> > <
> I was thinking of in-order...
>
> In my case, pipelie is:
> ~ PF (overlaps with ID1)
> IF (Instruction Fetch)
> ID1 (Decode)
> ID2 (Register Fetch)
> EX1
> EX2
> EX3
> ~ WB (Pseudo stage)
>
> Could, in theory, move AGU from EX1 to ID2 to buy an extra cycle here,
> but probably could not move it to ID1 without causing problems.
>
> This could, in premise, allow Load-Op in EX2 and EX3 without lengthening
> the pipeline.
>
>
> As-is, Binary32 load and LDTEX have their logic shoved into the EX3
> stage. Could in theory allow for a few 32-bit ALU ops or similar.
>
> Though, while Load+ADD occurs semi-frequently, it looks like Load+CMP is
> a somewhat more common case:
> CMPxx Imm, (Rm, Disp)
> Would likely be able to "hit" pretty often, if it existed.
>
> So, a few common sequences here:
> MOV.L (Rm, Disp), Rs; CMPxx Imm, Rs
> MOV.L (Rm1, Disp), Rs; MOV.L (Rm2, Disp), Rt; CMPxx Rt, Rs
> MOV.L (Rm, Disp), Rs; ADDS.L Rs, Imm, Rn
> MOV.L (Rm, Disp), Rs; ADDS.L Rs, Rt, Rn
>
>
> I suspect a lot of the Load+CMP sequences are for cases where a loop
> counter is used with a "for()" loop or similar, but then ends up being
> evicted.
>
> Another common case here seems to be:
> MOV.L (Rm, Disp), Rt; ADD Imm, Rt; MOV.L Rt, (Rm, Disp)
>
> This seems to be another common case of the loop counter being evicted.
>
> Then again, I suspect my ranking logic may not be counting the loop
> counters as being "inside" the loop (except when referenced withing the
> loop body).
> >> Ability to perform certain ops directly in the L1 cache.
> >>
> > A 16KB L1 cache (small) is already bigger than the register file, forwarding
> > logic, and all integer execution stuff, and other interfaces this section talks
> > to. Adding RMW operations to the cache does not add "that much" logic or
> > "that much" to verification.
> The way L1 works in my case:
> AGU happens (externally);
> Calculate index and similar to fetch from (low bits of address);
> -- (Clock Edge, Fetch goes here)
> Check whether or not request missed;
> Extract value from cache lines;
> -- (Clock Edge)
> Generate modified cache line with store value inserted back;
> Initiate store of modified cache lines;
<
Drop this into the Store Buffer and opportunistically wait for a WB
cycle into the Cache.
<
> -- (Clock Edge, Store happens here)
>
When there is a ST or no-mem-ref you can commit the pending store
to the data portion of the cache.
>
> In theory, could shove a few 32-bit ALU ops in there (between the Load
> and Store parts of the logic), and would need this if I wanted to
> support the RISC-V 'A' extension, but, "Is it worth it?"...
>
> But, then again, if it could deal effectively with:
> "i++;" (where 'i' is currently located on the stack)
> It could potentially be worthwhile from a performance POV.
>
>
> If I were to add these instructions, they would probably be as Op64
> encodings or similar though.
>
> But, even as such, "i++;" as an Op64 is arguably still more compact than
> "i++;" as "Load; ADD; Store" (with 32-bit instruction encodings), and
> could be made ~ 1 cycle, vs ~ 6 cycles for the 3-op sequence.
>
> Would still need to think a bit about how these instructions would be
> encoded though.
> Possibly:
> FFw0_0Vpp_F1nm_Zedd ?
>
> Well, or hack it onto the existing RiMOV encoding:
> FFw0_Pvdd_F0nm_0eoZ
> P: 0..7: Ld/St, XCHG.x, ADD.x, SUB.x, -, AND.x, OR.x, XOR.x
> 8..F: Same, but, Rn is an Imm6u? (Store Only)
> Loads understood as Load-Op, and Stores as RMW.
>
> Such an encoding would theoretically allow for stuff like, say:
> SUB.W R39, (R4, R6, 122)
> ADD.B 55, (R4, R6, 69)
> ADDS.L (R4, R6, 44), R45
>
>
> While CMPxx is technically a more common than ADD/SUB by a quick skim,
> this would be harder to add without stepping on some ugly edge cases.
>
> It would also likely be limited to Byte/Word/DWord "for reasons".
>
>
> Might make sense to model some of this first, to try to see if it would
> gain enough to make it worthwhile (well, it is either that, or add it to
> the Verilog first to see if it can be added without effectively dropping
> a nuke on resource cost or timing...).
>
>
> Then again, I don't expect costs to be quite as bad as my failed "add a
> second Load port" experiment, and this does have "could allow supporting
> the RV64 'A' extension and similar" as an upshot (well, and possibly
> also help if I wanted to add an x86 JIT compiler, but this would still
> likely also need helper logic for EFLAGS emulation and similar).
> >>
> >> Latter is assuming that only a subset of operations support memory as a
> >> destination, which appears to be the case in x86 at least
> >> (interestingly, both RISC-V's 'A' extension and SH-2A ended up with some
> >> operations to operate directly on memory in roughly the same scenarios).
> >>
> >> In my case, a limited form of LoadOp already exist for the "FMOV.S" and
> >> the "LDTEX" instruction (these perform work on the loaded value in EX3).
> >>
> >> Something like Load+ADDS.L or similar is not entirely implausible (could
> >> probably fit within ~ 1 cycle).
> >>
> >>
> >> However, something like extending EX to 5 stages, or needing "split this
> >> operation into micro-ops" logic, would be a little steep. The former
> >> would adversely effect branch latency (and increase LUT cost), the
> >> latter would cause these instructions to rather perform poorly
> >> (defeating the purpose of adding them).
> >>
> >> Other options would likely require rethinking the pipeline (such as
> >> trying to find some way to shove a Load into the ID stages; and all of
> >> the consequences this would entail).
> >>
> >> Say, for example, if ID1/ID2 had some special read ports, and special
> >> cases for "Assume op is Load or Load-Op", which could then allow Load-Op
> >> within similar pipeline latency, but would require some additional
> >> interlock-stage handling.
> >>
> >>
> >> Adding an EX4 and EX5 stage could be argued for on the basis that, while
> >> it would increase branch latency and LUT cost, it could potentially also
> >> be used to allow for fully pipelined FPU instructions (eg: could allow
> >> turning FADD and FMUL from 6C operations into 5L/1T operations).
> >>
> >> ...


Click here to read the complete article
Re: 64 bit 68080 CPU

<tcaak9$19kqm$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=26998&group=comp.arch#26998

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: 64 bit 68080 CPU
Date: Mon, 1 Aug 2022 23:53:25 -0500
Organization: A noiseless patient Spider
Lines: 242
Message-ID: <tcaak9$19kqm$1@dont-email.me>
References: <tc6mng$fm5h$1@dont-email.me>
<0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com>
<tc7rm1$o7cl$1@dont-email.me> <2022Aug1.105418@mips.complang.tuwien.ac.at>
<tc9c81$13df2$1@dont-email.me>
<779a18dc-9159-4527-9132-fe02c30fb6acn@googlegroups.com>
<tc9njg$14gll$1@dont-email.me>
<4a137c84-b06e-46c3-adff-577f1128dcc6n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 2 Aug 2022 04:53:29 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="90a2979674897695e370541a0bfe4d09";
logging-data="1364822"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19FDQCyYD0YHfoYaU8n9nkP"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.11.0
Cancel-Lock: sha1:xbVpVgfkMaTRuTjAlTBguBuwETw=
In-Reply-To: <4a137c84-b06e-46c3-adff-577f1128dcc6n@googlegroups.com>
Content-Language: en-US
 by: BGB - Tue, 2 Aug 2022 04:53 UTC

On 8/1/2022 7:43 PM, MitchAlsup wrote:
> On Monday, August 1, 2022 at 6:28:52 PM UTC-5, BGB wrote:
>> On 8/1/2022 4:38 PM, MitchAlsup wrote:
>>> On Monday, August 1, 2022 at 3:15:00 PM UTC-5, BGB wrote:
>>>> On 8/1/2022 3:54 AM, Anton Ertl wrote:
>>>>> BGB <cr8...@gmail.com> writes:
>>>>>> Find a way to reduce the cost of dealing with pipeline stall signals;
>>>>>
>>>>> It seems to me that OoO helps with that, and the 68080 is OoO.
>>>>>
>>>> Possible, I would have figured OoO would have been a bit steep for an
>>>> FPGA, but they seem to be managing with what looks like "not too
>>>> unreasonable" FPGAs, so dunno...
>>>>>> Well, and further reaching issues, say, whether a Reg/Mem ISA could
>>>>>> compare well with a Load/Store ISA?... Can use fewer instructions, but
>>>>>> how to do Reg/Mem ops without introducing significant complexities or
>>>>>> penalty cases?...
>>>>>
>>>>> Intel and AMD are doing it just fine with OoO implementations. The
>>>>> 68020 (and consequently the 68080) has the additional complexity of
>>>>> memory-indirect addressing, but the main problem I see here is one of
>>>>> verification (can you guarantee forward progress in the presence of
>>>>> TLB misses and page faults), not performance problems.
>>>>>
>>>> To support Reg/Mem "in general", seems like it would need mechanisms for:
>>>> Perform a Load, then perform the operation (Load-Op);
>>> <
>>> For these ISAs, you build the pipeline as::
>>> <
>>> |FETCH |DECODE| AGEN |CACHE|EXECUTE|WRITEB|
>>> <
>>> As I stated above. No OoO is needed, but OoO does not hurt, either.
>>> <
>> I was thinking of in-order...
>>
>> In my case, pipelie is:
>> ~ PF (overlaps with ID1)
>> IF (Instruction Fetch)
>> ID1 (Decode)
>> ID2 (Register Fetch)
>> EX1
>> EX2
>> EX3
>> ~ WB (Pseudo stage)
>>
>> Could, in theory, move AGU from EX1 to ID2 to buy an extra cycle here,
>> but probably could not move it to ID1 without causing problems.
>>
>> This could, in premise, allow Load-Op in EX2 and EX3 without lengthening
>> the pipeline.
>>
>>
>> As-is, Binary32 load and LDTEX have their logic shoved into the EX3
>> stage. Could in theory allow for a few 32-bit ALU ops or similar.
>>
>> Though, while Load+ADD occurs semi-frequently, it looks like Load+CMP is
>> a somewhat more common case:
>> CMPxx Imm, (Rm, Disp)
>> Would likely be able to "hit" pretty often, if it existed.
>>
>> So, a few common sequences here:
>> MOV.L (Rm, Disp), Rs; CMPxx Imm, Rs
>> MOV.L (Rm1, Disp), Rs; MOV.L (Rm2, Disp), Rt; CMPxx Rt, Rs
>> MOV.L (Rm, Disp), Rs; ADDS.L Rs, Imm, Rn
>> MOV.L (Rm, Disp), Rs; ADDS.L Rs, Rt, Rn
>>
>>
>> I suspect a lot of the Load+CMP sequences are for cases where a loop
>> counter is used with a "for()" loop or similar, but then ends up being
>> evicted.
>>
>> Another common case here seems to be:
>> MOV.L (Rm, Disp), Rt; ADD Imm, Rt; MOV.L Rt, (Rm, Disp)
>>
>> This seems to be another common case of the loop counter being evicted.
>>
>> Then again, I suspect my ranking logic may not be counting the loop
>> counters as being "inside" the loop (except when referenced withing the
>> loop body).
>>>> Ability to perform certain ops directly in the L1 cache.
>>>>
>>> A 16KB L1 cache (small) is already bigger than the register file, forwarding
>>> logic, and all integer execution stuff, and other interfaces this section talks
>>> to. Adding RMW operations to the cache does not add "that much" logic or
>>> "that much" to verification.
>> The way L1 works in my case:
>> AGU happens (externally);
>> Calculate index and similar to fetch from (low bits of address);
>> -- (Clock Edge, Fetch goes here)
>> Check whether or not request missed;
>> Extract value from cache lines;
>> -- (Clock Edge)
>> Generate modified cache line with store value inserted back;
>> Initiate store of modified cache lines;
> <
> Drop this into the Store Buffer and opportunistically wait for a WB
> cycle into the Cache.
> <
>> -- (Clock Edge, Store happens here)
>>
> When there is a ST or no-mem-ref you can commit the pending store
> to the data portion of the cache.

Went and added a small ALU into the L1 cache as an experiment, which
should theoretically be able to handle both LoadOp and StoreOp cases.
Cheaper than expected; timing seems to have survived (though is a little
tight now).

The ALU basically sits between the logic for doing a load, and the logic
for doing a store, and if doing a LoadOp or StoreOp, it calculates the
value and puts it on both the loaded-value and stored-value paths (so it
goes where it needs to go).

New encodings are based on the existing RiMOV encodings:
FFw0_Pvdd_F0nm_0eoZ

Z: Gives the type of the Load/Store:
0..3: ST.B/ST.W/ST.L/ST.Q (Rm, Disp17s)
4..7: ST.B/ST.W/ST.L/ST.Q (Rm, Ro*Sc, Disp9u)
8..B: LD.B/LD.W/LD.L/LD.Q (Rm, Disp17s)
C..F: LD.B/LD.W/LD.L/LD.Q (Rm, Ro*Sc, Disp9u)
E.q: Sign vs Zero Extend Loads, Turns Store into LEA
v: Encodes scale and similar for Ro.
P: Encodes the Operator.
Ld/St, XCHG, ADD, SUB (Mem-Reg), SUB (Reg-Mem), AND, OR, XOR
8..F: Imm6u Store variants;
Load/LEA/...=Reserved
May be used for "other operations".

Load Operation: Perform the operation, Result goes into Rn
Store Operation: Perform the operation, Result goes into Mem

XCHG:
Encoded like a Load;
Value from Register is stored;
Rn gets filled with the value previously held in memory;
Effectively, the two values "pass by each other" in this case.

While an immediate that is limited to 6 bits in a 64-bit encoding is
possibly "kinda weak", probably still better than "not at all" (previous
case), and still allows the usual "load constant into a register and
store the register" semantics.

It is sufficient for INC/DEC which is the main use-case it would likely
need to address.

Will need to do some more testing to try to determine its effectiveness
(if compiler support is added).

>>
>> In theory, could shove a few 32-bit ALU ops in there (between the Load
>> and Store parts of the logic), and would need this if I wanted to
>> support the RISC-V 'A' extension, but, "Is it worth it?"...
>>
>> But, then again, if it could deal effectively with:
>> "i++;" (where 'i' is currently located on the stack)
>> It could potentially be worthwhile from a performance POV.
>>
>>
>> If I were to add these instructions, they would probably be as Op64
>> encodings or similar though.
>>
>> But, even as such, "i++;" as an Op64 is arguably still more compact than
>> "i++;" as "Load; ADD; Store" (with 32-bit instruction encodings), and
>> could be made ~ 1 cycle, vs ~ 6 cycles for the 3-op sequence.
>>
>> Would still need to think a bit about how these instructions would be
>> encoded though.
>> Possibly:
>> FFw0_0Vpp_F1nm_Zedd ?
>>
>> Well, or hack it onto the existing RiMOV encoding:
>> FFw0_Pvdd_F0nm_0eoZ
>> P: 0..7: Ld/St, XCHG.x, ADD.x, SUB.x, -, AND.x, OR.x, XOR.x
>> 8..F: Same, but, Rn is an Imm6u? (Store Only)
>> Loads understood as Load-Op, and Stores as RMW.
>>
>> Such an encoding would theoretically allow for stuff like, say:
>> SUB.W R39, (R4, R6, 122)
>> ADD.B 55, (R4, R6, 69)
>> ADDS.L (R4, R6, 44), R45
>>
>>
>> While CMPxx is technically a more common than ADD/SUB by a quick skim,
>> this would be harder to add without stepping on some ugly edge cases.
>>
>> It would also likely be limited to Byte/Word/DWord "for reasons".
>>
>>
>> Might make sense to model some of this first, to try to see if it would
>> gain enough to make it worthwhile (well, it is either that, or add it to
>> the Verilog first to see if it can be added without effectively dropping
>> a nuke on resource cost or timing...).
>>
>>
>> Then again, I don't expect costs to be quite as bad as my failed "add a
>> second Load port" experiment, and this does have "could allow supporting
>> the RV64 'A' extension and similar" as an upshot (well, and possibly
>> also help if I wanted to add an x86 JIT compiler, but this would still
>> likely also need helper logic for EFLAGS emulation and similar).
>>>>
>>>> Latter is assuming that only a subset of operations support memory as a
>>>> destination, which appears to be the case in x86 at least
>>>> (interestingly, both RISC-V's 'A' extension and SH-2A ended up with some
>>>> operations to operate directly on memory in roughly the same scenarios).
>>>>
>>>> In my case, a limited form of LoadOp already exist for the "FMOV.S" and
>>>> the "LDTEX" instruction (these perform work on the loaded value in EX3).
>>>>
>>>> Something like Load+ADDS.L or similar is not entirely implausible (could
>>>> probably fit within ~ 1 cycle).
>>>>
>>>>
>>>> However, something like extending EX to 5 stages, or needing "split this
>>>> operation into micro-ops" logic, would be a little steep. The former
>>>> would adversely effect branch latency (and increase LUT cost), the
>>>> latter would cause these instructions to rather perform poorly
>>>> (defeating the purpose of adding them).
>>>>
>>>> Other options would likely require rethinking the pipeline (such as
>>>> trying to find some way to shove a Load into the ID stages; and all of
>>>> the consequences this would entail).
>>>>
>>>> Say, for example, if ID1/ID2 had some special read ports, and special
>>>> cases for "Assume op is Load or Load-Op", which could then allow Load-Op
>>>> within similar pipeline latency, but would require some additional
>>>> interlock-stage handling.
>>>>
>>>>
>>>> Adding an EX4 and EX5 stage could be argued for on the basis that, while
>>>> it would increase branch latency and LUT cost, it could potentially also
>>>> be used to allow for fully pipelined FPU instructions (eg: could allow
>>>> turning FADD and FMUL from 6C operations into 5L/1T operations).
>>>>
>>>> ...


Click here to read the complete article
Re: 64 bit 68080 CPU

<2022Aug2.130823@mips.complang.tuwien.ac.at>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27002&group=comp.arch#27002

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: 64 bit 68080 CPU
Date: Tue, 02 Aug 2022 11:08:23 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 42
Message-ID: <2022Aug2.130823@mips.complang.tuwien.ac.at>
References: <tc6mng$fm5h$1@dont-email.me> <0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com> <tc7rm1$o7cl$1@dont-email.me> <2022Aug1.105418@mips.complang.tuwien.ac.at> <tc9c81$13df2$1@dont-email.me>
Injection-Info: reader01.eternal-september.org; posting-host="b417a187289f7b231031816929ce3a35";
logging-data="1542352"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18EX76oqVRAfV35akbRi6A7"
Cancel-Lock: sha1:7qlnK13XnInNAvIlBR0HFnlz4KY=
X-newsreader: xrn 10.00-beta-3
 by: Anton Ertl - Tue, 2 Aug 2022 11:08 UTC

BGB <cr88192@gmail.com> writes:
>To support Reg/Mem "in general", seems like it would need mechanisms for:
> Perform a Load, then perform the operation (Load-Op);

Sure, that's what is done in every implementation.

> Ability to perform certain ops directly in the L1 cache.

I don't know any implementation that does that, although there have
been funky memory subsystems that supported fetch-and-add or other
synchronization primitives; AFAIK they do it in the remote memory
controller, not in the controlled memory, though.

>Latter is assuming that only a subset of operations support memory as a
>destination, which appears to be the case in x86 at least
>(interestingly, both RISC-V's 'A' extension

That's the atomic extension, i.e., what I called "synchronization
primitives" above. These operations are unlikely to be fast relative
to non-atomic operations.

>However, something like extending EX to 5 stages, or needing "split this
>operation into micro-ops" logic, would be a little steep. The former
>would adversely effect branch latency

Use a branch predictor, like the big boys.

>the
>latter would cause these instructions to rather perform poorly
>(defeating the purpose of adding them).

That's the way the 486 and Pentium went. Yes, load-and-op
instructions took just as long as a load and an op. I wonder how a
486 with an additional EX stage would have performed: one load-and-op
per cycle would increase performance, but you would have to wait
another cycle before a conditional branch resolves, you would need
more bypasses and more area overall.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: 64 bit 68080 CPU

<tcc1bn$1n70p$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27007&group=comp.arch#27007

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: 64 bit 68080 CPU
Date: Tue, 2 Aug 2022 15:27:31 -0500
Organization: A noiseless patient Spider
Lines: 250
Message-ID: <tcc1bn$1n70p$1@dont-email.me>
References: <tc6mng$fm5h$1@dont-email.me>
<0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com>
<tc7rm1$o7cl$1@dont-email.me> <2022Aug1.105418@mips.complang.tuwien.ac.at>
<tc9c81$13df2$1@dont-email.me> <2022Aug2.130823@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 2 Aug 2022 20:27:36 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="90a2979674897695e370541a0bfe4d09";
logging-data="1809433"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19omHTBzE+U/BGIAesijFpk"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.11.0
Cancel-Lock: sha1:OFeygwcYp/VtrU3nDSLMEzfxHdw=
Content-Language: en-US
In-Reply-To: <2022Aug2.130823@mips.complang.tuwien.ac.at>
 by: BGB - Tue, 2 Aug 2022 20:27 UTC

On 8/2/2022 6:08 AM, Anton Ertl wrote:
> BGB <cr88192@gmail.com> writes:
>> To support Reg/Mem "in general", seems like it would need mechanisms for:
>> Perform a Load, then perform the operation (Load-Op);
>
> Sure, that's what is done in every implementation.
>
>> Ability to perform certain ops directly in the L1 cache.
>
> I don't know any implementation that does that, although there have
> been funky memory subsystems that supported fetch-and-add or other
> synchronization primitives; AFAIK they do it in the remote memory
> controller, not in the controlled memory, though.
>

I did an experiment where I put a mini-ALU in the L1 cache.

Interestingly, it sorta works:
No significant change to architecture (vs Load/Store);
Resource cost is modest;
Timing Seems to survive;
...

Drawbacks:
Doesn't work for general operations;
A few 'useful' cases (like direct CMP against memory) are left out;
No good way to route a status flag update out of this.

Compare would require doing the flag update in EX3, after the result
arrives, but this wouldn't save much over the 2-op sequence.

This trick wouldn't work for a full ISA (like x86), but is probably OK
for "well, we'll stick a few ALU ops here".

Would also be fully insufficient for an ISA like M68K.

>> Latter is assuming that only a subset of operations support memory as a
>> destination, which appears to be the case in x86 at least
>> (interestingly, both RISC-V's 'A' extension
>
> That's the atomic extension, i.e., what I called "synchronization
> primitives" above. These operations are unlikely to be fast relative
> to non-atomic operations.
>

Possibly true.

But, it does add a limited set of RMW operations.

The extension I added should be more or less able to emulate the
behavior of the 'A' extension, but is nowhere near as far reaching as x86.

Then again, much past things like basic ALU operations, the compilers
tend to treat x86-64 like it were Load/Store, so it may not be a huge loss.

Well, and also x86 tends to only have LoadOp forms of most instructions,
with StoreOp limited primarily to things like basic ALU instructions and
similar.

Still not entirely sure how I will use them in my compiler (where
everything was written around the assumption of Load/Store), this will
be another thing to resolve for now.

Maybe would add special-cases to a few of the 3AC ops, where if trying
to do a binary op and the arguments are not in registers (and this
extension is enabled), will do a few extra checks and maybe use the
LoadOp or StoreOp encodings if appropriate.

>> However, something like extending EX to 5 stages, or needing "split this
>> operation into micro-ops" logic, would be a little steep. The former
>> would adversely effect branch latency
>
> Use a branch predictor, like the big boys.
>

I do use a branch predictor, but latency would still be latency on
misprediction.

>> the
>> latter would cause these instructions to rather perform poorly
>> (defeating the purpose of adding them).
>
> That's the way the 486 and Pentium went. Yes, load-and-op
> instructions took just as long as a load and an op. I wonder how a
> 486 with an additional EX stage would have performed: one load-and-op
> per cycle would increase performance, but you would have to wait
> another cycle before a conditional branch resolves, you would need
> more bypasses and more area overall.
>

Yeah, dunno. It is a mystery.

I can wonder, how did stuff back then perform as well as it did? By most
of my current metrics, performance should have been "kinda awful" with
386 and 486 PCs, but they still ran things like Doom and Win95 and
similar pretty well.

Well, also mysteries like how things like JPEG and ZIP were "fast",
where I am currently only getting:
~ 0.4 Mpix/sec from JPEG decoding;
~ 2 MB/s from Deflate (decoding);
...
Which doesn't really seem all that fast.

Well, some era appropriate video codecs also sorta work, but I have to
compress them further because, while I have enough CPU power to play
CRAM video, I don't generally have the IO bandwidth (and by the time one
gets the bitrate low enough, it looks like broken crap). Similar codec
designs with an extra LZ stage thrown on work pretty OK though (can do
320x200 at 30fps).

MPEG is still a little bit of a stretch though (unless I do it at
160x120 or 200x144 or something...).

For a moment, I was having (~ childhood) memories of movies on VCD being
playable on PCs, but then remembered that this was on a Pentium, so
probably doesn't really count for whether or not they would have worked
acceptably on a 486.

Well, I also have memories of things like FMV games and similar from
that era. The game basically being a stack of CDs with poor quality
video, most not offering much in terms of either gameplay or replay value.

In my case, Doom runs in the double-digits, but only rarely gets up near
the 32 fps limit.

Granted, I am using RGB55, drawing to an off-screen buffer (followed by
a "blit"), and using RGB alpha-blending to implement things like screen
color flashes, which possibly adds cost in a few areas.

Well, and for example, games like Hexen and Quake2 had faked
alpha-blending via using lookup tables (rather than using the RGB values).

Doom's original "invisibility" effect was also done using using colormap
trickery (whereas my port had switched to doing it via RGB math after
switching stuff over to 16-bit pixels), ...

With my newer TKGDI experiment, it is possible I could revisit the
original idea of doing everything with RGB555, and maybe look at the
possibility of going back to 8-bit indexed color for some things here
(and then convert during "blit").

Mostly this would require me to add indexed-color bitmap support to
TKGDI, say, traditional approach:
BITMAPINFOHEADER:
biBitCount=8, biClrUsed=256, biClrImportant=256, ...
Then one appends the color palette after the end of the BITMAPINFOHEADER
strucure (at 32 bits/color for whatever reason).

Where, the blit operation being responsible for the index-color to
RGB555 conversion.

Could maybe go further, and have a color-cell encoder that can operate
on index-color input (with a table of precomputed Y values and similar),
but this would be kinda moot if window backing buffers and the main
screen buffer were all still RGB555.

But, this still leaves a mystery of how things like Win95 and similar
were so responsive on the fairly limited hardware of the time. Though, I
can guess they didn't use per-window backing buffers (but then how does
one do the window stack redraw without windows drawing on top of each
other, ... ?).

Well, also on this era of hardware, they were using raw bitmapped
framebuffers in a hardware native format, rather than trying to feed
everything through a color-cell encoder (*1).

Well, say, because 640x400x16bpp would need 512K and 32MB/s for the
screen-refresh, and there isn't enough bandwidth to pull off the display
refresh in this case (it would turn into a broken mess).

But, OTOH, 640x400 16-color RGBI looks awful, ...

Which is part of why I originally had used a color-cell display to begin
with (as did a lot of the early game consoles and similar).

*1: This takes blocks of 4x4 pixels, figures out a "dark color" and a
"light color" (converts RGB->Y for this), and then generates a 2-bpp
interpolation value per pixel (interpolates between the A and B
endpoints). This process appears to be a performance bottleneck in the
640x400 mode.

Tested several options, the interpolation bits are generated with a
process like:
rcp=896/((ymax-ymin)+1); //cached (shared between all the pixels)
ix=(((pix_y-avg_y)*rcp)>>8)+2; //per pixel
block=(block<<2)|ix;
Generally, with 2 passes over all the pixels:
First pass, figures out the ranges and color endpoints;
Second pass, maps the pixels to 2-bpp values.

This use of a multiply was the faster approach in this case, where I had
also tested, eg:
ix=(pix_y>=avg_y)?((pix_y>=avg_hi_y)?3:2):((pix_y>=avg_lo_y)?1:0);
But, this was slower than the use of a per-pixel multiply.

The use of a multiply here tends to be faster on x86 as well.

For "higher quality" encoders, there might be multiple gamma functions,
and some fancier math for calculating endpoints (cluster averaging), but
this is a bit too slow for real-time encoding on the BJX2 core
(normally, one would want to select a "gamma curve" which maximizes
contrast between the high and low sets; and then calculate endpoints
roughly representing a weighted average of the extremes and the centroid
regions of each set of pixels).

For speed, one mostly has to live with a single gamma function, and
merely using the minimum and maximum values, ...


Click here to read the complete article
Re: 64 bit 68080 CPU

<tcc21g$dfj$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27008&group=comp.arch#27008

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-dd1e-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: 64 bit 68080 CPU
Date: Tue, 2 Aug 2022 20:39:12 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <tcc21g$dfj$1@newsreader4.netcologne.de>
References: <tc6mng$fm5h$1@dont-email.me>
<0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com>
<tc7rm1$o7cl$1@dont-email.me> <2022Aug1.105418@mips.complang.tuwien.ac.at>
<tc9c81$13df2$1@dont-email.me> <2022Aug2.130823@mips.complang.tuwien.ac.at>
Injection-Date: Tue, 2 Aug 2022 20:39:12 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-dd1e-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:dd1e:0:7285:c2ff:fe6c:992d";
logging-data="13811"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Tue, 2 Aug 2022 20:39 UTC

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
> BGB <cr88192@gmail.com> writes:
>>To support Reg/Mem "in general", seems like it would need mechanisms for:
>> Perform a Load, then perform the operation (Load-Op);
>
> Sure, that's what is done in every implementation.
>
>> Ability to perform certain ops directly in the L1 cache.
>
> I don't know any implementation that does that, although there have
> been funky memory subsystems that supported fetch-and-add or other
> synchronization primitives; AFAIK they do it in the remote memory
> controller, not in the controlled memory, though.

The Nova might count, if you consider its memory to be a cache
(well, not really, but it was the closest memory to the CPU, so...)
it had an ISZ (increment and skip if zero) and DSZ (decrement and
skip if zero) which apparently was done in the memory subsystem.
Slow, but it saved a register, which were in short supply.

Re: 64 bit 68080 CPU

<tcc2ub$1ngk6$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27009&group=comp.arch#27009

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: 64 bit 68080 CPU
Date: Tue, 2 Aug 2022 22:54:35 +0200
Organization: A noiseless patient Spider
Lines: 15
Message-ID: <tcc2ub$1ngk6$1@dont-email.me>
References: <tc6mng$fm5h$1@dont-email.me>
<0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com>
<tc7rm1$o7cl$1@dont-email.me>
<fdd07bcf-4f05-48dd-b809-47c67e0c7eacn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 2 Aug 2022 20:54:35 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="7ba9a9f0846e765a60a5e308cd321d97";
logging-data="1819270"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19wo248B9fueI6DRLEtEoGvuiLEmJBJ7lo="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.11.0
Cancel-Lock: sha1:JB94B+qaMKWcNQoN752eSPb4r6I=
In-Reply-To: <fdd07bcf-4f05-48dd-b809-47c67e0c7eacn@googlegroups.com>
Content-Language: en-US
 by: Marcus - Tue, 2 Aug 2022 20:54 UTC

On 2022-08-01, MitchAlsup wrote:
> On Monday, August 1, 2022 at 1:26:12 AM UTC-5, BGB wrote:
>>
>> Well, and further reaching issues, say, whether a Reg/Mem ISA could
>> compare well with a Load/Store ISA?... Can use fewer instructions, but
>> how to do Reg/Mem ops without introducing significant complexities or
>> penalty cases?...
>>
> You build a pipeline which has a pre-execute stage (which calculates
> AGEN) and then a stage or 2 of cache access, and then you get to the
> normal execute---writeback part of the pipeline. I have called this the
> 360 pipeline (or IBM pipeline) several times in the past, here. The S.E.L
> machines used such a pipeline.

Don't you need two data access points along such a pipeline?

Re: 64 bit 68080 CPU

<tcc562$1o6jg$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27011&group=comp.arch#27011

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: 64 bit 68080 CPU
Date: Tue, 2 Aug 2022 16:32:46 -0500
Organization: A noiseless patient Spider
Lines: 63
Message-ID: <tcc562$1o6jg$1@dont-email.me>
References: <tc6mng$fm5h$1@dont-email.me>
<0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com>
<tc7rm1$o7cl$1@dont-email.me>
<fdd07bcf-4f05-48dd-b809-47c67e0c7eacn@googlegroups.com>
<tcc2ub$1ngk6$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 2 Aug 2022 21:32:51 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="90a2979674897695e370541a0bfe4d09";
logging-data="1841776"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/k3pqx3BTn++Y7+JGTGqNi"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.11.0
Cancel-Lock: sha1:dJ5+YpG8+5EJSc5FiEx9zqIMKw0=
Content-Language: en-US
In-Reply-To: <tcc2ub$1ngk6$1@dont-email.me>
 by: BGB - Tue, 2 Aug 2022 21:32 UTC

On 8/2/2022 3:54 PM, Marcus wrote:
> On 2022-08-01, MitchAlsup wrote:
>> On Monday, August 1, 2022 at 1:26:12 AM UTC-5, BGB wrote:
>>>
>>> Well, and further reaching issues, say, whether a Reg/Mem ISA could
>>> compare well with a Load/Store ISA?... Can use fewer instructions, but
>>> how to do Reg/Mem ops without introducing significant complexities or
>>> penalty cases?...
>>>
>> You build a pipeline which has a pre-execute stage (which calculates
>> AGEN) and then a stage or 2 of cache access, and then you get to the
>> normal execute---writeback part of the pipeline. I have called this the
>> 360 pipeline (or IBM pipeline) several times in the past, here. The S.E.L
>> machines used such a pipeline.
>
> Don't you need two data access points along such a pipeline?

It is a mystery.

I guess, assuming LdOp:
IF ID RA MA EX1 EX2 EX3 WB
IF ID RA MA EX1 EX2 EX3 ST
RA: Register Access and AGEN
MA: Memory Access (Load)
WB: Write Nack (Register File)
ST: Store to Memory

Load : IF ID RA MA +++ +++ +++ WB
Store : IF ID RA -- --- --- --- ST
Reg Op (1L): IF ID RA -- EX1 +++ +++ WB
Reg Op (2L): IF ID RA -- EX1 EX2 +++ WB
Reg Op (3L): IF ID RA -- EX1 EX2 EX3 WB
StoreOp(2L): IF ID RA MA EX1 EX2 --- ST
StoreOp(3L): IF ID RA MA EX1 EX2 EX3 ST
--: Unused Stage (No Forward)
++: Unused Stage (May Forward)

This would likely require tighter coupling between the pipeline and L1
cache though, since these would happen in lockstep albeit with a longer
delay than for Load/Store (this would complicate things like memory
consistency, since it would now be possible for the MA stage of
following instructions to evict resident cache lines before the Store
stage of previous instructions).

It is quite possible that, in such a design, if the 'RA' stage generates
an L1 cache index which collides with a store that is already in-flight,
the pipeline would need to interlock.

Though, this would create an extra penalty, as load/store-collision
interlocks are likely to be a serious issue in many cases (would add a 3
or 4 cycle penalty whenever an operation tries to access a cache-line
with an in-flight store, which is likely to be fairly common in areas
like the stack and similar).

Though, one possibility would be to only cause an interlock if this
access would generate an L1 miss.

It is likely that MA would need to be treated like an Execute stage.

....

Re: 64 bit 68080 CPU

<191c8144-23a0-4e34-9246-dffa8c6e6371n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27013&group=comp.arch#27013

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:5e14:0:b0:31f:4280:8d93 with SMTP id h20-20020ac85e14000000b0031f42808d93mr19330349qtx.36.1659476491012;
Tue, 02 Aug 2022 14:41:31 -0700 (PDT)
X-Received: by 2002:a05:622a:100d:b0:31f:25e3:7a45 with SMTP id
d13-20020a05622a100d00b0031f25e37a45mr20179700qte.365.1659476490884; Tue, 02
Aug 2022 14:41:30 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border-2.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 2 Aug 2022 14:41:30 -0700 (PDT)
In-Reply-To: <tcc2ub$1ngk6$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:91a0:727c:d541:e11e;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:91a0:727c:d541:e11e
References: <tc6mng$fm5h$1@dont-email.me> <0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com>
<tc7rm1$o7cl$1@dont-email.me> <fdd07bcf-4f05-48dd-b809-47c67e0c7eacn@googlegroups.com>
<tcc2ub$1ngk6$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <191c8144-23a0-4e34-9246-dffa8c6e6371n@googlegroups.com>
Subject: Re: 64 bit 68080 CPU
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 02 Aug 2022 21:41:31 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 21
 by: MitchAlsup - Tue, 2 Aug 2022 21:41 UTC

On Tuesday, August 2, 2022 at 3:54:38 PM UTC-5, Marcus wrote:
> On 2022-08-01, MitchAlsup wrote:
> > On Monday, August 1, 2022 at 1:26:12 AM UTC-5, BGB wrote:
> >>
> >> Well, and further reaching issues, say, whether a Reg/Mem ISA could
> >> compare well with a Load/Store ISA?... Can use fewer instructions, but
> >> how to do Reg/Mem ops without introducing significant complexities or
> >> penalty cases?...
> >>
> > You build a pipeline which has a pre-execute stage (which calculates
> > AGEN) and then a stage or 2 of cache access, and then you get to the
> > normal execute---writeback part of the pipeline. I have called this the
> > 360 pipeline (or IBM pipeline) several times in the past, here. The S.E.L
> > machines used such a pipeline.
<
> Don't you need two data access points along such a pipeline?
<
Sorry cannot parse your question.
<
But the AGEN unit at the front of the pipeline is speculative, and all
actual calculations are done after inbound memory references (if any)
have shown up.

Re: 64 bit 68080 CPU

<tcd8rj$23efp$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27017&group=comp.arch#27017

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: m.del...@this.bitsnbites.eu (Marcus)
Newsgroups: comp.arch
Subject: Re: 64 bit 68080 CPU
Date: Wed, 3 Aug 2022 09:41:38 +0200
Organization: A noiseless patient Spider
Lines: 30
Message-ID: <tcd8rj$23efp$1@dont-email.me>
References: <tc6mng$fm5h$1@dont-email.me>
<0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com>
<tc7rm1$o7cl$1@dont-email.me>
<fdd07bcf-4f05-48dd-b809-47c67e0c7eacn@googlegroups.com>
<tcc2ub$1ngk6$1@dont-email.me>
<191c8144-23a0-4e34-9246-dffa8c6e6371n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 3 Aug 2022 07:41:39 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="604619cc124d67e8a64b0a779cdc9d18";
logging-data="2210297"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/YTA8rTw4uEv6QfxaOSNVTPpH4f0slwmc="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.11.0
Cancel-Lock: sha1:ZGujR80LIsVoiWd7nG9Qdau7rkA=
Content-Language: en-US
In-Reply-To: <191c8144-23a0-4e34-9246-dffa8c6e6371n@googlegroups.com>
 by: Marcus - Wed, 3 Aug 2022 07:41 UTC

On 2022-08-02, MitchAlsup wrote:
> On Tuesday, August 2, 2022 at 3:54:38 PM UTC-5, Marcus wrote:
>> On 2022-08-01, MitchAlsup wrote:
>>> On Monday, August 1, 2022 at 1:26:12 AM UTC-5, BGB wrote:
>>>>
>>>> Well, and further reaching issues, say, whether a Reg/Mem ISA could
>>>> compare well with a Load/Store ISA?... Can use fewer instructions, but
>>>> how to do Reg/Mem ops without introducing significant complexities or
>>>> penalty cases?...
>>>>
>>> You build a pipeline which has a pre-execute stage (which calculates
>>> AGEN) and then a stage or 2 of cache access, and then you get to the
>>> normal execute---writeback part of the pipeline. I have called this the
>>> 360 pipeline (or IBM pipeline) several times in the past, here. The S.E.L
>>> machines used such a pipeline.
> <
>> Don't you need two data access points along such a pipeline?
> <
> Sorry cannot parse your question.
> <
> But the AGEN unit at the front of the pipeline is speculative, and all
> actual calculations are done after inbound memory references (if any)
> have shown up.

Sorry, I meant: In the pre-execute stages you need to read memory
operands, no? And later in the pipeline you need to write to (or read
from?) memory? Thus you would need (at least) two concurrent ports to
the L1D$?

/Marcus

Re: 64 bit 68080 CPU

<mrxGK.823150$X_i.231361@fx18.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=27023&group=comp.arch#27023

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx18.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: 64 bit 68080 CPU
References: <tc6mng$fm5h$1@dont-email.me> <0882927d-1ed2-41f8-a594-745eaa961efbn@googlegroups.com> <tc7rm1$o7cl$1@dont-email.me> <fdd07bcf-4f05-48dd-b809-47c67e0c7eacn@googlegroups.com>
In-Reply-To: <fdd07bcf-4f05-48dd-b809-47c67e0c7eacn@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 31
Message-ID: <mrxGK.823150$X_i.231361@fx18.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Wed, 03 Aug 2022 16:33:22 UTC
Date: Wed, 03 Aug 2022 12:32:55 -0400
X-Received-Bytes: 2355
 by: EricP - Wed, 3 Aug 2022 16:32 UTC

MitchAlsup wrote:
> On Monday, August 1, 2022 at 1:26:12 AM UTC-5, BGB wrote:
>> Well, and further reaching issues, say, whether a Reg/Mem ISA could
>> compare well with a Load/Store ISA?... Can use fewer instructions, but
>> how to do Reg/Mem ops without introducing significant complexities or
>> penalty cases?...
>>
> You build a pipeline which has a pre-execute stage (which calculates
> AGEN) and then a stage or 2 of cache access, and then you get to the
> normal execute---writeback part of the pipeline. I have called this the
> 360 pipeline (or IBM pipeline) several times in the past, here. The S.E.L
> machines used such a pipeline.

The thing is that a fixed layout pipeline can only accommodate
the specific situations that it is designed for.
Forwarding allows limited topological rearrangement.
Putting an extra AGEN at an early pipeline stage makes
all uOps perform an extra stage, which add extra latency
that makes it costlier to fill in bubbles.

A pipeline might dynamically rearrange while maintaining In-Order (InO)
simplicity such that can fill in bubbles as best as possible.
(I have a mental picture of a dynamic Pert chart.
It is not OoO but it does allow concurrency to fill in bubbles.)

For example, a ST with immediate data and immediate address doesn't
need either a RR Register Read stage or AGEN, and can go straight
from Decode to LSU. That ST can launch concurrent with an earlier
RR-ALU uOp, or a following RR-ALU op can launch concurrent with ST.
ST doesn't need the WB stage so a subsequent uOp can use that stage.

Pages:12
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor