Welcome to novaBBS (click a section below)

mail files register newsreader groups login

Message-ID:

Your code should be more efficient!

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

Subject	Author
Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	MitchAlsup
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	MitchAlsup
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	MitchAlsup
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	luke.l...@gmail.com
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	luke.l...@gmail.com
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	luke.l...@gmail.com
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	luke.l...@gmail.com
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	MitchAlsup
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	Terje Mathisen
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	luke.l...@gmail.com
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	luke.l...@gmail.com
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	luke.l...@gmail.com
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	luke.l...@gmail.com
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	MitchAlsup
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	Andy
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	robf...@gmail.com
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	luke.l...@gmail.com
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	Andy
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	robf...@gmail.com
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	robf...@gmail.com
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	MitchAlsup
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	Terje Mathisen
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	Anton Ertl
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	Terje Mathisen
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	Quadibloc
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)	Anton Ertl
Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)	John Dallman
Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)	Thomas Koenig
Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)	John Dallman
Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)	Marcus
Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)	Anton Ertl
Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)	Terje Mathisen
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	MitchAlsup
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	Michael S
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	Anton Ertl
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	Michael S
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	Anton Ertl
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	MitchAlsup
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	Anton Ertl
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	Anton Ertl
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	luke.l...@gmail.com
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	Quadibloc
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	MitchAlsup
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	MitchAlsup
Power cost of IEEE754 (was: Misc: Idle thoughts for cheap and fast(ish) GPU)	Stefan Monnier
Re: Power cost of IEEE754 (was: Misc: Idle thoughts for cheap and	robf...@gmail.com
Re: Power cost of IEEE754 (was: Misc: Idle thoughts for cheap and	BGB
Re: Power cost of IEEE754	Terje Mathisen
Re: Power cost of IEEE754	BGB
Re: Power cost of IEEE754	MitchAlsup
Re: Power cost of IEEE754	BGB
Re: Power cost of IEEE754	MitchAlsup
Re: Power cost of IEEE754	Quadibloc
Re: Power cost of IEEE754	MitchAlsup
Re: Power cost of IEEE754	Anton Ertl
Re: Power cost of IEEE754	Terje Mathisen
Re: Power cost of IEEE754	MitchAlsup
Re: Power cost of IEEE754	Thomas Koenig
Re: Power cost of IEEE754	John Dallman
Re: Power cost of IEEE754	Michael S
Re: Power cost of IEEE754	EricP
Re: Power cost of IEEE754	MitchAlsup
Re: Power cost of IEEE754	EricP
Re: Power cost of IEEE754	EricP
Re: Power cost of IEEE754	Paul A. Clayton
Re: Power cost of IEEE754	MitchAlsup
Re: Power cost of IEEE754	luke.l...@gmail.com
Re: Power cost of IEEE754	MitchAlsup
Re: Power cost of IEEE754	Josh Vanderhoof
Re: Power cost of IEEE754	BGB
Re: Power cost of IEEE754	Josh Vanderhoof
Re: Power cost of IEEE754	BGB
Re: Power cost of IEEE754	Josh Vanderhoof
Re: Power cost of IEEE754	BGB
Re: Power cost of IEEE754	Josh Vanderhoof
Re: Power cost of IEEE754	BGB
Re: Power cost of IEEE754	Josh Vanderhoof
Re: Power cost of IEEE754	BGB
Re: Power cost of IEEE754	Terje Mathisen
Re: Power cost of IEEE754	Ivan Godard
Re: Power cost of IEEE754	BGB
Re: Power cost of IEEE754 (was: Misc: Idle thoughts for cheap and	Quadibloc

Pages:123 4 5 6

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<88ccb4a6-0440-4bee-9e2e-92bb6d986549n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26909&group=comp.arch#26909

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:2619:b0:6b5:eddf:ef8e with SMTP id z25-20020a05620a261900b006b5eddfef8emr10515514qko.674.1658783137575;
Mon, 25 Jul 2022 14:05:37 -0700 (PDT)
X-Received: by 2002:a0c:8170:0:b0:473:af82:9a77 with SMTP id
103-20020a0c8170000000b00473af829a77mr12157842qvc.87.1658783137361; Mon, 25
Jul 2022 14:05:37 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 25 Jul 2022 14:05:37 -0700 (PDT)
In-Reply-To: <tbmvan$1esgs$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.199.50; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.199.50
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org>
<ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com> <tblg10$13pon$1@dont-email.me>
<2c0104c2-0515-4d14-9e75-6e3525eea9f7n@googlegroups.com> <tbmn8v$1cpm4$1@dont-email.me>
<f230ca5e-45ea-4040-9879-e796a9628294n@googlegroups.com> <tbmvan$1esgs$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <88ccb4a6-0440-4bee-9e2e-92bb6d986549n@googlegroups.com>
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Mon, 25 Jul 2022 21:05:37 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 4075

by: luke.l...@gmail.com - Mon, 25 Jul 2022 21:05 UTC

On Monday, July 25, 2022 at 9:44:10 PM UTC+1, BGB wrote:

> It is more like:
> Adders for bits (15: 0)
> Adders for bits (31:16)
> Adders for bits (47:32)
> Adders for bits (63:48)
>
> tValRtV = tInvRt ? ~tValRt : tValRt;
> tAddValA0 = {1'b0, tValRs[15: 0]} + {1'b0, tValRtV[15: 0]} + 17'h0000;
> tAddValA1 = {1'b0, tValRs[15: 0]} + {1'b0, tValRtV[15: 0]} + 17'h0001;
> tAddValB0 = {1'b0, tValRs[31:16]} + {1'b0, tValRtV[31:16]} + 17'h0000;
> tAddValB1 = {1'b0, tValRs[31:16]} + {1'b0, tValRtV[31:16]} + 17'h0001;
> ... Gather up carry-bits for various operations ...
> tAddValRn[15: 0] = tAddSelA ? tAddValA1[15:0] : tAddValA0[15:0];
> tAddValRn[31:16] = tAddSelB ? tAddValB1[15:0] : tAddValB0[15:0];

you'll love this - see this page
https://libre-soc.org/3d_gpu/architecture/dynamic_simd/add/

(of course google groups will mangle this so i will replace space with
underscore, look at the original page to see it properly lined up)

partition:_____P____P____P_____(3_bits)
a________:_...._...._...._...._(32_bits)
b________:_...._...._...._...._(32_bits)
exp-a____:_....P....P....P...._(32+3_bits,_P=0_if_no_partition)
exp-b____:_....0....0....0...._(32_bits_plus_3_zeros)
exp-o____:_....xN...xN...xN..._(32+3_bits_-_x_to_be_discarded)
o________:_...._N..._N..._N..._(32_bits_-_x_ignored,_N_is_carry-over)

so instead of

Adders for bits (15: 0)
Adders for bits (31:16)
Adders for bits (47:32)
Adders for bits (63:48)

you have a *67* bit adder and...

Adders for bits (15: 0) in bits 15:0
partition bit in 16
Adders for bits (31:16) in bits 32:17
partition bit in 33
Adders for bits (47:32) in bits 49:34
partition bit in 50
Adders for bits (63:48) in bits 66:51

now let's say you want to do a full 64-bit add. guess what you do?

you set bits 16, 33 and 50 to ONE in the A input (and zeros
in the B input).

that's it. that's all there is to it.

let's say you now want to do 4x 16-bit adds. guess what you do?

you set bits 16, 33 and 50 to ZERO in both A and B.

why does this work? because due to the B-input being always zero,
when the partition bits are set to 1, they cause a carry-roll-over
condition into to the next partition. ta-daaa

it's so obvious, and so elegant. the beautiful bit is, you can use
standard HDL, you don't even have to do anything special for the
FPGA or VLSI synthesis tools.

the "cost"? you need a 67-bit adder instead of a 64-bit adder.
pffh. big deal. want 8-bit SIMD granularity? use 7 partitions.
want 8-bit SIMD granularity in a 128-bit ALU? use 15 partitions.
sooo difficult :)

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<3042c810-fc48-4623-867d-fbab519e639fn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26910&group=comp.arch#26910

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:2909:b0:6b6:a94:a988 with SMTP id m9-20020a05620a290900b006b60a94a988mr11076876qkp.350.1658783275084;
Mon, 25 Jul 2022 14:07:55 -0700 (PDT)
X-Received: by 2002:ac8:5f86:0:b0:31f:28da:bfd4 with SMTP id
j6-20020ac85f86000000b0031f28dabfd4mr10656785qta.563.1658783274859; Mon, 25
Jul 2022 14:07:54 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 25 Jul 2022 14:07:54 -0700 (PDT)
In-Reply-To: <tbn00p$1f27n$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.199.50; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.199.50
References: <t6gush$p5u$1@dont-email.me> <66967492-b252-4e76-8f06-5fde2c8a1cc4n@googlegroups.com>
<t6hgk5$5hg$1@dont-email.me> <25ec150c-9ba5-497f-864e-1895365dc27cn@googlegroups.com>
<t6j9gs$kta$1@dont-email.me> <166769ba-7b37-4b0d-837e-da5c4ffd5f25n@googlegroups.com>
<tbkr0u$v2et$1@dont-email.me> <16853e25-8f7e-40f4-9f3d-680efac0cd73n@googlegroups.com>
<tbn00p$1f27n$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3042c810-fc48-4623-867d-fbab519e639fn@googlegroups.com>
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Mon, 25 Jul 2022 21:07:55 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 1788

by: luke.l...@gmail.com - Mon, 25 Jul 2022 21:07 UTC

On Monday, July 25, 2022 at 9:55:56 PM UTC+1, BGB wrote:

> Then again, looking at it, probably not doing enough cross-products and
> similar to make this worthwhile.

if you have matrix multiply and have a skew-matrix system you get
cross-product effectively for free.

https://en.m.wikipedia.org/wiki/Skew-symmetric_matrix

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<05978d65-a1eb-4a9f-9812-104102828ea3n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26911&group=comp.arch#26911

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:7d91:0:b0:31f:37ea:98e9 with SMTP id c17-20020ac87d91000000b0031f37ea98e9mr5754805qtd.511.1658786026530;
Mon, 25 Jul 2022 14:53:46 -0700 (PDT)
X-Received: by 2002:a05:622a:100d:b0:31f:25e3:7a45 with SMTP id
d13-20020a05622a100d00b0031f25e37a45mr11891343qte.365.1658786026403; Mon, 25
Jul 2022 14:53:46 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 25 Jul 2022 14:53:46 -0700 (PDT)
In-Reply-To: <tbmvan$1esgs$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org>
<ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com> <tblg10$13pon$1@dont-email.me>
<2c0104c2-0515-4d14-9e75-6e3525eea9f7n@googlegroups.com> <tbmn8v$1cpm4$1@dont-email.me>
<f230ca5e-45ea-4040-9879-e796a9628294n@googlegroups.com> <tbmvan$1esgs$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <05978d65-a1eb-4a9f-9812-104102828ea3n@googlegroups.com>
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 25 Jul 2022 21:53:46 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2536

by: MitchAlsup - Mon, 25 Jul 2022 21:53 UTC

On Monday, July 25, 2022 at 3:44:10 PM UTC-5, BGB wrote:

> Yeah, bulk duplicating this stuff for every scenario does not seem like
> a great idea.
>
>
> It is more like:
> Adders for bits (15: 0)
> Adders for bits (31:16)
> Adders for bits (47:32)
> Adders for bits (63:48)
>
> tValRtV = tInvRt ? ~tValRt : tValRt;
> tAddValA0 = {1'b0, tValRs[15: 0]} + {1'b0, tValRtV[15: 0]} + 17'h0000;
> tAddValA1 = {1'b0, tValRs[15: 0]} + {1'b0, tValRtV[15: 0]} + 17'h0001;
> tAddValB0 = {1'b0, tValRs[31:16]} + {1'b0, tValRtV[31:16]} + 17'h0000;
> tAddValB1 = {1'b0, tValRs[31:16]} + {1'b0, tValRtV[31:16]} + 17'h0001;
> ... Gather up carry-bits for various operations ...
> tAddValRn[15: 0] = tAddSelA ? tAddValA1[15:0] : tAddValA0[15:0];
> tAddValRn[31:16] = tAddSelB ? tAddValB1[15:0] : tAddValB0[15:0];
> ...
<
Build the adders in 9-bit quanta. Use this 9th bit to control carrys.
00 no carry propagation
01 and 10 carry will propagate
11 carry in
<
>
> One can deal with the SIMD cases while figuring out what the carry bits
> should be.
<
You KNOW what the inputs to the 9-th bit need to be by OpCode.

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<61db33cb-133c-49fd-9b5d-e686cbcae112n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26912&group=comp.arch#26912

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:14cc:b0:31f:2426:cd6c with SMTP id u12-20020a05622a14cc00b0031f2426cd6cmr11923430qtx.147.1658786207541;
Mon, 25 Jul 2022 14:56:47 -0700 (PDT)
X-Received: by 2002:a05:620a:c4b:b0:6b6:5a6d:6d2d with SMTP id
u11-20020a05620a0c4b00b006b65a6d6d2dmr5038290qki.441.1658786207411; Mon, 25
Jul 2022 14:56:47 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 25 Jul 2022 14:56:47 -0700 (PDT)
In-Reply-To: <tbn00p$1f27n$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <t6gush$p5u$1@dont-email.me> <66967492-b252-4e76-8f06-5fde2c8a1cc4n@googlegroups.com>
<t6hgk5$5hg$1@dont-email.me> <25ec150c-9ba5-497f-864e-1895365dc27cn@googlegroups.com>
<t6j9gs$kta$1@dont-email.me> <166769ba-7b37-4b0d-837e-da5c4ffd5f25n@googlegroups.com>
<tbkr0u$v2et$1@dont-email.me> <16853e25-8f7e-40f4-9f3d-680efac0cd73n@googlegroups.com>
<tbn00p$1f27n$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <61db33cb-133c-49fd-9b5d-e686cbcae112n@googlegroups.com>
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 25 Jul 2022 21:56:47 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2536

by: MitchAlsup - Mon, 25 Jul 2022 21:56 UTC

On Monday, July 25, 2022 at 3:55:56 PM UTC-5, BGB wrote:
> On 7/25/2022 5:46 AM, luke.l...@gmail.com wrote:
> > On Monday, July 25, 2022 at 2:18:26 AM UTC+1, BGB wrote:
> >
> >> Recently, was also left thinking about a possible operation to invert or
> >> flip the sign bits of selected vector elements.
> >>
> >> Say:
> >> 00: Leave element unchanged;
> >> 01: Invert low bits of element;
> >> 10: Invert sign bit of element;
> >> 11: Invert all bits of element.
> >
> > this is integer? what's the benefit / use-case?
> > (you should be able to XOR optionally with 0b100000000..., 0b0111111.....
> > immediates, to achieve the same effect)
> >
> Both integer and floating point (both use the same registers in my case).
<
There are 3 cases {integer, logical, floating point}
{integer and logical} invert the bit pattern
{floating point} invert the sign
{Integer} append a carry in to the inverted bit pattern.
<
This gives you the bit inversion for logical, negation for integer, and negation
for FP, and only cost the inverter (and some decode).
<

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<tbns1p$1og63$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26913&group=comp.arch#26913

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
Date: Mon, 25 Jul 2022 23:54:11 -0500
Organization: A noiseless patient Spider
Lines: 201
Message-ID: <tbns1p$1og63$1@dont-email.me>
References: <t6gush$p5u$1@dont-email.me>
<66967492-b252-4e76-8f06-5fde2c8a1cc4n@googlegroups.com>
<t6hgk5$5hg$1@dont-email.me>
<25ec150c-9ba5-497f-864e-1895365dc27cn@googlegroups.com>
<t6j9gs$kta$1@dont-email.me>
<166769ba-7b37-4b0d-837e-da5c4ffd5f25n@googlegroups.com>
<tbkr0u$v2et$1@dont-email.me>
<16853e25-8f7e-40f4-9f3d-680efac0cd73n@googlegroups.com>
<tbn00p$1f27n$1@dont-email.me>
<3042c810-fc48-4623-867d-fbab519e639fn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 26 Jul 2022 04:54:18 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="376e25f9247a115bf27d860c6d07a887";
logging-data="1851587"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18swUF59HYwmR772OnlEZSZ"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.11.0
Cancel-Lock: sha1:+qioyXYdEPRqCwdCSgG1wAOWxSw=
In-Reply-To: <3042c810-fc48-4623-867d-fbab519e639fn@googlegroups.com>
Content-Language: en-US

by: BGB - Tue, 26 Jul 2022 04:54 UTC

On 7/25/2022 4:07 PM, luke.l...@gmail.com wrote:
> On Monday, July 25, 2022 at 9:55:56 PM UTC+1, BGB wrote:
>
>> Then again, looking at it, probably not doing enough cross-products and
>> similar to make this worthwhile.
>
> if you have matrix multiply and have a skew-matrix system you get
> cross-product effectively for free.
>
> https://en.m.wikipedia.org/wiki/Skew-symmetric_matrix
>

There is no matrix multiply operation in my case.

What there is, is vector multiply, vector add, and some instructions
that can be used to implement a matrix transpose (turning the matrix
multiply into a series of vector multiplies and adds).

Then again, looking at my runtime code for some of the cross-product
operators:

(3D cross product)
__vnf_v4f_cross:
PSHUFX.L R4, 0x49, R16
PSHUFX.L R6, 0x52, R18
PMULX.F R16, R18, R20
PSHUFX.L R4, 0x92, R16
PSHUFX.L R6, 0x89, R18
PMULX.F R16, R18, R22
PSUBX.F R20, R22, R2
EXTU.L R3
RTS

(2D cross product)
__vnf_v2f_cross:
MOVLHD R5, R5, R7
PMUL.F R4, R7, R6
FLDCF R6, R7
FLDCFH R6, R3
FSUB R7, R3, R2
RTS

And, quaternion multiply:
__vnf_vqf_mul:
PSHUFX.L R4, 0xFF, R16
PMULX.F R6, R16, R22

PSHUFX.L R4, 0x92, R16
PSHUFX.L R6, 0x89, R18
PMULX.F R16, R18, R20
PSUBX.F R22, R20, R2

PSHUFX.L R4, 0x24, R16
PSHUFX.L R6, 0x3F, R18
PMULX.F R16, R18, R20
FNEG R21
PADDX.F R2, R20, R2

PSHUFX.L R4, 0x49, R16
PSHUFX.L R6, 0x52, R18
PMULX.F R16, R18, R20
FNEG R21
PADDX.F R2, R20, R2
RTS

Where, eg:
00=i, 01=j, 10=k, 11=r
And, usual rules:
i^2=j^2=k^2=-1
ij=k jk=i, ki=j;
ji=-k, kj=-i, ik=-j

It would appear that I had gotten by OK enough without per-element
negate in this case (so, not entirely sure what I was thinking earlier...).

Though, "FNEG R21" in the latter was effectively flipping the sign bit
of the 'r' component.

And, I guess I can note that BGBCC lacks a Binary16 quaternion type, I
guess I could add it if it turned out to be needed.

I guess it is in a similar category to "_Complex short float"...

Or, I guess I can note the vertex-projection function (from my OpenGL
implementation):

TKRA_ProjectVertexB:
MOV.X (R6, 0), R22
MOVLD R4, R4, R20 | MOVLD R4, R4, R21
MOV.X (R6, 16), R18
PMULX.FA R20, R22, R2
MOVHD R4, R4, R20 | MOVHD R4, R4, R21
MOV.X (R6, 32), R22
PMULX.FA R20, R18, R16
PADDX.FA R2, R16, R2
MOVLD R5, R5, R20 | MOVLD R5, R5, R21
MOV.X (R6, 48), R18
PMULX.FA R20, R22, R16
MOVHD R5, R5, R20 | MOVHD R5, R5, R21
PADDX.FA R2, R16, R2
PMULX.FA R20, R18, R16
PADDX.FA R2, R16, R18
FLDCFH R19, R4
FLDCH 0x0A00, R5
FABS R4, R4
FADDA R4, R5, R4
MOV 0x7FE, R5 | MOV 0x3FF, R6
SHLD.Q R5, 52, R5 | MOV 0x400, R3
SHLD.Q R6, 52, R6 | SUB R5, R4, R5
FMULA R5, R4, R7
SHLD.Q R3, 52, R3 | SUB R6, R7, R7
ADD R7, R5, R2
FMULA R2, R4, R6
FSUBA R3, R6, R7
FMULA R2, R7, R2
MOV.X tkra_prj_xyzsc, R22
MOV.X tkra_prj_xyzbi, R20
MOV 0x3F800000, R5 | FSTCF R2, R1
MOVLD R1, R1, R16 | MOVLD R5, R1, R17
PMULX.FA R16, R22, R16
PMULX.FA R18, R16, R18
PADDX.FA R18, R20, R2
RTS

Where the 'FA' operations encode a hint to use lower precision.
FMULA / FSUBA / ... also encode a "low precision" hint.
The projection operation doesn't really need much precision, but does
need to be reasonably fast...

This code fakes FDIV with bit twiddling partly because:
Originally there was no FDIV;
Even with FDIV, the instruction takes significantly longer than faking
it in this case.

Also note that, for performance reasons, my GL implementation uses
affine texturing with dynamic tessellation rather than
perspective-correct texturing (pros/cons here).

Or, its C equivalents:
tkra_vec4f TKRA_ProjectVertex(tkra_vec4f vec, tkra_mat4 mat)
{ register tkra_vec4f v0x, v0y, v0z, v0w;
float f;

v0x=tkra_v4f_xxxx(vec);
v0y=tkra_v4f_yyyy(vec);
v0z=tkra_v4f_zzzz(vec);
v0w=tkra_v4f_wwww(vec);

v0x=tkra_v4fmul(v0x, mat.row0);
v0y=tkra_v4fmul(v0y, mat.row1);
v0z=tkra_v4fmul(v0z, mat.row2);
v0w=tkra_v4fmul(v0w, mat.row3);

v0x=tkra_v4fadd(v0x, v0y);
v0y=tkra_v4fadd(v0z, v0w);
v0z=tkra_v4fadd(v0x, v0y);

return(v0z);
}

tkra_vec4f TKRA_ProjectVertexB(tkra_vec4f vec, tkra_mat4 mat)
{ tkra_vec4f v0xyzw, v0ww, v0p;
float f0, f1;

v0xyzw=TKRA_ProjectVertex(vec, mat);
f0=tkra_v4f_w(v0xyzw);
f1=tkra_frcpabs(f0); // f1=1.0/(fabs(f0)+0.001);

v0ww=tkra_mkvec4f(f1, f1, f1, 1.0);
v0ww=tkra_v4fmul(v0ww, tkra_prj_xyzsc);
v0p=tkra_v4fadd(tkra_v4fmul(v0xyzw, v0ww), tkra_prj_xyzbi);

return(v0p);
}

Where, this code is using wrappers over BGBCC's SIMD system (and also
the "xmmintrin.h" system and similar) mostly to allow it to also be
built using MSVC.

I am not sure if there are any "obvious" areas where further improvement
is possible.

Note that the relative lack of bundling in these code sequences is due
to 128-bit SIMD instructions being unable to be encoded in bundles and
similar (and a lot of the rest either has dependencies, or did not
represent instruction sequences which could be bundled effectively).

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<tctk28$m2v$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27175&group=comp.arch#27175

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!vxOTfS5Jn/b499FNW7Y8DA.user.46.165.242.75.POSTED!not-for-mail
From: nos...@nowhere.com (Andy)
Newsgroups: comp.arch
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
Date: Wed, 10 Aug 2022 00:31:02 +1200
Organization: Aioe.org NNTP Server
Message-ID: <tctk28$m2v$1@gioia.aioe.org>
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org>
<ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com>
<tblg10$13pon$1@dont-email.me>
<2c0104c2-0515-4d14-9e75-6e3525eea9f7n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="22623"; posting-host="vxOTfS5Jn/b499FNW7Y8DA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.11.0
X-Notice: Filtered by postfilter v. 0.9.2
Content-Language: en-US

by: Andy - Tue, 9 Aug 2022 12:31 UTC

On 8/08/22 09:58, BGB wrote:

> It would mostly help if one is throwing multiple cores at the problem.
> Some of my past experiments with multi-threaded renders had split the
> screen half or into quarters, with one thread working on each part of
> the screen.

Which call me a pessimist, would not work very well at all, except for
carefully rigged demos.
I'm sure there's a version of Murphy's law for 3d rendering at play.

> Though, my current rasterizer is single threaded.
>
> Originally, it was intended to be dual threaded, but ran into a problem
> when I started exceeding the resource budgets needed for doing dual core
> on the FPGA I am using, and the emphasis shifted to trying to make a
> single core run fast.

Which still strikes me as an impressive effort none the less.

Sometimes it's not how well the monkey dances, the fact it can dance at
all is what impresses most.

> Splitting up geometry and then drawing each tile sequentially is not as
> likely to be helpful.

Depends I guess, I've a tendency towards per-polygon texture maps, so
splitting polygons offline at the level building stage to suit max
texture sizes would be a thing for me.

> There is a possibility of a slight advantage to drawing geometry
> Z-buffer-only first, and then going back and drawing surfaces.

No z-buffer in a tile renderer AFAIK, just z-sort front to back, then a
kind of a modified painters algorithm where you get the overwrite out of
the way quickly with a per polygon constant value.

>> Of course ripping up and changing your existing core probably isn't
>> something you'd happily contemplate, so take my suggestion with a
>> large grain of salt. ;-)
>>
>
> Yeah.
>
> Also it isn't likely to offer a huge advantage in this case (with a
> software renderer).

That potentially could be augmented with custom FPGA resources if it helps?

> Also possibly counter-intuitively, a bigger amount of time is currently
> going into the transform stages than into the raster-drawing parts.
>
>
> So for GLQuake, time budget seems to be, roughly, say:
> ~ 50%, Quake itself;
> ~ 38%, transform stages
> ~ 4%, Edge Walking
> ~ 12%, Span Drawing
>

I'm no expert, but I wouldn't have guessed that half the compute was
going to just Quake engine housekeeping duties. :-)

What on earth is going on in there I wonder?
(no, no, please don't answer that, it's not a real question) :-)

>
> Wasn't aware of any of the code for any of the Tomb Raider games having
> been released, but then again I wasn't really much into Tome Raider.

???Tome-Raider??? -- blasphemy if ever I saw it!!! ;-)

And I'm not sure released is quit the right word, but the company that
wrote it - Core Design, pretty much went bankrupt twice, leaving the
source code with no clear legal owners so the story goes, but of course
trademarks and copyrights for the franchise still exist, despite their
recent selloff not-withstanding.

>
> Contrary to some people, I suspect the HL/HL2 era is when graphics got
> "good enough", despite newer advances in terms of rendering technology,
> the "better graphics" don't really improve the gameplay experience all
> that much.

Oh, I agree completely, well mostly, more polygons == more pretty
applies to at times.

>
> One of the more interesting recent developments is real-time ray-tracing.
>

I'm not so sure on the recent, I remember Silicon Graphics (remember
them?) did a ray tracing demo donkeys years ago with some hideously
expensive multi-core machine entirely on CPU power alone.

And it's been a sort-of holy grail quest for graphics rendering "if only
we had the computing resources to pull it off" ever since.

Quite why that was such a good idea was pretty much left unspoken, given
that hardware accelerated 3D on PCs was a thing by then and PC gaming
had already taken off big time.

> However, it is not so easy to write a ray-tracer that is competitive
> with a raster renderer. I had experimented (not too long ago) with

Not easy?, there's an understatement, try impossible!! (until very
recently obviously, and Intel Larrebee demos not withstanding) the're
being so many orders of magnitude difference in the calculation costs
between the two.

> implementing a software ray-tracer in the Quake engine, but it fell well
> short of usable performance.
>

Not surprising at all, given even a Nvidia 3090 GPU struggles with
acceptable frame rates while raytracing, and technically cheating as
well with hardware enhanced image noise reduction and resolution
up-scaling even then.

Maybe the 4000 series fares better, but that's only if you can afford
the things in the first place.

>
> IME, line-tracing over a regular grid structure (or an octree) tends to
> be more efficient than doing so via a recursive BSP walk (an octree
> based engine likely being more efficient if one wants to implement a
> ray-cast or ray-tracing renderer).

I pretty much ignored the deluge years of SIGGRAPH papers on new and
improved ray-tracing acceleration methods, so what the state of the art
is these days I couldn't say.

But, Internet!, I'm sure somebody does!

>
> But, OTOH, doing a modified version of Quake where I rebuild all of the
> maps from the map source (with custom tools) using an octree and similar
> rather than a BSP, is probably "not really worth it".
>

I wouldn't be so quick to dismiss the idea outright, I mean how do you
really know if you're not flogging a dead horse with regards to
everything else, and it's the data structures that are holding back that
last bit of performance improvement to be had?

Perhaps our man from super mega hand optimized assembler programming -
Terje Mathisen has some sage words of advice on discovering where the
limits of optimization lie?

>>> I could make stuff look a lot better, but this would require:
>>> Using RGBA32 buffers and textures;

Yep, millions of colors is more betterist.

>
> Probably, I am using RGB555A, which can sorta mimic RGBA32 and (on
> average) looks better than RGBA4444 or similar.
>
>
> RGBA32 for a framebuffer can look better, but using it would be kind of
> a waste when the output framebuffer is using RGB555. And, on the FPGA
> board I am using, the VGA output only has 4 bits per component, so even
> the RGB555 output is effectively using a Bayer dither in this >case.

Damn those pesky demo boards, just a taste of what could be done, but
never quite doing a decent job of it!, often despite the rather high
price tags to boot!

I guess if money is an object then moving to a better board with 24bit,
or better still an HDMI/Display port and the high-speed serial drivers
to drive them at full speed is out of the question?

And probably the worst time in recent memory to be shopping for new
electronics boards anyway, it's truly the end times to be sure, for
realzies this time!!!

> It is possible, though if I were to try to fit TKRA-GL to it, it would
> likely mean cores that were more like:
> 2-wide with 64-bit Packed-Integer SIMD;
> Probably still needing a 32-bit address space.

You don't seem to do 'small' much do you! ;-)

Small can be beautiful too you know!
Watch the Lord of the Rings trilogy, sometimes the smallest of Hobbits
can achieve the mightiest of deeds. :-)

> Though, my attempts at RISC-V cores have thus far end up more expensive
> than ideal (in my attempts, a full RV64G core would end up costing
> *more* than another BJX2 core), and even making it single-issue isn't
> really enough to compensate for this.

At this point I feel I should start calling your design 'Sybil Core',
(don't worry it's a 'dad' joke)

> A bigger amount of savings would likely be possible with a redesigned
> API design, possibly:
> Rendering state configuration is folded into "objects";
> Front-end interface mostly uses fixed-point or similar;
> ...

Your FPGA only has integer DSP multipliers?, with floating point units
having to be synthesized manually?

Might be worth a shot, but do you remember a 3D rendering library called
'Pixomatic', an all integer polygon rendering tool from way back when by
Rad Game Tools I think?

Don't know what it's state of existence is these days, but if it by
chance has gone open source it could be a veritable gold mine of source
code to port or idea mine for a lean, mean hand optimized rendering
engine that might make a perfect fit for an integer FPGA.
YMMV of course.

>
> Kinda curious that they did this back then, whereas many later games
> (Half-Life 2, Doom 3, etc) didn't bother with these sorts of effects
> (they probably could have if they wanted as special cases of ragdoll
> within their skeletal animation systems).

Click here to read the complete article

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<589c23de-bb5a-4d55-b671-de4f35534d54n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27179&group=comp.arch#27179

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:144f:b0:341:728:dee with SMTP id v15-20020a05622a144f00b0034107280deemr21328006qtx.459.1660076325440;
Tue, 09 Aug 2022 13:18:45 -0700 (PDT)
X-Received: by 2002:a05:6214:262f:b0:477:734:3177 with SMTP id
gv15-20020a056214262f00b0047707343177mr21492495qvb.67.1660076325336; Tue, 09
Aug 2022 13:18:45 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 9 Aug 2022 13:18:45 -0700 (PDT)
In-Reply-To: <tctk28$m2v$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2607:fea8:1dde:6a00:4c19:cfe0:b2ef:b45d;
posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 2607:fea8:1dde:6a00:4c19:cfe0:b2ef:b45d
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org>
<ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com> <tblg10$13pon$1@dont-email.me>
<2c0104c2-0515-4d14-9e75-6e3525eea9f7n@googlegroups.com> <tctk28$m2v$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <589c23de-bb5a-4d55-b671-de4f35534d54n@googlegroups.com>
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
From: robfi...@gmail.com (robf...@gmail.com)
Injection-Date: Tue, 09 Aug 2022 20:18:45 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 1661

by: robf...@gmail.com - Tue, 9 Aug 2022 20:18 UTC

Have you seen the graphics accelerator core on opencores.org? orsoc_graphics_accelerator.
It can do basic drawing operations, draw triangles, and fill areas. It takes about 15k LUTs. I have
modified it and used it for my own purposes. For some operations it would be hard to beat with
a computing core.

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<34f2d4e1-619a-4d0b-a056-f8193255e08fn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27180&group=comp.arch#27180

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:8606:0:b0:6b8:e6d7:af09 with SMTP id i6-20020a378606000000b006b8e6d7af09mr19424561qkd.416.1660077853676;
Tue, 09 Aug 2022 13:44:13 -0700 (PDT)
X-Received: by 2002:ad4:4a73:0:b0:474:6e80:e1cb with SMTP id
cn19-20020ad44a73000000b004746e80e1cbmr21187154qvb.127.1660077853456; Tue, 09
Aug 2022 13:44:13 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 9 Aug 2022 13:44:13 -0700 (PDT)
In-Reply-To: <589c23de-bb5a-4d55-b671-de4f35534d54n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.186.122; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.186.122
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org>
<ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com> <tblg10$13pon$1@dont-email.me>
<2c0104c2-0515-4d14-9e75-6e3525eea9f7n@googlegroups.com> <tctk28$m2v$1@gioia.aioe.org>
<589c23de-bb5a-4d55-b671-de4f35534d54n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <34f2d4e1-619a-4d0b-a056-f8193255e08fn@googlegroups.com>
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Tue, 09 Aug 2022 20:44:13 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2226

by: luke.l...@gmail.com - Tue, 9 Aug 2022 20:44 UTC

On Tuesday, August 9, 2022 at 9:18:46 PM UTC+1, robf...@gmail.com wrote:
> Have you seen the graphics accelerator core on opencores.org? orsoc_graphics_accelerator.
> It can do basic drawing operations, draw triangles, and fill areas. It takes about 15k LUTs. I have
> modified it and used it for my own purposes. For some operations it would be hard to beat with
> a computing core.

it's what's termed a "Fixed Shader" Engine, and consequently
is useless for modern GPU compute which is 100% "Shader"
based (i.e. Vulkan).

GPLGPU likewise is a "Fixed Shader"
https://github.com/asicguy/gplgpu

but Jeff Bush the author of Nyuzi did an analysis and
showed that actually GPLGPU internally has functions
remarkably similar to modern GPU instructions needed
for modern Shader design
https://jbush001.github.io/

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<tcvce3$1m1ek$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27182&group=comp.arch#27182

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
Date: Tue, 9 Aug 2022 23:32:57 -0500
Organization: A noiseless patient Spider
Lines: 688
Message-ID: <tcvce3$1m1ek$1@dont-email.me>
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org>
<ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com>
<tblg10$13pon$1@dont-email.me>
<2c0104c2-0515-4d14-9e75-6e3525eea9f7n@googlegroups.com>
<tctk28$m2v$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 10 Aug 2022 04:33:08 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="bb3616ec4f39f4e53af088389b0c8a08";
logging-data="1770964"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/uNxub3xIQg+Y//bB7i8Hp"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.12.0
Cancel-Lock: sha1:/yxsUduSr4heT1GGNwSAVM3Cskk=
In-Reply-To: <tctk28$m2v$1@gioia.aioe.org>
Content-Language: en-US

by: BGB - Wed, 10 Aug 2022 04:32 UTC

On 8/9/2022 7:31 AM, Andy wrote:
> On 8/08/22 09:58, BGB wrote:
>
>
> > It would mostly help if one is throwing multiple cores at the problem.
> > Some of my past experiments with multi-threaded renders had split the
> > screen half or into quarters, with one thread working on each part of
> > the screen.
>
> Which call me a pessimist, would not work very well at all, except for
> carefully rigged demos.
> I'm sure there's a version of Murphy's law for 3d rendering at play.
>

It was faster than a single-threaded renderer.

The front-end though does need to divide up geometry based on where it
is on screen, which would be of limited advantage if rendering is
transform limited rather than fill-rate limited.

It wouldn't work well with TKRA-GL on my BJX2 core, which is mostly
transform limited.

>
> > Though, my current rasterizer is single threaded.
> >
> > Originally, it was intended to be dual threaded, but ran into a problem
> > when I started exceeding the resource budgets needed for doing dual core
> > on the FPGA I am using, and the emphasis shifted to trying to make a
> > single core run fast.
>
> Which still strikes me as an impressive effort none the less.
>
> Sometimes it's not how well the monkey dances, the fact it can dance at
> all is what impresses most.
>

I had been recently experimenting with a stripped down "GPU Profile" for
the BJX2 core, which ended up being primarily using "parameter" settings
for disabling some parts of the core (idea being to have one core remain
"full featured", and the other is more stripped down).

General idea for the GPU Profile being:
Has WEX and XGPR
Omits TLB and the main FPU;
Omits various other functionality;
...

Then have now ended up doing some things to try to reduce costs in other
areas:
Eliminated DLR, DHR, and LR writable side-channels;
These may now only be written to via the register ports.

Replaced the logic for swapping SP and SSP in the pipeline, to doing it
in the decoder (it now effectively causes these registers to switch
places at decode-time when executing inside an interrupt; rather than by
having a side-channels mechanism to swap the registers when entering or
leaving).

The former had an obvious LUT savings.

The latter appeared to cause LUT use to increase slightly, but given
both timing and power have improved, this seems to be a case of "logic
is looser so now Vivado uses more LUTs".

I have still yet to shave down LUT usage enough to fit two cores on the
FPGA though.

>
>
>
> > Splitting up geometry and then drawing each tile sequentially is not as
> > likely to be helpful.
>
> Depends I guess, I've a tendency towards per-polygon texture maps, so
> splitting polygons offline at the level building stage to suit max
> texture sizes would be a thing for me.
>

OK.

Can note that officially TKRA-GL claims in the API to support up to
256x256; actual technical limit is closer at present to 4096x4096, but
in this case I don't actually have enough RAM to make textures this size
viable.

>
> > There is a possibility of a slight advantage to drawing geometry
> > Z-buffer-only first, and then going back and drawing surfaces.
>
> No z-buffer in a tile renderer AFAIK, just z-sort front to back, then a
> kind of a modified painters algorithm where you get the overwrite out of
> the way quickly with a per polygon constant value.
>

Getting rid of a Z-Buffer would be a problem for OpenGL, since some
parts of the API are built around the assumption of a Z-Buffer existing
(as opposed to Painter's Algorithm or similar).

Similarly, OpenGL tends to also assume things like that you can stop the
scene mid-render, use glReadPixels, and fetch a copy of whatever has
been drawn thus far (including both the contents of the color buffer and
Z-buffer), ...

>
>
> >> Of course ripping up and changing your existing core probably isn't
> >> something you'd happily contemplate, so take my suggestion with a
> >> large grain of salt. ;-)
> >>
> >
> > Yeah.
> >
> > Also it isn't likely to offer a huge advantage in this case (with a
> > software renderer).
>
> That potentially could be augmented with custom FPGA resources if it helps?
>

There is some amount of ISA level support:
Helper ops to pack and unpack RGB555 / etc;
Operations to help with compressed-texture blocks;
An "LDTEX" instruction to help speed up texture-map rendering (*);
...

The LDTEX instruction basically turns the texture fetch into a special
addressing mode, with a texture block decoder shoved onto the end of the
load instruction. It doesn't do any texture filtering, so is basically a
"Nearest" fetch, but it is possible to encode which pixel position one
wants to help with implementing bilinear filtering or similar.

>
> > Also possibly counter-intuitively, a bigger amount of time is currently
> > going into the transform stages than into the raster-drawing parts.
> >
> >
> > So for GLQuake, time budget seems to be, roughly, say:
> >    ~ 50%, Quake itself;
> >    ~ 38%, transform stages
> >    ~ 4%, Edge Walking
> >    ~ 12%, Span Drawing
> >
>
> I'm no expert, but I wouldn't have guessed that half the compute was
> going to just Quake engine housekeeping duties. :-)
>
> What on earth is going on in there I wonder?
> (no, no, please don't answer that, it's not a real question) :-)
>

CPU is slow enough in this case that much of the time ends up going into
things like audio mixing and the main BSP walk (R_RecursiveWorldNode,
etc), and a fair bit of time into RecursivePointLight and similar as
well (eg, figuring out the light-level at a given spot in the map).

The "BoxOnPlaneSide" function also eats a good chunk of time, ...

I am mixing sound at 16 kHz, one can get a slight speed increase by
mixing at 8 kHz, but it sounds bad enough that it really worth it.

Part of the audio mixing overhead is due to Quake playing near constant
looping ambient sound effects (albeit usually at a fairly low volume),
but this "loses something".

>
> >
> > Wasn't aware of any of the code for any of the Tomb Raider games having
> > been released, but then again I wasn't really much into Tome Raider.
>
> ???Tome-Raider??? -- blasphemy if ever I saw it!!! ;-)
>

Typo.

> And I'm not sure released is quit the right word, but the company that
> wrote it - Core Design, pretty much went bankrupt twice, leaving the
> source code with no clear legal owners so the story goes, but of course
> trademarks and copyrights for the franchise still exist, despite their
> recent selloff not-withstanding.
>

OK.

I guess this happens sometimes.

Sometimes things can't continue because the original company is now
defunct. Or, sometimes the company splits and then the parts get bought
by different companies, and no one can agree who owns the rights to the
original media, ...

>
>
> >
> > Contrary to some people, I suspect the HL/HL2 era is when graphics got
> > "good enough", despite newer advances in terms of rendering technology,
> > the "better graphics" don't really improve the gameplay experience all
> > that much.
>
> Oh, I agree completely, well mostly, more polygons == more pretty
> applies to at times.
>

Probably true; for example, games like Minecraft can push pretty massive
numbers of polygons despite the "simplistic" graphics.

>
> >
> > One of the more interesting recent developments is real-time
> ray-tracing.
> >
>
> I'm not so sure on the recent, I remember Silicon Graphics (remember
> them?) did a ray tracing demo donkeys years ago with some hideously
> expensive multi-core machine entirely on CPU power alone.
>
> And it's been a sort-of holy grail quest for graphics rendering "if only
> we had the computing resources to pull it off" ever since.
>
> Quite why that was such a good idea was pretty much left unspoken, given
> that hardware accelerated 3D on PCs was a thing by then and PC gaming
> had already taken off big time.
>

Practicality is more hit or miss.

>
>
> > However, it is not so easy to write a ray-tracer that is competitive
> > with a raster renderer. I had experimented (not too long ago) with
>
> Not easy?, there's an understatement, try impossible!! (until very
> recently obviously, and Intel Larrebee demos not withstanding) the're
> being so many orders of magnitude difference in the calculation costs
> between the two.
>
>
> > implementing a software ray-tracer in the Quake engine, but it fell well
> > short of usable performance.
> >
>
> Not surprising at all, given even a Nvidia 3090 GPU struggles with
> acceptable frame rates while raytracing, and technically cheating as
> well with hardware enhanced image noise reduction and resolution
> up-scaling even then.
>
> Maybe the 4000 series fares better, but that's only if you can afford
> the things in the first place.
>

Click here to read the complete article

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<td13i7$1u1u7$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27196&group=comp.arch#27196

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
Date: Wed, 10 Aug 2022 15:13:48 -0500
Organization: A noiseless patient Spider
Lines: 95
Message-ID: <td13i7$1u1u7$1@dont-email.me>
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org>
<ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com>
<tblg10$13pon$1@dont-email.me>
<2c0104c2-0515-4d14-9e75-6e3525eea9f7n@googlegroups.com>
<tctk28$m2v$1@gioia.aioe.org>
<589c23de-bb5a-4d55-b671-de4f35534d54n@googlegroups.com>
<34f2d4e1-619a-4d0b-a056-f8193255e08fn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 10 Aug 2022 20:13:59 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="bb3616ec4f39f4e53af088389b0c8a08";
logging-data="2033607"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19GzXbTMTrr81hT3hH6+k1P"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.12.0
Cancel-Lock: sha1:O2KibuVaHwC4QlYoN2hzZ31rW7s=
In-Reply-To: <34f2d4e1-619a-4d0b-a056-f8193255e08fn@googlegroups.com>
Content-Language: en-US

by: BGB - Wed, 10 Aug 2022 20:13 UTC

On 8/9/2022 3:44 PM, luke.l...@gmail.com wrote:
> On Tuesday, August 9, 2022 at 9:18:46 PM UTC+1, robf...@gmail.com wrote:
>> Have you seen the graphics accelerator core on opencores.org? orsoc_graphics_accelerator.
>> It can do basic drawing operations, draw triangles, and fill areas. It takes about 15k LUTs. I have
>> modified it and used it for my own purposes. For some operations it would be hard to beat with
>> a computing core.
>
> it's what's termed a "Fixed Shader" Engine, and consequently
> is useless for modern GPU compute which is 100% "Shader"
> based (i.e. Vulkan).
>
> GPLGPU likewise is a "Fixed Shader"
> https://github.com/asicguy/gplgpu
>
> but Jeff Bush the author of Nyuzi did an analysis and
> showed that actually GPLGPU internally has functions
> remarkably similar to modern GPU instructions needed
> for modern Shader design
> https://jbush001.github.io/
>

FWIW, there isn't any hard-limit preventing running shaders on the BJX2
core, it is mostly a software issue (needing to write a shader compiler
and similar, ...).

One other limit at the moment is that trying to have a second "GPU core"
(modified BJX2 core lacking TLB and FPU, but still having WEX+Jumbo96
and XGPR and 128-bit FP-SIMD and similar) currently needs ~ 25 kLUT.

Lane 3 is partially disabled in the GPU core (its ability to run
instructions was disabled, but it can still provide register ports for
128-bit SIMD ops and similar).

This mode also omits the Shift-Add MUL/DIV unit, along with most of the
96-bit address-space stuff (I turned this off for now trying to see if I
can make everything fit).

For now, the "GPU core" still has the ALUX and BCD extensions as well
(128-bit ALU and BCD ADD/SUB), could probably add logic to exclude these
if needed (but doing so isn't likely to save much).

Main core (with most features still enabled), using ~ 40 kLUT.

Externally, the DDR+L2 costs 4.0k, and the display output costs 3.3k.

This currently goes a bit outside the total LUT budget (63k).

Looks like (GPU core):
L1 Caches : ~ 6.0k
ALUs : ~ 5.3k (1/2)
Reg-File : ~ 3.1k
Int32 MUL : ~ 1.5k
LFP SIMD : ~ 1.7k
Rest : ~ 8.0k

Within the main core:
L1 Caches : ~ 6.0k
TLB : ~ 3.0k
ALUs: : ~ 6.7k (1/2/3)
Reg-File: : ~ 4.8k (extra plumbing)
Int32 MUL : ~ 1.5k
LFP SIMD : ~ 1.7k
Binary64 FPU: ~ 4.5k (FADD/FMUL units)
Shift-Add : ~ 1.0k (64-bit MUL/DIV, FDIV)
Rest : ~ 10.8k

This is with the internal store forwarding disabled in the L1 D$, which
increases LUT cost of the L1 D$ from 5k to 11k (but saves 1 cycle if a
store is directly followed by a load from the same cache line).

The LDTEX extension seems to cost around 2k, but is fairly relevant to
the 'GPU' use-case.

Disabling XGPR has very little effect on cost; also, having a 1-wide
core as a GPU would have a significant adverse effect on rasterizer
performance. Fast Int32 multiply and similar are also kinda important as
well, ...

I had also disabled the recent experimental LoadOp extension and similar
as well for these tests.

Also had to reduce L1 cache sizes down to 16K, otherwise the two cores
also blew out the Block RAM budget...

Still a bit over-budget though...

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<td6vs6$e33$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27238&group=comp.arch#27238

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!vxOTfS5Jn/b499FNW7Y8DA.user.46.165.242.75.POSTED!not-for-mail
From: nos...@nowhere.com (Andy)
Newsgroups: comp.arch
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
Date: Sat, 13 Aug 2022 13:47:48 +1200
Organization: Aioe.org NNTP Server
Message-ID: <td6vs6$e33$1@gioia.aioe.org>
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org>
<ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com>
<tblg10$13pon$1@dont-email.me>
<2c0104c2-0515-4d14-9e75-6e3525eea9f7n@googlegroups.com>
<tctk28$m2v$1@gioia.aioe.org> <tcvce3$1m1ek$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="14435"; posting-host="vxOTfS5Jn/b499FNW7Y8DA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.11.0
Content-Language: en-US
X-Notice: Filtered by postfilter v. 0.9.2

by: Andy - Sat, 13 Aug 2022 01:47 UTC

On 10/08/22 16:32, BGB wrote:

> I had been recently experimenting with a stripped down "GPU Profile" for
> the BJX2 core, which ended up being primarily using "parameter" settings
> for disabling some parts of the core (idea being to have one core remain
> "full featured", and the other is more stripped down).

Yep, Big & Little cores, it's the wave of the future! (or the present)

>
> I have still yet to shave down LUT usage enough to fit two cores on the
> FPGA though.

Kinda why I suggested a minimal 16(18?)bit core might be a good idea,
once you've got a LUT count for the most minimal but functional core,
you'd be in a better position to decide how many cores you could expect
to fit on your FPGA.

> Getting rid of a Z-Buffer would be a problem for OpenGL, since some
> parts of the API are built around the assumption of a Z-Buffer existing
> (as opposed to Painter's Algorithm or similar).

Yeah true, but I'm not sure what your goal is here, higher frame rates
in Quake through any means fair or foul?
Or you're just using Quake to illustrate the performance of your
standards compliant OpenGL renderer?

> I am mixing sound at 16 kHz, one can get a slight speed increase by
> mixing at 8 kHz, but it sounds bad enough that it really worth it.
>
> Part of the audio mixing overhead is due to Quake playing near constant
> looping ambient sound effects (albeit usually at a fairly low volume),
> but this "loses something".

Well, looking at this from the Commodore Amiga way of doing things,
maybe a simple co-processor here and there to handle the simple but
real-time constrained parts of the game so the main core can just 'set
and forget' things like this - is a worthwhile investment of some number
of LUTS? YMMV

> I had considered getting a better board, but then sat on the fence some,
> and now they are all sold out.
>
>
>> And probably the worst time in recent memory to be shopping for new
>> electronics boards anyway, it's truly the end times to be sure, for
>> realzies this time!!!
>>
>
> Yeah, pretty much all the FPGA boards are out of stock, and those that
> remain are in dwindling numbers.
>
> People can fight for the last remaining Nexys A7 and Arty boards, ...

Yep, waiting out the current mess, and seeing what turns up when the
dust settles seems like a good plan.

I wonder though, there are startups designing and selling cheap
bare-bones FPGA boards for makers and the like, so I would have thought
someone's selling a basic-everything you need for a stand-alone
soft-core computer for a reasonable price.

After all it seems like there is no shortage of comp.sci students and
hobbyists who would love such a thing.

> I have doubts that a 16-bitter could do much in terms of running an
> OpenGL backend. The task is mostly shoveling data around with some
> amount of math, and 64-bits is better for the "shoveling a lot of data
> around in relatively few clock cycles".
>
> Partly this was also based on realizing that making the core "bigger"
> has generally been cheaper than having more cores.
>
> So, from past experience, the FPGA I am using could fit roughly:
> 8x 16 bit cores (16 or 16-pair-32);
> 4x 32-bit cores (roughly SH2 or RV32I level);
> 2x 64 bit cores (plain 64-bit);
> 1x "big" 64 bit core (with 64 and 64-pair-128-bit operations).
>
> There is a "modest" cost difference between 1, 2, and 3 wide. Along with
> a modest cost difference due to wider or narrower registers
>
> But, going from a 1-wide RISC to 3-wide VLIW costs less than going from
> 1 core to 2 cores.
>
> Though, it is non-linear, going single core to dual core costs less than
> trying to go from 3-wide to 6-wide.

Ah okay, if you've been down every road before and know the tradeoffs,
then ignore me I'm just throwing ideas 'out there'.

Still, it's an interesting question I think, what size and number of
cores delivers the most performance per unit resource? - two strong
oxen, or a thousand chickens?

Seymour Cray seems to be on the losing end of this argument at this
point in time I think, now that even PCs and laptops can be bought with
a dozen or more cores.

And then on the other end of the scale there is the wafer scale Cerberus
with '850,000 AI-optimized cores'

The chickens are winning big time I think, at least for machine learning
applications.

But how long until the compiler techniques Cerberus invented trickle
down to open source software and anyone with a few hundred cores can
throw them via a compiler at whatever problem they might have?

Seems like you're in a good position with your FPGA cores to experiment
your way to an answer for the hardware question at least.

>> At this point I feel I should start calling your design 'Sybil Core',
>> (don't worry it's a 'dad' joke)
>>
>
> Had to look this up, I am guessing this is a movie reference?...
> Had never seen that movie.

Not surprised, it was before our time I think..

It was the name of a patient and books and films about the first widely
publicized person with multiple personality disorder, now largely
debunked, but still good for the odd joke or two. ;-)

> Half-Life 2 also had ragdolls, but an entity would only be either be
> either fully under the control of skeletal animation, or fully ragdoll.
>
> IIRC, it also had cases where different sequences could run at the same
> time, say with one animation loop controlling the upper torso and
> another the legs, etc.

In Tombraider the shooting mechanic locks Lara's guns and arms onto a
target regardless of what the rest of her body animation is doing, so
the independent bone animation system is a big win for that sort of thing.

> Didn't really have a feature, say, where some bones would be keyframe
> animated and some ragdoll, or maybe have "inertial" bones with elastic
> effects, ... Then again, remembering some of the scenes in Portal and
> Portal 2, I suspect it is possible they may have added something like
> this. Not really looked into it though.

Pretty standard for a game engine these days AFAIK.

<snip>

> Decided not to try to explain the Mega-Man series timeline...

Okay then.

Last game I played was the simple click and point adventure game 'I have
no mouth, but I need to scream' based on Harlan Ellison's short story of
the same name, I'd recommend it for the voice-over work by Mr Ellison
himself for the evil AI in the game (funny and twistedly evil).

It's kinda spooky on two levels, 1) the story's world building itself
and the situation it presents.

And 2) the kicker is that it has Russia and China and the US at war with
each other, they each made an AI machine to control their war efforts,
the machines secretly hookup and decide to kill off almost the entire
human race.

Except for the last part, does it remind you of anything going on in the
real world perhaps?

Scary stuff, I hate it when life imitates art, especially when the
artist was trying to deliver a dire warning, not an instruction manual!

> I guess the bigger issue with making an arbitrarily large world within
> Quake-style map technology, would be having to be the person to build
> such a world in a map editor...

Yeep, par for the course I'm afraid, although writing a 3D-world
translator from something pre-existing might be an option to.

>
> My second and 3rd engines had mostly fallen back to 2D sprite graphics
> (3D modeling and animation is a pain).

And many Indie game devs would seem to agree with you, going by the
latest game being spammed on Youtube 'The Cult Of The Lamb', looks very
2D sprites in a virtual 3D setup, makes sense if you can paint and draw
I guess.

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<td97d0$2u54f$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27249&group=comp.arch#27249

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
Date: Sat, 13 Aug 2022 17:08:30 -0500
Organization: A noiseless patient Spider
Lines: 430
Message-ID: <td97d0$2u54f$1@dont-email.me>
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org>
<ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com>
<tblg10$13pon$1@dont-email.me>
<2c0104c2-0515-4d14-9e75-6e3525eea9f7n@googlegroups.com>
<tctk28$m2v$1@gioia.aioe.org> <tcvce3$1m1ek$1@dont-email.me>
<td6vs6$e33$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 13 Aug 2022 22:08:32 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="9825c1cd6ceb1df7683f3d0004ac55b7";
logging-data="3085455"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+jZDzsfSk7I3e/WHq4UPjn"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.12.0
Cancel-Lock: sha1:CDyqhjRgu0hUEd2P53dHLeiDupk=
In-Reply-To: <td6vs6$e33$1@gioia.aioe.org>
Content-Language: en-US

by: BGB - Sat, 13 Aug 2022 22:08 UTC

On 8/12/2022 8:47 PM, Andy wrote:
> On 10/08/22 16:32, BGB wrote:
>
>
>> I had been recently experimenting with a stripped down "GPU Profile"
>> for the BJX2 core, which ended up being primarily using "parameter"
>> settings for disabling some parts of the core (idea being to have one
>> core remain "full featured", and the other is more stripped down).
>
> Yep, Big & Little cores, it's the wave of the future! (or the present)
>
>
>>
>> I have still yet to shave down LUT usage enough to fit two cores on
>> the FPGA though.
>
> Kinda why I suggested a minimal 16(18?)bit core might be a good idea,
> once you've got a LUT count for the most minimal but functional core,
> you'd be in a better position to decide how many cores you could expect
> to fit on your FPGA.
>

I managed to get a lot closer, but is it worth it?...

I have managed to shave things enough that the main core went from 82%
of the FPGA's LUT budget to 68%, though in the process some bugs have
appeared that I am still trying to sort out (such as the low-precision
SIMD unit seemingly not working reliably).

But, a 16-bit GPU wouldn't be sufficient:
The data it needs to deal with is too big for this.

Generally the rasterizer and blending operations involve shoving around
64 bits at a time. While I could potentially shave RGBA vectors down to
32 bits (using packed-byte rather than packed-word), this leaves *no*
bits for fractional steps.

Say, for drawing a span of length N, one needs around ceil(log2(N)+1)
bits below the effective ULP. Less than this, and the interpolation is
prone to break.

If say, the max span length is around 64 pixels, then a 16-bit
intermediate value is sufficient for stepping the RGBA vector (treating
each element as 8.8 fixed point).

And, 4x 16 = 64 bits.

Longer spans don't really happen as by this point the primitive will
have been tessellated into smaller pieces.

If one did the math with discrete 16-bit elements, this would be slow
(and one would also need around 128 or 256 registers to deal with all
the working state inside the raster loop).

Generally, the smallest cores I was able to make were 16/32 cores, which
generally weighed in at around 5k LUT.

These were using very simplistic L1 caches, namely a single cache line,
with no misaligned accesses.

Allowing for a RAM interface (but little else), I could maybe fit 10 of
them in the FPGA I am using.

>
>
>> Getting rid of a Z-Buffer would be a problem for OpenGL, since some
>> parts of the API are built around the assumption of a Z-Buffer
>> existing (as opposed to Painter's Algorithm or similar).
>
> Yeah true, but I'm not sure what your goal is here, higher frame rates
> in Quake through any means fair or foul?
> Or you're just using Quake to illustrate the performance of your
> standards compliant OpenGL renderer?
>

It doesn't need to be standards compliant, but I do want typical code to
work.

Something which fundamentally breaks the way the API works, or the
ability to use glReadPixels, is probably no-go.

>
>
>> I am mixing sound at 16 kHz, one can get a slight speed increase by
>> mixing at 8 kHz, but it sounds bad enough that it really worth it.
>>
>> Part of the audio mixing overhead is due to Quake playing near
>> constant looping ambient sound effects (albeit usually at a fairly low
>> volume), but this "loses something".
>
> Well, looking at this from the Commodore Amiga way of doing things,
> maybe a simple co-processor here and there to handle the simple but
> real-time constrained parts of the game so the main core can just 'set
> and forget' things like this - is a worthwhile investment of some number
> of LUTS? YMMV
>

Possibly.

My PCM audio output device is a smallish MMIO mapped buffer which plays
in a loop with a few control register fields (such as selecting the
sample rate and format and mono-vs-stereo and similar).

If you leave something in the buffer and then stop updating it, it does
the "jackhammer" artifact.

I also have an FM module, which has several programmable FM channels.
Its operation is that it will basically step through the channels
calculating and adding their current output value into an accumulator
and updating the step, and when it gets to the end, it resets back to
the start and resets the accumulator, with the final accumulator value
driving a PCM output signal.

Sadly, sounds pretty bad if compared with an OPL2/3 chipset.
Would likely need to more accurately mimic the parameters and behavior
of the OPL2.

While a programmable mixer is possible, the Block-RAM requirements to
make this usable are harder, and giving it a ringbus interface to allow
it to use a DRAM backed buffer would be "kinda expensive".

>
>> I had considered getting a better board, but then sat on the fence
>> some, and now they are all sold out.
>>
>>
>>> And probably the worst time in recent memory to be shopping for new
>>> electronics boards anyway, it's truly the end times to be sure, for
>>> realzies this time!!!
>>>
>>
>> Yeah, pretty much all the FPGA boards are out of stock, and those that
>> remain are in dwindling numbers.
>>
>> People can fight for the last remaining Nexys A7 and Arty boards, ...
>
> Yep, waiting out the current mess, and seeing what turns up when the
> dust settles seems like a good plan.
>
> I wonder though, there are startups designing and selling cheap
> bare-bones FPGA boards for makers and the like, so I would have thought
> someone's selling a basic-everything you need for a stand-alone
> soft-core computer for a reasonable price.
>
> After all it seems like there is no shortage of comp.sci students and
> hobbyists who would love such a thing.
>

There are the Numato Labs boards, but most were more expensive for less
capability when Digilent options.

Similar with ALINX, where for a similar FPGA the board would be around
twice the cost as from Digilent, and they seem to be (mostly) focusing
on FPGA boards which go into PCIe slots, rather than standalone boards
with an SDcard and VGA port and similar.

A lot of the hobbyist-built boards are using ICE40 parts, where a
generally a 16-bit CPU is about all that is going to fit on the FPGA.

>
>
>> I have doubts that a 16-bitter could do much in terms of running an
>> OpenGL backend. The task is mostly shoveling data around with some
>> amount of math, and 64-bits is better for the "shoveling a lot of data
>> around in relatively few clock cycles".
>>
>> Partly this was also based on realizing that making the core "bigger"
>> has generally been cheaper than having more cores.
>>
>> So, from past experience, the FPGA I am using could fit roughly:
>>    8x 16 bit cores (16 or 16-pair-32);
>>    4x 32-bit cores (roughly SH2 or RV32I level);
>>    2x 64 bit cores (plain 64-bit);
>>    1x "big" 64 bit core (with 64 and 64-pair-128-bit operations).
>>
>> There is a "modest" cost difference between 1, 2, and 3 wide. Along
>> with a modest cost difference due to wider or narrower registers
>>
>> But, going from a 1-wide RISC to 3-wide VLIW costs less than going
>> from 1 core to 2 cores.
>>
>> Though, it is non-linear, going single core to dual core costs less
>> than trying to go from 3-wide to 6-wide.
>
> Ah okay, if you've been down every road before and know the tradeoffs,
> then ignore me I'm just throwing ideas 'out there'.
>

> Still, it's an interesting question I think, what size and number of
> cores delivers the most performance per unit resource? - two strong
> oxen, or a thousand chickens?
>

The problem is if your options are one or two "strong cores", or 4 or 8
"pathetically weak" cores, the strong cores generally win out here.

The cost of an L1 cache varies considerably:
Single line vs cache-line array (1);
Aligned-only vs allowing unaligned access (2);
...

1: Single-cache-line allows getting rid of a whole bunch of stuff.
However, now the vast majority of memory accesses result in an L1 miss.

Works "sorta OK" for fixed-length 16-bit instructions with an I$
(roughly 8 cycles between each I$ miss).

2: Aligned only can effectively halve the internal "width" of the cache,
since now nothing may cross a cache line boundary.

Some of my 5k LUT cores were basically using this design.

Click here to read the complete article

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<2609b76b-c113-46c9-9405-960bc4100295n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27269&group=comp.arch#27269

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:2909:b0:6b6:a94:a988 with SMTP id m9-20020a05620a290900b006b60a94a988mr10774088qkp.350.1660540884704;
Sun, 14 Aug 2022 22:21:24 -0700 (PDT)
X-Received: by 2002:a05:620a:1404:b0:6ba:c2c2:5eca with SMTP id
d4-20020a05620a140400b006bac2c25ecamr10432972qkj.220.1660540884550; Sun, 14
Aug 2022 22:21:24 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 14 Aug 2022 22:21:24 -0700 (PDT)
In-Reply-To: <td97d0$2u54f$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=99.251.79.92; posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 99.251.79.92
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org>
<ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com> <tblg10$13pon$1@dont-email.me>
<2c0104c2-0515-4d14-9e75-6e3525eea9f7n@googlegroups.com> <tctk28$m2v$1@gioia.aioe.org>
<tcvce3$1m1ek$1@dont-email.me> <td6vs6$e33$1@gioia.aioe.org> <td97d0$2u54f$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2609b76b-c113-46c9-9405-960bc4100295n@googlegroups.com>
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
From: robfi...@gmail.com (robf...@gmail.com)
Injection-Date: Mon, 15 Aug 2022 05:21:24 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 1923

by: robf...@gmail.com - Mon, 15 Aug 2022 05:21 UTC

Spent part of the day studying the Nyuzi core by Jeff Bush. It synthesizes to
about 88,000 LUTs in an Artix-7 xc7a200 part. But that is for 16 lanes. It could
probably be reduced in size by reducing the number of lanes. I also found this
article about using a rasterizer in addition to a GPU like core.

http://www.cs.binghamton.edu/~millerti/nyuziraster.pdf
nyuziraster.pdf (binghamton.edu)

Sounds like a rasterizer may be a good addition have in place.
I read through NVIDEA GPU docs a while ago. Also worth a read.

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<tdcpbq$3h22d$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27270&group=comp.arch#27270

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
Date: Mon, 15 Aug 2022 01:33:29 -0500
Organization: A noiseless patient Spider
Lines: 111
Message-ID: <tdcpbq$3h22d$1@dont-email.me>
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org>
<ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com>
<tblg10$13pon$1@dont-email.me>
<2c0104c2-0515-4d14-9e75-6e3525eea9f7n@googlegroups.com>
<tctk28$m2v$1@gioia.aioe.org> <tcvce3$1m1ek$1@dont-email.me>
<td6vs6$e33$1@gioia.aioe.org> <td97d0$2u54f$1@dont-email.me>
<2609b76b-c113-46c9-9405-960bc4100295n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 15 Aug 2022 06:33:31 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="3dbf9705c7d8a676bd69d3105d5c71d8";
logging-data="3704909"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+955B8tqtZn3CLWTI8LgUU"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.12.0
Cancel-Lock: sha1:TSoj7DGs4O2MOQg9DQvFDRSxy4A=
In-Reply-To: <2609b76b-c113-46c9-9405-960bc4100295n@googlegroups.com>
Content-Language: en-US

by: BGB - Mon, 15 Aug 2022 06:33 UTC

On 8/15/2022 12:21 AM, robf...@gmail.com wrote:
> Spent part of the day studying the Nyuzi core by Jeff Bush. It synthesizes to
> about 88,000 LUTs in an Artix-7 xc7a200 part. But that is for 16 lanes. It could
> probably be reduced in size by reducing the number of lanes. I also found this
> article about using a rasterizer in addition to a GPU like core.
>

If reduced to 1/8 or 1/16 that size, it might be OK.

> http://www.cs.binghamton.edu/~millerti/nyuziraster.pdf
> nyuziraster.pdf (binghamton.edu)
>
> Sounds like a rasterizer may be a good addition have in place.
> I read through NVIDEA GPU docs a while ago. Also worth a read.

Maybe.

Quick skim, seems like it works very differently from the BJX2 core...

My experiment with the GPU profile for a tweaked BJX2 core was that I
still couldn't get stuff small enough to fit it alongside the main core.

But, can't go much smaller as then I am not likely to have any real
advantage over the existing software rasterizer (and, even if I disable
a bunch of features; it doesn't save enough LUTs to make a significant
difference).

My biggest LUT savings was from modifying the workings of my interrupt
mechanism and similar.

I had also gained a little bit of performance from my OpenGL rasterizer
(along with a reduction in distortion), by addling dedicated code for
rasterizing quads.

Drawing quads is more complicated, however, one needs to draw fewer of
them, so it works out as a net win. Also, they seem to distort slightly
less due to affine warping, so they can be a little bigger while still
looking OK.

Also made some minor tweaks to how vertex projection is handled during
dynamic tessellation (to reduce the number of vertices being projected).

And, the code for how to tessellate now operates in world-space
coordinates rather than screen space coordinates (has fewer artifacts).

I guess a question is if there is some way to get things smaller without
compromising performance.

Something I can fit in 10k to 15k LUTs would be nice. Ideally, should
still be fully programmable (sufficient to run both C and GLSL).

Would be nice to have fully pipelined FPU. This is at least sorta doable
at Binary32 precision without needing an unreasonably long pipeline

This is sorta what the Low-Precision FPU aims to do in myself, except
that it is apparently "not working" for some reason at the moment (while
trying to run GLQuake on my main core).

Looking at SIMD unit output is hard to figure out, I have been gathering
large chunks of output like:
(A): Rs=4394000043940000 Rt=000000003bcccccd Rn=000000003feccc00 Ixt=067
(B): Rs=41c0000041c00000 Rt=0000000000000000 Rn=0000000000000000 Ixt=067
(A): Rs=41c0000041c00000 Rt=bc23d70a00000000 Rn=be75c28000000000 Ixt=067
(B): Rs=0000000000000000 Rt=0000000000000000 Rn=0000000000000000 Ixt=065
(A): Rs=000000003feccc00 Rt=be75c28000000000 Rn=be75c2803feccc00 Ixt=065
(B): Rs=0000000000000000 Rt=00000000b727c61a Rn=0000000000000000 Ixt=067
(A): Rs=0000000000000000 Rt=0000000000000000 Rn=0000000000000000 Ixt=067
(B): Rs=0000000000000000 Rt=0000000000000000 Rn=0000000000000000 Ixt=065
(A): Rs=be75c2803feccc00 Rt=0000000000000000 Rn=be75c2803feccc00 Ixt=065
(B): Rs=3f8000003f800000 Rt=3f80000000000000 Rn=3f80000000000000 Ixt=067
(A): Rs=3f8000003f800000 Rt=3f800000bf800000 Rn=3f800000bf800000 Ixt=067
(B): Rs=0000000000000000 Rt=3f80000000000000 Rn=3f80000000000000 Ixt=065
(A): Rs=be75c2803feccc00 Rt=3f800000bf800000 Rn=3f428f003f599700 Ixt=065
(B): Rs=3f8000003f7ff401 Rt=3f80000046dffe00 Rn=3f80000046dff380 Ixt=067
(A): Rs=3f7ff4013f7ff401 Rt=42c8000043200000 Rn=42c7f680431ff880 Ixt=067
(B): Rs=3f80000000000000 Rt=3f80000046dff380 Rn=3f80000000000000 Ixt=067
(A): Rs=3f428f003f599700 Rt=42c7f680431ff880 Rn=4297f8004307f800 Ixt=067
(B): Rs=3f80000000000000 Rt=0000000000000000 Rn=3f80000000000000 Ixt=065
(A): Rs=4297f8004307f800 Rt=42c8000043200000 Rn=432ffc004393fc00 Ixt=065

Which may, theoretically, contain a hint as to why it isn't working
(067=Packed FMUL, 065=Packed FADD). The (A/B) indicates which lane is
printing (and the general lack of "random garbage" seems to imply timing
is OK).

But, deciphering floating point numbers in hexadecimal form is a pain
(there seems to be a fault with FADD here, but basic tests pass in the
unit tests and in a few sanity-check cases).

I have spent several days thus far trying to figure out what is going on
here...

I guess one thing I could try would be to (more exactly) mimic the
behavior of the low-precision SIMD operators in the emulator, and see if
the bug appears in the emulator... (It doesn't tend to produce exactly
the same results as one would get by truncating the Binary32 values).

....

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<74068459-d5ab-4929-bad5-71a71f41d1a7n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27280&group=comp.arch#27280

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:174b:b0:343:1fc:14d8 with SMTP id l11-20020a05622a174b00b0034301fc14d8mr15226776qtk.579.1660593936999;
Mon, 15 Aug 2022 13:05:36 -0700 (PDT)
X-Received: by 2002:a05:620a:1404:b0:6ba:c2c2:5eca with SMTP id
d4-20020a05620a140400b006bac2c25ecamr12786523qkj.220.1660593936789; Mon, 15
Aug 2022 13:05:36 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 15 Aug 2022 13:05:36 -0700 (PDT)
In-Reply-To: <tdcpbq$3h22d$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=99.251.79.92; posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 99.251.79.92
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org>
<ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com> <tblg10$13pon$1@dont-email.me>
<2c0104c2-0515-4d14-9e75-6e3525eea9f7n@googlegroups.com> <tctk28$m2v$1@gioia.aioe.org>
<tcvce3$1m1ek$1@dont-email.me> <td6vs6$e33$1@gioia.aioe.org>
<td97d0$2u54f$1@dont-email.me> <2609b76b-c113-46c9-9405-960bc4100295n@googlegroups.com>
<tdcpbq$3h22d$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <74068459-d5ab-4929-bad5-71a71f41d1a7n@googlegroups.com>
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
From: robfi...@gmail.com (robf...@gmail.com)
Injection-Date: Mon, 15 Aug 2022 20:05:36 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 6907

by: robf...@gmail.com - Mon, 15 Aug 2022 20:05 UTC

On Monday, August 15, 2022 at 2:33:34 AM UTC-4, BGB wrote:
> On 8/15/2022 12:21 AM, robf...@gmail.com wrote:
> > Spent part of the day studying the Nyuzi core by Jeff Bush. It synthesizes to
> > about 88,000 LUTs in an Artix-7 xc7a200 part. But that is for 16 lanes. It could
> > probably be reduced in size by reducing the number of lanes. I also found this
> > article about using a rasterizer in addition to a GPU like core.
> >
> If reduced to 1/8 or 1/16 that size, it might be OK.
> > http://www.cs.binghamton.edu/~millerti/nyuziraster.pdf
> > nyuziraster.pdf (binghamton.edu)

> > Sounds like a rasterizer may be a good addition have in place.
> > I read through NVIDEA GPU docs a while ago. Also worth a read.
> Maybe.
>
>
> Quick skim, seems like it works very differently from the BJX2 core...
>
>
>
>
> My experiment with the GPU profile for a tweaked BJX2 core was that I
> still couldn't get stuff small enough to fit it alongside the main core.
>
> But, can't go much smaller as then I am not likely to have any real
> advantage over the existing software rasterizer (and, even if I disable
> a bunch of features; it doesn't save enough LUTs to make a significant
> difference).
>
> My biggest LUT savings was from modifying the workings of my interrupt
> mechanism and similar.
>
>
> I had also gained a little bit of performance from my OpenGL rasterizer
> (along with a reduction in distortion), by addling dedicated code for
> rasterizing quads.
>
> Drawing quads is more complicated, however, one needs to draw fewer of
> them, so it works out as a net win. Also, they seem to distort slightly
> less due to affine warping, so they can be a little bigger while still
> looking OK.
>
> Also made some minor tweaks to how vertex projection is handled during
> dynamic tessellation (to reduce the number of vertices being projected).
>
> And, the code for how to tessellate now operates in world-space
> coordinates rather than screen space coordinates (has fewer artifacts).
>
>
>
> I guess a question is if there is some way to get things smaller without
> compromising performance.
>
> Something I can fit in 10k to 15k LUTs would be nice. Ideally, should
> still be fully programmable (sufficient to run both C and GLSL).
>
>
> Would be nice to have fully pipelined FPU. This is at least sorta doable
> at Binary32 precision without needing an unreasonably long pipeline
>
>
>
> This is sorta what the Low-Precision FPU aims to do in myself, except
> that it is apparently "not working" for some reason at the moment (while
> trying to run GLQuake on my main core).
>
> Looking at SIMD unit output is hard to figure out, I have been gathering
> large chunks of output like:
> (A): Rs=4394000043940000 Rt=000000003bcccccd Rn=000000003feccc00 Ixt=067
> (B): Rs=41c0000041c00000 Rt=0000000000000000 Rn=0000000000000000 Ixt=067
> (A): Rs=41c0000041c00000 Rt=bc23d70a00000000 Rn=be75c28000000000 Ixt=067
> (B): Rs=0000000000000000 Rt=0000000000000000 Rn=0000000000000000 Ixt=065
> (A): Rs=000000003feccc00 Rt=be75c28000000000 Rn=be75c2803feccc00 Ixt=065
> (B): Rs=0000000000000000 Rt=00000000b727c61a Rn=0000000000000000 Ixt=067
> (A): Rs=0000000000000000 Rt=0000000000000000 Rn=0000000000000000 Ixt=067
> (B): Rs=0000000000000000 Rt=0000000000000000 Rn=0000000000000000 Ixt=065
> (A): Rs=be75c2803feccc00 Rt=0000000000000000 Rn=be75c2803feccc00 Ixt=065
> (B): Rs=3f8000003f800000 Rt=3f80000000000000 Rn=3f80000000000000 Ixt=067
> (A): Rs=3f8000003f800000 Rt=3f800000bf800000 Rn=3f800000bf800000 Ixt=067
> (B): Rs=0000000000000000 Rt=3f80000000000000 Rn=3f80000000000000 Ixt=065
> (A): Rs=be75c2803feccc00 Rt=3f800000bf800000 Rn=3f428f003f599700 Ixt=065
> (B): Rs=3f8000003f7ff401 Rt=3f80000046dffe00 Rn=3f80000046dff380 Ixt=067
> (A): Rs=3f7ff4013f7ff401 Rt=42c8000043200000 Rn=42c7f680431ff880 Ixt=067
> (B): Rs=3f80000000000000 Rt=3f80000046dff380 Rn=3f80000000000000 Ixt=067
> (A): Rs=3f428f003f599700 Rt=42c7f680431ff880 Rn=4297f8004307f800 Ixt=067
> (B): Rs=3f80000000000000 Rt=0000000000000000 Rn=3f80000000000000 Ixt=065
> (A): Rs=4297f8004307f800 Rt=42c8000043200000 Rn=432ffc004393fc00 Ixt=065
>
> Which may, theoretically, contain a hint as to why it isn't working
> (067=Packed FMUL, 065=Packed FADD). The (A/B) indicates which lane is
> printing (and the general lack of "random garbage" seems to imply timing
> is OK).
>
> But, deciphering floating point numbers in hexadecimal form is a pain
> (there seems to be a fault with FADD here, but basic tests pass in the
> unit tests and in a few sanity-check cases).
>
> I have spent several days thus far trying to figure out what is going on
> here...
>
>
> I guess one thing I could try would be to (more exactly) mimic the
> behavior of the low-precision SIMD operators in the emulator, and see if
> the bug appears in the emulator... (It doesn't tend to produce exactly
> the same results as one would get by truncating the Binary32 values).
>
> ...
I would run tests against random test cases and see what falls out. For
my decimal float multiply I found one in 5,000 random cases did not work.
It turned out to be an issue with carries in the Karatsuba multiplier.
And, I did not run enough test cases the first time through.

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<tdffj8$c20$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27298&group=comp.arch#27298

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
Date: Tue, 16 Aug 2022 02:05:11 -0500
Organization: A noiseless patient Spider
Lines: 147
Message-ID: <tdffj8$c20$1@dont-email.me>
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org>
<ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com>
<tblg10$13pon$1@dont-email.me>
<2c0104c2-0515-4d14-9e75-6e3525eea9f7n@googlegroups.com>
<tctk28$m2v$1@gioia.aioe.org> <tcvce3$1m1ek$1@dont-email.me>
<td6vs6$e33$1@gioia.aioe.org> <td97d0$2u54f$1@dont-email.me>
<2609b76b-c113-46c9-9405-960bc4100295n@googlegroups.com>
<tdcpbq$3h22d$1@dont-email.me>
<74068459-d5ab-4929-bad5-71a71f41d1a7n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 16 Aug 2022 07:05:12 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="3d4124a629ae225c26b29c89547dc00f";
logging-data="12352"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19lOP3RfNxZIY74i0+dr9Za"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.12.0
Cancel-Lock: sha1:VCEeVh38tD2DwCCzBSPFPOUtAeQ=
In-Reply-To: <74068459-d5ab-4929-bad5-71a71f41d1a7n@googlegroups.com>
Content-Language: en-US

by: BGB - Tue, 16 Aug 2022 07:05 UTC

On 8/15/2022 3:05 PM, robf...@gmail.com wrote:
> On Monday, August 15, 2022 at 2:33:34 AM UTC-4, BGB wrote:
>> On 8/15/2022 12:21 AM, robf...@gmail.com wrote:
>>> Spent part of the day studying the Nyuzi core by Jeff Bush. It synthesizes to
>>> about 88,000 LUTs in an Artix-7 xc7a200 part. But that is for 16 lanes. It could
>>> probably be reduced in size by reducing the number of lanes. I also found this
>>> article about using a rasterizer in addition to a GPU like core.
>>>
>> If reduced to 1/8 or 1/16 that size, it might be OK.
>>> http://www.cs.binghamton.edu/~millerti/nyuziraster.pdf
>>> nyuziraster.pdf (binghamton.edu)
>
>>> Sounds like a rasterizer may be a good addition have in place.
>>> I read through NVIDEA GPU docs a while ago. Also worth a read.
>> Maybe.
>>
>>
>> Quick skim, seems like it works very differently from the BJX2 core...
>>
>>
>>
>>
>> My experiment with the GPU profile for a tweaked BJX2 core was that I
>> still couldn't get stuff small enough to fit it alongside the main core.
>>
>> But, can't go much smaller as then I am not likely to have any real
>> advantage over the existing software rasterizer (and, even if I disable
>> a bunch of features; it doesn't save enough LUTs to make a significant
>> difference).
>>
>> My biggest LUT savings was from modifying the workings of my interrupt
>> mechanism and similar.
>>
>>
>> I had also gained a little bit of performance from my OpenGL rasterizer
>> (along with a reduction in distortion), by addling dedicated code for
>> rasterizing quads.
>>
>> Drawing quads is more complicated, however, one needs to draw fewer of
>> them, so it works out as a net win. Also, they seem to distort slightly
>> less due to affine warping, so they can be a little bigger while still
>> looking OK.
>>
>> Also made some minor tweaks to how vertex projection is handled during
>> dynamic tessellation (to reduce the number of vertices being projected).
>>
>> And, the code for how to tessellate now operates in world-space
>> coordinates rather than screen space coordinates (has fewer artifacts).
>>
>>
>>
>> I guess a question is if there is some way to get things smaller without
>> compromising performance.
>>
>> Something I can fit in 10k to 15k LUTs would be nice. Ideally, should
>> still be fully programmable (sufficient to run both C and GLSL).
>>
>>
>> Would be nice to have fully pipelined FPU. This is at least sorta doable
>> at Binary32 precision without needing an unreasonably long pipeline
>>
>>
>>
>> This is sorta what the Low-Precision FPU aims to do in myself, except
>> that it is apparently "not working" for some reason at the moment (while
>> trying to run GLQuake on my main core).
>>
>> Looking at SIMD unit output is hard to figure out, I have been gathering
>> large chunks of output like:
>> (A): Rs=4394000043940000 Rt=000000003bcccccd Rn=000000003feccc00 Ixt=067
>> (B): Rs=41c0000041c00000 Rt=0000000000000000 Rn=0000000000000000 Ixt=067
>> (A): Rs=41c0000041c00000 Rt=bc23d70a00000000 Rn=be75c28000000000 Ixt=067
>> (B): Rs=0000000000000000 Rt=0000000000000000 Rn=0000000000000000 Ixt=065
>> (A): Rs=000000003feccc00 Rt=be75c28000000000 Rn=be75c2803feccc00 Ixt=065
>> (B): Rs=0000000000000000 Rt=00000000b727c61a Rn=0000000000000000 Ixt=067
>> (A): Rs=0000000000000000 Rt=0000000000000000 Rn=0000000000000000 Ixt=067
>> (B): Rs=0000000000000000 Rt=0000000000000000 Rn=0000000000000000 Ixt=065
>> (A): Rs=be75c2803feccc00 Rt=0000000000000000 Rn=be75c2803feccc00 Ixt=065
>> (B): Rs=3f8000003f800000 Rt=3f80000000000000 Rn=3f80000000000000 Ixt=067
>> (A): Rs=3f8000003f800000 Rt=3f800000bf800000 Rn=3f800000bf800000 Ixt=067
>> (B): Rs=0000000000000000 Rt=3f80000000000000 Rn=3f80000000000000 Ixt=065
>> (A): Rs=be75c2803feccc00 Rt=3f800000bf800000 Rn=3f428f003f599700 Ixt=065
>> (B): Rs=3f8000003f7ff401 Rt=3f80000046dffe00 Rn=3f80000046dff380 Ixt=067
>> (A): Rs=3f7ff4013f7ff401 Rt=42c8000043200000 Rn=42c7f680431ff880 Ixt=067
>> (B): Rs=3f80000000000000 Rt=3f80000046dff380 Rn=3f80000000000000 Ixt=067
>> (A): Rs=3f428f003f599700 Rt=42c7f680431ff880 Rn=4297f8004307f800 Ixt=067
>> (B): Rs=3f80000000000000 Rt=0000000000000000 Rn=3f80000000000000 Ixt=065
>> (A): Rs=4297f8004307f800 Rt=42c8000043200000 Rn=432ffc004393fc00 Ixt=065
>>
>> Which may, theoretically, contain a hint as to why it isn't working
>> (067=Packed FMUL, 065=Packed FADD). The (A/B) indicates which lane is
>> printing (and the general lack of "random garbage" seems to imply timing
>> is OK).
>>
>> But, deciphering floating point numbers in hexadecimal form is a pain
>> (there seems to be a fault with FADD here, but basic tests pass in the
>> unit tests and in a few sanity-check cases).
>>
>> I have spent several days thus far trying to figure out what is going on
>> here...
>>
>>
>> I guess one thing I could try would be to (more exactly) mimic the
>> behavior of the low-precision SIMD operators in the emulator, and see if
>> the bug appears in the emulator... (It doesn't tend to produce exactly
>> the same results as one would get by truncating the Binary32 values).
>>
>> ...
> I would run tests against random test cases and see what falls out. For
> my decimal float multiply I found one in 5,000 random cases did not work.
> It turned out to be an issue with carries in the Karatsuba multiplier.
> And, I did not run enough test cases the first time through.

Turns out it wasn't actually the math that was failing here, but rather
another edge case related to a previously observed (and not fully
resolved) issue:
Previously, if a branch directly fallowed a memory operation, it would
cause the operation to fail;
Turns out, this also applies not just to memory loads, but seemingly any
other 3-stage operation (apparently includes Low-precision FPU and also
Integer Multiply and similar).

The workaround was apparently to extend the interlock-checks for this
case to also include checking for the low-precision FPU and also for
integer multiply and similar...

Also possible would be to go on a bug hunt and try to figure out what is
actually going on here (a branch should not effect the EX3 stage), but
when a predicted branch initiates it will flush whatever was in EX1 (the
instruction following the branch op).

Also doesn't appear to effect operations where EX3 merely forwards the
result from a previous stage (with the result value having been
generated in EX2 or EX1), but more seems to effect operations where the
result arrives in EX3 (in these cases, the next stage is for the value
to be written back to the register file).

There was a special case check here, which adds an interlock stall,
mostly because previous attempts to figure out the cause of this issue
were unsuccessful.

....

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<77573561-b8dc-4d86-9d52-558980ae5546n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27303&group=comp.arch#27303

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:6905:0:b0:6bb:5827:e658 with SMTP id e5-20020a376905000000b006bb5827e658mr4451564qkc.735.1660665849100;
Tue, 16 Aug 2022 09:04:09 -0700 (PDT)
X-Received: by 2002:ad4:4ea9:0:b0:474:7389:8593 with SMTP id
ed9-20020ad44ea9000000b0047473898593mr18795062qvb.94.1660665848801; Tue, 16
Aug 2022 09:04:08 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border-2.nntp.ord.giganews.com!border-1.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 16 Aug 2022 09:04:08 -0700 (PDT)
In-Reply-To: <tdffj8$c20$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:c894:1a68:cad4:f08b;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:c894:1a68:cad4:f08b
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org>
<ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com> <tblg10$13pon$1@dont-email.me>
<2c0104c2-0515-4d14-9e75-6e3525eea9f7n@googlegroups.com> <tctk28$m2v$1@gioia.aioe.org>
<tcvce3$1m1ek$1@dont-email.me> <td6vs6$e33$1@gioia.aioe.org>
<td97d0$2u54f$1@dont-email.me> <2609b76b-c113-46c9-9405-960bc4100295n@googlegroups.com>
<tdcpbq$3h22d$1@dont-email.me> <74068459-d5ab-4929-bad5-71a71f41d1a7n@googlegroups.com>
<tdffj8$c20$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <77573561-b8dc-4d86-9d52-558980ae5546n@googlegroups.com>
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 16 Aug 2022 16:04:09 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 27

by: MitchAlsup - Tue, 16 Aug 2022 16:04 UTC

On Tuesday, August 16, 2022 at 2:05:15 AM UTC-5, BGB wrote:
> On 8/15/2022 3:05 PM, robf...@gmail.com wrote:
> > On Monday, August 15, 2022 at 2:33:34 AM UTC-4, BGB wrote:

> > I would run tests against random test cases and see what falls out. For
> > my decimal float multiply I found one in 5,000 random cases did not work.
> > It turned out to be an issue with carries in the Karatsuba multiplier.
> > And, I did not run enough test cases the first time through.
> Turns out it wasn't actually the math that was failing here, but rather
> another edge case related to a previously observed (and not fully
> resolved) issue:
> Previously, if a branch directly fallowed a memory operation, it would
> cause the operation to fail;
> Turns out, this also applies not just to memory loads, but seemingly any
> other 3-stage operation (apparently includes Low-precision FPU and also
> Integer Multiply and similar).
<
Why not just check the pipeline flip-flops to see if a register you need
in the issuing instruction is still in the pipeline. {This is how MIPS 2000
performed forwarding}
>
> The workaround was apparently to extend the interlock-checks for this
> case to also include checking for the low-precision FPU and also for
> integer multiply and similar...
>
If the operand registers are interlocked, why do you need more ?
>

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<tdglui$46lv$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27309&group=comp.arch#27309

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!news.swapon.de!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
Date: Tue, 16 Aug 2022 12:59:44 -0500
Organization: A noiseless patient Spider
Lines: 156
Message-ID: <tdglui$46lv$1@dont-email.me>
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org>
<ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com>
<tblg10$13pon$1@dont-email.me>
<2c0104c2-0515-4d14-9e75-6e3525eea9f7n@googlegroups.com>
<tctk28$m2v$1@gioia.aioe.org> <tcvce3$1m1ek$1@dont-email.me>
<td6vs6$e33$1@gioia.aioe.org> <td97d0$2u54f$1@dont-email.me>
<2609b76b-c113-46c9-9405-960bc4100295n@googlegroups.com>
<tdcpbq$3h22d$1@dont-email.me>
<74068459-d5ab-4929-bad5-71a71f41d1a7n@googlegroups.com>
<tdffj8$c20$1@dont-email.me>
<77573561-b8dc-4d86-9d52-558980ae5546n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 16 Aug 2022 17:59:47 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="3d4124a629ae225c26b29c89547dc00f";
logging-data="137919"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18pZ6NA2fewjWSfW1qk6N68"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.12.0
Cancel-Lock: sha1:DdRKq45LU9br8LT5iHq5E2NlQPk=
Content-Language: en-US
In-Reply-To: <77573561-b8dc-4d86-9d52-558980ae5546n@googlegroups.com>

by: BGB - Tue, 16 Aug 2022 17:59 UTC

On 8/16/2022 11:04 AM, MitchAlsup wrote:
> On Tuesday, August 16, 2022 at 2:05:15 AM UTC-5, BGB wrote:
>> On 8/15/2022 3:05 PM, robf...@gmail.com wrote:
>>> On Monday, August 15, 2022 at 2:33:34 AM UTC-4, BGB wrote:
>
>>> I would run tests against random test cases and see what falls out. For
>>> my decimal float multiply I found one in 5,000 random cases did not work.
>>> It turned out to be an issue with carries in the Karatsuba multiplier.
>>> And, I did not run enough test cases the first time through.
>> Turns out it wasn't actually the math that was failing here, but rather
>> another edge case related to a previously observed (and not fully
>> resolved) issue:
>> Previously, if a branch directly fallowed a memory operation, it would
>> cause the operation to fail;
>> Turns out, this also applies not just to memory loads, but seemingly any
>> other 3-stage operation (apparently includes Low-precision FPU and also
>> Integer Multiply and similar).
> <
> Why not just check the pipeline flip-flops to see if a register you need
> in the issuing instruction is still in the pipeline. {This is how MIPS 2000
> performed forwarding}

This doesn't seem to be a forwarding issue.

But, yeah, there is "check register is in the pipeline" logic.
Register is in pipeline, Held=0:
We can forward the result from this register (back to ID2).
Register is in pipeline, Held=1:
Result doesn't exist yet (need to interlock).

The value-forwarding part is kinda expensive though.
Effectively needs to route 9 possible stage outputs back to ID2;
Each of which may be in turn routed to 6 different register ports.

Could save some LUTs by using interlocks rather than forwarding, but
this would cause pretty much every instruction to now have a 3 cycle
latency.

I guess it could be possible to forward from the stage inputs rather
than the outputs, which could potentially be cheaper and a little better
for timing. For the pipeline, values are forwarded along using
flip-flops (presumably there is no other option here?...).

But, would effectively add 1-cycle of latency to everything:
EXTU.B R4, R6
ADD R6, R7

So, EX1 for EXTU.B would happen at ID2 for ADD, but could not forward in
this case (would need to treat this as an interlock). Effectively,
raising the latency from 1 cycle to 2 cycles (and causing an interlock
stall in the above example).

But, "everything is now 2 or 3" cycles wouldn't be much better than
"everything is now 3 cycles".

In the scenario that came up recently, the PADDX.FA instruction would
have already left the pipeline by the time its result was being used.

PADDX.FA R2, R5, R2
RTS
BREAK

EX1a EX2a EX3a WBa~ **** PADDX.FA
---- EX1b EX2b EX3b **** RTS (Decoded as "JMP LR")
---- ---- EX1c EX2c EX3c BREAK (Flushed)
---- ---- ---- EX1~ EX2~ (*2)
---- ^^^^ ---- (Branch Address Calculated)
**** **** ^^^^ (Branch Happens Here)

So, when the branch takes place, the Fetch PC is overridden.
EX1c is Flushed (forced to become NOP), otherwise it would behave like a
delay slot.

The PC override (and pipeline flush) is skipped if the branch was
handled by the branch predictor (this appears to be the scenario where
the bug in question is happening).

However, if EX3a is doing something (receiving a result from another
FU), it seems to get stomped, for some as-of-yet undetermined reason.

*2: If the branch was predicted, this is the first instruction from the
new PC. If a full branch happened, this stage (and the next several that
follow) are flushed.

In this scenario, the first instruction following the RTS would be a
"MOVX R2, Rn" or similar (accessing the register from the WB stage).

Previously, I had thought the issue was somehow effecting the L1 D$, but
based on the observations with PADDX.FA, it appears to be somehow
effecting EX3s in some other way.

If it were disrupting the EX3->WB interface (or incorrectly flushing
EX3), I would expect it to disrupt *any* instruction preceding the
branch. Likely, likely disruption of WB->ID2 path would likely be "very
obvious", ...

I guess, it could be possible to try to look into whether stalls from
possible L1 I$ misses are quietly causing WB to fail or something, since
a stall from a branch-related L1 I$ miss would happen on this clock cycle.

It doesn't seem to be on the input side to EX3 (from the FUs), as the
debug prints were not showing either garbage or the debug values (for
testing, had rigged up a mechanism so that the LFP-FPU will output
0xAAAAAAAAAAAAAAAA or 0x5555555555555555 if it is misaligned; everything
looked fine from debug prints of the values arriving at EX3).

But, as noted, this is a mystery that I have not figured out as of yet.
But, I guess now I know that it is something to do with EX3, rather than
something related to the L1 D$.

>>
>> The workaround was apparently to extend the interlock-checks for this
>> case to also include checking for the low-precision FPU and also for
>> integer multiply and similar...
>>
> If the operand registers are interlocked, why do you need more ?

In this case, "make it work" hacks...

In theory, an interlock for "ID2 contains a branch and EX1 contains a
Load/etc" should be unnecessary.

But, it exists mostly because previously I ran into a bug and this was
the only way I could figure out at the time to "make it work".

Doesn't help that apparently it extends to other cases beyond loads.

In my case, my boot ROM contains some sanity checks for branches (some
"branch mazes" and mixtures of branches and some other types of
instructions), mostly to detect if the original form of the bug
reappears (along with a few other cases, like a lot of BREAK
instructions for if the branch suddenly decides to gain a delay slot, or
if the target PC gets misaligned somehow, ...).

This is along with a bunch of ALU sanity tests, memory Load/Store sanity
checks, and for various other instructions. Does only minimal checks for
WEX, and hardly anything for SIMD, mostly because there is only so much
space in the Boot ROM.

The 'OS'/Shell also runs some additional sanity checks.

Sadly, my ISA is a little out of scope of what can be adequately
addressed with sanity checks (would likely need impractical levels of
sanity checking code).

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<3d8a28d0-afb7-4650-b59c-34fc8a92a637n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27323&group=comp.arch#27323

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:561:b0:6b6:1b3a:5379 with SMTP id p1-20020a05620a056100b006b61b3a5379mr17126716qkp.111.1660718153002;
Tue, 16 Aug 2022 23:35:53 -0700 (PDT)
X-Received: by 2002:a05:620a:4492:b0:6b8:c85c:8d13 with SMTP id
x18-20020a05620a449200b006b8c85c8d13mr17347728qkp.423.1660718152859; Tue, 16
Aug 2022 23:35:52 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 16 Aug 2022 23:35:52 -0700 (PDT)
In-Reply-To: <0a9765d5-f885-4109-9ba8-69430513d05fn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:21a0:dbfb:4010:f422;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:21a0:dbfb:4010:f422
References: <t6gush$p5u$1@dont-email.me> <0a9765d5-f885-4109-9ba8-69430513d05fn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3d8a28d0-afb7-4650-b59c-34fc8a92a637n@googlegroups.com>
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Wed, 17 Aug 2022 06:35:52 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2500

by: Quadibloc - Wed, 17 Aug 2022 06:35 UTC

On Sunday, July 24, 2022 at 4:28:47 AM UTC-6, luke.l...@gmail.com wrote:

> 3) RVV or any other Cray-Style Vector ISA requiring strict compliance
> with IEEE754 FP accuracy will punish you with a 400% power/area
> penalty compared to modern 3D-optimised GPUs which, as explicitly
> spelled out and allowed in the Vulkan(tm) Spec by the Khronos
> Group, are permitted significant accuracy reductions. Mitch
> will (or will have already) filled you in on that.

Well, if your purpose is building a device that is helpful to scientists
wishing to perform large-scale computations on FP64 numbers, you'll
just have to pay that penalty. After paying that penalty, though, something
based on a GPU architecture instead of that of a conventional non-vector
CPU is *still* vastly more powerful.

And we have the NEC Aurora TSUBASA as an example of a Cray-style
vector ISA. At a significantly higher cost, it was capable of about half
of the FP64 FLOPS that a video card with high FP64 capabilities made
on the same technology could provide. But because this architecture is
more flexible, one could use it for a higher proportion of the floating-point
calculations in typical scientific programs.

That is a tradeoff that might be well worth it.

John Savard

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<88646c7a-3e1d-406e-b3ec-7171cfd4e235n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27324&group=comp.arch#27324

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:d0b:b0:47b:4d2b:fa53 with SMTP id 11-20020a0562140d0b00b0047b4d2bfa53mr20434393qvh.13.1660718343934;
Tue, 16 Aug 2022 23:39:03 -0700 (PDT)
X-Received: by 2002:a0c:a889:0:b0:473:d709:8753 with SMTP id
x9-20020a0ca889000000b00473d7098753mr21093203qva.16.1660718343829; Tue, 16
Aug 2022 23:39:03 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 16 Aug 2022 23:39:03 -0700 (PDT)
In-Reply-To: <ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2001:56a:fb70:6300:21a0:dbfb:4010:f422;
posting-account=1nOeKQkAAABD2jxp4Pzmx9Hx5g9miO8y
NNTP-Posting-Host: 2001:56a:fb70:6300:21a0:dbfb:4010:f422
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org> <ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <88646c7a-3e1d-406e-b3ec-7171cfd4e235n@googlegroups.com>
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
From: jsav...@ecn.ab.ca (Quadibloc)
Injection-Date: Wed, 17 Aug 2022 06:39:03 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 1535

by: Quadibloc - Wed, 17 Aug 2022 06:39 UTC

On Sunday, July 24, 2022 at 4:41:19 PM UTC-6, luke.l...@gmail.com wrote:

> AVX-512 *shudder*

And here I was wondering why Intel hadn't been rushing to
put AVX-512 on its consumer CPUs, so that they would
squash AMD with vastly superior games and graphic
performance.

John Savard

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<tdi70u$d6lk$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27326&group=comp.arch#27326

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
Date: Wed, 17 Aug 2022 02:57:16 -0500
Organization: A noiseless patient Spider
Lines: 29
Message-ID: <tdi70u$d6lk$1@dont-email.me>
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org>
<ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com>
<88646c7a-3e1d-406e-b3ec-7171cfd4e235n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 17 Aug 2022 07:57:18 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="e0f4f55b45f43c5c8d3c11b71a881169";
logging-data="432820"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+aq2Laj05PYl/taaKS7A7o"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.12.0
Cancel-Lock: sha1:B92PrpXBTEqR81EGp4Yfz2Sj9l0=
Content-Language: en-US
In-Reply-To: <88646c7a-3e1d-406e-b3ec-7171cfd4e235n@googlegroups.com>

by: BGB - Wed, 17 Aug 2022 07:57 UTC

On 8/17/2022 1:39 AM, Quadibloc wrote:
> On Sunday, July 24, 2022 at 4:41:19 PM UTC-6, luke.l...@gmail.com wrote:
>
>> AVX-512 *shudder*
>
> And here I was wondering why Intel hadn't been rushing to
> put AVX-512 on its consumer CPUs, so that they would
> squash AMD with vastly superior games and graphic
> performance.
>

I can make a prediction: For most workloads, if it gets used at all, it
probably wont make things much faster, or very possibly, will cause
things to get slower.

Quite possibly it would power through things like "multiply this big
matrix with this other big matrix", but then suck pretty hard for pretty
much everything else.

Or, sort of like 256-bit AVX, but even more so...

Or, say, it is something that would require special compiler support and
funky coding practices to get much benefit from (say, as traditional
auto vectorization continues to fall on its face, but now even more so...).

> John Savard

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<tdi9co$ddqi$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27327&group=comp.arch#27327

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
Date: Wed, 17 Aug 2022 03:37:42 -0500
Organization: A noiseless patient Spider
Lines: 74
Message-ID: <tdi9co$ddqi$1@dont-email.me>
References: <t6gush$p5u$1@dont-email.me>
<0a9765d5-f885-4109-9ba8-69430513d05fn@googlegroups.com>
<3d8a28d0-afb7-4650-b59c-34fc8a92a637n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 17 Aug 2022 08:37:44 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="e0f4f55b45f43c5c8d3c11b71a881169";
logging-data="440146"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+91kL9Yqt/fnSK5CWjy7Q8"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.12.0
Cancel-Lock: sha1:/cxurXDnd799iv6N3ECAb/ntC6w=
Content-Language: en-US
In-Reply-To: <3d8a28d0-afb7-4650-b59c-34fc8a92a637n@googlegroups.com>

by: BGB - Wed, 17 Aug 2022 08:37 UTC

On 8/17/2022 1:35 AM, Quadibloc wrote:
> On Sunday, July 24, 2022 at 4:28:47 AM UTC-6, luke.l...@gmail.com wrote:
>
>> 3) RVV or any other Cray-Style Vector ISA requiring strict compliance
>> with IEEE754 FP accuracy will punish you with a 400% power/area
>> penalty compared to modern 3D-optimised GPUs which, as explicitly
>> spelled out and allowed in the Vulkan(tm) Spec by the Khronos
>> Group, are permitted significant accuracy reductions. Mitch
>> will (or will have already) filled you in on that.
>
> Well, if your purpose is building a device that is helpful to scientists
> wishing to perform large-scale computations on FP64 numbers, you'll
> just have to pay that penalty. After paying that penalty, though, something
> based on a GPU architecture instead of that of a conventional non-vector
> CPU is *still* vastly more powerful.
>
> And we have the NEC Aurora TSUBASA as an example of a Cray-style
> vector ISA. At a significantly higher cost, it was capable of about half
> of the FP64 FLOPS that a video card with high FP64 capabilities made
> on the same technology could provide. But because this architecture is
> more flexible, one could use it for a higher proportion of the floating-point
> calculations in typical scientific programs.
>
> That is a tradeoff that might be well worth it.
>

Pretty much, there are reasons I am mostly going for DAZ + FTZ and FPU
designs which don't give exact 0.5 ULP rounding. These make the FPU so
much cheaper.

For GPU-like uses, it is possible there could be a use-case for a
further reduced-precision Binary64 variant, say:
S.E11.F32.Z20

This is partly because there are still a few cases where a 16-bit
mantissa is not sufficient (and where needing to keep the main FPU
around is also fairly expensive).

Namely, there is the point in my rasterizer where it goes from
single-precision to 32-bit fixed-point.

On the floating-point side, it can use low precision calculations just
fine. On the integer side, it is 16.16 fixed point. However, the
conversion step from floating point to fixed point integer exceeds the
accuracy requirements of both the truncated-single and full single
precision (so, would effectively need something with a 32-bit mantissa
to deal with this).

This may be added as another possible sub-feature to the low-precision
unit, namely scalar FADD/FMUL units with 32-bit mantissa, and also needs
to support integer conversion. Mostly to serve as a cheaper stand-in for
the main FPU.

Though, it is possible the "GPU core" idea might be dead for now, given
I still can't quite make stuff cheap enough.

Well, either that, or try to cheap out on the FP->Int conversion process
and add some additional "low cost" SIMD converters (2x FP32 <-> Packed
Int32). Say, if one adds a bias of 1024, this is enough to get approx
10.6 output, which can then be shifted to give values in the 16.16
format (but would also require doing this step via a specialized blob of
ASM; at present this part of the rasterizer is still written in C).

I guess this could potentially side-step the need for a higher precision
unit (but would limit maximum viewport size).

> John Savard

AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)

<2022Aug17.101938@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27328&group=comp.arch#27328

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!paganini.bofh.team!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)
Date: Wed, 17 Aug 2022 08:19:38 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 52
Message-ID: <2022Aug17.101938@mips.complang.tuwien.ac.at>
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org> <ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com> <88646c7a-3e1d-406e-b3ec-7171cfd4e235n@googlegroups.com>
Injection-Info: reader01.eternal-september.org; posting-host="cf676451a0ccde7d1507fc7ebb226724";
logging-data="444168"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19yrnbGGDkxSfxfglu95+r0"
Cancel-Lock: sha1:Rp1+PJg0DzOZjCZmAqVYaVFvfL4=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Wed, 17 Aug 2022 08:19 UTC

Quadibloc <jsavard@ecn.ab.ca> writes:
>And here I was wondering why Intel hadn't been rushing to
>put AVX-512 on its consumer CPUs, so that they would
>squash AMD with vastly superior games and graphic
>performance.

Intel engineering has. They put AVX512 in Cannon Lake, Ice Lake,
Tiger Lake, Rocket Lake and Alder Lake. Thanks to Intel's 10nm woes,
Cannon Lake did not appear in 2016 as the tick-tock plan would have
had it, but in 2018 in very small numbers and with disabled GPU, Ice
Lake and Tiger Lake only in laptops and NUCs, so the desktop saw it
only in Rocket Lake in 2021 (and that appeared only in high-priced
SKUs). For Alder Lake, the AVX512-capable Golden Cove cores are
combined with the non-AVX512 Gracemont cores, and because OSs cannot
(yet?) deal with cores that support different instruction set
extensions, AVX512 was disabled (even if the Gracemont cores are
disabled).

There are also the low-cost cores (Silvermont, Goldmont, Goldmont+,
Tremont, Gracemont), which don't execute AVX512 instructions (and,
except for Gracemont, not AVX, either). AMD showed that you can
implement AVX with 128-bit functional units, and I think you can also
do that with AVX512. AMD showed with Jaguar and Puma that this can
also be done in low-cost cores. If Intel actually wanted to
support AVX512 everywhere, they could have followed this example.

And then you have Intel marketing, which disables AVX512 in SKUs for
chips that actually support AVX512 (and likewise disable AVX in some
SKUs), ensuring that most programmers will steer clear of AVX for many
years to come and even longer for AVX512.

If they do this disabling and the non-support of these extensions on
the low-cost cores for market segmentation reasons, they don't
understand their job: Their actions ensure that AVX and AVX512 are
used by so little software that few people will see AVX and AVX512 as
product differentiators. It would be better for them if they had
cores (and maybe also SKUs) with fast and slow AVX and AVX512
implementations.

If they do this to salvage chips with broken high parts of their
AVX/AVX512 units, another way to deal with this is to build the
support for synthesizing AVX/AVX512 from narrower functional units
into cores that actually have wide units. They could achieve both
goals by disabling the high part of the AVX/AVX512 units and enable
the synthesizing on cheaper SKUs; then there would be SKUs with fast
AVX512 and SKUs with slow AVX512, a situations that is much more
likely to lead to widespread AVX512 usage in software.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Power cost of IEEE754 (was: Misc: Idle thoughts for cheap and fast(ish) GPU)

<jwvlerncc0p.fsf-monnier+comp.arch@gnu.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27336&group=comp.arch#27336

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: monn...@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.arch
Subject: Power cost of IEEE754 (was: Misc: Idle thoughts for cheap and fast(ish) GPU)
Date: Wed, 17 Aug 2022 08:07:10 -0400
Organization: A noiseless patient Spider
Lines: 11
Message-ID: <jwvlerncc0p.fsf-monnier+comp.arch@gnu.org>
References: <t6gush$p5u$1@dont-email.me>
<0a9765d5-f885-4109-9ba8-69430513d05fn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: reader01.eternal-september.org; posting-host="002c913128fd21c3b4e5b18a78c4b1c5";
logging-data="475682"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX183CMzALBrmfD/qHltYehkn"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/29.0.50 (gnu/linux)
Cancel-Lock: sha1:oRToKlZc0jImI4jnx5Sj3djH4Gg=
sha1:fSR/b1iPcNTRx322w1MgiL7hAU8=

by: Stefan Monnier - Wed, 17 Aug 2022 12:07 UTC

What is the origin of this extra cost? IOW, where's the meat of the
savings (I mean: which part of the computation of the last few bits
costs so much)?
Does the saving vary significantly between instructions?

Stefan

Re: Power cost of IEEE754 (was: Misc: Idle thoughts for cheap and fast(ish) GPU)

<4b919cba-b547-427a-9151-554139e3cca4n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=27338&group=comp.arch#27338

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:5fd6:0:b0:343:4b4:1022 with SMTP id k22-20020ac85fd6000000b0034304b41022mr22644852qta.616.1660742003335;
Wed, 17 Aug 2022 06:13:23 -0700 (PDT)
X-Received: by 2002:ad4:5be1:0:b0:496:a686:2bec with SMTP id
k1-20020ad45be1000000b00496a6862becmr2614970qvc.85.1660742002984; Wed, 17 Aug
2022 06:13:22 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border-2.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 17 Aug 2022 06:13:22 -0700 (PDT)
In-Reply-To: <jwvlerncc0p.fsf-monnier+comp.arch@gnu.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2607:fea8:1dde:6a00:d28:75cd:170c:3a4f;
posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 2607:fea8:1dde:6a00:d28:75cd:170c:3a4f
References: <t6gush$p5u$1@dont-email.me> <0a9765d5-f885-4109-9ba8-69430513d05fn@googlegroups.com>
<jwvlerncc0p.fsf-monnier+comp.arch@gnu.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4b919cba-b547-427a-9151-554139e3cca4n@googlegroups.com>
Subject: Re: Power cost of IEEE754 (was: Misc: Idle thoughts for cheap and
fast(ish) GPU)
From: robfi...@gmail.com (robf...@gmail.com)
Injection-Date: Wed, 17 Aug 2022 13:13:23 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 15

by: robf...@gmail.com - Wed, 17 Aug 2022 13:13 UTC

>For GPU-like uses, it is possible there could be a use-case for a
>further reduced-precision Binary64 variant, say:
>S.E11.F32.Z20

>This is partly because there are still a few cases where a 16-bit
>mantissa is not sufficient (and where needing to keep the main FPU
>around is also fairly expensive).

I have been thinking of going with 21-bit floats S1.E5.F15 because
three can be fit into 64-bits and so extra vector lanes can be had
cheaply.

I like the low precision idea. Keeping the significand under 18 bits
means just a single DSP can be used for the multiply.

Pages:123 4 5 6

server_pubkey.txt

rocksolid light 0.9.81
clearnet tor