Welcome to novaBBS (click a section below)

mail files register newsreader groups login

Message-ID:

Adding manpower to a late software project makes it later. -- F. Brooks, "The Mythical Man-Month"

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

Subject	Author
Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	MitchAlsup
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	MitchAlsup
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	MitchAlsup
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	luke.l...@gmail.com
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	luke.l...@gmail.com
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	luke.l...@gmail.com
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	luke.l...@gmail.com
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	MitchAlsup
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	Terje Mathisen
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	luke.l...@gmail.com
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	luke.l...@gmail.com
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	luke.l...@gmail.com
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	luke.l...@gmail.com
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	MitchAlsup
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	Andy
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	robf...@gmail.com
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	luke.l...@gmail.com
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	Andy
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	robf...@gmail.com
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	robf...@gmail.com
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	MitchAlsup
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	Terje Mathisen
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	Anton Ertl
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	Terje Mathisen
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	Quadibloc
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)	Anton Ertl
Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)	John Dallman
Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)	Thomas Koenig
Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)	John Dallman
Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)	Marcus
Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)	Anton Ertl
Re: AVX512 (was: Misc: Idle thoughts for cheap and fast(ish) GPU.)	Terje Mathisen
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	MitchAlsup
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	Michael S
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	Anton Ertl
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	Michael S
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	Anton Ertl
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	MitchAlsup
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	Anton Ertl
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	Anton Ertl
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	luke.l...@gmail.com
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	Quadibloc
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	MitchAlsup
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	BGB
Re: Misc: Idle thoughts for cheap and fast(ish) GPU.	MitchAlsup
Power cost of IEEE754 (was: Misc: Idle thoughts for cheap and fast(ish) GPU)	Stefan Monnier
Re: Power cost of IEEE754 (was: Misc: Idle thoughts for cheap and	robf...@gmail.com
Re: Power cost of IEEE754 (was: Misc: Idle thoughts for cheap and	BGB
Re: Power cost of IEEE754	Terje Mathisen
Re: Power cost of IEEE754	BGB
Re: Power cost of IEEE754	MitchAlsup
Re: Power cost of IEEE754	BGB
Re: Power cost of IEEE754	MitchAlsup
Re: Power cost of IEEE754	Quadibloc
Re: Power cost of IEEE754	MitchAlsup
Re: Power cost of IEEE754	Anton Ertl
Re: Power cost of IEEE754	Terje Mathisen
Re: Power cost of IEEE754	MitchAlsup
Re: Power cost of IEEE754	Thomas Koenig
Re: Power cost of IEEE754	John Dallman
Re: Power cost of IEEE754	Michael S
Re: Power cost of IEEE754	EricP
Re: Power cost of IEEE754	MitchAlsup
Re: Power cost of IEEE754	EricP
Re: Power cost of IEEE754	EricP
Re: Power cost of IEEE754	Paul A. Clayton
Re: Power cost of IEEE754	MitchAlsup
Re: Power cost of IEEE754	luke.l...@gmail.com
Re: Power cost of IEEE754	MitchAlsup
Re: Power cost of IEEE754	Josh Vanderhoof
Re: Power cost of IEEE754	BGB
Re: Power cost of IEEE754	Josh Vanderhoof
Re: Power cost of IEEE754	BGB
Re: Power cost of IEEE754	Josh Vanderhoof
Re: Power cost of IEEE754	BGB
Re: Power cost of IEEE754	Josh Vanderhoof
Re: Power cost of IEEE754	BGB
Re: Power cost of IEEE754	Josh Vanderhoof
Re: Power cost of IEEE754	BGB
Re: Power cost of IEEE754	Terje Mathisen
Re: Power cost of IEEE754	Ivan Godard
Re: Power cost of IEEE754	BGB
Re: Power cost of IEEE754 (was: Misc: Idle thoughts for cheap and	Quadibloc

Pages:12 3 4 5 6

Misc: Idle thoughts for cheap and fast(ish) GPU.

<t6gush$p5u$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=25457&group=comp.arch#25457

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Misc: Idle thoughts for cheap and fast(ish) GPU.
Date: Mon, 23 May 2022 16:38:19 -0500
Organization: A noiseless patient Spider
Lines: 225
Message-ID: <t6gush$p5u$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 23 May 2022 21:38:25 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="eeaaa3b404a5ae8eb247a19ed7e0cb7b";
logging-data="25790"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/ueB7fTkEfFVP6FKjRmhwc"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.9.0
Cancel-Lock: sha1:DG87HotpOIq/C/6YnqsM27nQY/o=
Content-Language: en-US

by: BGB - Mon, 23 May 2022 21:38 UTC

So, it came up elsewhere, but I can note in my efforts that:
Getting GLQuake above 10fps on a 50MHz CPU core is non-trivial.

A dedicated GPU is possible, but I have a practical resource limit (on
the XC7A100T) of around 15-20 kLUT for something like this.

To get the sort of frame-rates I want for Quake, it looks like I am
going to need a fill-rate of around 5 megapixels/second (or more, more
is better...), or around 10 clock-cycles per-pixel assuming a 50MHz GPU.

Some of this seems like a stretch...
Not quite impossible, but still a bit much.

So, high-level thoughts for a GPU ISA:
32x 128-bit registers, also addressable as 64x 64-bit (R0..R63);
These function as both SIMD registers and GPRs;
One or two operations per cycle, depending on operation type;
An ALU op may pair with another op;
Most other ops are one-off.
Would have 2 ALUs, and a low-precision FP-SIMD unit.
FP-SIMD would be Binary16 and truncated Binary32;
Would ignore the low 8 bits of Binary32 to make FPUs faster/cheaper;
DAZ + FTZ + Truncate Only.
Would omit Binary64 support entirely (too expensive).
Load/Store:
Typical sizes (mostly the same as BJX2);
Would support a LDTEX instruction:
Mashes texture addressing and UTX2 decoding into a single op (1).
No MMU or similar.

Or, basically, optimizing it for GL rasterization tasks but it sucking
for pretty much anything else.

Probable target language would be for a C-like superset of GLSL (would
add things like pointer-based memory addressing and similar).

SIMT could be possible, but is unlikely to fit within the resource budget.

LDTEX:
AGU would use the high bits of the texture address to figure out how
to map ST coords.

(47: 0): Address
(51:48): X Size (0|15=Morton)
(56:52): X+Y (Size/Mask)
(59:57): Pixel format:
000=RGB555A
001=UTX2

Address:
Extract Block/Texel ST
If Morton:
Interleave ST bits;
Else:
Shift T by X bits, copying low-bits from S.
Likely with an implicit size-limit on the texture.
Mask by XY size;
Shift and Add to Base-Address.
For RGB555A: Shift left 1 bit and add to Rb;
For UTX2: Shift right 1 bit, zero low 3 bits, add to Rb.
Low 4 bits of texel address swizzled into high bits.

The UTX decoder could potentially be shoved into the EX3 stage.
Or into the L1, with a mandatory alignment on texture blocks.
Not sure which would be "less awful" for timing reasons.

This instruction would only do NEAREST, LINEAR would need to be done
manually.

Say:
LDTEX (R8, R10, 0), R4
LDTEX (R8, R10, 1), R5
LDTEX (R8, R10, 2), R6
LDTEX (R8, R10, 3), R7
MOVU.L (R13), R3 //Z-Buffer
BLERPS.W R4, R10, R16
BLERPS.W R6, R10, R17
BLERPT.W R16, R10, R2
PMULHU.W R2, R24, R2
RGB5PCK R2, R2 | GMPGT R14, R3 //Z test
MOV.W?T R2, (R12) //Color Write
MOV.L?T R14, (R13) //Z Write
Step and Loop

Though, even this would already more than blow out the 10-cycle-per
pixel target it one assumes GL_LINEAR.

Though, there is a trick here of using:
GL_LINEAR, GL_NEAREST_MIPMAP_NEAREST
Which only uses LINEAR for things up close to the camera, leaving it as
a small minority of the on-screen pixels (except when the player has
their face up close to a wall).

Trying to cram the entire interpolation process into a single
instruction is likely to be a little too expensive.

Copy/Pasting bits from elsewhere:

The ppp field encoding an instruction mode, say:
* 000: Pred?T
* 001: Pred?F
* 010: Pred?T, WEX
* 011: Pred?F, WEX
* 100: Normal / Unconditional
* 101: Special / Unconditional
* 110: Normal, WEX
* 111: Jumbo

Would use Jumbo encodings like in BJX2, but would have a slightly bigger
space for Jumbo encodings.

Would allow expanding Imm9 forms to Imm33.
Would allow some special-forms to encode a few Imm48 cases.
Such as a 2x Float24 or 4x Float12 Vector Load.
May consider a 96-bit Jumbo Load mostly to allow for Imm64 encodings.

Say, Possible:
1111-nnnn-iiii-iiii iiii-iiii-iiii-iiii
1101-00nn-iiii-iiii iiii-iiii-iiii-iiii
Load Imm48s, Rn // (0000iiiiiiiiiiii / FFFFiiiiiiiiiiii)
1101-01nn-iiii-iiii iiii-iiii-iiii-iiii
Load Imm48hi, Rn // (iiiiiiiiiiii0000)

1111-nnnn-iiii-iiii iiii-iiii-iiii-iiii
1101-10nn-iiii-iiii iiii-iiii-iiii-iiii
Load Imm2xFp24, Rn // (iiiiii0011111100)
1111-nnnn-iiii-iiii iiii-iiii-iiii-iiii
1101-11nn-iiii-iiii iiii-iiii-iiii-iiii
Load Imm4xFp12, Rn // (iii0iii0iii0iii0)

1111-0000-iiii-iiii iiii-iiii-iiii-iiii
1111-0000-iiii-iiii iiii-iiii-iiii-iiii
1ppp-1110-00nn-nnnn iiii-iiii-iiii-iiii
Load Imm64, Rn

Say, partial listing (still "idea stage" for now):

* 1ppp-0000-vvnn-nnnn usss-sss0-iitt-tttt STx Rn, (Rs, Rt*(2^ii))
* 1ppp-0000-vvnn-nnnn usss-sss1-iitt-tttt LEA (Rs, Rt*(2^ii)), Rn / ...
* 1ppp-0001-vvnn-nnnn usss-sss0-iitt-tttt LDx (Rs, Rt*(2^ii)), Rn
* 1ppp-0001-vvnn-nnnn usss-sss1-iitt-tttt LDTEX (Rs, Rt, ii), Rn

* 1ppp-0010-vvnn-nnnn usss-sssi-iiii-iiii STx Rn, (Rs, Imm9)
* 1ppp-0011-vvnn-nnnn usss-sssi-iiii-iiii LDx (Rs, Imm9), Rn

uvv: 3b field: 000=SB, 001=SW, 010=SD, 011=SQ, 100=UB, 101=UW, 110=UD, 111=X

Except LDTEX, where uvv relates to the texture mode.
High bits of Rs would encode the texture texture type and resolution.
The ii pixel would indicate which texel to sample.
Namely, whether to round the S or T coord down or up.
The Rt register in this case would be a SIMD vector encoding the S
and T coordinates.

Note that LDTEX would unpack the texel into a SIMD vector format.
uvv:
000=2x Int32, 001=2x Binary32, 010=4x Int16, 011=4x Binary16
100= -, 101= -, 110=4x Int32, 011=4x Binary32
Assumes a texture-format compatible with the vector type.

1ppp-0100-vvnn-nnnn usss-sss0-00tt-tttt ALU Ops (3R)
* 1ppp-0101-vvnn-nnnn usss-sssi-iiii-iiii ALU Rs, Imm9, Rn
** uvv: 000=ADD, 001=SUB, 010=MULS, 011=MULU, 100=Shift, 101=AND,
110=OR, 111=XOR
** Shift would encode sub-type via high bits of immediate:
*** 000w: Logical Left (2x32/4x16), 001w=Logical Right (2x32/4x16)
*** 010w: Arithmetic Left (2x32/4x16), 011w=Arithmetic Right (2x32/4x16)
*** 100: Logical Left (64), 001=Logical Right (64)
*** 110: Arithmetic Left (64), 011=Arithmetic Right (64)

* 1ppp-0100-vvnn-nnnn usss-sss1-zzzz-zzzz ALU Ops (2R Space)
** ( MOV-RR, CMPxx, TEST, ... go here )
** For 2R SIMD ops, uvv encodes vector format
*** Other ops: interpreted as opcode bits.

* 1ppp-0110-vvnn-nnnn usss-sssu-uutt-tttt 64-bit SIMD Ops (3R)
* 1ppp-0111-vvnn-nnnz usss-sszu-uutt-tttz 128-bit SIMD Ops (3R)
** vv: 00=2xI, 01=2xF, 10=4xI, 11=4xF
** uuuu:
*** 0000=ADD/FADD, 0001=SUB/FSUB, 0010=MULS/FMUL, 0011=MULU/FMAC
*** 0100=MULHS, 0101=MULHU, 0110=- , 0111=FMSC
*** 1000=MOVLL, 1001=MOVLH, 1010=MOVHL, 1011=MOVHH
**** ( Selects elements from source registers )
**** ( Can be used for things like matrix transpose, 32b shuffle, ... )
*** ...

* 1ppp-1110-00nn-nnnn iiii-iiii-iiii-iiii MOV Imm16u, Rn
* 1ppp-1110-01nn-nnnn iiii-iiii-iiii-iiii MOV Imm16n, Rn
* 1ppp-1110-10nn-nnnn iiii-iiii-iiii-iiii ADD Imm16u, Rn
* 1ppp-1110-11nn-nnnn iiii-iiii-iiii-iiii ADD Imm16n, Rn

* 1ppp-1111-00nn-nnnn izzz-zzzi-iiii-iiii ALU Imm10s, Rn
** ( CMPxx, TEST, ... go here )

* 1ppp-1111-10nn-nnnn 0sss-sss0-iiii-iiii SHUF.W Rs, Imm8, Rn
* 1ppp-1111-10nn-nnnn 0sss-sss1-iiii-iiii SHUF.L Rs, Imm8, Rn
** Shuffle SIMD elements.

* 1ppp-1111-1110-dddd dddd-dddd-dddd-dddd BRA Disp20 (Branch)
* 1ppp-1111-1111-dddd dddd-dddd-dddd-dddd BSR Disp20 (Call)

These encodings basically eat all the encoding space, but for the
use-case this is probably acceptable.

I have yet to fully spec out this idea yet.

Any thoughts? ...

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<66967492-b252-4e76-8f06-5fde2c8a1cc4n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=25458&group=comp.arch#25458

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:2402:b0:6a5:3b28:d726 with SMTP id d2-20020a05620a240200b006a53b28d726mr1129080qkn.500.1653350918257;
Mon, 23 May 2022 17:08:38 -0700 (PDT)
X-Received: by 2002:ac8:7fcc:0:b0:2f9:183e:da30 with SMTP id
b12-20020ac87fcc000000b002f9183eda30mr14902706qtk.22.1653350918149; Mon, 23
May 2022 17:08:38 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!2.eu.feeder.erje.net!feeder.erje.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 23 May 2022 17:08:38 -0700 (PDT)
In-Reply-To: <t6gush$p5u$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:6924:db9b:e195:3505;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:6924:db9b:e195:3505
References: <t6gush$p5u$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <66967492-b252-4e76-8f06-5fde2c8a1cc4n@googlegroups.com>
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 24 May 2022 00:08:38 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Tue, 24 May 2022 00:08 UTC

On Monday, May 23, 2022 at 4:38:28 PM UTC-5, BGB wrote:

> This instruction would only do NEAREST, LINEAR would need to be done
> manually.
>
> Say:
> LDTEX (R8, R10, 0), R4
> LDTEX (R8, R10, 1), R5
> LDTEX (R8, R10, 2), R6
> LDTEX (R8, R10, 3), R7
> MOVU.L (R13), R3 //Z-Buffer
> BLERPS.W R4, R10, R16
> BLERPS.W R6, R10, R17
> BLERPT.W R16, R10, R2
> PMULHU.W R2, R24, R2
<
Why are the 9 above instructions not a single instruction ?
<
> RGB5PCK R2, R2 | GMPGT R14, R3 //Z test
> MOV.W?T R2, (R12) //Color Write
> MOV.L?T R14, (R13) //Z Write
> Step and Loop
--------------------------------
>
> Any thoughts? ...

See above.

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<t6hgk5$5hg$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=25459&group=comp.arch#25459

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
Date: Mon, 23 May 2022 21:41:04 -0500
Organization: A noiseless patient Spider
Lines: 97
Message-ID: <t6hgk5$5hg$1@dont-email.me>
References: <t6gush$p5u$1@dont-email.me>
<66967492-b252-4e76-8f06-5fde2c8a1cc4n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 24 May 2022 02:41:09 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4b225c29e5615bf2924203db3e119084";
logging-data="5680"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/9pKoXrskQ1mauMXOhtieE"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.9.0
Cancel-Lock: sha1:m/QOpQcd6ql7K9GyCEbbtA6YnrQ=
In-Reply-To: <66967492-b252-4e76-8f06-5fde2c8a1cc4n@googlegroups.com>
Content-Language: en-US

by: BGB - Tue, 24 May 2022 02:41 UTC

On 5/23/2022 7:08 PM, MitchAlsup wrote:
> On Monday, May 23, 2022 at 4:38:28 PM UTC-5, BGB wrote:
>
>> This instruction would only do NEAREST, LINEAR would need to be done
>> manually.
>>
>> Say:
>> LDTEX (R8, R10, 0), R4
>> LDTEX (R8, R10, 1), R5
>> LDTEX (R8, R10, 2), R6
>> LDTEX (R8, R10, 3), R7
>> MOVU.L (R13), R3 //Z-Buffer
>> BLERPS.W R4, R10, R16
>> BLERPS.W R6, R10, R17
>> BLERPT.W R16, R10, R2
>> PMULHU.W R2, R24, R2
> <
> Why are the 9 above instructions not a single instruction ?

But... How exactly would I be expected to implement this in a way where
it works consistently and can still pass timing?...

The accesses will not necessarily fall within the same texture block,
nor necessarily into adjacent blocks in memory.

So, at best I either have:
Seams at block-edges (if only interpolating intra-block texels);
4x the resource cost for the texture fetch;
...

LERP latency isn't exactly zero either:
Simple LERP: C=(A*(~S))+(B*S)
Bi-LERP: C=((A*(~S))+(B*S))*(~T) + ((C*(~S))+(D*S))*T;

So, it is multiple ops mostly because I don't think I can make it much
shorter than this.

As-is, on BJX2, the dependency chain is basically:
PMORT //suffle
SHLD //shift
AND //mask by texture size
MOV.Q //load
BLKUTX2 //extract texel from block
... //blending math

Collapsing the first 5 into a single operation should at least be an
improvement.

Well, along with needing to reduce the number of clock-cycles needed for
PADDX.F and PMULX.F, ...

My current thinking is that if ~ 70-80% of the scene is quietly being
rendered with NEAREST filtering, it is faster, and "not super obvious".

There are "slightly noticeable" boundaries where where stuff moves
between minimization and magnification cases, or transitions between
mip-levels, but alas. Full trilinear would help here, but I am not
really going to have the clock-cycle budget with this to afford trilinear.

Hope is more just to be able to get GLQuake above single-digit framerates.

Which as-is, looks like it will effectively require being able to push
megapixel-per-second fill-rates.

> <
>> RGB5PCK R2, R2 | GMPGT R14, R3 //Z test
>> MOV.W?T R2, (R12) //Color Write
>> MOV.L?T R14, (R13) //Z Write
>> Step and Loop
> --------------------------------
>>
>> Any thoughts? ...
>
> See above.

My design was partially a counter point where people were thinking:
Yeah, RISC-V with the 'V' extension, totally makes sense as a GPU;
Me thinking: That's probably gonna *suuuck*.

Well, unless one imagines the GPU as basically a glorified coprocessor
for running TensorFlow or similar, it would probably work pretty good
for this.

But, for actual 3D rendering tasks, I suspect the 'V' extension is
probably not the right tool for the job.

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<t6htd8$q8c$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=25460&group=comp.arch#25460

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!EhtdJS5E9ITDZpJm3Uerlg.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
Date: Tue, 24 May 2022 08:19:25 +0200
Organization: Aioe.org NNTP Server
Message-ID: <t6htd8$q8c$1@gioia.aioe.org>
References: <t6gush$p5u$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="26892"; posting-host="EhtdJS5E9ITDZpJm3Uerlg.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.12
X-Notice: Filtered by postfilter v. 0.9.2

by: Terje Mathisen - Tue, 24 May 2022 06:19 UTC

BGB wrote:
> So, it came up elsewhere, but I can note in my efforts that:
> Getting GLQuake above 10fps on a 50MHz CPU core is non-trivial.
>
> A dedicated GPU is possible, but I have a practical resource limit (on
> the XC7A100T) of around 15-20 kLUT for something like this.
>
>
> To get the sort of frame-rates I want for Quake, it looks like I am
> going to need a fill-rate of around 5 megapixels/second (or more, more
> is better...), or around 10 clock-cycles per-pixel assuming a 50MHz GPU.
>
> Some of this seems like a stretch...
> Not quite impossible, but still a bit much.
>
>
> So, high-level thoughts for a GPU ISA:
> 32x 128-bit registers, also addressable as 64x 64-bit (R0..R63);
> These function as both SIMD registers and GPRs;
> One or two operations per cycle, depending on operation type;
>     An ALU op may pair with another op;
>     Most other ops are one-off.
> Would have 2 ALUs, and a low-precision FP-SIMD unit.
>     FP-SIMD would be Binary16 and truncated Binary32;
>     Would ignore the low 8 bits of Binary32 to make FPUs faster/cheaper;
>     DAZ + FTZ + Truncate Only.
>     Would omit Binary64 support entirely (too expensive).
> Load/Store:
>     Typical sizes (mostly the same as BJX2);
>     Would support a LDTEX instruction:
>       Mashes texture addressing and UTX2 decoding into a single op (1).
> No MMU or similar.

Have you taken a good, long look at Larrabee (and its list of descendants)?

This was Intel + a few games programmers believing they could start with
a Pentium-class core, improve the second pipe to be much more general
and add a full set of cache-line wide regs for graphics operations.

Presto, a fully programmable cpu that could compete neck&neck with
dedicated GPUs!

It turned out that they couldn't really do this, at least not in a
performance/watt competitive manner, but it does provide a nice
blueprint for what's needed to at least get close.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<t6i553$t7r$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=25462&group=comp.arch#25462

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
Date: Tue, 24 May 2022 03:31:25 -0500
Organization: A noiseless patient Spider
Lines: 148
Message-ID: <t6i553$t7r$1@dont-email.me>
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 24 May 2022 08:31:31 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4b225c29e5615bf2924203db3e119084";
logging-data="29947"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX193BhirDcYGayI7DDX/qA0W"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.9.0
Cancel-Lock: sha1:MinOOPRcdxQMsuyxinzx9Zutcew=
In-Reply-To: <t6htd8$q8c$1@gioia.aioe.org>
Content-Language: en-US

by: BGB - Tue, 24 May 2022 08:31 UTC

On 5/24/2022 1:19 AM, Terje Mathisen wrote:
> BGB wrote:
>> So, it came up elsewhere, but I can note in my efforts that:
>>    Getting GLQuake above 10fps on a 50MHz CPU core is non-trivial.
>>
>> A dedicated GPU is possible, but I have a practical resource limit (on
>> the XC7A100T) of around 15-20 kLUT for something like this.
>>
>>
>> To get the sort of frame-rates I want for Quake, it looks like I am
>> going to need a fill-rate of around 5 megapixels/second (or more, more
>> is better...), or around 10 clock-cycles per-pixel assuming a 50MHz GPU.
>>
>> Some of this seems like a stretch...
>>    Not quite impossible, but still a bit much.
>>
>>
>> So, high-level thoughts for a GPU ISA:
>>    32x 128-bit registers, also addressable as 64x 64-bit (R0..R63);
>>    These function as both SIMD registers and GPRs;
>>    One or two operations per cycle, depending on operation type;
>>      An ALU op may pair with another op;
>>      Most other ops are one-off.
>>    Would have 2 ALUs, and a low-precision FP-SIMD unit.
>>      FP-SIMD would be Binary16 and truncated Binary32;
>>      Would ignore the low 8 bits of Binary32 to make FPUs faster/cheaper;
>>      DAZ + FTZ + Truncate Only.
>>      Would omit Binary64 support entirely (too expensive).
>>    Load/Store:
>>      Typical sizes (mostly the same as BJX2);
>>      Would support a LDTEX instruction:
>>        Mashes texture addressing and UTX2 decoding into a single op (1).
>>    No MMU or similar.
>
> Have you taken a good, long look at Larrabee (and its list of descendants)?
>
> This was Intel + a few games programmers believing they could start with
> a Pentium-class core, improve the second pipe to be much more general
> and add a full set of cache-line wide regs for graphics operations.
>
> Presto, a fully programmable cpu that could compete neck&neck with
> dedicated GPUs!
>
> It turned out that they couldn't really do this, at least not in a
> performance/watt competitive manner, but it does provide a nice
> blueprint for what's needed to at least get close.
>

AFAIK, the Larrabee project mostly gifted the world with the (later
canceled) Xeon Phi cards, and also AVX512?...

While I suspect a 4-wide SIMT core (with 256-bit registers behaving like
64-bit registers at the ISA level) could be "better" as a GPU, this
would most likely not fit with the LUT budget I am imagining.

In any case, one thing I am pretty sure it is not, is a RISC-V core with
the V extension...

What I am imagining here would be in some ways narrower than the BJX2
core, but would omit almost everything much outside of stuff related to
3D rendering tasks.

Goal being also to make SIMD fast at the expense of accuracy:
Instead of what I currently have, namely 10-cycle FP-SIMD ops, the goal
would be 3-cycle fully pipelined FP-SIMD ops.

I guess the alternative is that, rather than try to design a GPU, I
could just try to glue some of this stuff onto the BJX2 core (even if
kinda expensive).

This would most likely take the form of the fast-but-low-precision
FP-SIMD ops, and maybe the LDTEX instruction (since I can't think up a
good way to implement it piecewise).

So, I will need some way to encode the PADDX.F and PMULX.F instructions
with a modifier that says "do it fast". This would be unnecessary for
PADD.H or PMUL.H, since their accuracy wouldn't get any worse than it is
already (they would merely get faster).

The "new" SIMD unit would have mostly similar semantics to SIMD via the
existing FPU, so wouldn't really effect the semantics.

Most likely option for this is would be sticking an Op64 prefix on these
instructions, and then define that it behaves roughly the same as on
FMUL/FADD, but then maybe redefine that the "Single Precision" rounding
modes mean "Go fast, screw accuracy."

Looking at it, there are several spots I could put a LDTEX in BJX2:
* FFw0_0Vii-F0nm_0GoB LDTEX 1 ?
* FFw0_0Vii-F0nm_0GoF LDTEX 2 ?

Say:
* FFw0_0Vii-F0nm_0GoB LDTEX (Rm, Ro, sc, Imm9), Rn

With the scale-field serving as a pixel-selector, and the Imm9 encoding
an access format or vector-mode or similar. Though, the Rm field could
still encode the texture size in the high-order bits (since this is a
run-time property.

Don't know how much I could really gain though (since a lot of time will
still be spent running the Quake engine itself).

Did recently at least gain a small gain in performance in GLQuake by
adding logic to discard any minuscule polygons (with an area less than 8
units), since these seemed to be mostly irrelevant to "actually
rendering the map" (vs, say, leaving obvious holes in the geometry).

(Where in Quake, a "unit" is approx 1.2 inches).

Given Quake already had logic here to filter out colinear vertices, and
drop polygons with fewer than 3 vertices, and there were still a
non-zero number of polygons with (nearly) zero area (which I added some
logic to filter out), I am suspecting QBSP didn't exactly do a whole lot
of "clean up" of the results of its CSG clipping process.

Also experimented with moving part of the renderer over to Binary16 /
"short float" as a footprint-reducing experiment. After some debugging,
it "sorta works". Since most of the maps tend to be relatively close to
the origin, the distortion isn't that noticeable.

Though, for larger maps, for areas further away from the origin, the ULP
is 2 units (so bits of the work geometry would be a bit mashed). Also,
some things look "a little mangled" for reasons I haven't figured out yet.

Then again, "2 inch ULP and crappy dynamic range makes stuff look kinda
mashed and wonky" probably shouldn't be that much of a surprise.

....

Though, S.E8.F15 would likely fare a little better, as the ULP would be
at around 1/16 inch rather than around 2 inch.

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<25ec150c-9ba5-497f-864e-1895365dc27cn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=25471&group=comp.arch#25471

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:a9c:b0:6a3:8411:6c78 with SMTP id v28-20020a05620a0a9c00b006a384116c78mr7047016qkg.689.1653406399488;
Tue, 24 May 2022 08:33:19 -0700 (PDT)
X-Received: by 2002:ad4:5cc2:0:b0:461:e291:9766 with SMTP id
iu2-20020ad45cc2000000b00461e2919766mr21630681qvb.48.1653406399363; Tue, 24
May 2022 08:33:19 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 24 May 2022 08:33:19 -0700 (PDT)
In-Reply-To: <t6hgk5$5hg$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:f189:f730:a180:c250;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:f189:f730:a180:c250
References: <t6gush$p5u$1@dont-email.me> <66967492-b252-4e76-8f06-5fde2c8a1cc4n@googlegroups.com>
<t6hgk5$5hg$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <25ec150c-9ba5-497f-864e-1895365dc27cn@googlegroups.com>
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 24 May 2022 15:33:19 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 5234

by: MitchAlsup - Tue, 24 May 2022 15:33 UTC

On Monday, May 23, 2022 at 9:41:13 PM UTC-5, BGB wrote:
> On 5/23/2022 7:08 PM, MitchAlsup wrote:
> > On Monday, May 23, 2022 at 4:38:28 PM UTC-5, BGB wrote:
> >
> >> This instruction would only do NEAREST, LINEAR would need to be done
> >> manually.
> >>
> >> Say:
> >> LDTEX (R8, R10, 0), R4
> >> LDTEX (R8, R10, 1), R5
> >> LDTEX (R8, R10, 2), R6
> >> LDTEX (R8, R10, 3), R7
> >> MOVU.L (R13), R3 //Z-Buffer
> >> BLERPS.W R4, R10, R16
> >> BLERPS.W R6, R10, R17
> >> BLERPT.W R16, R10, R2
> >> PMULHU.W R2, R24, R2
> > <
> > Why are the 9 above instructions not a single instruction ?
<
> But... How exactly would I be expected to implement this in a way where
> it works consistently and can still pass timing?...
>
GPUs do it and they do it 32 threads wide.
>
> The accesses will not necessarily fall within the same texture block,
> nor necessarily into adjacent blocks in memory.
<
Yes, I am aware of that, I am also aware that texture is inherently a
memory reference where the address is floating point, the integer
part is used to access the data, the fractional part is used to LERP
the various datums into a single result.
>
> So, at best I either have:
> Seams at block-edges (if only interpolating intra-block texels);
> 4x the resource cost for the texture fetch;
<
It is more like 9× if you want 3D textures.
> ...
>
>
> LERP latency isn't exactly zero either:
> Simple LERP: C=(A*(~S))+(B*S)
<
If you look at the above correctly, the LERP can be done in 1 pass
through the multiplier. If a bit is set in S you add B otherwise you add
A for all bits in S.
<
> Bi-LERP: C=((A*(~S))+(B*S))*(~T) + ((C*(~S))+(D*S))*T;
>
>
> So, it is multiple ops mostly because I don't think I can make it much
> shorter than this.
>
> As-is, on BJX2, the dependency chain is basically:
> PMORT //suffle
> SHLD //shift
> AND //mask by texture size
> MOV.Q //load
> BLKUTX2 //extract texel from block
> ... //blending math
>
> Collapsing the first 5 into a single operation should at least be an
> improvement.
>
> Well, along with needing to reduce the number of clock-cycles needed for
> PADDX.F and PMULX.F, ...
>
>
>
>
>
> My current thinking is that if ~ 70-80% of the scene is quietly being
> rendered with NEAREST filtering, it is faster, and "not super obvious".
>
> There are "slightly noticeable" boundaries where where stuff moves
> between minimization and magnification cases, or transitions between
> mip-levels, but alas. Full trilinear would help here, but I am not
> really going to have the clock-cycle budget with this to afford trilinear..
>
> Hope is more just to be able to get GLQuake above single-digit framerates..
>
>
> Which as-is, looks like it will effectively require being able to push
> megapixel-per-second fill-rates.
> > <
> >> RGB5PCK R2, R2 | GMPGT R14, R3 //Z test
> >> MOV.W?T R2, (R12) //Color Write
> >> MOV.L?T R14, (R13) //Z Write
> >> Step and Loop
> > --------------------------------
> >>
> >> Any thoughts? ...
> >
> > See above.
> My design was partially a counter point where people were thinking:
> Yeah, RISC-V with the 'V' extension, totally makes sense as a GPU;
> Me thinking: That's probably gonna *suuuck*.
>
CPUs with general purpose ISA do suuuuckkkkk at graphics.
>
> Well, unless one imagines the GPU as basically a glorified coprocessor
> for running TensorFlow or similar, it would probably work pretty good
> for this.
>
> But, for actual 3D rendering tasks, I suspect the 'V' extension is
> probably not the right tool for the job.
<
GPUs are designed the way they are so they don't suuuccckkk at graphics.

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<t6j9gs$kta$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=25475&group=comp.arch#25475

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
Date: Tue, 24 May 2022 13:52:06 -0500
Organization: A noiseless patient Spider
Lines: 234
Message-ID: <t6j9gs$kta$1@dont-email.me>
References: <t6gush$p5u$1@dont-email.me>
<66967492-b252-4e76-8f06-5fde2c8a1cc4n@googlegroups.com>
<t6hgk5$5hg$1@dont-email.me>
<25ec150c-9ba5-497f-864e-1895365dc27cn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 24 May 2022 18:52:12 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="4b225c29e5615bf2924203db3e119084";
logging-data="21418"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19oCR1s9TmLX8dMzUDKGncA"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.9.0
Cancel-Lock: sha1:vK945dwVr68EQpIMxzXC9u3UXqE=
In-Reply-To: <25ec150c-9ba5-497f-864e-1895365dc27cn@googlegroups.com>
Content-Language: en-US

by: BGB - Tue, 24 May 2022 18:52 UTC

On 5/24/2022 10:33 AM, MitchAlsup wrote:
> On Monday, May 23, 2022 at 9:41:13 PM UTC-5, BGB wrote:
>> On 5/23/2022 7:08 PM, MitchAlsup wrote:
>>> On Monday, May 23, 2022 at 4:38:28 PM UTC-5, BGB wrote:
>>>
>>>> This instruction would only do NEAREST, LINEAR would need to be done
>>>> manually.
>>>>
>>>> Say:
>>>> LDTEX (R8, R10, 0), R4
>>>> LDTEX (R8, R10, 1), R5
>>>> LDTEX (R8, R10, 2), R6
>>>> LDTEX (R8, R10, 3), R7
>>>> MOVU.L (R13), R3 //Z-Buffer
>>>> BLERPS.W R4, R10, R16
>>>> BLERPS.W R6, R10, R17
>>>> BLERPT.W R16, R10, R2
>>>> PMULHU.W R2, R24, R2
>>> <
>>> Why are the 9 above instructions not a single instruction ?
> <
>> But... How exactly would I be expected to implement this in a way where
>> it works consistently and can still pass timing?...
>>
> GPUs do it and they do it 32 threads wide.

But, most don't need to also fit on an XC7A100T or similar while still
leaving room for a CPU core.

Could probably do a little more with an XC7A200T.

Will state that a GPU core is probably not viable on either an XC7A35T
or XC7S50, as pretty much all these LUTs are likely to be much better
spent on the main CPU.

>>
>> The accesses will not necessarily fall within the same texture block,
>> nor necessarily into adjacent blocks in memory.
> <
> Yes, I am aware of that, I am also aware that texture is inherently a
> memory reference where the address is floating point, the integer
> part is used to access the data, the fractional part is used to LERP
> the various datums into a single result.

If they were all the same block, then fetch could be cheaper:
Fetch 1 block, then decode multiple textels from the same block.

But, needing multiple fetches kinda ruins things here.

In this case, a version which has a seam every 4 texels would be worse
than one which only does nearest interpolation.

>>
>> So, at best I either have:
>> Seams at block-edges (if only interpolating intra-block texels);
>> 4x the resource cost for the texture fetch;
> <
> It is more like 9× if you want 3D textures.

Pretty much no one used these AFAIK.

My thinking was, say, if one wanted hardware that could do:
GL_LINEAR
And:
GL_LINEAR_MIPMAP_NEAREST

My thinking at the moment is that, effectively, LDTEX would do
GL_NEAREST with an extra performance cost for GL_LINEAR and similar.

For the GLQuake port, the filtering mode can be set as:
GL_LINEAR, GL_NEAREST_MIPMAP_NEAREST
Which is computationally cheaper.

>> ...
>>
>>
>> LERP latency isn't exactly zero either:
>> Simple LERP: C=(A*(~S))+(B*S)
> <
> If you look at the above correctly, the LERP can be done in 1 pass
> through the multiplier. If a bit is set in S you add B otherwise you add
> A for all bits in S.

Could work, but not with DSP48's.

Doing it with discrete adders but this in is not likely to scale very
well, but could maybe be done if S/T were reduced to around 2 or 3 bits
(like with the 3/8 + 5/8 blending used in UTX2, since 3/8 and 5/8 are
"close enough" to 1/3 and 2/3).

DSP48's sorta work well for what they do, but have a fairly high
latency, so trying to daisy-chain multiple DSP48's within a single cycle
isn't good.

> <
>> Bi-LERP: C=((A*(~S))+(B*S))*(~T) + ((C*(~S))+(D*S))*T;
>>

Note, I could do this, just the instruction would be more expensive and
would likely need to have a multi-cycle latency.

There is also a cost cutting trick to reduce the number of texel
fetches, say:
LDTEX (R8, R10, 0), R4
LDTEX (R8, R10, 1), R5
LDTEX (R8, R10, 2), R7
MOVU.L (R13), R3 //Z-Buffer
MOV R4, R6 ||
BLERPS.W R4, R10, R16
BLERPT.W R6, R10, R17
PAVG.W R16, R17, R2
...

Or:
PADD.W R16, R17, R2
PSUB.W R2, R4, R2

But, this has a problem, namely that it can overflow and cause obvious
artifacts.

One option here I guess could be to consider either:
A combined instruction which has higher dynamic range and clamps the result;
Or, an instruction which effectively implements a Paeth filter.

Or, not bother, and stick with using an average.

>>
>> So, it is multiple ops mostly because I don't think I can make it much
>> shorter than this.
>>
>> As-is, on BJX2, the dependency chain is basically:
>> PMORT //suffle
>> SHLD //shift
>> AND //mask by texture size
>> MOV.Q //load
>> BLKUTX2 //extract texel from block
>> ... //blending math
>>
>> Collapsing the first 5 into a single operation should at least be an
>> improvement.
>>
>> Well, along with needing to reduce the number of clock-cycles needed for
>> PADDX.F and PMULX.F, ...
>>
>>
>>
>>
>>
>> My current thinking is that if ~ 70-80% of the scene is quietly being
>> rendered with NEAREST filtering, it is faster, and "not super obvious".
>>
>> There are "slightly noticeable" boundaries where where stuff moves
>> between minimization and magnification cases, or transitions between
>> mip-levels, but alas. Full trilinear would help here, but I am not
>> really going to have the clock-cycle budget with this to afford trilinear.
>>
>> Hope is more just to be able to get GLQuake above single-digit framerates.
>>
>>
>> Which as-is, looks like it will effectively require being able to push
>> megapixel-per-second fill-rates.
>>> <
>>>> RGB5PCK R2, R2 | GMPGT R14, R3 //Z test
>>>> MOV.W?T R2, (R12) //Color Write
>>>> MOV.L?T R14, (R13) //Z Write
>>>> Step and Loop
>>> --------------------------------
>>>>
>>>> Any thoughts? ...
>>>
>>> See above.
>> My design was partially a counter point where people were thinking:
>> Yeah, RISC-V with the 'V' extension, totally makes sense as a GPU;
>> Me thinking: That's probably gonna *suuuck*.
>>
> CPUs with general purpose ISA do suuuuckkkkk at graphics.

Yeah.

Though, they can be helped along with helper ops for pixel pack/unpack,
compressed texture decode, and interpolation/blending operations.

I suspect this is partly how, despite being "not very fast", OpenGL
rasterization gets nearly 3x the fill-rate per-clock-cycle when compared
to running similar logic on my Ryzen...

By most other metrics, the Ryzen beats it hard on per-clock-cycle
performance (the Ryzen seems to get roughly 6x the performance,
per-clock-cycle, for more general-purpose benchmarks).

>>
>> Well, unless one imagines the GPU as basically a glorified coprocessor
>> for running TensorFlow or similar, it would probably work pretty good
>> for this.
>>
>> But, for actual 3D rendering tasks, I suspect the 'V' extension is
>> probably not the right tool for the job.
> <
> GPUs are designed the way they are so they don't suuuccckkk at graphics.

Granted...

Or if I were to do it for RISC-V, I would do it a bit different:
First off, add scaled-index Load/Store;
Add the relevant helper instructions;
RGBA Pack/Unpack;
Compressed texture stuff;
...
SIMD, lots of SIMD.

Streaming floating-point vectors to/from memory (with a stride) through
the FPU via pipelining, that is something, just does not fit what is
needed for a rasterizer.

Doesn't matter how big the "vector registers" are internally, it isn't
going to do what is needed for 3D rendering tasks.

.... Like, do they just sorta think that the GPU is sitting around doing
giant matrix multiplies ?...

I don't really get it.

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<29370e99-7b21-4473-8076-0509a6c54f9dn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=25476&group=comp.arch#25476

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:24cd:b0:6a0:414c:a648 with SMTP id m13-20020a05620a24cd00b006a0414ca648mr18558898qkn.465.1653419591467;
Tue, 24 May 2022 12:13:11 -0700 (PDT)
X-Received: by 2002:ac8:7fc5:0:b0:2f9:4414:c3b4 with SMTP id
b5-20020ac87fc5000000b002f94414c3b4mr2989395qtk.22.1653419591286; Tue, 24 May
2022 12:13:11 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!feeder1.cambriumusenet.nl!feed.tweak.nl!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 24 May 2022 12:13:11 -0700 (PDT)
In-Reply-To: <t6j9gs$kta$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:6179:6b8e:2145:567d;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:6179:6b8e:2145:567d
References: <t6gush$p5u$1@dont-email.me> <66967492-b252-4e76-8f06-5fde2c8a1cc4n@googlegroups.com>
<t6hgk5$5hg$1@dont-email.me> <25ec150c-9ba5-497f-864e-1895365dc27cn@googlegroups.com>
<t6j9gs$kta$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <29370e99-7b21-4473-8076-0509a6c54f9dn@googlegroups.com>
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 24 May 2022 19:13:11 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Tue, 24 May 2022 19:13 UTC

On Tuesday, May 24, 2022 at 1:52:16 PM UTC-5, BGB wrote:
> On 5/24/2022 10:33 AM, MitchAlsup wrote:
> > On Monday, May 23, 2022 at 9:41:13 PM UTC-5, BGB wrote:
> >> On 5/23/2022 7:08 PM, MitchAlsup wrote:
> >>> On Monday, May 23, 2022 at 4:38:28 PM UTC-5, BGB wrote:
> >>>
> >>>> This instruction would only do NEAREST, LINEAR would need to be done
> >>>> manually.
> >>>>
> >>>> Say:
> >>>> LDTEX (R8, R10, 0), R4
> >>>> LDTEX (R8, R10, 1), R5
> >>>> LDTEX (R8, R10, 2), R6
> >>>> LDTEX (R8, R10, 3), R7
> >>>> MOVU.L (R13), R3 //Z-Buffer
> >>>> BLERPS.W R4, R10, R16
> >>>> BLERPS.W R6, R10, R17
> >>>> BLERPT.W R16, R10, R2
> >>>> PMULHU.W R2, R24, R2
> >>> <
> >>> Why are the 9 above instructions not a single instruction ?
> > <
> >> But... How exactly would I be expected to implement this in a way where
> >> it works consistently and can still pass timing?...
> >>
> > GPUs do it and they do it 32 threads wide.
> But, most don't need to also fit on an XC7A100T or similar while still
> leaving room for a CPU core.
>
> Could probably do a little more with an XC7A200T.
>
>
> Will state that a GPU core is probably not viable on either an XC7A35T
> or XC7S50, as pretty much all these LUTs are likely to be much better
> spent on the main CPU.
> >>
> >> The accesses will not necessarily fall within the same texture block,
> >> nor necessarily into adjacent blocks in memory.
> > <
> > Yes, I am aware of that, I am also aware that texture is inherently a
> > memory reference where the address is floating point, the integer
> > part is used to access the data, the fractional part is used to LERP
> > the various datums into a single result.
> If they were all the same block, then fetch could be cheaper:
> Fetch 1 block, then decode multiple textels from the same block.
>
> But, needing multiple fetches kinda ruins things here.
<
The Samsung GPU I worked on had a Texture unit that could perform
9 fetches per cycle to feed the 3D LERPer. It also had an interpolator
that performed the (up to) 29 FP operations per cycle (pipelined), and
a "front end" that would read the input stream and be able to create
a new "warp" every 8 cycles.
<
All on a "per instance" basis, and we were using 8 instances.
<
Imagine a CPU performing 32 excels() every 8 cycles !!
Just try to imagine !!
>
> In this case, a version which has a seam every 4 texels would be worse
> than one which only does nearest interpolation.
> >>
> >> So, at best I either have:
> >> Seams at block-edges (if only interpolating intra-block texels);
> >> 4x the resource cost for the texture fetch;
> > <
> > It is more like 9× if you want 3D textures.
> Pretty much no one used these AFAIK.
<
Maybe in DOOM and Quake......but several graphics programs used textures in
the vertex shader to do "interesting physics".
>
> My thinking was, say, if one wanted hardware that could do:
> GL_LINEAR
> And:
> GL_LINEAR_MIPMAP_NEAREST
>
> My thinking at the moment is that, effectively, LDTEX would do
> GL_NEAREST with an extra performance cost for GL_LINEAR and similar.
>
> For the GLQuake port, the filtering mode can be set as:
> GL_LINEAR, GL_NEAREST_MIPMAP_NEAREST
> Which is computationally cheaper.
> >> ...
> >>
> >>
> >> LERP latency isn't exactly zero either:
> >> Simple LERP: C=(A*(~S))+(B*S)
> > <
> > If you look at the above correctly, the LERP can be done in 1 pass
> > through the multiplier. If a bit is set in S you add B otherwise you add
> > A for all bits in S.
> Could work, but not with DSP48's.
<
See what happens when you don't have access to the raw gates !!
>
> Doing it with discrete adders but this in is not likely to scale very
> well, but could maybe be done if S/T were reduced to around 2 or 3 bits
> (like with the 3/8 + 5/8 blending used in UTX2, since 3/8 and 5/8 are
> "close enough" to 1/3 and 2/3).
>
> DSP48's sorta work well for what they do, but have a fairly high
> latency, so trying to daisy-chain multiple DSP48's within a single cycle
> isn't good.
>
>
> > <
> >> Bi-LERP: C=((A*(~S))+(B*S))*(~T) + ((C*(~S))+(D*S))*T;
> >>
>
> Note, I could do this, just the instruction would be more expensive and
> would likely need to have a multi-cycle latency.
>
>
> There is also a cost cutting trick to reduce the number of texel
> fetches, say:
> LDTEX (R8, R10, 0), R4
> LDTEX (R8, R10, 1), R5
> LDTEX (R8, R10, 2), R7
> MOVU.L (R13), R3 //Z-Buffer
> MOV R4, R6 ||
> BLERPS.W R4, R10, R16
> BLERPT.W R6, R10, R17
> PAVG.W R16, R17, R2
> ...
>
> Or:
> PADD.W R16, R17, R2
> PSUB.W R2, R4, R2
>
> But, this has a problem, namely that it can overflow and cause obvious
> artifacts.
>
> One option here I guess could be to consider either:
> A combined instruction which has higher dynamic range and clamps the result;
> Or, an instruction which effectively implements a Paeth filter.
>
> Or, not bother, and stick with using an average.
> >>
> >> So, it is multiple ops mostly because I don't think I can make it much
> >> shorter than this.
> >>
> >> As-is, on BJX2, the dependency chain is basically:
> >> PMORT //suffle
> >> SHLD //shift
> >> AND //mask by texture size
> >> MOV.Q //load
> >> BLKUTX2 //extract texel from block
> >> ... //blending math
> >>
> >> Collapsing the first 5 into a single operation should at least be an
> >> improvement.
> >>
> >> Well, along with needing to reduce the number of clock-cycles needed for
> >> PADDX.F and PMULX.F, ...
> >>
> >>
> >>
> >>
> >>
> >> My current thinking is that if ~ 70-80% of the scene is quietly being
> >> rendered with NEAREST filtering, it is faster, and "not super obvious"..
> >>
> >> There are "slightly noticeable" boundaries where where stuff moves
> >> between minimization and magnification cases, or transitions between
> >> mip-levels, but alas. Full trilinear would help here, but I am not
> >> really going to have the clock-cycle budget with this to afford trilinear.
> >>
> >> Hope is more just to be able to get GLQuake above single-digit framerates.
> >>
> >>
> >> Which as-is, looks like it will effectively require being able to push
> >> megapixel-per-second fill-rates.
> >>> <
> >>>> RGB5PCK R2, R2 | GMPGT R14, R3 //Z test
> >>>> MOV.W?T R2, (R12) //Color Write
> >>>> MOV.L?T R14, (R13) //Z Write
> >>>> Step and Loop
> >>> --------------------------------
> >>>>
> >>>> Any thoughts? ...
> >>>
> >>> See above.
> >> My design was partially a counter point where people were thinking:
> >> Yeah, RISC-V with the 'V' extension, totally makes sense as a GPU;
> >> Me thinking: That's probably gonna *suuuck*.
> >>
> > CPUs with general purpose ISA do suuuuckkkkk at graphics.
> Yeah.
>
> Though, they can be helped along with helper ops for pixel pack/unpack,
> compressed texture decode, and interpolation/blending operations.
<
Why do these as instructions ?? Why not do them as a CoProcessor ??
optimized for the task at hand.
>
> I suspect this is partly how, despite being "not very fast", OpenGL
> rasterization gets nearly 3x the fill-rate per-clock-cycle when compared
> to running similar logic on my Ryzen...
>
>
> By most other metrics, the Ryzen beats it hard on per-clock-cycle
> performance (the Ryzen seems to get roughly 6x the performance,
> per-clock-cycle, for more general-purpose benchmarks).
> >>
> >> Well, unless one imagines the GPU as basically a glorified coprocessor
> >> for running TensorFlow or similar, it would probably work pretty good
> >> for this.
> >>
> >> But, for actual 3D rendering tasks, I suspect the 'V' extension is
> >> probably not the right tool for the job.
> > <
> > GPUs are designed the way they are so they don't suuuccckkk at graphics..
> Granted...
>
>
> Or if I were to do it for RISC-V, I would do it a bit different:
> First off, add scaled-index Load/Store;
> Add the relevant helper instructions;
> RGBA Pack/Unpack;
> Compressed texture stuff;
> ...
> SIMD, lots of SIMD.
<
SIM T not D
>
>
> Streaming floating-point vectors to/from memory (with a stride) through
> the FPU via pipelining, that is something, just does not fit what is
> needed for a rasterizer.
<
Which is why GPUs are not vector processors.
>
> Doesn't matter how big the "vector registers" are internally, it isn't
> going to do what is needed for 3D rendering tasks.
>
>
> ... Like, do they just sorta think that the GPU is sitting around doing
> giant matrix multiplies ?...
>
> I don't really get it.
<
½ of the FP ops/sec in a real GPU are in the rasterizer, interpolator, and
texture units. The programmable engine has the other ½.

Click here to read the complete article

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<t6k5bc$984$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=25480&group=comp.arch#25480

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
Date: Tue, 24 May 2022 21:47:06 -0500
Organization: A noiseless patient Spider
Lines: 385
Message-ID: <t6k5bc$984$1@dont-email.me>
References: <t6gush$p5u$1@dont-email.me>
<66967492-b252-4e76-8f06-5fde2c8a1cc4n@googlegroups.com>
<t6hgk5$5hg$1@dont-email.me>
<25ec150c-9ba5-497f-864e-1895365dc27cn@googlegroups.com>
<t6j9gs$kta$1@dont-email.me>
<29370e99-7b21-4473-8076-0509a6c54f9dn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 25 May 2022 02:47:08 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="c66afbb0174660618a5db87d5b5af227";
logging-data="9476"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/aldsM20RktJwR694bn2dX"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.9.0
Cancel-Lock: sha1:r33ojv8u1kmosfm5CCy7qrztWE0=
In-Reply-To: <29370e99-7b21-4473-8076-0509a6c54f9dn@googlegroups.com>
Content-Language: en-US

by: BGB - Wed, 25 May 2022 02:47 UTC

On 5/24/2022 2:13 PM, MitchAlsup wrote:
> On Tuesday, May 24, 2022 at 1:52:16 PM UTC-5, BGB wrote:
>> On 5/24/2022 10:33 AM, MitchAlsup wrote:
>>> On Monday, May 23, 2022 at 9:41:13 PM UTC-5, BGB wrote:
>>>> On 5/23/2022 7:08 PM, MitchAlsup wrote:
>>>>> On Monday, May 23, 2022 at 4:38:28 PM UTC-5, BGB wrote:
>>>>>
>>>>>> This instruction would only do NEAREST, LINEAR would need to be done
>>>>>> manually.
>>>>>>
>>>>>> Say:
>>>>>> LDTEX (R8, R10, 0), R4
>>>>>> LDTEX (R8, R10, 1), R5
>>>>>> LDTEX (R8, R10, 2), R6
>>>>>> LDTEX (R8, R10, 3), R7
>>>>>> MOVU.L (R13), R3 //Z-Buffer
>>>>>> BLERPS.W R4, R10, R16
>>>>>> BLERPS.W R6, R10, R17
>>>>>> BLERPT.W R16, R10, R2
>>>>>> PMULHU.W R2, R24, R2
>>>>> <
>>>>> Why are the 9 above instructions not a single instruction ?
>>> <
>>>> But... How exactly would I be expected to implement this in a way where
>>>> it works consistently and can still pass timing?...
>>>>
>>> GPUs do it and they do it 32 threads wide.
>> But, most don't need to also fit on an XC7A100T or similar while still
>> leaving room for a CPU core.
>>
>> Could probably do a little more with an XC7A200T.
>>
>>
>> Will state that a GPU core is probably not viable on either an XC7A35T
>> or XC7S50, as pretty much all these LUTs are likely to be much better
>> spent on the main CPU.
>>>>
>>>> The accesses will not necessarily fall within the same texture block,
>>>> nor necessarily into adjacent blocks in memory.
>>> <
>>> Yes, I am aware of that, I am also aware that texture is inherently a
>>> memory reference where the address is floating point, the integer
>>> part is used to access the data, the fractional part is used to LERP
>>> the various datums into a single result.
>> If they were all the same block, then fetch could be cheaper:
>> Fetch 1 block, then decode multiple textels from the same block.
>>
>> But, needing multiple fetches kinda ruins things here.
> <
> The Samsung GPU I worked on had a Texture unit that could perform
> 9 fetches per cycle to feed the 3D LERPer. It also had an interpolator
> that performed the (up to) 29 FP operations per cycle (pipelined), and
> a "front end" that would read the input stream and be able to create
> a new "warp" every 8 cycles.
> <
> All on a "per instance" basis, and we were using 8 instances.
> <
> Imagine a CPU performing 32 excels() every 8 cycles !!
> Just try to imagine !!

Can be done, would be expensive...

I have decided to try adding an LDTEX instruction to my BJX2 core.

It will still only do a single NEAREST fetch, probably Morton only for now.

Then I realized while starting to write the Verilog, that I could
in-fact support a subset of non-square textures with Morton:
Say, 512x256 is the same as 512x512 but with one less bit in the bit mask.

The same trick could support 256x512 textures via flipping the S/T
coordinates (likely while setting up for drawing individual primitives;
say there is a per texture "Flip S/T" flag, along with the "this texture
is Morton" flag).

This is probably good to note, as these cases represents a vast majority
of non-square textures (long skinny textures being comparably rare). For
the era of OpenGL I am aiming to support here, non-power-of-2 textures
were not yet a thing.

>>
>> In this case, a version which has a seam every 4 texels would be worse
>> than one which only does nearest interpolation.
>>>>
>>>> So, at best I either have:
>>>> Seams at block-edges (if only interpolating intra-block texels);
>>>> 4x the resource cost for the texture fetch;
>>> <
>>> It is more like 9× if you want 3D textures.
>> Pretty much no one used these AFAIK.
> <
> Maybe in DOOM and Quake......but several graphics programs used textures in
> the vertex shader to do "interesting physics".

OK.

I don't yet have any immediate plans for shaders, or if I do support
them, they are likely to be via an ugly hack of using offline batch
compilation.

Partly this is that I have concerns that having a GLSL compiler would
eat too much RAM, at least for anything "competent" (a compiler which
spits out call-threaded-code or similar would be unsuitable for this
use-case).

For the time being, OpenGL 1.x would be sufficient...

Well, and there is pretty much no way Doom 3 or similar would have any
real hope of running on my BJX2 core anyways (also Doom 3 would require
a compiler with C++ support, ...).

>>
>> My thinking was, say, if one wanted hardware that could do:
>> GL_LINEAR
>> And:
>> GL_LINEAR_MIPMAP_NEAREST
>>
>> My thinking at the moment is that, effectively, LDTEX would do
>> GL_NEAREST with an extra performance cost for GL_LINEAR and similar.
>>
>> For the GLQuake port, the filtering mode can be set as:
>> GL_LINEAR, GL_NEAREST_MIPMAP_NEAREST
>> Which is computationally cheaper.
>>>> ...
>>>>
>>>>
>>>> LERP latency isn't exactly zero either:
>>>> Simple LERP: C=(A*(~S))+(B*S)
>>> <
>>> If you look at the above correctly, the LERP can be done in 1 pass
>>> through the multiplier. If a bit is set in S you add B otherwise you add
>>> A for all bits in S.
>> Could work, but not with DSP48's.
> <
> See what happens when you don't have access to the raw gates !!

I can't change what the FPGA is doing here...

I either live with the inflexibility of the DSP48 here, or try to define
it full-custom in Verilog, and then have it turn out to be considerably
more expensive and slower.

I have still been paying for not wanting to just give in and use
Vivado's MIG and talk to VRAM via an AXI bus.

And then get annoyed by most of the information around being seemingly
more to "sell" the idea of why someone should use an AXI bus, rather
than anything actually useful in "actually using it", or if they do say
anything, it is more along the lines of "Generate components using the
IP wizard from the IP catalog" type stuff, which is kinda not useful for
someone who is writing a CPU core in Verilog, rather than apparently
doing what Xilinx intends, and using the FPGA mostly like a collection
of high-level prefab lego blocks.

>>
>> Doing it with discrete adders but this in is not likely to scale very
>> well, but could maybe be done if S/T were reduced to around 2 or 3 bits
>> (like with the 3/8 + 5/8 blending used in UTX2, since 3/8 and 5/8 are
>> "close enough" to 1/3 and 2/3).
>>
>> DSP48's sorta work well for what they do, but have a fairly high
>> latency, so trying to daisy-chain multiple DSP48's within a single cycle
>> isn't good.
>>
>>
>>> <
>>>> Bi-LERP: C=((A*(~S))+(B*S))*(~T) + ((C*(~S))+(D*S))*T;
>>>>
>>
>> Note, I could do this, just the instruction would be more expensive and
>> would likely need to have a multi-cycle latency.
>>
>>
>> There is also a cost cutting trick to reduce the number of texel
>> fetches, say:
>> LDTEX (R8, R10, 0), R4
>> LDTEX (R8, R10, 1), R5
>> LDTEX (R8, R10, 2), R7
>> MOVU.L (R13), R3 //Z-Buffer
>> MOV R4, R6 ||
>> BLERPS.W R4, R10, R16
>> BLERPT.W R6, R10, R17
>> PAVG.W R16, R17, R2
>> ...
>>
>> Or:
>> PADD.W R16, R17, R2
>> PSUB.W R2, R4, R2
>>
>> But, this has a problem, namely that it can overflow and cause obvious
>> artifacts.
>>
>> One option here I guess could be to consider either:
>> A combined instruction which has higher dynamic range and clamps the result;
>> Or, an instruction which effectively implements a Paeth filter.
>>
>> Or, not bother, and stick with using an average.
>>>>
>>>> So, it is multiple ops mostly because I don't think I can make it much
>>>> shorter than this.
>>>>
>>>> As-is, on BJX2, the dependency chain is basically:
>>>> PMORT //suffle
>>>> SHLD //shift
>>>> AND //mask by texture size
>>>> MOV.Q //load
>>>> BLKUTX2 //extract texel from block
>>>> ... //blending math
>>>>
>>>> Collapsing the first 5 into a single operation should at least be an
>>>> improvement.
>>>>
>>>> Well, along with needing to reduce the number of clock-cycles needed for
>>>> PADDX.F and PMULX.F, ...
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> My current thinking is that if ~ 70-80% of the scene is quietly being
>>>> rendered with NEAREST filtering, it is faster, and "not super obvious".
>>>>
>>>> There are "slightly noticeable" boundaries where where stuff moves
>>>> between minimization and magnification cases, or transitions between
>>>> mip-levels, but alas. Full trilinear would help here, but I am not
>>>> really going to have the clock-cycle budget with this to afford trilinear.
>>>>
>>>> Hope is more just to be able to get GLQuake above single-digit framerates.
>>>>
>>>>
>>>> Which as-is, looks like it will effectively require being able to push
>>>> megapixel-per-second fill-rates.
>>>>> <
>>>>>> RGB5PCK R2, R2 | GMPGT R14, R3 //Z test
>>>>>> MOV.W?T R2, (R12) //Color Write
>>>>>> MOV.L?T R14, (R13) //Z Write
>>>>>> Step and Loop
>>>>> --------------------------------
>>>>>>
>>>>>> Any thoughts? ...
>>>>>
>>>>> See above.
>>>> My design was partially a counter point where people were thinking:
>>>> Yeah, RISC-V with the 'V' extension, totally makes sense as a GPU;
>>>> Me thinking: That's probably gonna *suuuck*.
>>>>
>>> CPUs with general purpose ISA do suuuuckkkkk at graphics.
>> Yeah.
>>
>> Though, they can be helped along with helper ops for pixel pack/unpack,
>> compressed texture decode, and interpolation/blending operations.
> <
> Why do these as instructions ?? Why not do them as a CoProcessor ??
> optimized for the task at hand.

Click here to read the complete article

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<0a9765d5-f885-4109-9ba8-69430513d05fn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26867&group=comp.arch#26867

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:5953:0:b0:31f:3566:8cff with SMTP id 19-20020ac85953000000b0031f35668cffmr2016918qtz.96.1658658526198;
Sun, 24 Jul 2022 03:28:46 -0700 (PDT)
X-Received: by 2002:a05:6214:21c2:b0:474:500:659d with SMTP id
d2-20020a05621421c200b004740500659dmr6685743qvh.11.1658658525863; Sun, 24 Jul
2022 03:28:45 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 24 Jul 2022 03:28:45 -0700 (PDT)
In-Reply-To: <t6gush$p5u$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.199.50; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.199.50
References: <t6gush$p5u$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <0a9765d5-f885-4109-9ba8-69430513d05fn@googlegroups.com>
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Sun, 24 Jul 2022 10:28:46 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2947

by: luke.l...@gmail.com - Sun, 24 Jul 2022 10:28 UTC

On Monday, May 23, 2022 at 10:38:28 PM UTC+1, BGB wrote:

> Any thoughts? ...

1) nyuzi. jeff bush's work is amazingly clear

https://jbush001.github.io/
https://github.com/jbush001/NyuziProcessor
http://www.cs.binghamton.edu/~millerti/nyuziraster.pdf
https://github.com/jbush001/ChiselGPU

jeff was extraordinarily helpful in making sure i did not
make a hell of a lot of mistakes. reading his work and
blog is pretty much crucial. he also specifically went
to the trouble of reaching out to the Larrabee team for
insights.

2) you'll find that even a 48-bit ISA is hopelessly inadequate.
between 10 and 30% of shader binaries involve swizzle
(XYZW) which if on 2 source operations takes min 16 bits
immediate.

in SVP64 we compromised with a sv.mv.swiz instruction
https://libre-soc.org/openpower/sv/mv.swizzle/

this avoids the need for the same explicit double-swizzle
immediates on 64-bit source operations, at a price.

3) RVV or any other Cray-Style Vector ISA requiring strict compliance
with IEEE754 FP accuracy will punish you with a 400% power/area
penalty compared to modern 3D-optimised GPUs which, as explicitly
spelled out and allowed in the Vulkan(tm) Spec by the Khronos
Group, are permitted significant accuracy reductions. Mitch
will (or will have already) filled you in on that.

4) read as many 3D GPU ISAs as you can get your hands on.
here's the ones i know of, if you find any more please add them
because that's a publicly-editable wiki, you don't need "permission"
to edit it:

https://libre-soc.org/openpower/sv/vector_isa_comparison/

skipping forwards to the latest message(s) i do note with some
relief that it's already been observed "Vector ISA != GPU ISA".

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26881&group=comp.arch#26881

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:246c:b0:474:4629:6ae1 with SMTP id im12-20020a056214246c00b0047446296ae1mr2557594qvb.0.1658702478310;
Sun, 24 Jul 2022 15:41:18 -0700 (PDT)
X-Received: by 2002:a05:6214:1cc3:b0:46e:64aa:842a with SMTP id
g3-20020a0562141cc300b0046e64aa842amr8204470qvd.101.1658702478157; Sun, 24
Jul 2022 15:41:18 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 24 Jul 2022 15:41:17 -0700 (PDT)
In-Reply-To: <t6htd8$q8c$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.199.50; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.199.50
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com>
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Sun, 24 Jul 2022 22:41:18 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 1411

by: luke.l...@gmail.com - Sun, 24 Jul 2022 22:41 UTC

On Tuesday, May 24, 2022 at 7:19:24 AM UTC+1, Terje Mathisen wrote:

> Have you taken a good, long look at Larrabee (and its list of descendants)?

AVX-512 *shudder*

this is an absolutely brilliant, funny, and insightful talk:
https://media.handmade-seattle.com/tom-forsyth/

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<166769ba-7b37-4b0d-837e-da5c4ffd5f25n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26883&group=comp.arch#26883

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:4502:b0:6b4:6c2f:e7b7 with SMTP id t2-20020a05620a450200b006b46c2fe7b7mr7320453qkp.11.1658702940913;
Sun, 24 Jul 2022 15:49:00 -0700 (PDT)
X-Received: by 2002:a05:6214:2308:b0:432:e69f:5d71 with SMTP id
gc8-20020a056214230800b00432e69f5d71mr8513849qvb.19.1658702940570; Sun, 24
Jul 2022 15:49:00 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 24 Jul 2022 15:49:00 -0700 (PDT)
In-Reply-To: <t6j9gs$kta$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.199.50; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.199.50
References: <t6gush$p5u$1@dont-email.me> <66967492-b252-4e76-8f06-5fde2c8a1cc4n@googlegroups.com>
<t6hgk5$5hg$1@dont-email.me> <25ec150c-9ba5-497f-864e-1895365dc27cn@googlegroups.com>
<t6j9gs$kta$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <166769ba-7b37-4b0d-837e-da5c4ffd5f25n@googlegroups.com>
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Sun, 24 Jul 2022 22:49:00 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 3081

by: luke.l...@gmail.com - Sun, 24 Jul 2022 22:49 UTC

On Tuesday, May 24, 2022 at 7:52:16 PM UTC+1, BGB wrote:
> On 5/24/2022 10:33 AM, MitchAlsup wrote:
> > On Monday, May 23, 2022 at 9:41:13 PM UTC-5, BGB wrote:
> >> On 5/23/2022 7:08 PM, MitchAlsup wrote:
> >>> On Monday, May 23, 2022 at 4:38:28 PM UTC-5, BGB wrote:
> >>>
> >>>> This instruction would only do NEAREST, LINEAR would need to be done
> >>>> manually.
> >>>>
> >>>> Say:
> >>>> LDTEX (R8, R10, 0), R4
> >>>> LDTEX (R8, R10, 1), R5
> >>>> LDTEX (R8, R10, 2), R6
> >>>> LDTEX (R8, R10, 3), R7
> >>>> MOVU.L (R13), R3 //Z-Buffer
> >>>> BLERPS.W R4, R10, R16
> >>>> BLERPS.W R6, R10, R17
> >>>> BLERPT.W R16, R10, R2
> >>>> PMULHU.W R2, R24, R2
> >>> <
> >>> Why are the 9 above instructions not a single instruction ?
> > <
> >> But... How exactly would I be expected to implement this in a way where
> >> it works consistently and can still pass timing?...
> >>
> > GPUs do it and they do it 32 threads wide.
> But, most don't need to also fit on an XC7A100T or similar while still
> leaving room for a CPU core.
>
> Could probably do a little more with an XC7A200T.

nexys video is pretty damn good
https://www.mouser.co.uk/new/digilent/digilent-nexys-board/

i've got a couple of these so as to avoid wasting 25% of the LUTs on DDR3
https://1bitsquared.de/products/pmod-hyperram

i'll be adding nexys_video to nmigen as soon as i have time
https://gitlab.com/nmigen/nmigen-boards/-/issues/1

should work perfectly with nextpnr-xilinx although our install
scripts need some extra lines to compile the xc7200t.bba

https://libre-soc.org/HDL_workflow/nextpnr-xilinx/

confirmed nextpnr-xilinx is perfectly functional with the Arty A7-100t
although do avoid using CARRY4 blocks for now (synth_xilinx -nocarry
in yosys)

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<cb920ad4-945d-437b-9dc1-e13407cddb15n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26886&group=comp.arch#26886

copy link Newsgroups: comp.arch

X-Received: by 2002:a0c:f2d0:0:b0:474:e74:f43c with SMTP id c16-20020a0cf2d0000000b004740e74f43cmr8140058qvm.75.1658703298975;
Sun, 24 Jul 2022 15:54:58 -0700 (PDT)
X-Received: by 2002:a05:6214:c4a:b0:473:d7bb:e308 with SMTP id
r10-20020a0562140c4a00b00473d7bbe308mr8251630qvj.53.1658703298754; Sun, 24
Jul 2022 15:54:58 -0700 (PDT)
Path: i2pn2.org!rocksolid2!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 24 Jul 2022 15:54:58 -0700 (PDT)
In-Reply-To: <t6k5bc$984$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.199.50; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.199.50
References: <t6gush$p5u$1@dont-email.me> <66967492-b252-4e76-8f06-5fde2c8a1cc4n@googlegroups.com>
<t6hgk5$5hg$1@dont-email.me> <25ec150c-9ba5-497f-864e-1895365dc27cn@googlegroups.com>
<t6j9gs$kta$1@dont-email.me> <29370e99-7b21-4473-8076-0509a6c54f9dn@googlegroups.com>
<t6k5bc$984$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <cb920ad4-945d-437b-9dc1-e13407cddb15n@googlegroups.com>
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Sun, 24 Jul 2022 22:54:58 +0000
Content-Type: text/plain; charset="UTF-8"

by: luke.l...@gmail.com - Sun, 24 Jul 2022 22:54 UTC

On Wednesday, May 25, 2022 at 3:47:12 AM UTC+1, BGB wrote:

> I don't yet have any immediate plans for shaders, or if I do support
> them, they are likely to be via an ugly hack of using offline batch
> compilation.

as you appear to be following pretty much exactly the path
that we envisioned 3+ years ago (Hybrid CPU-VPU-GPU) it
may be worthwhile looking up what we planned in this area.

https://bugs.libre-soc.org/show_bug.cgi?id=140

the sort-of-software-only-but-not-really MESA 3D driver
was started

https://git.libre-soc.org/?p=mesa.git;a=shortlog;h=refs/heads/libresoc_dev

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<tbkr0u$v2et$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26890&group=comp.arch#26890

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
Date: Sun, 24 Jul 2022 20:18:17 -0500
Organization: A noiseless patient Spider
Lines: 131
Message-ID: <tbkr0u$v2et$1@dont-email.me>
References: <t6gush$p5u$1@dont-email.me>
<66967492-b252-4e76-8f06-5fde2c8a1cc4n@googlegroups.com>
<t6hgk5$5hg$1@dont-email.me>
<25ec150c-9ba5-497f-864e-1895365dc27cn@googlegroups.com>
<t6j9gs$kta$1@dont-email.me>
<166769ba-7b37-4b0d-837e-da5c4ffd5f25n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 25 Jul 2022 01:18:22 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="17dfea8260c097f7bb3c88934754c72b";
logging-data="1018333"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18SuAB5spmBlK/zK+DAKbOE"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.11.0
Cancel-Lock: sha1:2BIVGiNx8AWsCE/b3EYKSqzCUT8=
In-Reply-To: <166769ba-7b37-4b0d-837e-da5c4ffd5f25n@googlegroups.com>
Content-Language: en-US

by: BGB - Mon, 25 Jul 2022 01:18 UTC

On 7/24/2022 5:49 PM, luke.l...@gmail.com wrote:
> On Tuesday, May 24, 2022 at 7:52:16 PM UTC+1, BGB wrote:
>> On 5/24/2022 10:33 AM, MitchAlsup wrote:
>>> On Monday, May 23, 2022 at 9:41:13 PM UTC-5, BGB wrote:
>>>> On 5/23/2022 7:08 PM, MitchAlsup wrote:
>>>>> On Monday, May 23, 2022 at 4:38:28 PM UTC-5, BGB wrote:
>>>>>
>>>>>> This instruction would only do NEAREST, LINEAR would need to be done
>>>>>> manually.
>>>>>>
>>>>>> Say:
>>>>>> LDTEX (R8, R10, 0), R4
>>>>>> LDTEX (R8, R10, 1), R5
>>>>>> LDTEX (R8, R10, 2), R6
>>>>>> LDTEX (R8, R10, 3), R7
>>>>>> MOVU.L (R13), R3 //Z-Buffer
>>>>>> BLERPS.W R4, R10, R16
>>>>>> BLERPS.W R6, R10, R17
>>>>>> BLERPT.W R16, R10, R2
>>>>>> PMULHU.W R2, R24, R2
>>>>> <
>>>>> Why are the 9 above instructions not a single instruction ?
>>> <
>>>> But... How exactly would I be expected to implement this in a way where
>>>> it works consistently and can still pass timing?...
>>>>
>>> GPUs do it and they do it 32 threads wide.
>> But, most don't need to also fit on an XC7A100T or similar while still
>> leaving room for a CPU core.
>>
>> Could probably do a little more with an XC7A200T.
>
> nexys video is pretty damn good
> https://www.mouser.co.uk/new/digilent/digilent-nexys-board/
>

I had wanted to get one before, but held off due to it being fairly
expensive (and I don't really have a source of income right now).

Now this board is sold out pretty much everywhere (along with lots of
other FPGA boards).

When I noticed that they were nearly sold out, had tried to transfer
over some money to PayPal, but by the time it got transferred, they were
sold out entirely.

Now I guess at this point, it is more seeing if/when they will come back
on the market...

> i've got a couple of these so as to avoid wasting 25% of the LUTs on DDR3
> https://1bitsquared.de/products/pmod-hyperram
>
> i'll be adding nexys_video to nmigen as soon as i have time
> https://gitlab.com/nmigen/nmigen-boards/-/issues/1
>
> should work perfectly with nextpnr-xilinx although our install
> scripts need some extra lines to compile the xc7200t.bba
>
> https://libre-soc.org/HDL_workflow/nextpnr-xilinx/
>
> confirmed nextpnr-xilinx is perfectly functional with the Arty A7-100t
> although do avoid using CARRY4 blocks for now (synth_xilinx -nocarry
> in yosys)
>

Still mostly using the Nexys A7 with Vivado, doing most offline
simulation using Verilator.

Since I had posted this, I have added both the LDTEX instruction and
3-cycle low-precision FP-SIMD ops to my BJX2 core.

So, LDTEX:
Loads a texel from a texture with 3L/1T.

Was originally planned to be "Nearest or Corners", but ended up dropping
the rounding on Nearest, since most of the rest of the code already
added a 1/2 texel bias for nearest filtering (so the Nearest modes used
Down/Down rounding, and LDTEX now does Down/Down as its default mode).

The low-precision FP-SIMD ops are also 3L/1T operating on either:
4x Binary16
2x | 4x Binary32, with an internal S.E8.F16.P7 format
(low 7 bits ignored, truncate only).

I ended up doing the internal mantissa negation with ~X rather than -X,
because ~X is both faster/cheaper, and also tended to give less error on
average for some reason.

Possible operations (don't yet exist):
4x Int32 <-> 4x Binary32
4x Int16 <-> 4x Binary16

These would likely end up needing several sub-variants for how to
interpret the fixed-point integer values (Mostly for selecting between
signed and unsigned and adjusting the location of the decimal point).

Also possible:
RGB555 <-> 4x Binary16

....

Recently, was also left thinking about a possible operation to invert or
flip the sign bits of selected vector elements.

Say:
00: Leave element unchanged;
01: Invert low bits of element;
10: Invert sign bit of element;
11: Invert all bits of element.

There are some operations where "Shuffle and then negate certain
elements" are relevant operations, and as-is, the per-element negation
would need to be handled with XOR.

Then again, possible options:
Dedicated ops for sign-flipping and similar;
Op96: XOR Imm64, Rn //possible, but unlikely to add ATM
Op64: SHUF.W Rm, Imm16, Rn
Could extend the normal SHUF.W operator with sign-flipping.

....

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<tblg10$13pon$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26892&group=comp.arch#26892

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
Date: Mon, 25 Jul 2022 02:16:43 -0500
Organization: A noiseless patient Spider
Lines: 113
Message-ID: <tblg10$13pon$1@dont-email.me>
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org>
<ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 25 Jul 2022 07:16:48 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="17dfea8260c097f7bb3c88934754c72b";
logging-data="1173271"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18R8fE4blO3eaV5L9X6e8tv"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.11.0
Cancel-Lock: sha1:rBjqooz22jxrtcBZ3kZUXFLQD8M=
In-Reply-To: <ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com>
Content-Language: en-US

by: BGB - Mon, 25 Jul 2022 07:16 UTC

On 7/24/2022 5:41 PM, luke.l...@gmail.com wrote:
> On Tuesday, May 24, 2022 at 7:19:24 AM UTC+1, Terje Mathisen wrote:
>
>> Have you taken a good, long look at Larrabee (and its list of descendants)?
>
> AVX-512 *shudder*
>
> this is an absolutely brilliant, funny, and insightful talk:
> https://media.handmade-seattle.com/tom-forsyth/

Interesting...

Though, it still seems like it would be pretty difficult to work with,
and well outside what is likely to fit into something like an XC7A100T.

What sort of FPGA would one need for this? I don't know, probably pretty
big, given the giant registers.

In my case, I have instruction-level predication, but not SIMD element
predication (with the partial exception of Packed-Compare and
Packed-Select instructions).

As noted, my current SIMD looks like:
64 or 128 bit vectors.
128-bit cases by using paired register ports.
4x Binary16
2x | 4x Binary32
1x | 2x Binary64

At present, effectively with two FPUs:
Main FPU:
Does Binary64 Natively;
Can do Binary32 or Binary16 SIMD via internal pipelining.
Typically, 10 cycle latency for SIMD ops.
aaaaaa----
-bbbbbb---
--cccccc--
---dddddd-
(Last cycle for forwarding output)
Vec4SF FPU (Low-Precision Vector Unit):
Four parallel FADD and FMUL units (low precision);
Does Binary16 and Binary32 natively:
Binary32 reduced to a 16-bit mantissa.
Can do 1x and 2x Binary64 (optional):
Still with a 16-bit mantissa;
Mostly because Binary64 is used for all scalar values by ABI (*).
Has a 3 cycle latency.

*: In effect, this leads to a funky edge case of values with 11 bits of
exponent but only 16 bits of mantissa. It exists for 2x Binary64 mostly
as a side-effect of how this FPU is implemented (with only 2 of the 4
units being used in this case). Note that while "float" uses Binary32 in
memory, it uses Binary64 in registers (with format conversion on
Load/Store).

The Main FPU:
Has an FADD and FMUL unit;
Currently separate units (vs FMA)
Mostly for cost and latency reasons;
Can be glued for FMAC ops, at a latency cost.

Low-Precision Vector Unit:
Also has FADD and FMUL;
For latency reasons, FMA would have needed ~ 4 or 5 cycles.

Currently, doing operations with the low-precision FPU (for types other
than Binary16, which is the default if this unit is enabled), requires
encoding the instructions with a special rounding mode, which basically
tells the CPU to use use the low precision FPU. Currently this is only
available via a statically encoded rounding mode.

At present, encoding a rounding mode for FPU or SIMD operations requires
using a 64-bit instruction format, which has its own drawbacks (namely
that it can't be bundled). But, these cases are niche enough to not
likely justify adding dedicated encodings in the 32-bit space.

Any attempts to use FMAC would currently be handled by the main FPU
regardless of rounding mode.

Within the confines of standard C, there is basically no way to use the
low-precision FPU directly (pretty much all ways to use it would involve
language extensions; though this is true of the SIMD support in general).

It also adds no fundamentally new capabilities, other than allowing for
faster FP operations (if the unit is not enabled, these instructions
will fall back to the main FPU).

It does appear though that much of Quake can survive with FP precision
being dropped, so it may make sense to come up with some way (in C) to
signal that float variables can be treated as reduced-precision.

Say, for example:
#define lowprec __declspec((low_precision))
lowprec float x; //hint to use reduced precision (still Binary32)
Would differ from "short float" :
short float x; //uses Binary16 in memory

....

I guess I will see how long it is until stuff changes.

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<16853e25-8f7e-40f4-9f3d-680efac0cd73n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26894&group=comp.arch#26894

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:519b:b0:473:1c23:f871 with SMTP id kl27-20020a056214519b00b004731c23f871mr9931280qvb.28.1658745996295;
Mon, 25 Jul 2022 03:46:36 -0700 (PDT)
X-Received: by 2002:a05:622a:1442:b0:31f:555:9d32 with SMTP id
v2-20020a05622a144200b0031f05559d32mr9495974qtx.135.1658745996122; Mon, 25
Jul 2022 03:46:36 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 25 Jul 2022 03:46:35 -0700 (PDT)
In-Reply-To: <tbkr0u$v2et$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.199.50; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.199.50
References: <t6gush$p5u$1@dont-email.me> <66967492-b252-4e76-8f06-5fde2c8a1cc4n@googlegroups.com>
<t6hgk5$5hg$1@dont-email.me> <25ec150c-9ba5-497f-864e-1895365dc27cn@googlegroups.com>
<t6j9gs$kta$1@dont-email.me> <166769ba-7b37-4b0d-837e-da5c4ffd5f25n@googlegroups.com>
<tbkr0u$v2et$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <16853e25-8f7e-40f4-9f3d-680efac0cd73n@googlegroups.com>
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Mon, 25 Jul 2022 10:46:36 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 1878

by: luke.l...@gmail.com - Mon, 25 Jul 2022 10:46 UTC

On Monday, July 25, 2022 at 2:18:26 AM UTC+1, BGB wrote:

> Recently, was also left thinking about a possible operation to invert or
> flip the sign bits of selected vector elements.
>
> Say:
> 00: Leave element unchanged;
> 01: Invert low bits of element;
> 10: Invert sign bit of element;
> 11: Invert all bits of element.

this is integer? what's the benefit / use-case?
(you should be able to XOR optionally with 0b100000000..., 0b0111111.....
immediates, to achieve the same effect)

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<2c0104c2-0515-4d14-9e75-6e3525eea9f7n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26895&group=comp.arch#26895

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:144b:b0:31e:f78b:65b with SMTP id v11-20020a05622a144b00b0031ef78b065bmr9759353qtx.459.1658746610345;
Mon, 25 Jul 2022 03:56:50 -0700 (PDT)
X-Received: by 2002:ac8:5f86:0:b0:31f:28da:bfd4 with SMTP id
j6-20020ac85f86000000b0031f28dabfd4mr8349716qta.563.1658746610069; Mon, 25
Jul 2022 03:56:50 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 25 Jul 2022 03:56:49 -0700 (PDT)
In-Reply-To: <tblg10$13pon$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.199.50; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.199.50
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org>
<ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com> <tblg10$13pon$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2c0104c2-0515-4d14-9e75-6e3525eea9f7n@googlegroups.com>
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Mon, 25 Jul 2022 10:56:50 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2930

by: luke.l...@gmail.com - Mon, 25 Jul 2022 10:56 UTC

On Monday, July 25, 2022 at 8:16:51 AM UTC+1, BGB wrote:

> As noted, my current SIMD looks like:
> 64 or 128 bit vectors.
> 128-bit cases by using paired register ports.
> 4x Binary16
> 2x | 4x Binary32
> 1x | 2x Binary64

and you designed - and maintain - all of these "by hand"?

this is an approach that we considered to produce far
too much HDL to be able to maintain. instead went
for a dynamic SIMD approach (completed at the
top level)

https://libre-soc.org/3d_gpu/architecture/dynamic_simd/

any operation x+y requires the variables x and y to be
handed (the same) bit-mask context which tells them
where their partitioning bits are.

thus by opening up the partitions between 8-bit adders
you get a carry-over which allows creaation of 16-bit
32-bit 64-bit and 128-bit adds.

further investigation showed that for eq you can do the
exact same thing, perform 8-bit eqs then join those sub-results
together dynamically.

further investigation showed the same thing for gt and lt

further investigation showed the same thing for shift.

multiply we had also done by using 8x8 blocks and
performing a series of Dadda adds, each Dadda add also
having the same partitioning breaks.

the next pase, move on to language abstraction and
by the time you are done you literally have the ability
to make high-level-language constructs "if x > y then x = z"
which works by accumulating decision-bits on a per-8-bit
lane basis, individually into Muxes.

why the hell go to all this trouble?

because, putting it very straight: we considered maintaining
5 sets of SIMD HDL to be flat-out insane, and attempting
explicit dynamic partitioning even worse.

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<tbmcla$h0r$2@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26899&group=comp.arch#26899

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!10O9MudpjwoXIahOJRbDvA.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
Date: Mon, 25 Jul 2022 17:25:29 +0200
Organization: Aioe.org NNTP Server
Message-ID: <tbmcla$h0r$2@gioia.aioe.org>
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org>
<ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="17435"; posting-host="10O9MudpjwoXIahOJRbDvA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.13
X-Notice: Filtered by postfilter v. 0.9.2

by: Terje Mathisen - Mon, 25 Jul 2022 15:25 UTC

luke.l...@gmail.com wrote:
> On Tuesday, May 24, 2022 at 7:19:24 AM UTC+1, Terje Mathisen wrote:
>
>> Have you taken a good, long look at Larrabee (and its list of descendants)?
>
> AVX-512 *shudder*
>
> this is an absolutely brilliant, funny, and insightful talk:
> https://media.handmade-seattle.com/tom-forsyth/
>
Tom Forsyth & Mike Abrash were the sw architects behind Larrabee, so he
definitely knows what he's talking about. :-)

I'll set off some time to watch the video, thanks for the link!

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<2022Jul25.181202@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26901&group=comp.arch#26901

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
Date: Mon, 25 Jul 2022 16:12:02 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 37
Message-ID: <2022Jul25.181202@mips.complang.tuwien.ac.at>
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org> <ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com>
Injection-Info: reader01.eternal-september.org; posting-host="75e1eab1069421daf389743353948dee";
logging-data="1426498"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/jujnar1HprwBD71KadZq3"
Cancel-Lock: sha1:RJcyR4U5ImhcU52WxOQOiDdIEnY=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Mon, 25 Jul 2022 16:12 UTC

"luke.l...@gmail.com" <luke.leighton@gmail.com> writes:
>AVX-512 *shudder*
>
>this is an absolutely brilliant, funny, and insightful talk:
>https://media.handmade-seattle.com/tom-forsyth/

An interesting talk.

Some non-central things are also interesting: They indeed used the
P54C (Pentium) as their core on which they tacked the AVX-512 units.
They considered Bonnell (the core of the first Atom generation, also
two-wide in-order), but went with P54C to reduce risk (and because the
Bonnell people would not have time for helping them). One minor thing
I wonder is why he mentioned the P54C (which is the 0.6um shrink of
the Pentium) instead of the original 0.8um P5.

The 45nm Knight's Ferry ran at 1.2GHz, which the fastest 45nm Bonnell
ran at 1.83GHz. Given that Bonnell has 16-19 stages, while the P54C
has 5 stages, I wonder how much they changed the P54C in the process.
The P54C also had 8KB+8KB I+D-cache, while Knights Ferry has 32KB L1
cache. I guess that they increased the number of stages (at least for
D-cache access and for instruction fetch, probably also decode) in
order to achieve that clock rate.

The shrink to 22nm did not increase the clock rate.

They switched to the Airmont core for Knights Landing (in 14nm). The
highest clock rate of Knight's Landing is 1.5GHz, while the fastest
clock rate of other Airmont CPUs is 2.56GHz. My guess is that they
kept the number of gate levels per pipeline stage for the AVX512 units
the similar to Knight's Ferry, so they could not benefit from the
higher possible clock rate of the Airmont.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<tbmn8v$1cpm4$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26903&group=comp.arch#26903

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
Date: Mon, 25 Jul 2022 13:26:33 -0500
Organization: A noiseless patient Spider
Lines: 283
Message-ID: <tbmn8v$1cpm4$1@dont-email.me>
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org>
<ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com>
<tblg10$13pon$1@dont-email.me>
<2c0104c2-0515-4d14-9e75-6e3525eea9f7n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 25 Jul 2022 18:26:39 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="17dfea8260c097f7bb3c88934754c72b";
logging-data="1468100"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX194QsPgfx65Sb82zZaT0eZA"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.11.0
Cancel-Lock: sha1:tUtlyf0hcSeXLiwPmLMuFEAKXF0=
Content-Language: en-US
In-Reply-To: <2c0104c2-0515-4d14-9e75-6e3525eea9f7n@googlegroups.com>

by: BGB - Mon, 25 Jul 2022 18:26 UTC

On 7/25/2022 5:56 AM, luke.l...@gmail.com wrote:
> On Monday, July 25, 2022 at 8:16:51 AM UTC+1, BGB wrote:
>
>> As noted, my current SIMD looks like:
>> 64 or 128 bit vectors.
>> 128-bit cases by using paired register ports.
>> 4x Binary16
>> 2x | 4x Binary32
>> 1x | 2x Binary64
>
> and you designed - and maintain - all of these "by hand"?
>

Same mechanisms can deal with multiple cases.

So, it is not like I need duplicate Verilog for every case;
The FPUs typically convert things to a normalized internal format for
performing an operation.

Quick check, my CPU core currently weighs in at around 108k lines:
85k Main core
23k Ringbus + L1/L2 Caches

For the pipeline, there is a 6R3W register file, which does one of
several configurations internally:
3x 2x1
OP3 Rx, Ry, Z, Rc | OP2 Ru, Rv, Z, Rb | OP1 Rs, Rt, Z, Ra
2x 3x1
OP2 Ru, Rv, Ry, Rb | OP1 Rs, Rt, Rx, Ra
1x 3x1
OP1 Rs, Rt, Rx, Ra
SIMD 3x1:
OP1 Ru:Rs, Rv:Rt, Ry:Rx, Rb:Ra

The 128-bit SIMD ops basically use 3 lanes worth of register ports, but
spread over 2 lanes.

Both the main FPU and low-precision FPU span two lanes.

For both the first and second lanes, also the ALUs and Shift units are
basically able to combine in order to perform 128-bit ALU and Shift
operations.

For some operations, one routes the carry bits from the output of one
ALU into the input of the other (still passes timing, so works I guess).

As for lanes:
Lane 1 can run basically every instruction in the ISA;
Lane 2 is more limited than Lane 1, but not as much as 3;
Lane 3 is mostly limited to ALU and Conv ops.

The shifters are basically using a funnel shifter approach, with special
cases to deal with 128-bit shifts. Ironically, the funnel shifters were
cheaper (in terms of LUTs) than the original barrel shifter design.

It can be noted that bundling does effect what sorts of things are allowed:
128-bit SIMD ops may not be bundled;
3-input instructions are only allowed in 1 and 2 op bundles;
...

The current core allows:
ALU | FPU | LOAD
FPU | LOAD
FPU | STORE

Though, this is considered as a different WEX profile from the current
default (though, I could consider making this the default in the
compiler if both WEX3 and 128-bit FP-SIMD are enabled).

Another recent experiment would have added:
LOAD | LOAD //Two parallel memory loads (independent locations)
LOAD | STORE //Load in parallel with a store (independent locations)

But, this is currently not enabled, as my attempt increased LUT cost by
around 6% and also ruined timing, in addition to not yet being stable.

More so, measured performance gains (via emulator) were smaller than
were hoped for, making it "probably not worth the cost".

In the current configurations, only 1 memory access is allowed per clock
cycle.

> this is an approach that we considered to produce far
> too much HDL to be able to maintain. instead went
> for a dynamic SIMD approach (completed at the
> top level)
>
> https://libre-soc.org/3d_gpu/architecture/dynamic_simd/
>
> any operation x+y requires the variables x and y to be
> handed (the same) bit-mask context which tells them
> where their partitioning bits are.
>
> thus by opening up the partitions between 8-bit adders
> you get a carry-over which allows creaation of 16-bit
> 32-bit 64-bit and 128-bit adds.
>

The integer SIMD in BJX2 uses the same basic approach.
Integer SIMD is routed through the ALU, and breaks the carry chains.
These were special cases in the carry-select Adder logic.

Total Verilog for both FPUs:
~ 8 kLOC.

This includes both the main FPU, low precision FPU, and all the FP-SIMD
mechanisms.

For comparison:
The instruction decoder logic is around 12k lines:
2.2k 16-bit BJX2 decoder
5.0k 32-bit BJX2 decoder
Also deals with 64b and 96b ops
1.9k RISC-V decoder (RV64IM for now)
1.3k bundle decoder (decoder front-end)
Glues the other decoders together.

Some of the rest is misc stuff (~ 2k), like branch predictor and the
(unused and incomplete) RISC-V superscalar checker.

Roughly 30 kLOC goes to stuff related to the EX stages.
~ 4.9k, EX1/EX2/EX3 handlers;
~ 2.6k, ALU;
~ 2.1k, Various carry-select adders;
~ 1.5k, Integer multipliers and similar;
~ 1.1k, Converters;
~ 0.8k, Shifters;
~ 0.7k, AGU;
...

So:
~ 30k, Execute stages (various stuff)
~ 23k, Memory Caches (Ringbus)
~ 12k, Decode
~ 8k, FPU & FP-SIMD
~ 7k, Peripheral devices (Screen, Audio, SDcard SPI, ...)
~ 7k, Old bus (original/slower L1/L2 caches)
~ 7k, DDR Controller
~ 6k, Register File (GPR and CR)
~ 5k, Definitions and config parameters.

> further investigation showed that for eq you can do the
> exact same thing, perform 8-bit eqs then join those sub-results
> together dynamically.
>
> further investigation showed the same thing for gt and lt
>

ADD/SUB/CMP are all basically the same logic in my case (all run through
a shared carry-select adder).

FPU compare is routed through the ALU, with a few special cases:
Need to handle sign specially;
Otherwise, it is basically an unsigned integer compare.
Need to detect and special-case NaN values.

The FP SIMD compares are also handled as special cases along with the
packed-integer compare.

> further investigation showed the same thing for shift.
>

I don't have per-element SIMD shift as of yet.

I have a 64b/128b funnel shift.
Mostly fakes 32b as well.

> multiply we had also done by using 8x8 blocks and
> performing a series of Dadda adds, each Dadda add also
> having the same partitioning breaks.
>

In my case, multiplier works with 16-bit chunks, because this is what
DSP48 allows for.

Currently, there are several integer multipliers:
32x32=>64 integer multiplier;
4x 16x16=>32 SIMD multiplier;
64x64=>128 Shift/Add multiplier:
64-bit integer multiply
32 and 64 bit integer divide and modulo
FPU divide (FDIV).

I mostly skipped out on packed-byte cases, because:
In general, they added too much complexity to be worthwhile;
Cheaper and easier to route packed-byte through the packed-word paths
via converters;
....

A few packed byte formats exist via converters:
4x signed byte
4x unsigned byte
4x signed microfloat (E4.F3.S, *)
4x unsigned microfloat (E4.F4)

*: Microfloats put the sign in the LSB as this was noticeably cheaper
than putting it in the MSB (partly because these microfloat converters
ended up used in multiple pathways, and the same converters need to deal
with both sub-formats).

There are also converters for RGB555 and a few other special formats as
well.

Along as decoders for compressed texture blocks:
The BLKUTXn instructions connected to the ALU;
The LDTEX instruction (connected to EX3);
This case also deals with RGB555, RGBA32, and FP8 pixel formats (*).

*: Had considered a joint-exponent format (RGBE4444), but this turned
out to be "unreasonably expensive" to unpack given the denormalized
mantissas. There are both 4x-Binary16 and 4x-FP8 texture formats though.

Had considered a 3x FP10 texture format (E5.F5), but it lost to 4x FP8
(E4.F4), since the latter is "more versatile" and the perceptual quality
difference wasn't that large (FP8 is roughly comparable to RGB555, but
with a larger dynamic range).

> the next pase, move on to language abstraction and
> by the time you are done you literally have the ability
> to make high-level-language constructs "if x > y then x = z"
> which works by accumulating decision-bits on a per-8-bit
> lane basis, individually into Muxes.
>
> why the hell go to all this trouble?
>
> because, putting it very straight: we considered maintaining
> 5 sets of SIMD HDL to be flat-out insane, and attempting
> explicit dynamic partitioning even worse.
>

Didn't seem too unmanageable in my case.

Though, besides the Verilog for my CPU core, my project also involves
around 1.5 MLOC of C.

So:
C Compiler: ~ 220 kLOC
Emulator: ~ 49 kLOC
TestKern + C runtime: ~ 129 kLOC
...

A fair chunk of the C code though is stuff I had ported to my ISA,
mostly old games and similar being used as test cases.

....

My core is basically fast enough to get "semi usable" framerates in
software-rasterized GLQuake when running at 50MHz (but, sadly, still
well below what would be considered acceptable for general gameplay).

But, this sort of thing proves difficult within the limits of the
XC7A100T (though, I can still do a little more here than is possible to
fit on an XC7S50).

Click here to read the complete article

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<tbmolj$1d4p0$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26904&group=comp.arch#26904

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
Date: Mon, 25 Jul 2022 13:50:13 -0500
Organization: A noiseless patient Spider
Lines: 42
Message-ID: <tbmolj$1d4p0$1@dont-email.me>
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org>
<ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com>
<tbmcla$h0r$2@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 25 Jul 2022 18:50:27 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="17dfea8260c097f7bb3c88934754c72b";
logging-data="1479456"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18Fgx4t218gMITZbXis50Cv"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.11.0
Cancel-Lock: sha1:MbMYZcfejwRua53x8yrY9m7xj/Y=
Content-Language: en-US
In-Reply-To: <tbmcla$h0r$2@gioia.aioe.org>

by: BGB - Mon, 25 Jul 2022 18:50 UTC

On 7/25/2022 10:25 AM, Terje Mathisen wrote:
> luke.l...@gmail.com wrote:
>> On Tuesday, May 24, 2022 at 7:19:24 AM UTC+1, Terje Mathisen wrote:
>>
>>> Have you taken a good, long look at Larrabee (and its list of
>>> descendants)?
>>
>> AVX-512 *shudder*
>>
>> this is an absolutely brilliant, funny, and insightful talk:
>> https://media.handmade-seattle.com/tom-forsyth/
>>
> Tom Forsyth & Mike Abrash were the sw architects behind Larrabee, so he
> definitely knows what he's talking about. :-)
>
> I'll set off some time to watch the video, thanks for the link!
>

Yeah, was an interesting talk about the history of things.

Not sure how to judge the relative cost comparisons, as x86 still seems
like an unreasonably awkward architecture to try to deal with in hardware.

In my case, I am just someone figuring it out as I go along, but seemed
to have done "reasonably OK", I think.

At least, haven't seen any other FPGA cores which "blow my stuff out of
the water" in terms of major stats, so probably not too unreasonable.

One of the major things limiting clock speed is mostly things like the
L1 cache sizes and pipeline-stall-handling. But, the L1 caches need to
be reasonably large to perform well (shrinking the L1 hurts worse than
the gains from a higher MHz).

At present, looks like I would be pulling solidly double-digits in
GLQuake if I could have both 100MHz and decently high memory bandwidth.

....

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<f230ca5e-45ea-4040-9879-e796a9628294n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26905&group=comp.arch#26905

copy link Newsgroups: comp.arch

X-Received: by 2002:a0c:e14e:0:b0:474:1fcf:6828 with SMTP id c14-20020a0ce14e000000b004741fcf6828mr11250521qvl.96.1658775266347;
Mon, 25 Jul 2022 11:54:26 -0700 (PDT)
X-Received: by 2002:a37:6cd:0:b0:6b5:bdbd:e70 with SMTP id 196-20020a3706cd000000b006b5bdbd0e70mr10039146qkg.423.1658775266182;
Mon, 25 Jul 2022 11:54:26 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!feeder.erje.net!border-1.nntp.ord.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 25 Jul 2022 11:54:25 -0700 (PDT)
In-Reply-To: <tbmn8v$1cpm4$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.199.50; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.199.50
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org>
<ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com> <tblg10$13pon$1@dont-email.me>
<2c0104c2-0515-4d14-9e75-6e3525eea9f7n@googlegroups.com> <tbmn8v$1cpm4$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f230ca5e-45ea-4040-9879-e796a9628294n@googlegroups.com>
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Mon, 25 Jul 2022 18:54:26 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 33

by: luke.l...@gmail.com - Mon, 25 Jul 2022 18:54 UTC

On Monday, July 25, 2022 at 7:26:42 PM UTC+1, BGB wrote:
> On 7/25/2022 5:56 AM, luke.l...@gmail.com wrote:
> > On Monday, July 25, 2022 at 8:16:51 AM UTC+1, BGB wrote:
> >
> >> As noted, my current SIMD looks like:
> >> 64 or 128 bit vectors.
> >> 128-bit cases by using paired register ports.
> >> 4x Binary16
> >> 2x | 4x Binary32
> >> 1x | 2x Binary64
> >
> > and you designed - and maintain - all of these "by hand"?
> >
> Same mechanisms can deal with multiple cases.

apologies i should have been clearer. do you *instantiate* all
of those at once? is there any sharing or micro-coding?

the other main reason for the dynamic SIMD work is that although it
adds appx 50% overhead this is far less than having multiple separate SIMD
ALUs, one dedicated to 8x8, one to 4x16, one to 2x32, even worse
if you have 128-bit which is

LUT4s for 16x8
LUT4s for 8x16
LUT4s for 4x32
LUT4s for 2x64
LUT4s for 1x128

even just for add that's 5x the number of adders!

150% vs 500% is an easy decision.

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<tbmsq9$1nj9$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26906&group=comp.arch#26906

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!10O9MudpjwoXIahOJRbDvA.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
Date: Mon, 25 Jul 2022 22:01:13 +0200
Organization: Aioe.org NNTP Server
Message-ID: <tbmsq9$1nj9$1@gioia.aioe.org>
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org>
<ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com>
<2022Jul25.181202@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="56937"; posting-host="10O9MudpjwoXIahOJRbDvA.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.13
X-Notice: Filtered by postfilter v. 0.9.2

by: Terje Mathisen - Mon, 25 Jul 2022 20:01 UTC

Anton Ertl wrote:
> "luke.l...@gmail.com" <luke.leighton@gmail.com> writes:
>> AVX-512 *shudder*
>>
>> this is an absolutely brilliant, funny, and insightful talk:
>> https://media.handmade-seattle.com/tom-forsyth/
>
> An interesting talk.
>
> Some non-central things are also interesting: They indeed used the
> P54C (Pentium) as their core on which they tacked the AVX-512 units.
> They considered Bonnell (the core of the first Atom generation, also
> two-wide in-order), but went with P54C to reduce risk (and because the
> Bonnell people would not have time for helping them). One minor thing
> I wonder is why he mentioned the P54C (which is the 0.6um shrink of
> the Pentium) instead of the original 0.8um P5.

I don't know the answer to that one, except that they probably cleaned
up some speed paths during that shrink?

I think they picked the Pentium because with that very short pipeline,
and 4-way barrel scheduling, Larrabee could run back-to-back FP SIMD
operations in every cycle of every thread, and the branch miss penalty
could be kept at zero or a very low number.

I know much less about what was changed for the later iterations.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<tbmvan$1esgs$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26907&group=comp.arch#26907

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
Date: Mon, 25 Jul 2022 15:44:01 -0500
Organization: A noiseless patient Spider
Lines: 124
Message-ID: <tbmvan$1esgs$1@dont-email.me>
References: <t6gush$p5u$1@dont-email.me> <t6htd8$q8c$1@gioia.aioe.org>
<ae97dee0-35fc-4294-be1f-aca37367c1c8n@googlegroups.com>
<tblg10$13pon$1@dont-email.me>
<2c0104c2-0515-4d14-9e75-6e3525eea9f7n@googlegroups.com>
<tbmn8v$1cpm4$1@dont-email.me>
<f230ca5e-45ea-4040-9879-e796a9628294n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 25 Jul 2022 20:44:07 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="17dfea8260c097f7bb3c88934754c72b";
logging-data="1536540"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19TQnUdIiQWEMNXx8ICkXFC"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.11.0
Cancel-Lock: sha1:LxFVjJc/gDj4Zwf4lE2iwtTiJ08=
Content-Language: en-US
In-Reply-To: <f230ca5e-45ea-4040-9879-e796a9628294n@googlegroups.com>

by: BGB - Mon, 25 Jul 2022 20:44 UTC

On 7/25/2022 1:54 PM, luke.l...@gmail.com wrote:
> On Monday, July 25, 2022 at 7:26:42 PM UTC+1, BGB wrote:
>> On 7/25/2022 5:56 AM, luke.l...@gmail.com wrote:
>>> On Monday, July 25, 2022 at 8:16:51 AM UTC+1, BGB wrote:
>>>
>>>> As noted, my current SIMD looks like:
>>>> 64 or 128 bit vectors.
>>>> 128-bit cases by using paired register ports.
>>>> 4x Binary16
>>>> 2x | 4x Binary32
>>>> 1x | 2x Binary64
>>>
>>> and you designed - and maintain - all of these "by hand"?
>>>
>> Same mechanisms can deal with multiple cases.
>
> apologies i should have been clearer. do you *instantiate* all
> of those at once? is there any sharing or micro-coding?
>

For FPU, originally there was only one FADD and one FMUL unit, all
FP-SIMD cases were pipelined though it via an internal state machine.

There are some internal state machines, but no microcode.

The newer SIMD unit instantiates 4 low-precision FP ADD/MUL units, with
a bunch of plumbing for routing and format conversions.

Mostly the latter is a big mess of:
wire[63:0] tWhaterver;
assign tWhatever = opIsHaf ? tHalf : opIsDbl ? tDouble : tSingle;

This mostly being a choice to spend some extra LUTs to reduce FP-SIMD
latency (10 cycles to 3 cycles), partly as this was relevant for
performance in GLQuake and similar.

Seems to cost ~ 2%, or around 1.3k LUT.

There is roughly a 5x cost difference between the Binary64 FADD+FMUL and
vs each low-precision (~ S.E11.F16) FADD+FMUL (though, given this FPU
has 4 of these low-precision units; it roughly doubles FPU-related LUT
costs).

Note that originally, the addition of SIMD had negligible cost increase
over the scalar FPU.

Though, 128-bit SIMD, and the ability to invoke the FPU from Lane 2,
adding a relatively minor cost increase (around 1% ish territory). I
suspect most of this was due to internal plumbing and similar...

> the other main reason for the dynamic SIMD work is that although it
> adds appx 50% overhead this is far less than having multiple separate SIMD
> ALUs, one dedicated to 8x8, one to 4x16, one to 2x32, even worse
> if you have 128-bit which is
>
> LUT4s for 16x8
> LUT4s for 8x16
> LUT4s for 4x32
> LUT4s for 2x64
> LUT4s for 1x128
>
> even just for add that's 5x the number of adders!
>
> 150% vs 500% is an easy decision.
>

Yeah, bulk duplicating this stuff for every scenario does not seem like
a great idea.

It is more like:
Adders for bits (15: 0)
Adders for bits (31:16)
Adders for bits (47:32)
Adders for bits (63:48)

tValRtV = tInvRt ? ~tValRt : tValRt;
tAddValA0 = {1'b0, tValRs[15: 0]} + {1'b0, tValRtV[15: 0]} + 17'h0000;
tAddValA1 = {1'b0, tValRs[15: 0]} + {1'b0, tValRtV[15: 0]} + 17'h0001;
tAddValB0 = {1'b0, tValRs[31:16]} + {1'b0, tValRtV[31:16]} + 17'h0000;
tAddValB1 = {1'b0, tValRs[31:16]} + {1'b0, tValRtV[31:16]} + 17'h0001;
.... Gather up carry-bits for various operations ...
tAddValRn[15: 0] = tAddSelA ? tAddValA1[15:0] : tAddValA0[15:0];
tAddValRn[31:16] = tAddSelB ? tAddValB1[15:0] : tAddValB0[15:0];
....

One can deal with the SIMD cases while figuring out what the carry bits
should be.

As noted, I didn't actually bother with packed-byte SIMD (operating on
packed byte needs to be done using conversions and packed word operations).

Partly this was because:
Packed byte is less commonly needed than packed word;
It more commonly needs saturating arithmetic;
The carry-select worked "best" at around 16 or 20 bits (*1);
I didn't want to bother with saturating arithmetic.

So, no dedicated packed byte ops...

*1: Under 16-bits, LUT cost increases;
Over 16 bits, timing delay increases;
The CARRY4's operate on 4 bits at a time;
...

There is no dedicated 128-bit packed word op, but it is possible to fake
this:
PADD.W R5, R7, R3 | PADD.W R4, R6, R2

With each ALU operating on half of the vector.

Each ALU is 64-bits internally (with any 128-bit ALU ops working by
using two ALUs in parallel).

....

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<tbn00p$1f27n$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=26908&group=comp.arch#26908

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
Date: Mon, 25 Jul 2022 15:55:47 -0500
Organization: A noiseless patient Spider
Lines: 43
Message-ID: <tbn00p$1f27n$1@dont-email.me>
References: <t6gush$p5u$1@dont-email.me>
<66967492-b252-4e76-8f06-5fde2c8a1cc4n@googlegroups.com>
<t6hgk5$5hg$1@dont-email.me>
<25ec150c-9ba5-497f-864e-1895365dc27cn@googlegroups.com>
<t6j9gs$kta$1@dont-email.me>
<166769ba-7b37-4b0d-837e-da5c4ffd5f25n@googlegroups.com>
<tbkr0u$v2et$1@dont-email.me>
<16853e25-8f7e-40f4-9f3d-680efac0cd73n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 25 Jul 2022 20:55:53 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="17dfea8260c097f7bb3c88934754c72b";
logging-data="1542391"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+0nBSagGeJAwJNyzdGEMdz"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.11.0
Cancel-Lock: sha1:0XG9kgcKFeIdAerKTYJXwBo0SXY=
Content-Language: en-US
In-Reply-To: <16853e25-8f7e-40f4-9f3d-680efac0cd73n@googlegroups.com>

by: BGB - Mon, 25 Jul 2022 20:55 UTC

On 7/25/2022 5:46 AM, luke.l...@gmail.com wrote:
> On Monday, July 25, 2022 at 2:18:26 AM UTC+1, BGB wrote:
>
>> Recently, was also left thinking about a possible operation to invert or
>> flip the sign bits of selected vector elements.
>>
>> Say:
>> 00: Leave element unchanged;
>> 01: Invert low bits of element;
>> 10: Invert sign bit of element;
>> 11: Invert all bits of element.
>
> this is integer? what's the benefit / use-case?
> (you should be able to XOR optionally with 0b100000000..., 0b0111111.....
> immediates, to achieve the same effect)
>

Both integer and floating point (both use the same registers in my case).

It could be relevant for operators like:
Cross product;
Complex multiply;
Quaternion multiply;
Pauli matrices;
...

But, yes, XOR could also do this, but I lack an operator for:
XOR Imm64, Rn

One can do, say:
MOV Imm64, Rt
XOR Rm, Rt, Rn

But, this is adds an extra instruction.

Something like a combined shuffle-and-negate could be semi-relevant to
reducing the clock-cycle cost of calculating vector cross-products.

Then again, looking at it, probably not doing enough cross-products and
similar to make this worthwhile.

Pages:12 3 4 5 6

server_pubkey.txt

rocksolid light 0.9.8
clearnet tor