novaBBS - comp.arch - Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

Re: Misc: Idle thoughts for cheap and fast(ish) GPU.

<29370e99-7b21-4473-8076-0509a6c54f9dn@googlegroups.com>

https://www.novabbs.com/computers/article-flat.php?id=25476&group=comp.arch#25476

X-Received: by 2002:a05:620a:24cd:b0:6a0:414c:a648 with SMTP id m13-20020a05620a24cd00b006a0414ca648mr18558898qkn.465.1653419591467;
Tue, 24 May 2022 12:13:11 -0700 (PDT)
X-Received: by 2002:ac8:7fc5:0:b0:2f9:4414:c3b4 with SMTP id
b5-20020ac87fc5000000b002f94414c3b4mr2989395qtk.22.1653419591286; Tue, 24 May
2022 12:13:11 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!feeder1.cambriumusenet.nl!feed.tweak.nl!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 24 May 2022 12:13:11 -0700 (PDT)
In-Reply-To: <t6j9gs$kta$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:6179:6b8e:2145:567d;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:6179:6b8e:2145:567d
References: <t6gush$p5u$1@dont-email.me> <66967492-b252-4e76-8f06-5fde2c8a1cc4n@googlegroups.com>
<t6hgk5$5hg$1@dont-email.me> <25ec150c-9ba5-497f-864e-1895365dc27cn@googlegroups.com>
<t6j9gs$kta$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <29370e99-7b21-4473-8076-0509a6c54f9dn@googlegroups.com>
Subject: Re: Misc: Idle thoughts for cheap and fast(ish) GPU.
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 24 May 2022 19:13:11 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Tue, 24 May 2022 19:13 UTC

On Tuesday, May 24, 2022 at 1:52:16 PM UTC-5, BGB wrote:
> On 5/24/2022 10:33 AM, MitchAlsup wrote:
> > On Monday, May 23, 2022 at 9:41:13 PM UTC-5, BGB wrote:
> >> On 5/23/2022 7:08 PM, MitchAlsup wrote:
> >>> On Monday, May 23, 2022 at 4:38:28 PM UTC-5, BGB wrote:
> >>>
> >>>> This instruction would only do NEAREST, LINEAR would need to be done
> >>>> manually.
> >>>>
> >>>> Say:
> >>>> LDTEX (R8, R10, 0), R4
> >>>> LDTEX (R8, R10, 1), R5
> >>>> LDTEX (R8, R10, 2), R6
> >>>> LDTEX (R8, R10, 3), R7
> >>>> MOVU.L (R13), R3 //Z-Buffer
> >>>> BLERPS.W R4, R10, R16
> >>>> BLERPS.W R6, R10, R17
> >>>> BLERPT.W R16, R10, R2
> >>>> PMULHU.W R2, R24, R2
> >>> <
> >>> Why are the 9 above instructions not a single instruction ?
> > <
> >> But... How exactly would I be expected to implement this in a way where
> >> it works consistently and can still pass timing?...
> >>
> > GPUs do it and they do it 32 threads wide.
> But, most don't need to also fit on an XC7A100T or similar while still
> leaving room for a CPU core.
>
> Could probably do a little more with an XC7A200T.
>
>
> Will state that a GPU core is probably not viable on either an XC7A35T
> or XC7S50, as pretty much all these LUTs are likely to be much better
> spent on the main CPU.
> >>
> >> The accesses will not necessarily fall within the same texture block,
> >> nor necessarily into adjacent blocks in memory.
> > <
> > Yes, I am aware of that, I am also aware that texture is inherently a
> > memory reference where the address is floating point, the integer
> > part is used to access the data, the fractional part is used to LERP
> > the various datums into a single result.
> If they were all the same block, then fetch could be cheaper:
> Fetch 1 block, then decode multiple textels from the same block.
>
> But, needing multiple fetches kinda ruins things here.
<
The Samsung GPU I worked on had a Texture unit that could perform
9 fetches per cycle to feed the 3D LERPer. It also had an interpolator
that performed the (up to) 29 FP operations per cycle (pipelined), and
a "front end" that would read the input stream and be able to create
a new "warp" every 8 cycles.
<
All on a "per instance" basis, and we were using 8 instances.
<
Imagine a CPU performing 32 excels() every 8 cycles !!
Just try to imagine !!
>
> In this case, a version which has a seam every 4 texels would be worse
> than one which only does nearest interpolation.
> >>
> >> So, at best I either have:
> >> Seams at block-edges (if only interpolating intra-block texels);
> >> 4x the resource cost for the texture fetch;
> > <
> > It is more like 9× if you want 3D textures.
> Pretty much no one used these AFAIK.
<
Maybe in DOOM and Quake......but several graphics programs used textures in
the vertex shader to do "interesting physics".
>
> My thinking was, say, if one wanted hardware that could do:
> GL_LINEAR
> And:
> GL_LINEAR_MIPMAP_NEAREST
>
> My thinking at the moment is that, effectively, LDTEX would do
> GL_NEAREST with an extra performance cost for GL_LINEAR and similar.
>
> For the GLQuake port, the filtering mode can be set as:
> GL_LINEAR, GL_NEAREST_MIPMAP_NEAREST
> Which is computationally cheaper.
> >> ...
> >>
> >>
> >> LERP latency isn't exactly zero either:
> >> Simple LERP: C=(A*(~S))+(B*S)
> > <
> > If you look at the above correctly, the LERP can be done in 1 pass
> > through the multiplier. If a bit is set in S you add B otherwise you add
> > A for all bits in S.
> Could work, but not with DSP48's.
<
See what happens when you don't have access to the raw gates !!
>
> Doing it with discrete adders but this in is not likely to scale very
> well, but could maybe be done if S/T were reduced to around 2 or 3 bits
> (like with the 3/8 + 5/8 blending used in UTX2, since 3/8 and 5/8 are
> "close enough" to 1/3 and 2/3).
>
> DSP48's sorta work well for what they do, but have a fairly high
> latency, so trying to daisy-chain multiple DSP48's within a single cycle
> isn't good.
>
>
> > <
> >> Bi-LERP: C=((A*(~S))+(B*S))*(~T) + ((C*(~S))+(D*S))*T;
> >>
>
> Note, I could do this, just the instruction would be more expensive and
> would likely need to have a multi-cycle latency.
>
>
> There is also a cost cutting trick to reduce the number of texel
> fetches, say:
> LDTEX (R8, R10, 0), R4
> LDTEX (R8, R10, 1), R5
> LDTEX (R8, R10, 2), R7
> MOVU.L (R13), R3 //Z-Buffer
> MOV R4, R6 ||
> BLERPS.W R4, R10, R16
> BLERPT.W R6, R10, R17
> PAVG.W R16, R17, R2
> ...
>
> Or:
> PADD.W R16, R17, R2
> PSUB.W R2, R4, R2
>
> But, this has a problem, namely that it can overflow and cause obvious
> artifacts.
>
> One option here I guess could be to consider either:
> A combined instruction which has higher dynamic range and clamps the result;
> Or, an instruction which effectively implements a Paeth filter.
>
> Or, not bother, and stick with using an average.
> >>
> >> So, it is multiple ops mostly because I don't think I can make it much
> >> shorter than this.
> >>
> >> As-is, on BJX2, the dependency chain is basically:
> >> PMORT //suffle
> >> SHLD //shift
> >> AND //mask by texture size
> >> MOV.Q //load
> >> BLKUTX2 //extract texel from block
> >> ... //blending math
> >>
> >> Collapsing the first 5 into a single operation should at least be an
> >> improvement.
> >>
> >> Well, along with needing to reduce the number of clock-cycles needed for
> >> PADDX.F and PMULX.F, ...
> >>
> >>
> >>
> >>
> >>
> >> My current thinking is that if ~ 70-80% of the scene is quietly being
> >> rendered with NEAREST filtering, it is faster, and "not super obvious"..
> >>
> >> There are "slightly noticeable" boundaries where where stuff moves
> >> between minimization and magnification cases, or transitions between
> >> mip-levels, but alas. Full trilinear would help here, but I am not
> >> really going to have the clock-cycle budget with this to afford trilinear.
> >>
> >> Hope is more just to be able to get GLQuake above single-digit framerates.
> >>
> >>
> >> Which as-is, looks like it will effectively require being able to push
> >> megapixel-per-second fill-rates.
> >>> <
> >>>> RGB5PCK R2, R2 | GMPGT R14, R3 //Z test
> >>>> MOV.W?T R2, (R12) //Color Write
> >>>> MOV.L?T R14, (R13) //Z Write
> >>>> Step and Loop
> >>> --------------------------------
> >>>>
> >>>> Any thoughts? ...
> >>>
> >>> See above.
> >> My design was partially a counter point where people were thinking:
> >> Yeah, RISC-V with the 'V' extension, totally makes sense as a GPU;
> >> Me thinking: That's probably gonna *suuuck*.
> >>
> > CPUs with general purpose ISA do suuuuckkkkk at graphics.
> Yeah.
>
> Though, they can be helped along with helper ops for pixel pack/unpack,
> compressed texture decode, and interpolation/blending operations.
<
Why do these as instructions ?? Why not do them as a CoProcessor ??
optimized for the task at hand.
>
> I suspect this is partly how, despite being "not very fast", OpenGL
> rasterization gets nearly 3x the fill-rate per-clock-cycle when compared
> to running similar logic on my Ryzen...
>
>
> By most other metrics, the Ryzen beats it hard on per-clock-cycle
> performance (the Ryzen seems to get roughly 6x the performance,
> per-clock-cycle, for more general-purpose benchmarks).
> >>
> >> Well, unless one imagines the GPU as basically a glorified coprocessor
> >> for running TensorFlow or similar, it would probably work pretty good
> >> for this.
> >>
> >> But, for actual 3D rendering tasks, I suspect the 'V' extension is
> >> probably not the right tool for the job.
> > <
> > GPUs are designed the way they are so they don't suuuccckkk at graphics..
> Granted...
>
>
> Or if I were to do it for RISC-V, I would do it a bit different:
> First off, add scaled-index Load/Store;
> Add the relevant helper instructions;
> RGBA Pack/Unpack;
> Compressed texture stuff;
> ...
> SIMD, lots of SIMD.
<
SIM T not D
>
>
> Streaming floating-point vectors to/from memory (with a stride) through
> the FPU via pipelining, that is something, just does not fit what is
> needed for a rasterizer.
<
Which is why GPUs are not vector processors.
>
> Doesn't matter how big the "vector registers" are internally, it isn't
> going to do what is needed for 3D rendering tasks.
>
>
> ... Like, do they just sorta think that the GPU is sitting around doing
> giant matrix multiplies ?...
>
> I don't really get it.
<
½ of the FP ops/sec in a real GPU are in the rasterizer, interpolator, and
texture units. The programmable engine has the other ½.

Subject	Replies	Author
Misc: Idle thoughts for cheap and fast(ish) GPU. By: BGB on Mon, 23 May 2022	134	BGB

Usage: fortune -P [-f] -a [xsz] Q: file [rKe9] -v6[+] file1 ...

computers / comp.arch / Re: Misc: Idle thoughts for cheap and fast(ish) GPU.