novaBBS - comp.arch - Re: Self-debate: Extending the BJX2 GPR space (XGPR)?

On 6/6/2021 6:25 PM, MitchAlsup wrote:
> On Sunday, June 6, 2021 at 5:49:46 PM UTC-5, BGB wrote:
>> On 6/6/2021 4:28 PM, MitchAlsup wrote:
>>> On Sunday, June 6, 2021 at 4:11:56 PM UTC-5, BGB wrote:
>>>> For most of its existence, the BJX2 ISA had 32 GPRs (R0..R31).
>>>>
>>>> Then recently I started running into a few cases where having more GPRs
>>>> could be useful (mostly involving big complicated loops working with a
>>>> lot of parameters at the same time, such as tend to occur in my OpenGL
>>>> rasterizer).
>>> <
>>> As I stated in the other thread:: I consider a rasterizer and interpolator to be
>>> the kind of "function" that deserves dedicated HW. you feed it triangles and
>>> it feeds you vectors of pixels and coordinates. The GPUs build these things
>>> to produce 32 pixels (RGBA) and 32 sets of coordinates (XYZW) per cycle.
>>> This would be easy to scale down to 1 set per cycle.
>>> <
>>> Texture deserves its own dedicated HW, also.
>> Yeah, if you want to work more than one pixel at a time, the amount of
>> in-flight state can get kinda absurd.
>>
>>
>> When I looked into GPUs before, it seemed like a lot of them basically
>> boiled down to being VLIW+SIMD machines with large vectors and lower
>> clock speeds (relative to CPUs).
> <
> Not VLIW versus SIMD, it is SIMD plus dedicated wide hardware.
> <
> Imagine the width necessary to do 8 3-D texture lookups and interpolations
> per cycle !
> <
> These are not programmable devices, they are hardware programmed (fixed),
> deeply pipelined, units designed to feed the SIMD programmable kernels.

Yeah, I can't really do this...

In my case, the process might look more like:
Morton shuffle ST coord, A;
Add 1 pixel to ST.S and Morton Shuffle, B;
Add 1 pixel to ST.T and Morton Shuffle, C;
Add 1 pixel to ST.ST and Morton Shuffle, D;
Mask A, B, C, and D, by the texture-size mask;
Shift A/B/C/D right 4 bits E/F/G/H.
Fetch Blocks E/F/G/H;
Extract Pixels from E/F/G/H (via compressed-texture instruction);
Interpolate E and F based on ST.S, E1;
Interpolate G and H based on ST.S, F1;
Interpolate E1 and F1 based on ST.S, D1;
Multiply D1 by the current modulation color;
Fetch Z Buffer pixel;
Compare Z step with fetched pixel (Less-Than);
Store D1 color and Z-step to destination if true;
Add various step values to the various coordinates;
...

Then structure the loop such that the fetch/interpolation/... for the
next pixel begins before the last pixel gets stored to the destination.

So, rasterizer function:
Set up variables;
Start doing first pixel;
Do main loop body;
Loop as long as more pixels remain;
Finish storing last pixel;
Cleanup / Return.

Then as a bunch of large/unwieldy blobs of ASM specialized for various
combinations of settings.

This stuff works, but as noted, can have a fairly large amount of
register pressure.

For some parts of the rasterization process, I ended up needing to spill
registers to memory, which isn't really ideal for this sort of thing.

There is a helper instruction which, given a compressed texture block
and an index, fetches a pixel (single cycle). Compressed texture blocks
are preferable (despite adding a few more instructions per fetch)
because they have a significantly lower cache-miss rate than
uncompressed textures.

I recently also figured out a way to do small-block compression which
works OK for audio.

16 samples in 32 bits:
6 bits, Line Start Point of a line (S.3.2 microfloat);
6 bits, Line End Point of a line (S.3.2 microfloat);
4 bits, Line Sigma (RMSE or MAE, 3.1 microfloat);
16 bits, selector bits (1 bit/sample).

Extracting a sample involves:
Unpack values internally into PCM (12-bit);
Linearly interpolate between the start and end sample;
The sample position within the block is the interpolation index;
Add and subtract sigma from interpolated sample to get a min and max;
Use selector bit for this sample to select min or max.

A similar scheme for 16 samples in 64 bits was also defined (with bigger
minifloat values and 2 bit selectors). Which gave better audio quality.

Where, 2-bit values were:
00: - 0.333 * Sigma
01: - 1.000 * Sigma
10: + 0.333 * Sigma
11: + 1.000 * Sigma

Endpoints were 8-bit (S.3.4), as in my tests there wasn't much advantage
going bigger than this (the loss from the interpolation was larger than
the loss from the minifloat format).

Here, the linear part deals with low-frequency components, whereas the
selector bits and sigma deal with high frequencies.

While generally worse (quality wise) than ADPCM, small-block schemes
have tended to have difficulties in this area, so the threshold here
isn't so much "good" as more "doesn't sound like broken garbage", and
"cheap enough to potentially be decoded in hardware in 1 or 2 clock cycles".

I considered this mostly for doing Module/Tracker style music playback,
as the PCM data for the files tends to be somewhat larger than the L2
cache and are prone to suffer from a relatively high L2 miss rate when
mixing audio samples.

Nevermind if this is a little unclear, as the theoretical bandwidth
estimates for Mod playback are well within the DRAM bandwidth limits of
my CPU core, so dunno.

>>
>> My initial thoughts at how to scale these back to "something I could
>> actually implement" resulted in something not too far from where BJX2
>> was at the time, so it seemed like I could just adapt the CPU core to
>> also handle GPU tasks.
> <
> One channel of a Texture lookup and blending per cycle. Where there are up
> to 4 channels.

Note the "scaling things back" part:
I can't exactly do something with massive parallel texture fetch and 512
bit vectors and similar on this thing...

But, I could do 64 and 128 bit SIMD vectors...

I can't afford a dedicated texture fetch and interpolate unit, but I
could afford helper instructions to extract texel values from compressed
texture blocks, perform Morton shuffle, ...

>>
>>
>> Luckily, this part isn't too horrible at this at least.
>>>>
>>>>
>>>> Previously, I had an extension specifically for 128-bit SIMD ops, which
>>>> extended the register space. However, the only way to access these
>>>> registers was via 128-bit SIMD ops, MOV.X, or via a few other special
>>>> instructions (eg: MOV), severely limiting their utility.
>>> <
>>> Yes, the separate register file gets clunky at certain boundaries.
> <
>> R0..R31 were shared with GPR ops, but R32..R63 were in a sort of
>> "technically exist but can't be addressed" limbo.
>>
>> Given my ISA doesn't have a redundant set of operations for SIMD ops,
>> and the 64-bit SIMD ops were still limited to R0..R31, this "kinda sucked".
>>
>> The newer 'XGPR' is an attempt to remedy this situation.
>> Current leaning though is that XGPR encodings of 128-bit SIMD ops will
>> not be allowed.
> <
> The field to be investigated is deep and long.
>>>>
>
>>>>
>>> Stealing from the REX prefix in x86-64.
>> Stealing from REX, twice...
> <
> it is a good trick, steal away!
>>
>> The E field:
>> Eq: 'Quad' bit (rarely actually used this way);
>> En: Bit 4 of Rn
>> Em: Bit 4 of Rm
>> Eo: Bit 4 of Ro (or Bit 8 for Disp9 encodings).
>>
>> So, the m/n/o fields are 4-bit hex nybbles.
> <snip>
>>>> Well, or potentially, I use the existing XGPR encoding for unconditional
>>>> ops, and use a jumbo-encoding for predicated XGPR ops, at the cost that
>>>> the predicated ops may not be bundled (grr...).
>>>>
>>>> And, all this is a bit more messy than I would prefer.
>>> <
>>> Of the 3 dimensions of expansion are concerned (separate files, bigger
>>> register specifiers and the encoding mess, or dedicated HW) I prefer
>>> the dedicated HW approach.
>> OK.
>>>>
>>>>
>>>> ...
>>>>
>>>>
>>>> Any thoughts?...
>>> <
>>> You found all the options, you get to grind through the paths towards
>>> reasonable solutions.
>>>
>> It is at the moment of whether to try to expand this out and try to make
>> it semi-orthogonal in some way, or leave a lot of this more as a "break
>> glass in case of emergency" thing and otherwise mostly ignore that it
>> exists.
>>
>> XGPR, as it exists, is probably sufficient for what it is.
>> The lack of orthogonality here would more matter for the codegen than it
>> does for ASM coding.
>>
>> ...

Subject	Replies	Author
Self-debate: Extending the BJX2 GPR space (XGPR)? By: BGB on Sun, 6 Jun 2021	11	BGB

Interchangeable parts won't.

computers / comp.arch / Re: Self-debate: Extending the BJX2 GPR space (XGPR)?