Message-ID:

Mystics always hope that science will some day overtake them. -- Booth Tarkington

On 2/27/2022 6:06 PM, MitchAlsup wrote:
> On Sunday, February 27, 2022 at 4:59:55 PM UTC-6, BGB wrote:
>> On 2/27/2022 8:36 AM, Anton Ertl wrote:
>>>
>>> Reading up on it, yes it did have hardware BitBLT indeed. And it
>>> seems that XFree86 used these acceleration features. But still, it
>>> had a 32-bit memory interface with whatever slow memory was current in
>>> 1993; the U74 should be able to run rings around it when blitting
>>> (even though its DRAM controller could do better).
>>>
>>>> Now, your RISC-V system is probably trying to do all these things through
>>>> a software emulation of OpenGL.
>>>
>>> That may be the reason for the sluggishness.
>>>
>> Also, the design of RISC-V seems like it would "kinda suck" for OpenGL
>> emulation performance (at least, limited to the parts of the ISA I am
>> currently aware of).
> <
> Gomer Pyle would be proud........
> <

Not sure how to interpret this.

But, short of a bunch of specialized SIMD operations, and indexed
load/store, ..., it seems like a bit of a stretch that RISC-V would be
particularly suited to OpenGL emulation.

....

>>>> Application software libraries now do a lot more in the client and often
>>>> just send bitmaps to the server to display.
>>>
>>> That should be easy for the server. But of course if it first has to
>>> do some OpenGL scaling etc., that could be slow on a pure software
>>> implementation of OpenGL.
>>>
>> All the fancy "Compositing Window Manager" stuff is kinda pointless
>> fluff IMO.
>>
>> There hasn't really been any practical advancement in these areas much
>> past Win2K and similar.
>>
>> Nor is there much one can really do to move things forward, at least
>> given that full 3D VR UIs, or desktop backgrounds with spinning 3D
>> models, are mostly something that didn't make it much past 80s and 90s
>> era scifi movies (there not being much point IRL in having a spinning
>> Utah Teapot or similar as ones' desktop background).
>>
>> Scifi gives us needless spinning stuff, and people doing surfing motions
>> while wearing gloves and a VR helmet, and also presumably they find some
>> way to avoid the users invariably getting motion sickness as a result of
>> all this.
>>
> I, personally, cannot stay on an internet page that uses moving advertisements.
> I either turn of JavaScript or leave.

Yeah. I also have the issue that I seem to be fairly weak against motion
sickness. I can't really play games for more than short bursts, and
watching videos of gameplay is also prone to cause issues.

For VR, the motion sickness is almost immediate. I can deal with 3D
glasses a little better though, but pretty much no one is developing
home-use "make 3D glasses not suck" technology.

>>
>> Reality gives us translucent border effects and lag...
>>
>> Like, oh wow, the window border looks like frosted glass, and now the
>> whole OS is lagging trying to deliver this effect until the user can
>> find the option in Control Panel to turn it off.
>>
>>
>> OS Updates:
>> Were gonna switch your Power-Button setting back from Hibernate to
>> Sleep, turn GUI effects back on, ...
>> Also random "Your battery is Low, Plug in your Device soon."
>> notifications, me, "FFS Windows, I am running a Ye Olde Desktop PC.
>> There is no battery in this thing."
>>>> The RISC-V market needs a better story on how they are going to handle
>>>> graphics,
>>>
>>> Only if it wants to make inroads in the desktop and mobile markets.
>>> For a server, there are other places that need to be addressed first
>>> (memory controller, I/O). They also need to fix these for the
>>> desktop.
>>>
>>>> trying to use PowerVR for the GPU when people will be mostly
>>>> using Linux seems wrong to me.
>>>
>>> Yes.
>>>
>> Makes me half wonder what the "minimum viable GPU" could be.
>>
>> I guess the main task it would have would be:
>> Ability to copy pixels from one place in VRAM to another;
>> Ability to copy pixels while applying modulation or blending.
>> Ability to perform a texture fetch and ST stepping (3D / Advanced 2D)
>> Ability to walk edges of a primitive triangle or quad (3D)
> <
> Minimum viable GPU is the backend == Rendering. The above is a list
> of things the backend can do. Blitting is simply 'texture' to triangles
> normal to the viewing surface. Blending is simply sub-features of
> 'texture'.

I was figuring it is possible it could have special cases for operations
which function like a memcpy or rectangular blit (common in GUI tasks).

>>
>> Ideally, it should be able to do these:
>> Faster than doing them on the CPU;
>> Cheaper than throwing a specialized CPU core at it.
>> (Eg: CPU core with ISA tweaks for common GPU tasks).
> <
> It can do these faster than CPU only when the number of things requested
> takes GPU way more time than the latency of getting started and stopped.
> {Generally in the 1,000-10,000 triangles range.} GPUs need embarrassing
> levels of parallelism to work well.

This likely depends on the type of GPU.
I was not imagining anything like a modern GPU.

>>
>> Seems like one could do something like this with a few specialized
>> caches and registers:
>> Framebuffer Cache(s), represent logical screen pixels Read/Write;
>> Texture Cache, holds texture blocks, Read-Only.
> <
> It might be time to integrate 'texture' into CPU memory reference pipeline
> and integrate access into memory reference instructions.

This is one possibility.

The Texture-Cache could expose something resembling a memory port,
except that it has the ability during fetch to be indexed using texture
coordinates (rather than a linear index) with a mask based on the size
of the texture, and the ability to decode block texture compression. It
would return the result as a color vector.

It would probably exclude bilinear interpolation or similar, since this
would likely be too expensive to be performed directly within a single
fetch (main issue being to deal with fetches which cross the edge of a
texture block).

But, say we have an op like:
LDUTEX (R14, R8, 0), R4 //Fetch first pixel (S+0,T+0)
LDUTEX (R14, R8, 1), R5 //Fetch second pixel (S+1,T+0)
LDUTEX (R14, R8, 2), R6 //Fetch third pixel (S+0,T+1)
LDUTEX (R14, R8, 3), R7 //Fetch fourth pixel (S+1,T+1)
(Pad Cycle) //Do something else here, *
(Pad Cycle) //Do something else here
BILERPX R4, R6, R8, R10 //Bilinear Interpolate (4R)
...
*: Could do math here, or fetch the value from the Z-Buffer, ...
Need to do something here so BILERP doesn't interlock on the loads.

With R14 giving the texture:
(47:0): Address of Texture
(59:48): Encodes the format and size of the texture
(59:56): Pixel Format
(55:52): Y Size (log2)
(51:48): X Size (log2)
R8 giving the ST Coords (2x 16.16 fixed-point)
The displacement encodes a pixel displacement:
0..3: Which corner of an interpolated pixel.
4..7: Round-Nearest(?)

These instructions would fall within the scope of what BJX2 can
currently encode in the Op64 and Op40x2 encodings.

Though, the cheaper option here would be to only support square
Morton-order textures (my existing TKRA GL is primarily using Morton
ordering, and non-square textures are uncommon).

The BILERPX would ignore the "integer part" of the of the ST coords
(with the texture fetches effectively doing a "truncate and mask" fetch).

Unclear here is the timing cost of sticking a block-texture decode on
the tail-end of a Load operation.

The Framebuffer Cache would basically be a normal L1 D-Cache.
Special features would be that it could include built-in pixel repacking.

Say, for example, during drawing, pixels are represented as a 64-bit
vector format (aaaa-rrrr-gggg-bbbb), Texture Fetch returns pixels in
this format, and Framebuffer Store repacks these into RGBA32 or RGB555
or RGBA4444 or similar.

It is possible Framebuffer ops could also include raster-indexed
Load/Store, likely with a predefined table of Y strides, *, ...

*: 256/384/512/768/... This avoiding the need for a multiplier, while
not wasting as much memory as padding the size to a power-of-2 (so,
320x200->384x200, 640x480->768x480, ...).

At least compared with BJX2 as-is, these could save a decent number of
clock-cycles during 3D rendering tasks.

>>
>> Would need a bunch of registers to represent the edge-walking state of
>> the primitive being drawn.
> <
> Are you talking about the rasterizer and interpolator ?

I was half imagining if the raster-drawing loops themselves were
themselves moved into dedicated hardware.

But, yeah, for rasterization, there are typically two levels of loops:
Walk down the primitive in the Y axis, stepping along each edge;
One interpolates by stepping a fixed amount each scanline.
Step from the left edge to the right edge (X axis).
Texel fetch, bilinear interpolation, modulation, ... goes here.
Interpolation also happens via fixed steps.

This is assuming the use of affine texturing.

This stuff eats enough registers, that this was one of my major
use-cases for expanding the GPR space to 64 GPRs.

But, say, rather than needing to use loops and draw pixels on the CPU,
one throws ~ 500B worth of state at a specialized coprocessor (via MMIO
or similar), which causes a triangle to (asynchronously) appear within a
framebuffer. It is then drip-fed triangles as they complete.

Or, sort of vaguely like an S3 ViRGE or similar...

Well, this, or using a BJX2 core as a GPU. This was previously an idea,
but this requires multi-threading, and is partly derailed by the issue
of my BJX2 core currently being too heavyweight to go multi-core on the
XC7A100T.

Using a BJX2 core as a GPU could make sense though, given it already
seems to be doing moderately well at this task (and this was originally
how I intended TKRA GL to work, rather than running it as a single
thread as I am currently doing).

However, if I had a "Nexys Video" or similar (XC7A200T), as-is I could
probably go quad-core; but, alas... (Don't have the money for this, this
board being ~ $500 ...).

Also, using a BJX2 core as a GPU could allow for shaders, assuming a
GLSL compiler were written; Well, or, use some cheese hack, like
compiling the shaders to native code beforehand via BGBCC or similar
(would then need to add a frontend to compile GLSL, or write a GLSL->C
translator).

Though, the OpenGL API doesn't really make any provision for this.
Also the naive strategy (internally calling a GLSL fragment shader
per-pixel via a function pointer, passing/returning state via global
variables) would likely still perform kinda like crap (ideally, one
would want the fragment shader to be inlined within the body of the
span-drawing loop, ...).

>>
>> Maybe the GPU throws interrupts at whatever core is managing the
>> geometry transform?... It is that or use polling. Doing a GPU which runs
>> the full transform seems like a harder problem than one which runs a
>> rasterizer.
>>
>> These would come with a big obvious drawback: They would preclude the
>> use of shaders.
>>
>>
>>
>> The "able to run shaders" requirement would likely turn the "cheapest"
>> option into throwing one or more specialized CPU cores at it.
>>
>> Maybe the GPU cores could run RISC-V, but RISC-V would still kinda suck
>> as the basis for a GPU ISA.
>>> The discussion about Blitting inspired me to check out write bandwidth:
>>>
>>> Here we malloc() 1GB and use memset() to fill it with 1s:
>>>
>>> [fedora-starfive:~/nfstmp/gforth-riscv:76808] time gforth-fast -e "1000000000 allocate throw 1000000000 1 fill bye"
>>>
>>> real 0m6.362s
>>> user 0m0.243s
>>> sys 0m6.095s
>>>
>>> Funny user/system balance. My explanation is that the system COWs
>>> each page on the first write access, and then that page is in the
>>> cache, and the memset runs fast (that's why we see such a low user
>>> time) up to the next page fault.
>>>
>>> To reduce the system overhead, now write the 1GB by allocating a 10MB
>>> block and memset()ting it 100 times, with values 0-99:
>>>
>>> [fedora-starfive:~/nfstmp/gforth-riscv:76809] time gforth-fast -e "10000000 allocate throw constant a : foo 100 0 do a 10000000 i fill loop ; foo bye"
>>>
>>> real 0m4.055s
>>> user 0m3.902s
>>> sys 0m0.126s
>>>
>>> So it seems that the U74 writes at about 250MB/s. That's not great,
>>> but it probably exceeds the CL5428 by a good margin.
>>>
>> Well, it is a lot faster at this than my BJX2 core running on a Nexys A7
>> at least...
>>
>>
>>> - anton

Subject	Replies	Author
RISC-V U74 on Starfive Visionfive V1 By: Anton Ertl on Sat, 26 Feb 2022	12	Anton Ertl

Mystics always hope that science will some day overtake them. -- Booth Tarkington

computers / comp.arch / Re: RISC-V U74 on Starfive Visionfive V1