Message-ID:

"What I've done, of course, is total garbage." -- R. Willard, Pure Math 430a

On 5/15/2022 4:15 PM, MitchAlsup wrote:
> On Sunday, May 15, 2022 at 5:26:11 PM UTC-5, BGB wrote:
>> On 5/15/2022 3:40 PM, MitchAlsup wrote:
>
>>>> I am currently in the "everything goes in GPRs camp":
>>>> If one has 64-bit GPRs, then pretty much everything can go into them;
>>>> If one needs 128-bit SIMD, one can use two GPRs per SIMD vector.
>>> <
>>> My 66000 does all SIMD stuff as vectorized loops. No need for a
>>> SIMD register file, either.
>> No SIMD file in my case, only GPRs.
>>
>> SIMD operations exist, in one of several "flavors":
>> Those that operate on 64 bits, and use a single GPR;
>> Those that operate on 128-bits, and use a pair.
> <
> Yes, but I am looking both forwards and backwards at the same time.
> I have SIMD of 64×2^k for some integer k determined by the implementation.
> So, on lowest end machines, VVM runs 1 iteration per cycle, in middle range
> implementations VVM runs 2 or 4 iterations per cycle, and in higher end
> implementations, VVM runs 4-8-16 iterations per cycle.
> <
> Especially note: there are 0 SIMD OpCodes in the ISA.
>>
>>
>> Which in turn has other effects:
>> Operations on 64-bit vectors can in many cases be bundled;
>> Operations on 128-bit vectors can't be bundled.
> <
> This is the static look I mentioned above. You want an architecture that at the
> lowest end runs 1 instruction per cycle throughout the pipeline. At the other
> end you want the pipeline to run 6-8-10 instruction per cycle--ALL FROM THE
> SAME instruction stream.
>>
>> Mostly because the latter cases eat multiple lanes.
> <
> As Ivan would say:: Fix It !
>>>>
>>>>
>>>> This is even with 'int' and friends still being 32-bit, where most older
>>>> code doesn't make particularly effective use of an ISA which has 64-bit
>>>> registers.
>>> <
>>> # define int int64_t
>> One could do this...
>>
>> Wouldn't achieve much for most older code beyond wasting memory and
>> probably causing it to no longer work correctly.
>>>>
>>>> One could potentially argue for limiting 'int' operations to 32 bits,
>>>> ignoring the high half of the register in these cases, but this would be
>>>> "kinda lame".
>>>>
>>>> One could potentially have such an ISA which natively implements
>>>> arithmetic in a SIMD like manner, but then turns 64-bit integer ops into
>>>> a multi-instruction sequence (with manual carry propagation).
>>>>
>>> Vectorization, instead.
>> I was thinking where instead of having a 64-bit ADD, one used a 2x
>> 32-bit ADD, and then had additional instructions to update the high-word
>> based on the preceding instructions' result in the low-word.
> <
> And this is used "how often" ???
>>
>> But, yes, from an ISA design POV this would be worse than just having a
>> 64-bit ADD instruction.
>>
>>
>> Meanwhile: IRL, I ended up going the other direction and having a
>> 128-bit ADDX instruction (sometimes useful).
>>>>
>>>>
>>>> But, at this point I would probably still prefer an ISA with a single
>>>> larger register space (and a few asinine special cases) to the
>>>> traditional 3-way slit between GPRs, FPU, and SIMD registers.
>>>>
>>>>
>>>> Though, code which works with a lot of 128-bit vectors might still push
>>>> the limits of a 32 GPR design when each vector uses 2 GPRs.
>>>>
>>>> One could argue for one of:
>>>> One has 64 GPRs, which is awkward for encoding reasons;
>>>> One has 32 GPRs, leading to register-pressure issues with 128b SIMD;
>>> <
>>> Not with VVM. PLUS: when you widen up the machine capabilities
>>> you don't need new register resources, nor do you even need to
>>> change the code to assess these new wider data paths.
>>> <
>> Doing SIMD the way I did it seemed like the simplest/cheapest option.
> <
> With no (== little) look to future implementations. where you have 10× the
> resources you have today.........
>>
>> No current plans to go beyond 128 bits.
>>>> One has 32 regs, with half the space being "SIMD only".
>>>>
>>>>
>>>> In the latter case, say:
>>>> X0 -> R1:R0
>>>> X2 -> R3:R2
>>>> ...
>>>> X30 -> R31:R30
>>> <
>>> Not sure you want to HW to allow using the SP and/or FP as SIMD registers
>>> (assuming you want SIMD registers)
> <
>> A few of the registers are basically "undefined" if accessed as 128-bit:
>> (R1:R0), May be encoded, but effectively undefined at present;
>> (R15:R14), Contains SP, also undefined.
> <
> Strange place to put SP and FP........
>>
>> If the emulator sees these cases, it will turn the instruction into a
>> breakpoint.
>>
>> If the CPU core encounters this, it will likely access the GPR space
>> "behind" these registers, as the way the SPRs is implemented is as
>> special register IDs that are overlaid on top of the GPR space by the
>> decoder, and the logic for SIMD registers would basically side-step this
>> remapping.
>>
>> In the RISC-V mode, similar remapping is also done, though a few of the
>> registers are mapped to different locations.
>>> <
>>>> X1,X3,...: Only existing as 128b SIMD registers.
>>>>
>>>> But, this kinda sucks, as one can no longer use any operation on any
>>>> register.
>>> <
>>> kinda is a serious understatement.
>> This is why I later reworked this into the XGPR extension...
>>
>> There are some cracks in the design, but being able to (more or less)
>> access all of the registers in a consistent way is much preferable to
>> having asymmetric access to half of the register space by only certain
>> types of instructions.
>>
>> If XGPR is not enabled, we can assume that R32..R63 do not exist (and
>> the low bit of the register for 128-bit operations is "Must Be Zero").
> <
> What do you do for code compiled assuming XGPR exists and you want to
> run on THIS implementation ??
>>>>
>>>> I was kinda doing it this way initially, but then later added some
>>>> encoding hacks to allow much of the rest of the ISA to have access to
>>>> these registers.
>
>>>> Ironically, saving and using more than the "locally optimal" number of
>>>> registers for a given function will tend to actually make performance
>>>> worse, since any potential savings in terms of "fewer register spills"
>>>> is offset by the cost of saving/restoring more registers in the function
>>>> prolog and epilog (and, saving/restoring registers but then never using
>>>> them, is counter-productive).
>>> <
>>> When the save/restore is allowed to use full cache access width (while
>>> normal LDs and STs only use data-path widths) prologue and epilogue
>>> sequences are faster than register spills.
>> Probably true.
>>
>> I was mostly using Load/Store pair (which operate 128 bits at a time).
>>
>> But, a register saved/restored but not used, still costs more than not
>> saving/restoring the register.
> <
> Certainly, but that is why the prologues and epilogues are not canned,
> the compiler has a choice on how many get saved and how many get
> restored--all cleverly orchestrated with ABI rules in mind.
> <
> The compiler can save as many or as few registers between R16 and R29
> as it desired, can save an update FP as desired, can save and restore SP
> if desired (seldom) or just update and backdate SP as desired. If desired
> Registers R1-R8 (arguments) can be saved--and here they get concatenated
> with the stack passed arguments for ease of varargs.
> <
> So, the compiler gets to choose how many and if.
>>
>>
>> But, "getting it right" isn't necessarily a given, if one assumes a
>> register allocator which tries to "round robin" the register allocation
>> in an attempt to increase usable ILP.
> <
> Agreed, it took Brian a "while" to get My 66000 ABI fully integrated in his
> LLVM port. But after he did, the code looks fabulous.
>>
>>
>> So, ended up using heuristics to divide up the register space, and
>> enabling parts of the register space based on register pressure.
>>
>> So, say:
>> R8 ..R14: Always enabled
>> R24..R31: Enabled for high-pressure functions.
>> R4 ..R7 : Enabled for leaf functions.
>> R18..R23: Enabled for high-pressure leaf functions.
> <
> Not sure what you are getting at here.
>>
>> If I enabled R32..R63 in the main C ABI:
>> R40..R47: Enabled for very high register pressure.
>> R56..R63: Enabled for very high register pressure.
>> R32..R39: Enabled for very high pressure leaf functions.
>> R48..R55: Enabled for very high pressure leaf functions.
>>
> <
> With only 32 registers (1 being SP and 1 optionally being FP) I am finding
> very little spill/fill codes, so whatever LLVM did and whatever Brian's
> My 66000 port does, I don't seem to run into the mentioned problems.

Idle curiosity: what do you dofor a function that has more arguments
than registers?

Also: how do you do VARARGS?

Subject	Replies	Author
Mixed EGU/EGO floating-point By: Quadibloc on Fri, 13 May 2022	116	Quadibloc

"What I've done, of course, is total garbage." -- R. Willard, Pure Math 430a

computers / comp.arch / Re: Mixed EGU/EGO floating-point