novaBBS - comp.arch - Re: Self-debate: Extending the BJX2 GPR space (XGPR)?

For most of its existence, the BJX2 ISA had 32 GPRs (R0..R31).

Then recently I started running into a few cases where having more GPRs
could be useful (mostly involving big complicated loops working with a
lot of parameters at the same time, such as tend to occur in my OpenGL
rasterizer).

Previously, I had an extension specifically for 128-bit SIMD ops, which
extended the register space. However, the only way to access these
registers was via 128-bit SIMD ops, MOV.X, or via a few other special
instructions (eg: MOV), severely limiting their utility.

I had more recently started floating ideas for a more general set of
extensions, which would allow a larger part of the ISA to use them.

I have implemented an expanded subset, which adds support for R0..R63
for normal instructions.

As currently implemented:
7wnm-ZeoZ (3R / 2R)
9wnm-Zeii (3RI / 2RI, Ld/St Disp9)

Which reuses the space I had previously used for 24-bit instructions,
since 24 bit instructions are "basically useless" on a
performance-oriented core, and a microcontroller is rather unlikely to
need XGPR (actually, even Op24 is debatable, given its effect on
code-density is fairly underwhelming for how much hair it adds).

I am still not particularly happy with this encoding:
It breaks the established pattern for instruction lengths;
Can only encode a subset of the ISA;
It can't encode predicated instructions.

However:
It retains binary compatibility with existing code;
Can be used in WEX bundles in the same way as Ezzz/Fzzz encodings.

Where 'W':
Ww: WEX bit;
Wn: Bit 5 of Rn
Wm: Bit 5 of Rm
Wo: Bit 5 of Ro (or an F2/F1 selector)

Another considered alternative could be using Op48 space, however:
If I wanted to support WEX with them, the bundle mechanism would
effectively need to be redesigned (and probably a lot more expensive).

Both Op48 and Op64 encodings, as-is, would preclude using them in WEX
bundles.

Yet another possibility, I could add a mode flag and then effectively
eat the entire 16-bit encoding space for 32-bit ops, where:
0zzz..Dzzz: Expanded space with predication and more GPRs;
Ezzz/ Fzzz: Existing 32-bit encoding space.

Or:
pwnm-ZeoZ
pwnm-Zeii

Say, where 'p' partly encodes both the predication mode and encoding
block, say:
0000: F0, PredT (SR.T)
0001: F1/F2, PredT (SR.T)
0010: F0, PredT (SR.S)
0011: F1/F2, PredT (SR.S)
0100: F0, PredF (SR.T)
0101: F1/F2, PredF (SR.T)
0110: F0, PredF (SR.S)
0111: F1/F2, PredF (SR.S)
1000: F0, Always
1001: F1/F2, Always
1010: F8, Always
1011: F3, Always
1100: -
1101: -
1110: Ezzz, Same as Before
1111: Fzzz, Same as Before

However, this would also suck:
Requires a mode flag for fetch and decode to do its thing;
Would require explicit mode-change instructions;
Would likely be implemented as a special-purpose branch op;
Doesn't allow disassembly without knowing the mode;
...

However, this would only have a fairly minor impact on the existing
pipeline (as most of the significant changes had already been done as
part of the existing extensions).

An addition here would be to be able to use SR.S as a predicate (eg: to
allow overlapping two different if branches or similar).

Similarly, it seems that there is relatively little cost difference
between having R0..R31 and R0..R63, so they can be good if they can
potentially offer a performance advantage.

I am left feeling unsure if such an expanded GPR space is even really a
good idea though, since arguably only a minority of code will have
enough register pressure to benefit much from an expanded GPR space.
Likewise, these sorts of things are in the "once you add them and put
them into general use, you are basically stuck with them" category.

One other tradeoff is that they do mean saving a lot more registers for
setjmp/longjmp, interrupts, or for context switches (roughly 512B worth
of registers), but this doesn't seem too terrible.

For now, either way, it is likely that the codegen will mostly ignore
the existence of the extended GPRs (until at such a point it is
"actually competent" at doing stuff more effectively). Instead, they
would mostly be left (for now) mostly in the domain of ASM code.

As-is, the compiler needs to use heuristics to decide between whether or
not to use 32 GPRs for a given function or limit itself to 16, since for
small functions using a larger GPR space can hurt more than it helps (if
the cost of saving/restoring registers to/from the stack frame exceeds
the cost of the register spills).

One could argue though, why not keep everything in scratch registers and
thus avoid saving things to the stack? This could be done, in theory,
but would require a bit more work on the compiler (such as classifying
whether or not a function is a pure leaf function, and effectively
replacing the current register allocator).

However, "pure leaf" functions could potentially benefit from a larger
GPR space (a lot more scratch registers to assign values into), but
XGPR's complicate things if they can't be encoded with predicates (the
determination over whether to use predication or a branch is made before
it can be known which registers are being used).

This would likely mean that whether or not such a "pure leaf" approach
could be used, and whether or not to enable XGPRs, would depend on
whether or not particular branches had been predicated, and which
registers the variables had been assigned to. All of this could require
using multiple passes over a function to disambiguate (well, or figure
out some other way for XGPR ops to be predicated).

Well, or potentially, I use the existing XGPR encoding for unconditional
ops, and use a jumbo-encoding for predicated XGPR ops, at the cost that
the predicated ops may not be bundled (grr...).

And, all this is a bit more messy than I would prefer.

....

Any thoughts?...

On Sunday, June 6, 2021 at 4:11:56 PM UTC-5, BGB wrote:
> For most of its existence, the BJX2 ISA had 32 GPRs (R0..R31).
>
> Then recently I started running into a few cases where having more GPRs
> could be useful (mostly involving big complicated loops working with a
> lot of parameters at the same time, such as tend to occur in my OpenGL
> rasterizer).
<
As I stated in the other thread:: I consider a rasterizer and interpolator to be
the kind of "function" that deserves dedicated HW. you feed it triangles and
it feeds you vectors of pixels and coordinates. The GPUs build these things
to produce 32 pixels (RGBA) and 32 sets of coordinates (XYZW) per cycle.
This would be easy to scale down to 1 set per cycle.
<
Texture deserves its own dedicated HW, also.
>
>
> Previously, I had an extension specifically for 128-bit SIMD ops, which
> extended the register space. However, the only way to access these
> registers was via 128-bit SIMD ops, MOV.X, or via a few other special
> instructions (eg: MOV), severely limiting their utility.
<
Yes, the separate register file gets clunky at certain boundaries.
>
> I had more recently started floating ideas for a more general set of
> extensions, which would allow a larger part of the ISA to use them.
>
>
> I have implemented an expanded subset, which adds support for R0..R63
> for normal instructions.
>
> As currently implemented:
> 7wnm-ZeoZ (3R / 2R)
> 9wnm-Zeii (3RI / 2RI, Ld/St Disp9)
>
>
> Which reuses the space I had previously used for 24-bit instructions,
> since 24 bit instructions are "basically useless" on a
> performance-oriented core, and a microcontroller is rather unlikely to
> need XGPR (actually, even Op24 is debatable, given its effect on
> code-density is fairly underwhelming for how much hair it adds).
>
> I am still not particularly happy with this encoding:
> It breaks the established pattern for instruction lengths;
> Can only encode a subset of the ISA;
> It can't encode predicated instructions.
>
> However:
> It retains binary compatibility with existing code;
> Can be used in WEX bundles in the same way as Ezzz/Fzzz encodings.
>
> Where 'W':
> Ww: WEX bit;
> Wn: Bit 5 of Rn
> Wm: Bit 5 of Rm
> Wo: Bit 5 of Ro (or an F2/F1 selector)
>
Stealing from the REX prefix in x86-64.
>
>
> Another considered alternative could be using Op48 space, however:
> If I wanted to support WEX with them, the bundle mechanism would
> effectively need to be redesigned (and probably a lot more expensive).
>
> Both Op48 and Op64 encodings, as-is, would preclude using them in WEX
> bundles.
>
>
>
> Yet another possibility, I could add a mode flag and then effectively
> eat the entire 16-bit encoding space for 32-bit ops, where:
> 0zzz..Dzzz: Expanded space with predication and more GPRs;
> Ezzz/ Fzzz: Existing 32-bit encoding space.
>
> Or:
> pwnm-ZeoZ
> pwnm-Zeii
>
> Say, where 'p' partly encodes both the predication mode and encoding
> block, say:
> 0000: F0, PredT (SR.T)
> 0001: F1/F2, PredT (SR.T)
> 0010: F0, PredT (SR.S)
> 0011: F1/F2, PredT (SR.S)
> 0100: F0, PredF (SR.T)
> 0101: F1/F2, PredF (SR.T)
> 0110: F0, PredF (SR.S)
> 0111: F1/F2, PredF (SR.S)
> 1000: F0, Always
> 1001: F1/F2, Always
> 1010: F8, Always
> 1011: F3, Always
> 1100: -
> 1101: -
> 1110: Ezzz, Same as Before
> 1111: Fzzz, Same as Before
>
>
> However, this would also suck:
> Requires a mode flag for fetch and decode to do its thing;
> Would require explicit mode-change instructions;
> Would likely be implemented as a special-purpose branch op;
> Doesn't allow disassembly without knowing the mode;
> ...
>
> However, this would only have a fairly minor impact on the existing
> pipeline (as most of the significant changes had already been done as
> part of the existing extensions).
>
> An addition here would be to be able to use SR.S as a predicate (eg: to
> allow overlapping two different if branches or similar).
>
>
>
> Similarly, it seems that there is relatively little cost difference
> between having R0..R31 and R0..R63, so they can be good if they can
> potentially offer a performance advantage.
>
>
> I am left feeling unsure if such an expanded GPR space is even really a
> good idea though, since arguably only a minority of code will have
> enough register pressure to benefit much from an expanded GPR space.
> Likewise, these sorts of things are in the "once you add them and put
> them into general use, you are basically stuck with them" category.
>
> One other tradeoff is that they do mean saving a lot more registers for
> setjmp/longjmp, interrupts, or for context switches (roughly 512B worth
> of registers), but this doesn't seem too terrible.
>
>
> For now, either way, it is likely that the codegen will mostly ignore
> the existence of the extended GPRs (until at such a point it is
> "actually competent" at doing stuff more effectively). Instead, they
> would mostly be left (for now) mostly in the domain of ASM code.
>
>
> As-is, the compiler needs to use heuristics to decide between whether or
> not to use 32 GPRs for a given function or limit itself to 16, since for
> small functions using a larger GPR space can hurt more than it helps (if
> the cost of saving/restoring registers to/from the stack frame exceeds
> the cost of the register spills).
>
>
> One could argue though, why not keep everything in scratch registers and
> thus avoid saving things to the stack? This could be done, in theory,
> but would require a bit more work on the compiler (such as classifying
> whether or not a function is a pure leaf function, and effectively
> replacing the current register allocator).
>
> However, "pure leaf" functions could potentially benefit from a larger
> GPR space (a lot more scratch registers to assign values into), but
> XGPR's complicate things if they can't be encoded with predicates (the
> determination over whether to use predication or a branch is made before
> it can be known which registers are being used).
>
> This would likely mean that whether or not such a "pure leaf" approach
> could be used, and whether or not to enable XGPRs, would depend on
> whether or not particular branches had been predicated, and which
> registers the variables had been assigned to. All of this could require
> using multiple passes over a function to disambiguate (well, or figure
> out some other way for XGPR ops to be predicated).
>
>
> Well, or potentially, I use the existing XGPR encoding for unconditional
> ops, and use a jumbo-encoding for predicated XGPR ops, at the cost that
> the predicated ops may not be bundled (grr...).
>
> And, all this is a bit more messy than I would prefer.
<
Of the 3 dimensions of expansion are concerned (separate files, bigger
register specifiers and the encoding mess, or dedicated HW) I prefer
the dedicated HW approach.
>
>
> ...
>
>
> Any thoughts?...
<
You found all the options, you get to grind through the paths towards
reasonable solutions.

On 6/6/2021 4:28 PM, MitchAlsup wrote:
> On Sunday, June 6, 2021 at 4:11:56 PM UTC-5, BGB wrote:
>> For most of its existence, the BJX2 ISA had 32 GPRs (R0..R31).
>>
>> Then recently I started running into a few cases where having more GPRs
>> could be useful (mostly involving big complicated loops working with a
>> lot of parameters at the same time, such as tend to occur in my OpenGL
>> rasterizer).
> <
> As I stated in the other thread:: I consider a rasterizer and interpolator to be
> the kind of "function" that deserves dedicated HW. you feed it triangles and
> it feeds you vectors of pixels and coordinates. The GPUs build these things
> to produce 32 pixels (RGBA) and 32 sets of coordinates (XYZW) per cycle.
> This would be easy to scale down to 1 set per cycle.
> <
> Texture deserves its own dedicated HW, also.

Yeah, if you want to work more than one pixel at a time, the amount of
in-flight state can get kinda absurd.

When I looked into GPUs before, it seemed like a lot of them basically
boiled down to being VLIW+SIMD machines with large vectors and lower
clock speeds (relative to CPUs).

My initial thoughts at how to scale these back to "something I could
actually implement" resulted in something not too far from where BJX2
was at the time, so it seemed like I could just adapt the CPU core to
also handle GPU tasks.

Luckily, this part isn't too horrible at this at least.

R0..R31 were shared with GPR ops, but R32..R63 were in a sort of
"technically exist but can't be addressed" limbo.

Given my ISA doesn't have a redundant set of operations for SIMD ops,
and the 64-bit SIMD ops were still limited to R0..R31, this "kinda sucked".

The newer 'XGPR' is an attempt to remedy this situation.
Current leaning though is that XGPR encodings of 128-bit SIMD ops will
not be allowed.

>>
>> I had more recently started floating ideas for a more general set of
>> extensions, which would allow a larger part of the ISA to use them.
>>
>>
>> I have implemented an expanded subset, which adds support for R0..R63
>> for normal instructions.
>>
>> As currently implemented:
>> 7wnm-ZeoZ (3R / 2R)
>> 9wnm-Zeii (3RI / 2RI, Ld/St Disp9)
>>
>>
>> Which reuses the space I had previously used for 24-bit instructions,
>> since 24 bit instructions are "basically useless" on a
>> performance-oriented core, and a microcontroller is rather unlikely to
>> need XGPR (actually, even Op24 is debatable, given its effect on
>> code-density is fairly underwhelming for how much hair it adds).
>>
>> I am still not particularly happy with this encoding:
>> It breaks the established pattern for instruction lengths;
>> Can only encode a subset of the ISA;
>> It can't encode predicated instructions.
>>
>> However:
>> It retains binary compatibility with existing code;
>> Can be used in WEX bundles in the same way as Ezzz/Fzzz encodings.
>>
>> Where 'W':
>> Ww: WEX bit;
>> Wn: Bit 5 of Rn
>> Wm: Bit 5 of Rm
>> Wo: Bit 5 of Ro (or an F2/F1 selector)
>>
> Stealing from the REX prefix in x86-64.

Stealing from REX, twice...

The E field:
Eq: 'Quad' bit (rarely actually used this way);
En: Bit 4 of Rn
Em: Bit 4 of Rm
Eo: Bit 4 of Ro (or Bit 8 for Disp9 encodings).

So, the m/n/o fields are 4-bit hex nybbles.

>>
>>
>> Another considered alternative could be using Op48 space, however:
>> If I wanted to support WEX with them, the bundle mechanism would
>> effectively need to be redesigned (and probably a lot more expensive).
>>
>> Both Op48 and Op64 encodings, as-is, would preclude using them in WEX
>> bundles.
>>
>>
>>
>> Yet another possibility, I could add a mode flag and then effectively
>> eat the entire 16-bit encoding space for 32-bit ops, where:
>> 0zzz..Dzzz: Expanded space with predication and more GPRs;
>> Ezzz/ Fzzz: Existing 32-bit encoding space.
>>
>> Or:
>> pwnm-ZeoZ
>> pwnm-Zeii
>>
>> Say, where 'p' partly encodes both the predication mode and encoding
>> block, say:
>> 0000: F0, PredT (SR.T)
>> 0001: F1/F2, PredT (SR.T)
>> 0010: F0, PredT (SR.S)
>> 0011: F1/F2, PredT (SR.S)
>> 0100: F0, PredF (SR.T)
>> 0101: F1/F2, PredF (SR.T)
>> 0110: F0, PredF (SR.S)
>> 0111: F1/F2, PredF (SR.S)
>> 1000: F0, Always
>> 1001: F1/F2, Always
>> 1010: F8, Always
>> 1011: F3, Always
>> 1100: -
>> 1101: -
>> 1110: Ezzz, Same as Before
>> 1111: Fzzz, Same as Before
>>
>>
>> However, this would also suck:
>> Requires a mode flag for fetch and decode to do its thing;
>> Would require explicit mode-change instructions;
>> Would likely be implemented as a special-purpose branch op;
>> Doesn't allow disassembly without knowing the mode;
>> ...
>>
>> However, this would only have a fairly minor impact on the existing
>> pipeline (as most of the significant changes had already been done as
>> part of the existing extensions).
>>
>> An addition here would be to be able to use SR.S as a predicate (eg: to
>> allow overlapping two different if branches or similar).
>>
>>
>>
>> Similarly, it seems that there is relatively little cost difference
>> between having R0..R31 and R0..R63, so they can be good if they can
>> potentially offer a performance advantage.
>>
>>
>> I am left feeling unsure if such an expanded GPR space is even really a
>> good idea though, since arguably only a minority of code will have
>> enough register pressure to benefit much from an expanded GPR space.
>> Likewise, these sorts of things are in the "once you add them and put
>> them into general use, you are basically stuck with them" category.
>>
>> One other tradeoff is that they do mean saving a lot more registers for
>> setjmp/longjmp, interrupts, or for context switches (roughly 512B worth
>> of registers), but this doesn't seem too terrible.
>>
>>
>> For now, either way, it is likely that the codegen will mostly ignore
>> the existence of the extended GPRs (until at such a point it is
>> "actually competent" at doing stuff more effectively). Instead, they
>> would mostly be left (for now) mostly in the domain of ASM code.
>>
>>
>> As-is, the compiler needs to use heuristics to decide between whether or
>> not to use 32 GPRs for a given function or limit itself to 16, since for
>> small functions using a larger GPR space can hurt more than it helps (if
>> the cost of saving/restoring registers to/from the stack frame exceeds
>> the cost of the register spills).
>>
>>
>> One could argue though, why not keep everything in scratch registers and
>> thus avoid saving things to the stack? This could be done, in theory,
>> but would require a bit more work on the compiler (such as classifying
>> whether or not a function is a pure leaf function, and effectively
>> replacing the current register allocator).
>>
>> However, "pure leaf" functions could potentially benefit from a larger
>> GPR space (a lot more scratch registers to assign values into), but
>> XGPR's complicate things if they can't be encoded with predicates (the
>> determination over whether to use predication or a branch is made before
>> it can be known which registers are being used).
>>
>> This would likely mean that whether or not such a "pure leaf" approach
>> could be used, and whether or not to enable XGPRs, would depend on
>> whether or not particular branches had been predicated, and which
>> registers the variables had been assigned to. All of this could require
>> using multiple passes over a function to disambiguate (well, or figure
>> out some other way for XGPR ops to be predicated).
>>
>>
>> Well, or potentially, I use the existing XGPR encoding for unconditional
>> ops, and use a jumbo-encoding for predicated XGPR ops, at the cost that
>> the predicated ops may not be bundled (grr...).
>>
>> And, all this is a bit more messy than I would prefer.
> <
> Of the 3 dimensions of expansion are concerned (separate files, bigger
> register specifiers and the encoding mess, or dedicated HW) I prefer
> the dedicated HW approach.

Click here to read the complete article

On Sunday, June 6, 2021 at 5:49:46 PM UTC-5, BGB wrote:
> On 6/6/2021 4:28 PM, MitchAlsup wrote:
> > On Sunday, June 6, 2021 at 4:11:56 PM UTC-5, BGB wrote:
> >> For most of its existence, the BJX2 ISA had 32 GPRs (R0..R31).
> >>
> >> Then recently I started running into a few cases where having more GPRs
> >> could be useful (mostly involving big complicated loops working with a
> >> lot of parameters at the same time, such as tend to occur in my OpenGL
> >> rasterizer).
> > <
> > As I stated in the other thread:: I consider a rasterizer and interpolator to be
> > the kind of "function" that deserves dedicated HW. you feed it triangles and
> > it feeds you vectors of pixels and coordinates. The GPUs build these things
> > to produce 32 pixels (RGBA) and 32 sets of coordinates (XYZW) per cycle.
> > This would be easy to scale down to 1 set per cycle.
> > <
> > Texture deserves its own dedicated HW, also.
> Yeah, if you want to work more than one pixel at a time, the amount of
> in-flight state can get kinda absurd.
>
>
> When I looked into GPUs before, it seemed like a lot of them basically
> boiled down to being VLIW+SIMD machines with large vectors and lower
> clock speeds (relative to CPUs).
<
Not VLIW versus SIMD, it is SIMD plus dedicated wide hardware.
<
Imagine the width necessary to do 8 3-D texture lookups and interpolations
per cycle !
<
These are not programmable devices, they are hardware programmed (fixed),
deeply pipelined, units designed to feed the SIMD programmable kernels.
>
> My initial thoughts at how to scale these back to "something I could
> actually implement" resulted in something not too far from where BJX2
> was at the time, so it seemed like I could just adapt the CPU core to
> also handle GPU tasks.
<
One channel of a Texture lookup and blending per cycle. Where there are up
to 4 channels.
>
>
> Luckily, this part isn't too horrible at this at least.
> >>
> >>
> >> Previously, I had an extension specifically for 128-bit SIMD ops, which
> >> extended the register space. However, the only way to access these
> >> registers was via 128-bit SIMD ops, MOV.X, or via a few other special
> >> instructions (eg: MOV), severely limiting their utility.
> > <
> > Yes, the separate register file gets clunky at certain boundaries.
<
> R0..R31 were shared with GPR ops, but R32..R63 were in a sort of
> "technically exist but can't be addressed" limbo.
>
> Given my ISA doesn't have a redundant set of operations for SIMD ops,
> and the 64-bit SIMD ops were still limited to R0..R31, this "kinda sucked".
>
> The newer 'XGPR' is an attempt to remedy this situation.
> Current leaning though is that XGPR encodings of 128-bit SIMD ops will
> not be allowed.
<
The field to be investigated is deep and long.
> >>

> >>
> > Stealing from the REX prefix in x86-64.
> Stealing from REX, twice...
<
it is a good trick, steal away!
>
> The E field:
> Eq: 'Quad' bit (rarely actually used this way);
> En: Bit 4 of Rn
> Em: Bit 4 of Rm
> Eo: Bit 4 of Ro (or Bit 8 for Disp9 encodings).
>
> So, the m/n/o fields are 4-bit hex nybbles.
<snip>
> >> Well, or potentially, I use the existing XGPR encoding for unconditional
> >> ops, and use a jumbo-encoding for predicated XGPR ops, at the cost that
> >> the predicated ops may not be bundled (grr...).
> >>
> >> And, all this is a bit more messy than I would prefer.
> > <
> > Of the 3 dimensions of expansion are concerned (separate files, bigger
> > register specifiers and the encoding mess, or dedicated HW) I prefer
> > the dedicated HW approach.
> OK.
> >>
> >>
> >> ...
> >>
> >>
> >> Any thoughts?...
> > <
> > You found all the options, you get to grind through the paths towards
> > reasonable solutions.
> >
> It is at the moment of whether to try to expand this out and try to make
> it semi-orthogonal in some way, or leave a lot of this more as a "break
> glass in case of emergency" thing and otherwise mostly ignore that it
> exists.
>
> XGPR, as it exists, is probably sufficient for what it is.
> The lack of orthogonality here would more matter for the codegen than it
> does for ASM coding.
>
> ...

On 6/6/2021 2:10 PM, BGB wrote:
> For most of its existence, the BJX2 ISA had 32 GPRs (R0..R31).
>
> Then recently I started running into a few cases where having more GPRs
> could be useful (mostly involving big complicated loops working with a
> lot of parameters at the same time, such as tend to occur in my OpenGL
> rasterizer).
>
>
> Previously, I had an extension specifically for 128-bit SIMD ops, which
> extended the register space. However, the only way to access these
> registers was via 128-bit SIMD ops, MOV.X, or via a few other special
> instructions (eg: MOV), severely limiting their utility.
>
> I had more recently started floating ideas for a more general set of
> extensions, which would allow a larger part of the ISA to use them.
>
>
> I have implemented an expanded subset, which adds support for R0..R63
> for normal instructions.
>
> As currently implemented:
> 7wnm-ZeoZ (3R / 2R)
> 9wnm-Zeii (3RI / 2RI, Ld/St Disp9)
>
>
> Which reuses the space I had previously used for 24-bit instructions,
> since 24 bit instructions are "basically useless" on a
> performance-oriented core, and a microcontroller is rather unlikely to
> need XGPR (actually, even Op24 is debatable, given its effect on
> code-density is fairly underwhelming for how much hair it adds).
>
> I am still not particularly happy with this encoding:
> It breaks the established pattern for instruction lengths;
> Can only encode a subset of the ISA;
> It can't encode predicated instructions.
>
> However:
> It retains binary compatibility with existing code;
> Can be used in WEX bundles in the same way as Ezzz/Fzzz encodings.
>
> Where 'W':
> Ww: WEX bit;
> Wn: Bit 5 of Rn
> Wm: Bit 5 of Rm
> Wo: Bit 5 of Ro (or an F2/F1 selector)
>
>
>
> Another considered alternative could be using Op48 space, however:
> If I wanted to support WEX with them, the bundle mechanism would
> effectively need to be redesigned (and probably a lot more expensive).
>
> Both Op48 and Op64 encodings, as-is, would preclude using them in WEX
> bundles.
>
>
>
> Yet another possibility, I could add a mode flag and then effectively
> eat the entire 16-bit encoding space for 32-bit ops, where:
> 0zzz..Dzzz: Expanded space with predication and more GPRs;
> Ezzz/ Fzzz: Existing 32-bit encoding space.
>
> Or:
> pwnm-ZeoZ
> pwnm-Zeii
>
> Say, where 'p' partly encodes both the predication mode and encoding
> block, say:
> 0000: F0,    PredT (SR.T)
> 0001: F1/F2, PredT (SR.T)
> 0010: F0,    PredT (SR.S)
> 0011: F1/F2, PredT (SR.S)
> 0100: F0,    PredF (SR.T)
> 0101: F1/F2, PredF (SR.T)
> 0110: F0,    PredF (SR.S)
> 0111: F1/F2, PredF (SR.S)
> 1000: F0,    Always
> 1001: F1/F2, Always
> 1010: F8,    Always
> 1011: F3,    Always
> 1100: -
> 1101: -
> 1110: Ezzz, Same as Before
> 1111: Fzzz, Same as Before
>
>
> However, this would also suck:
> Requires a mode flag for fetch and decode to do its thing;
> Would require explicit mode-change instructions;
>     Would likely be implemented as a special-purpose branch op;
> Doesn't allow disassembly without knowing the mode;
> ...
>
> However, this would only have a fairly minor impact on the existing
> pipeline (as most of the significant changes had already been done as
> part of the existing extensions).
>
> An addition here would be to be able to use SR.S as a predicate (eg: to
> allow overlapping two different if branches or similar).
>
>
>
> Similarly, it seems that there is relatively little cost difference
> between having R0..R31 and R0..R63, so they can be good if they can
> potentially offer a performance advantage.
>
>
> I am left feeling unsure if such an expanded GPR space is even really a
> good idea though, since arguably only a minority of code will have
> enough register pressure to benefit much from an expanded GPR space.
> Likewise, these sorts of things are in the "once you add them and put
> them into general use, you are basically stuck with them" category.
>
> One other tradeoff is that they do mean saving a lot more registers for
> setjmp/longjmp, interrupts, or for context switches (roughly 512B worth
> of registers), but this doesn't seem too terrible.
>
>
> For now, either way, it is likely that the codegen will mostly ignore
> the existence of the extended GPRs (until at such a point it is
> "actually competent" at doing stuff more effectively). Instead, they
> would mostly be left (for now) mostly in the domain of ASM code.
>
>
> As-is, the compiler needs to use heuristics to decide between whether or
> not to use 32 GPRs for a given function or limit itself to 16, since for
> small functions using a larger GPR space can hurt more than it helps (if
> the cost of saving/restoring registers to/from the stack frame exceeds
> the cost of the register spills).
>
>
> One could argue though, why not keep everything in scratch registers and
> thus avoid saving things to the stack? This could be done, in theory,
> but would require a bit more work on the compiler (such as classifying
> whether or not a function is a pure leaf function, and effectively
> replacing the current register allocator).
>
> However, "pure leaf" functions could potentially benefit from a larger
> GPR space (a lot more scratch registers to assign values into), but
> XGPR's complicate things if they can't be encoded with predicates (the
> determination over whether to use predication or a branch is made before
> it can be known which registers are being used).
>
> This would likely mean that whether or not such a "pure leaf" approach
> could be used, and whether or not to enable XGPRs, would depend on
> whether or not particular branches had been predicated, and which
> registers the variables had been assigned to. All of this could require
> using multiple passes over a function to disambiguate (well, or figure
> out some other way for XGPR ops to be predicated).
>
>
> Well, or potentially, I use the existing XGPR encoding for unconditional
> ops, and use a jumbo-encoding for predicated XGPR ops, at the cost that
> the predicated ops may not be bundled (grr...).
>
> And, all this is a bit more messy than I would prefer.
>
>
> ...
>
>
> Any thoughts?...

One thought that I have mentioned earlier, but actually works better
when going from 32 to 64 than from 16 to 32 is the following:

I am assuming you currently have three register specifiers in the
instruction, two sources and one destination, each requiring 5 bits, so
you are using a total of 15 bits. You could change this such that the
two sources are 6 bits each, allowing 64 GPRs, but the destination
specifier is now 3 bits, but instead of directly specifying a register,
is is a value added to the value of the first source register specifier
to give the destination specifier.

For example, an add instruction might be encoded as Add 3,R17,R40. The
value 3 would be added to the value 17 to get the destination register,
namely R20.

This does mean taking a little longer to "decode" the address of the
destination register, but you have the time, as it isn't needed until
the operation is complete. You also loose some flexibility, in register
allocation, and it means modifying the compiler's register allocator,
but you still have the freedom of up to 8 potential destination
registers for each instruction, and if the operation is commutative, the
compiler can swap the two source operand specifiers to gain potentially
another 8 destination possibilities.

It isn't pretty, but it might be better than the alternatives.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

On 6/6/2021 6:25 PM, MitchAlsup wrote:
> On Sunday, June 6, 2021 at 5:49:46 PM UTC-5, BGB wrote:
>> On 6/6/2021 4:28 PM, MitchAlsup wrote:
>>> On Sunday, June 6, 2021 at 4:11:56 PM UTC-5, BGB wrote:
>>>> For most of its existence, the BJX2 ISA had 32 GPRs (R0..R31).
>>>>
>>>> Then recently I started running into a few cases where having more GPRs
>>>> could be useful (mostly involving big complicated loops working with a
>>>> lot of parameters at the same time, such as tend to occur in my OpenGL
>>>> rasterizer).
>>> <
>>> As I stated in the other thread:: I consider a rasterizer and interpolator to be
>>> the kind of "function" that deserves dedicated HW. you feed it triangles and
>>> it feeds you vectors of pixels and coordinates. The GPUs build these things
>>> to produce 32 pixels (RGBA) and 32 sets of coordinates (XYZW) per cycle.
>>> This would be easy to scale down to 1 set per cycle.
>>> <
>>> Texture deserves its own dedicated HW, also.
>> Yeah, if you want to work more than one pixel at a time, the amount of
>> in-flight state can get kinda absurd.
>>
>>
>> When I looked into GPUs before, it seemed like a lot of them basically
>> boiled down to being VLIW+SIMD machines with large vectors and lower
>> clock speeds (relative to CPUs).
> <
> Not VLIW versus SIMD, it is SIMD plus dedicated wide hardware.
> <
> Imagine the width necessary to do 8 3-D texture lookups and interpolations
> per cycle !
> <
> These are not programmable devices, they are hardware programmed (fixed),
> deeply pipelined, units designed to feed the SIMD programmable kernels.

Yeah, I can't really do this...

In my case, the process might look more like:
Morton shuffle ST coord, A;
Add 1 pixel to ST.S and Morton Shuffle, B;
Add 1 pixel to ST.T and Morton Shuffle, C;
Add 1 pixel to ST.ST and Morton Shuffle, D;
Mask A, B, C, and D, by the texture-size mask;
Shift A/B/C/D right 4 bits E/F/G/H.
Fetch Blocks E/F/G/H;
Extract Pixels from E/F/G/H (via compressed-texture instruction);
Interpolate E and F based on ST.S, E1;
Interpolate G and H based on ST.S, F1;
Interpolate E1 and F1 based on ST.S, D1;
Multiply D1 by the current modulation color;
Fetch Z Buffer pixel;
Compare Z step with fetched pixel (Less-Than);
Store D1 color and Z-step to destination if true;
Add various step values to the various coordinates;
...

Then structure the loop such that the fetch/interpolation/... for the
next pixel begins before the last pixel gets stored to the destination.

So, rasterizer function:
Set up variables;
Start doing first pixel;
Do main loop body;
Loop as long as more pixels remain;
Finish storing last pixel;
Cleanup / Return.

Then as a bunch of large/unwieldy blobs of ASM specialized for various
combinations of settings.

This stuff works, but as noted, can have a fairly large amount of
register pressure.

For some parts of the rasterization process, I ended up needing to spill
registers to memory, which isn't really ideal for this sort of thing.

There is a helper instruction which, given a compressed texture block
and an index, fetches a pixel (single cycle). Compressed texture blocks
are preferable (despite adding a few more instructions per fetch)
because they have a significantly lower cache-miss rate than
uncompressed textures.

I recently also figured out a way to do small-block compression which
works OK for audio.

16 samples in 32 bits:
6 bits, Line Start Point of a line (S.3.2 microfloat);
6 bits, Line End Point of a line (S.3.2 microfloat);
4 bits, Line Sigma (RMSE or MAE, 3.1 microfloat);
16 bits, selector bits (1 bit/sample).

Extracting a sample involves:
Unpack values internally into PCM (12-bit);
Linearly interpolate between the start and end sample;
The sample position within the block is the interpolation index;
Add and subtract sigma from interpolated sample to get a min and max;
Use selector bit for this sample to select min or max.

A similar scheme for 16 samples in 64 bits was also defined (with bigger
minifloat values and 2 bit selectors). Which gave better audio quality.

Where, 2-bit values were:
00: - 0.333 * Sigma
01: - 1.000 * Sigma
10: + 0.333 * Sigma
11: + 1.000 * Sigma

Endpoints were 8-bit (S.3.4), as in my tests there wasn't much advantage
going bigger than this (the loss from the interpolation was larger than
the loss from the minifloat format).

Here, the linear part deals with low-frequency components, whereas the
selector bits and sigma deal with high frequencies.

While generally worse (quality wise) than ADPCM, small-block schemes
have tended to have difficulties in this area, so the threshold here
isn't so much "good" as more "doesn't sound like broken garbage", and
"cheap enough to potentially be decoded in hardware in 1 or 2 clock cycles".

I considered this mostly for doing Module/Tracker style music playback,
as the PCM data for the files tends to be somewhat larger than the L2
cache and are prone to suffer from a relatively high L2 miss rate when
mixing audio samples.

Nevermind if this is a little unclear, as the theoretical bandwidth
estimates for Mod playback are well within the DRAM bandwidth limits of
my CPU core, so dunno.

>>
>> My initial thoughts at how to scale these back to "something I could
>> actually implement" resulted in something not too far from where BJX2
>> was at the time, so it seemed like I could just adapt the CPU core to
>> also handle GPU tasks.
> <
> One channel of a Texture lookup and blending per cycle. Where there are up
> to 4 channels.

Note the "scaling things back" part:
I can't exactly do something with massive parallel texture fetch and 512
bit vectors and similar on this thing...

But, I could do 64 and 128 bit SIMD vectors...

I can't afford a dedicated texture fetch and interpolate unit, but I
could afford helper instructions to extract texel values from compressed
texture blocks, perform Morton shuffle, ...

>>
>>
>> Luckily, this part isn't too horrible at this at least.
>>>>
>>>>
>>>> Previously, I had an extension specifically for 128-bit SIMD ops, which
>>>> extended the register space. However, the only way to access these
>>>> registers was via 128-bit SIMD ops, MOV.X, or via a few other special
>>>> instructions (eg: MOV), severely limiting their utility.
>>> <
>>> Yes, the separate register file gets clunky at certain boundaries.
> <
>> R0..R31 were shared with GPR ops, but R32..R63 were in a sort of
>> "technically exist but can't be addressed" limbo.
>>
>> Given my ISA doesn't have a redundant set of operations for SIMD ops,
>> and the 64-bit SIMD ops were still limited to R0..R31, this "kinda sucked".
>>
>> The newer 'XGPR' is an attempt to remedy this situation.
>> Current leaning though is that XGPR encodings of 128-bit SIMD ops will
>> not be allowed.
> <
> The field to be investigated is deep and long.
>>>>
>
>>>>
>>> Stealing from the REX prefix in x86-64.
>> Stealing from REX, twice...
> <
> it is a good trick, steal away!
>>
>> The E field:
>> Eq: 'Quad' bit (rarely actually used this way);
>> En: Bit 4 of Rn
>> Em: Bit 4 of Rm
>> Eo: Bit 4 of Ro (or Bit 8 for Disp9 encodings).
>>
>> So, the m/n/o fields are 4-bit hex nybbles.
> <snip>
>>>> Well, or potentially, I use the existing XGPR encoding for unconditional
>>>> ops, and use a jumbo-encoding for predicated XGPR ops, at the cost that
>>>> the predicated ops may not be bundled (grr...).
>>>>
>>>> And, all this is a bit more messy than I would prefer.
>>> <
>>> Of the 3 dimensions of expansion are concerned (separate files, bigger
>>> register specifiers and the encoding mess, or dedicated HW) I prefer
>>> the dedicated HW approach.
>> OK.
>>>>
>>>>
>>>> ...
>>>>
>>>>
>>>> Any thoughts?...
>>> <
>>> You found all the options, you get to grind through the paths towards
>>> reasonable solutions.
>>>
>> It is at the moment of whether to try to expand this out and try to make
>> it semi-orthogonal in some way, or leave a lot of this more as a "break
>> glass in case of emergency" thing and otherwise mostly ignore that it
>> exists.
>>
>> XGPR, as it exists, is probably sufficient for what it is.
>> The lack of orthogonality here would more matter for the codegen than it
>> does for ASM coding.
>>
>> ...

Click here to read the complete article

Stephen Fuld wrote:
> On 6/6/2021 2:10 PM, BGB wrote:
>> Any thoughts?...
>
> One thought that I have mentioned earlier, but actually works better
> when going from 32 to 64 than from 16 to 32 is the following:
>
> I am assuming you currently have three register specifiers in the
> instruction, two sources and one destination, each requiring 5 bits, so
> you are using a total of 15 bits. You could change this such that the
> two sources are 6 bits each, allowing 64 GPRs, but the destination
> specifier is now 3 bits, but instead of directly specifying a register,
> is is a value added to the value of the first source register specifier
> to give the destination specifier.
>
> For example, an add instruction might be encoded as Add 3,R17,R40. The
> value 3 would be added to the value 17 to get the destination register,
> namely R20.
>
> This does mean taking a little longer to "decode" the address of the
> destination register, but you have the time, as it isn't needed until
> the operation is complete. You also loose some flexibility, in register
> allocation, and it means modifying the compiler's register allocator,
> but you still have the freedom of up to 8 potential destination
> registers for each instruction, and if the operation is commutative, the
> compiler can swap the two source operand specifiers to gain potentially
> another 8 destination possibilities.
>
> It isn't pretty, but it might be better than the alternatives.

I would use the same approach, but avoid the adder in the target
register specifier:

Grab the two top bits from the first source reg and merge with the
bottom 3 bits given as the destination. I.e. source and dest must be in
the same 8-reg bank, which saves that 6-bit adder.

Alternatively, 5/6/4 bits per reg, so only the first source (or maybe
the destination?) has the full 6-bit specifier, another reg must be from
the same 32-reg bank and the final one from the same 16-reg bank.

This would have been no trouble at all for my asm code, but possibly a
bit of bother for a compiler.

I already have this kind of issue with x64 where the same instruction
can have at least 3 different length depending upon which exact
registers are being used.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

ANY1 has had 64 GPR’s from the start. 64 GPR’s were chosen because of the number
of different data types that could be processed (integer, float, decimal float, posit, and
string). I could not see having a separate register file for each type of data. That would
consume resources that would not be in use all the time.
GPR’s are used to store return addresses, although the JAL instruction can only specify
x0 to x3.
There are only two source registers allowed per instruction. To get past this and allow
more source registers an instruction modifier may be used which adds two more source
registers. Some instructions like mux or bit-field ops always use a modifier, others use
them optionally.
To reduce register pressure on a machine with 32 GPR’s independent registers could be
used for subroutine link and comparison results. Special registers could also be used for
the stack / frame pointer.
I think it will be very difficult to change the number of registers in the register set after
the processor has been developed to a certain point. Especially difficult if backwards
binary compatibility is desired with a machine that only support 32 regs. It is critical for
the register fields to be in fixed positions and of fixed sizes for performance reasons.

On 6/7/2021 1:29 AM, Terje Mathisen wrote:
> Stephen Fuld wrote:
>> On 6/6/2021 2:10 PM, BGB wrote:
>>> Any thoughts?...
>>
>> One thought that I have mentioned earlier, but actually works better
>> when going from 32 to 64 than from 16 to 32 is the following:
>>
>> I am assuming you currently have three register specifiers in the
>> instruction, two sources and one destination, each requiring 5 bits,
>> so you are using a total of 15 bits. You could change this such that
>> the two sources are 6 bits each, allowing 64 GPRs, but the destination
>> specifier is now 3 bits, but instead of directly specifying a
>> register, is is a value added to the value of the first source
>> register specifier to give the destination specifier.
>>
>> For example, an add instruction might be encoded as Add 3,R17,R40.
>> The value 3 would be added to the value 17 to get the destination
>> register, namely R20.
>>
>> This does mean taking a little longer to "decode" the address of the
>> destination register, but you have the time, as it isn't needed until
>> the operation is complete. You also loose some flexibility, in
>> register allocation, and it means modifying the compiler's register
>> allocator, but you still have the freedom of up to 8 potential
>> destination registers for each instruction, and if the operation is
>> commutative, the compiler can swap the two source operand specifiers
>> to gain potentially another 8 destination possibilities.
>>
>> It isn't pretty, but it might be better than the alternatives.
>
> I would use the same approach, but avoid the adder in the target
> register specifier:
>
> Grab the two top bits from the first source reg and merge with the
> bottom 3 bits given as the destination. I.e. source and dest must be in
> the same 8-reg bank, which saves that 6-bit adder.
>
> Alternatively, 5/6/4 bits per reg, so only the first source (or maybe
> the destination?) has the full 6-bit specifier, another reg must be from
> the same 32-reg bank and the final one from the same 16-reg bank.
>
> This would have been no trouble at all for my asm code, but possibly a
> bit of bother for a compiler.
>

This sort of approach would *suck* for a compiler...

My issue though wasn't that I couldn't come up with an encoding that can
do all 3 registers with 6-bit IDs, but rather that there is seemingly no
good way to have 3x 6-bit IDs *and* predication at the same time, eg:

ADD R17, R45, R59 //Fine

ADD?T R17, R31, R22 //Fine
ADD?T R17, R45, R59 //Non-Encodable

This is a potentially serious issue if I want to be able to use the XGPR
registers in the compiler (as-is), since it creates a potential "gotcha"
that there is no good way to resolve if it pops up during code
generation (it effectively makes the implicit conversion of "if()" to
predicated ops, and the use of XGPRs in a function, mutually exclusive).

Granted, I do have the technology to use 48 or 64 bit encodings, but
these would (at present) effectively preclude using them in bundles.

It is less of an issue for ASM though, since a human can generally avoid
walking into and getting stuck in a corner case.

I am also thinking they will probably be ignored by the C ABI, so the C
ABI will still only pass 8 items in registers, even though XGPR could
allow passing 16 items in registers (if the pattern were extended).

This would mostly be so that XGPR and non-XGPR functions could still
call each other.

> I already have this kind of issue with x64 where the same instruction
> can have at least 3 different length depending upon which exact
> registers are being used.
>

Yes, as noted, using a longer encoding is a possible option.
This would at least allow the compiler to not get stuck, but isn't
really ideal either from a performance POV (since it would only allow
one instruction to execute at a time in these cases).

Granted, another possible (pure compiler) workaround would be inferring
high-register-pressure functions early, and then disabling predication
in these cases.

> Terje
>

On 2021-06-06, MitchAlsup wrote:
> On Sunday, June 6, 2021 at 4:11:56 PM UTC-5, BGB wrote:
>> For most of its existence, the BJX2 ISA had 32 GPRs (R0..R31).
>>
>> Then recently I started running into a few cases where having more GPRs
>> could be useful (mostly involving big complicated loops working with a
>> lot of parameters at the same time, such as tend to occur in my OpenGL
>> rasterizer).
> <
> As I stated in the other thread:: I consider a rasterizer and interpolator to be
> the kind of "function" that deserves dedicated HW. you feed it triangles and
> it feeds you vectors of pixels and coordinates. The GPUs build these things
> to produce 32 pixels (RGBA) and 32 sets of coordinates (XYZW) per cycle.
> This would be easy to scale down to 1 set per cycle.
> <
> Texture deserves its own dedicated HW, also.

....yet there are similar use cases that are better done with a CPU than
a GPU, either because you need really low latencies (i.e. you can't wait
for the CPU->GPU->CPU round trip), you need to access huge amounts of
data (e.g. thousands of gigabyte textures) or because your code relies
on CPU-friendly/GPU-hostile constructs (e.g. lots of conditional
branches and lookups and large code coverage).

For instance, physically based 3D rendering can gain a lot from
intelligent culling and sampling which can give a CPU an edge over a
GPU, especially if the scene/texture data needs to be accessed from a
huge database on an on-demand basis.

So far SIMD(ish) instruction set extensions to general purpose CPU:s
have been the goto solution for high bandwidth / low latency compute.

The holy grail would be a completely fused CPU+GPU, but I have trouble
envisioning how such a beast would be designed (and obviously the
entire CPU/GPU industry has too, since it hasn't happened yet).

>>
>>
>> Previously, I had an extension specifically for 128-bit SIMD ops, which
>> extended the register space. However, the only way to access these
>> registers was via 128-bit SIMD ops, MOV.X, or via a few other special
>> instructions (eg: MOV), severely limiting their utility.
> <
> Yes, the separate register file gets clunky at certain boundaries.
>>
>> I had more recently started floating ideas for a more general set of
>> extensions, which would allow a larger part of the ISA to use them.
>>
>>
>> I have implemented an expanded subset, which adds support for R0..R63
>> for normal instructions.
>>
>> As currently implemented:
>> 7wnm-ZeoZ (3R / 2R)
>> 9wnm-Zeii (3RI / 2RI, Ld/St Disp9)
>>
>>
>> Which reuses the space I had previously used for 24-bit instructions,
>> since 24 bit instructions are "basically useless" on a
>> performance-oriented core, and a microcontroller is rather unlikely to
>> need XGPR (actually, even Op24 is debatable, given its effect on
>> code-density is fairly underwhelming for how much hair it adds).
>>
>> I am still not particularly happy with this encoding:
>> It breaks the established pattern for instruction lengths;
>> Can only encode a subset of the ISA;
>> It can't encode predicated instructions.
>>
>> However:
>> It retains binary compatibility with existing code;
>> Can be used in WEX bundles in the same way as Ezzz/Fzzz encodings.
>>
>> Where 'W':
>> Ww: WEX bit;
>> Wn: Bit 5 of Rn
>> Wm: Bit 5 of Rm
>> Wo: Bit 5 of Ro (or an F2/F1 selector)
>>
> Stealing from the REX prefix in x86-64.
>>
>>
>> Another considered alternative could be using Op48 space, however:
>> If I wanted to support WEX with them, the bundle mechanism would
>> effectively need to be redesigned (and probably a lot more expensive).
>>
>> Both Op48 and Op64 encodings, as-is, would preclude using them in WEX
>> bundles.
>>
>>
>>
>> Yet another possibility, I could add a mode flag and then effectively
>> eat the entire 16-bit encoding space for 32-bit ops, where:
>> 0zzz..Dzzz: Expanded space with predication and more GPRs;
>> Ezzz/ Fzzz: Existing 32-bit encoding space.
>>
>> Or:
>> pwnm-ZeoZ
>> pwnm-Zeii
>>
>> Say, where 'p' partly encodes both the predication mode and encoding
>> block, say:
>> 0000: F0, PredT (SR.T)
>> 0001: F1/F2, PredT (SR.T)
>> 0010: F0, PredT (SR.S)
>> 0011: F1/F2, PredT (SR.S)
>> 0100: F0, PredF (SR.T)
>> 0101: F1/F2, PredF (SR.T)
>> 0110: F0, PredF (SR.S)
>> 0111: F1/F2, PredF (SR.S)
>> 1000: F0, Always
>> 1001: F1/F2, Always
>> 1010: F8, Always
>> 1011: F3, Always
>> 1100: -
>> 1101: -
>> 1110: Ezzz, Same as Before
>> 1111: Fzzz, Same as Before
>>
>>
>> However, this would also suck:
>> Requires a mode flag for fetch and decode to do its thing;
>> Would require explicit mode-change instructions;
>> Would likely be implemented as a special-purpose branch op;
>> Doesn't allow disassembly without knowing the mode;
>> ...
>>
>> However, this would only have a fairly minor impact on the existing
>> pipeline (as most of the significant changes had already been done as
>> part of the existing extensions).
>>
>> An addition here would be to be able to use SR.S as a predicate (eg: to
>> allow overlapping two different if branches or similar).
>>
>>
>>
>> Similarly, it seems that there is relatively little cost difference
>> between having R0..R31 and R0..R63, so they can be good if they can
>> potentially offer a performance advantage.
>>
>>
>> I am left feeling unsure if such an expanded GPR space is even really a
>> good idea though, since arguably only a minority of code will have
>> enough register pressure to benefit much from an expanded GPR space.
>> Likewise, these sorts of things are in the "once you add them and put
>> them into general use, you are basically stuck with them" category.
>>
>> One other tradeoff is that they do mean saving a lot more registers for
>> setjmp/longjmp, interrupts, or for context switches (roughly 512B worth
>> of registers), but this doesn't seem too terrible.
>>
>>
>> For now, either way, it is likely that the codegen will mostly ignore
>> the existence of the extended GPRs (until at such a point it is
>> "actually competent" at doing stuff more effectively). Instead, they
>> would mostly be left (for now) mostly in the domain of ASM code.
>>
>>
>> As-is, the compiler needs to use heuristics to decide between whether or
>> not to use 32 GPRs for a given function or limit itself to 16, since for
>> small functions using a larger GPR space can hurt more than it helps (if
>> the cost of saving/restoring registers to/from the stack frame exceeds
>> the cost of the register spills).
>>
>>
>> One could argue though, why not keep everything in scratch registers and
>> thus avoid saving things to the stack? This could be done, in theory,
>> but would require a bit more work on the compiler (such as classifying
>> whether or not a function is a pure leaf function, and effectively
>> replacing the current register allocator).
>>
>> However, "pure leaf" functions could potentially benefit from a larger
>> GPR space (a lot more scratch registers to assign values into), but
>> XGPR's complicate things if they can't be encoded with predicates (the
>> determination over whether to use predication or a branch is made before
>> it can be known which registers are being used).
>>
>> This would likely mean that whether or not such a "pure leaf" approach
>> could be used, and whether or not to enable XGPRs, would depend on
>> whether or not particular branches had been predicated, and which
>> registers the variables had been assigned to. All of this could require
>> using multiple passes over a function to disambiguate (well, or figure
>> out some other way for XGPR ops to be predicated).
>>
>>
>> Well, or potentially, I use the existing XGPR encoding for unconditional
>> ops, and use a jumbo-encoding for predicated XGPR ops, at the cost that
>> the predicated ops may not be bundled (grr...).
>>
>> And, all this is a bit more messy than I would prefer.
> <
> Of the 3 dimensions of expansion are concerned (separate files, bigger
> register specifiers and the encoding mess, or dedicated HW) I prefer
> the dedicated HW approach.
>>
>>
>> ...
>>
>>
>> Any thoughts?...
> <
> You found all the options, you get to grind through the paths towards
> reasonable solutions.
>

Click here to read the complete article

On Sunday, June 6, 2021 at 9:29:51 PM UTC-5, Stephen Fuld wrote:
> On 6/6/2021 2:10 PM, BGB wrote:

> > Any thoughts?...
> One thought that I have mentioned earlier, but actually works better
> when going from 32 to 64 than from 16 to 32 is the following:
>
> I am assuming you currently have three register specifiers in the
> instruction, two sources and one destination, each requiring 5 bits, so
> you are using a total of 15 bits. You could change this such that the
> two sources are 6 bits each, allowing 64 GPRs, but the destination
> specifier is now 3 bits, but instead of directly specifying a register,
> is is a value added to the value of the first source register specifier
> to give the destination specifier.
<
I should point out that with 3 register specifiers, you have 11-bits of
OpCOde. (2048 OpCodes which should be plenty). Change to 6-bit
register specifiers, and you only have 8-bits of OpCOde (256 OpCodes
which is "pushing it" for a first generation of your architecture.)

On 6/7/2021 4:02 AM, BGB wrote:
> On 6/7/2021 1:29 AM, Terje Mathisen wrote:
>> Stephen Fuld wrote:
>>> On 6/6/2021 2:10 PM, BGB wrote:
>>>> Any thoughts?...
>>>
>>> One thought that I have mentioned earlier, but actually works better
>>> when going from 32 to 64 than from 16 to 32 is the following:
>>>
>>> I am assuming you currently have three register specifiers in the
>>> instruction, two sources and one destination, each requiring 5 bits,
>>> so you are using a total of 15 bits. You could change this such that
>>> the two sources are 6 bits each, allowing 64 GPRs, but the
>>> destination specifier is now 3 bits, but instead of directly
>>> specifying a register, is is a value added to the value of the first
>>> source register specifier to give the destination specifier.
>>>
>>> For example, an add instruction might be encoded as Add 3,R17,R40.
>>> The value 3 would be added to the value 17 to get the destination
>>> register, namely R20.
>>>
>>> This does mean taking a little longer to "decode" the address of the
>>> destination register, but you have the time, as it isn't needed until
>>> the operation is complete. You also loose some flexibility, in
>>> register allocation, and it means modifying the compiler's register
>>> allocator, but you still have the freedom of up to 8 potential
>>> destination registers for each instruction, and if the operation is
>>> commutative, the compiler can swap the two source operand specifiers
>>> to gain potentially another 8 destination possibilities.
>>>
>>> It isn't pretty, but it might be better than the alternatives.
>>
>> I would use the same approach, but avoid the adder in the target
>> register specifier:
>>
>> Grab the two top bits from the first source reg and merge with the
>> bottom 3 bits given as the destination. I.e. source and dest must be
>> in the same 8-reg bank, which saves that 6-bit adder.
>>
>> Alternatively, 5/6/4 bits per reg, so only the first source (or maybe
>> the destination?) has the full 6-bit specifier, another reg must be
>> from the same 32-reg bank and the final one from the same 16-reg bank.
>>
>> This would have been no trouble at all for my asm code, but possibly a
>> bit of bother for a compiler.
>>
>
> This sort of approach would *suck* for a compiler...
>
> My issue though wasn't that I couldn't come up with an encoding that can
> do all 3 registers with 6-bit IDs, but rather that there is seemingly no
> good way to have 3x 6-bit IDs *and* predication at the same time, eg:
>
> ADD R17, R45, R59 //Fine
>
> ADD?T R17, R31, R22 //Fine
> ADD?T R17, R45, R59 //Non-Encodable
>
> This is a potentially serious issue if I want to be able to use the XGPR
> registers in the compiler (as-is), since it creates a potential "gotcha"
> that there is no good way to resolve if it pops up during code
> generation (it effectively makes the implicit conversion of "if()" to
> predicated ops, and the use of XGPRs in a function, mutually exclusive).
>
> Granted, I do have the technology to use 48 or 64 bit encodings, but
> these would (at present) effectively preclude using them in bundles.
>

Partial update (mentioned elsewhere already):
I have now gone and added some 64-bit encodings, so these cases are now
able to be encoded.

* FFw0_00zz_F0nm_ZeoZ OP Rm, Ro, Rn
* FFw0_00ii_F1nm_Zeii OP (Rm, Disp17s), Rn
* FFw0_00ii_F2nm_Zeii OP Rm, Imm17, Rn
* FFw0_00ii_F2nZ_Zeii OP Imm18, Rn

* FFw0_0pzz_E0nm_ZeoZ OP?p Rm, Ro, Rn
* FFw0_0pii_E1nm_Zeii OP?p (Rm, Disp17s), Rn
* FFw0_0pii_E2nm_Zeii OP?p Rm, Imm17, Rn

* FF00_0ddd_F0dd_Cddd BRA (PC, Disp32s) //semi-revived encoding
* FF00_0ddd_F0dd_Dddd BSR (PC, Disp32s) //semi-revived encoding

Where the 0 values here mostly indicated points which will be defined as
leading to alternate parts of the encoding space.

The field used for immediate extension is Reserved / Must Be Zero for
instructions which don't use an immediate. In the Verilog code, what
would have otherwise been Imm5 fields are extended to 14 bits.

For most expanded Imm5 or Imm9 ops, the extended immediate will retain
the same zero or one extension as the base instruction (though,
potentially, some may become sign-extended instead).

Imm9 ops are extended to 17 bits, and Imm10 ops are extended to 18 bits.

The zero extended forms which upgrade to sign-extended forms will use Wi
as a sign-extension bit, so effectively Imm9u -> Imm18s, or Imm10u to
Imm19s. Other cases Wi will be MBZ.

The p field may allow use of alternate predicate bits, potentially:
* 0: SR.T
* 1: SR.S
* 2: -
* 3: -
* 4: SR.P
* 5: SR.Q
* 6: SR.R
* 7: SR.O
* 8..F: Reserved

For the near term though, it is likely only T and S will be supported as
targets for CMPxx and friends (and T|S predication would be cheaper).

Though, I am still not entirely settled on the specifics of alternate
predicate bits, or whether or not they are "actually necessary".

Lowest cost option being to not use alternate predicates though.
Or, could instead consider an S|T scheme.

It is likely instruction notation would need to be updated with the
predicate bit, eg:
?T //SR.T is Set
?F //SR.T is Clear
?ST //SR.S is Set
?SF //SR.S is Clear

Still working out details on this part.

Luckily at least, the XGPR stuff hasn't really added much in terms of
resource cost (but have seen resource cost increases due to other features).

Had also recently added a Packed Search for Byte instruction, which did
result in a visible increase in LUT cost. I suspect it is partly that
synthesis seems to be freaking out some (and doing a lot of mass
duplication) on paths involving the SR.T bit (likely due to large fanout
due to driving the predication logic for the EX1 stage and similar).

But, not really much obvious workaround apart from adding another clock
cycle for CMPxx and similar (such as by using an interlock on compare
operations followed by a predicate op or conditional branch and
capturing the SR bits in ID2 rather than in EX1).

>
> It is less of an issue for ASM though, since a human can generally avoid
> walking into and getting stuck in a corner case.
>
>
> I am also thinking they will probably be ignored by the C ABI, so the C
> ABI will still only pass 8 items in registers, even though XGPR could
> allow passing 16 items in registers (if the pattern were extended).
>
> This would mostly be so that XGPR and non-XGPR functions could still
> call each other.
>

This could effect a subset of functions:
At 8 args in registers, a minority of functions end up needing to pass
arguments on the stack;
At 16 args in registers, hardly any of them would need to go on the stack.

Though, in the cases where they differ, the choice would effect the
ability for a function on one side to call a function on the other if
there were a mismatch here.

Extended args registers would look something like:
R4..R7, R20..R23, R36..R39, R50..R55
With even pairs being usable for passing 128-bit vectors.

>
>> I already have this kind of issue with x64 where the same instruction
>> can have at least 3 different length depending upon which exact
>> registers are being used.
>>
>
> Yes, as noted, using a longer encoding is a possible option.
> This would at least allow the compiler to not get stuck, but isn't
> really ideal either from a performance POV (since it would only allow
> one instruction to execute at a time in these cases).
>
>
> Granted, another possible (pure compiler) workaround would be inferring
> high-register-pressure functions early, and then disabling predication
> in these cases.
>

With the XGPR Op64 encodings, luckily, this would no longer be necessary.

>
>> Terje
>>
>

You will have a head crash on your private pack.

devel / comp.arch / Re: Self-debate: Extending the BJX2 GPR space (XGPR)?

Subject	Author
Self-debate: Extending the BJX2 GPR space (XGPR)?	BGB
Re: Self-debate: Extending the BJX2 GPR space (XGPR)?	MitchAlsup
Re: Self-debate: Extending the BJX2 GPR space (XGPR)?	BGB
Re: Self-debate: Extending the BJX2 GPR space (XGPR)?	MitchAlsup
Re: Self-debate: Extending the BJX2 GPR space (XGPR)?	BGB
Re: Self-debate: Extending the BJX2 GPR space (XGPR)?	Marcus
Re: Self-debate: Extending the BJX2 GPR space (XGPR)?	Stephen Fuld
Re: Self-debate: Extending the BJX2 GPR space (XGPR)?	Terje Mathisen
Re: Self-debate: Extending the BJX2 GPR space (XGPR)?	robf...@gmail.com
Re: Self-debate: Extending the BJX2 GPR space (XGPR)?	BGB
Re: Self-debate: Extending the BJX2 GPR space (XGPR)?	BGB
Re: Self-debate: Extending the BJX2 GPR space (XGPR)?	MitchAlsup