novaBBS - comp.arch - Misc idea: Dual ISA, BJX2 + RISC-V support...

So, I was poking at a few seemingly random ideas:
* Take the existing RISC-V ISA, and add a WEX like sub-mode.
* Goal would be to remain backwards compatible with both the base ISA
and with RVC (though, they may be mutually-exclusive in a temporal sense);
* Gluing RISC-V decoder support onto my BJX2 core as an optional
extension feature.

So, as can be noted, standard RISC-V ops have the two LSB bits:
00/01/10: RVC Ops;
11: Base 32-bit ops.

So, say, in this sub-mode (possible):
00/01: Reserved
10: Wide-Execute
11: Scalar / End-of-Bundle

The 10 block is nearly identical to the 11 block, just with parts of the
encoding space chopped off and repurposed (such as for an equivalent of
a Jumbo prefix). The 11 block will be mostly unchanged.

Things like Pred or PrWEX encodings would not map over well.

So, I am looking at supporting a core which does both BJX2 and RISC-V
(in particular RV64IC) as sub-modes.

SR(27:26):
00: BJX2, WEX Disabled (Sequential Execution)
10: BJX2, WEX Enabled
01: RISC-V, Scalar Mode (supports RVC)
11: RISC-V, Wide-Execute Mode (Possible)

The RISC-V mode would mostly be handled with some tweaks in the L1 I$,
and an alternate set of decoders.

Otherwise, it looks like a lot of stuff is close enough that it may not
be too much of a stretch to support both ISAs on a single core.

Some of the basic extension sets are being left out for now:
* M: The BJX2 core lacks both full-width integer multiply, and divide.
* F: Doesn't map well to BJX2's FPU
* D: Doesn't map well to BJX2's FPU
* A: Doesn't match up
* V: Doesn't match up
* ...

Went and wrote up some code for some of this already, at least length
determination in RISC-V is fairly easy (within the 16/32 subset). It
appears the ISA supports some longer encoding forms, but I will ignore
them for now.

I may support WEX as an ISA extension in the RISC-V case, but I am
unlikely to support predicated instructions in this mode (partly because
there is a non-trivial point of divergence between the ISAs).

Also, unlike in BJX2, WEX mode in RISC-V would require the use of 32-bit
alignment.

Some other things are not strictly 1:1, so still dunno if all this is
"sane"...

The usage of the WEX sub-mode would likely be first generating RV32I
code, and then (after the fact) running something akin to a Wexifier,
which would try to shuffle and bundle the instructions. Would need a way
to figure out every branch target though, such that it doesn't try to
Wexify across basic-block boundaries (could require either hints or
metadata).

Then again, if the compiler output is already optimized for in-order
superscalar, the shuffling is easier, and the Wexifier would only need
to flag instructions that could be executed in parallel with the
following instruction (and it is no longer necessary to care as much
about basic-block boundaries).

Current leaning is that jumps between BJX2 and RISC-V mode would be
pulled off by branching to an address with the LSB set (the LSB would
effectively function as a mode-change toggle).

I would likely leave out the M extension, as RISC-V specifies a
full-width Multiply and Divide, which are still a bit too expensive to
do in hardware as described. Though, emulating these could be possible
in theory.

The F and D extensions also don't map exactly to BJX2's FPU.

These cases would either require using traps or implementing some sort
of microcode.

Namely, RISC-V specifies an FPU with:
Dedicated FPU Registers
FADD/FSUB/FMUL/FDIV/FSQRT
Meanwhile, BJX2's FPU:
Uses GPRs
FADD/FSUB/FMUL

It is possible I could do a variation where I map F0..F31 to R32..R63,
but have not done so yet, and it still would not support FDIV or FSQRT.

Most likely options are:
Leave this stuff as full software emulation, which is slow.

Add special instructions that map closer to what the underlying hardware
supports, but this partly defeats the point of supporting a more generic
ISA.

Some register remapping is used:
X0 -> ZZR
X1 -> LR
X2 -> SP (R15 in BJX2)
X3 -> GBR
X4 -> TBR
X5 -> DHR (R1 in BJX2)
X6 -> R6
X7 -> R7
X8..X14 -> R8..R14
X15 -> R15U
X16..X31 -> R16..R31

F0..F31: R32..R63 ( Possible FPU Support )

This would not be cross-ISA ABI compatible, but trying to make cross-ISA
ABI calls work does not seem worthwhile. The ABIs also differ in terms
of the relative split between preserve and scratch registers, ...

As such, a call between RISC-V and BJX2 code would likely require the
use of a ABI thunk in addition to jumping between ISA modes.

Note that R2..R5 would be inaccessible in RISC-V mode as they are
effectively shadowed.

Likewise, X1..X5 are not GPRs in this implementation, so it will be
assumed/required that code actually follow the ABI in this area.

X15 goes the opposite direction, where it is a GPR in RISC-V rather than
an SPR in BJX2, leading to needing to define C15 as R15U with this
extension, which maps to this register. The purpose of R15U is mostly to
allow a scheduler written in BJX2 code to be able to task-switch code
running in RISC-V Mode.

The reverse is not true, however, in that a task scheduler written in
RISC-V mode would not be able to task-switch processes running in BJX2 mode.

It looks like RISC-V leaves things like MMU, interrupt handling, ... as
implementation dependent, so I could probably get away with just leaving
these parts mostly as-is.

Similar for things like address space, ...
So, the core would retain BJX2's existing address space and memory map.

Some aspects of the RISC-V ISA design leave me with a bit of a mystery
though, namely how does it "not suck"?...

A lot of fairly common stuff that can be expressed in a single
instruction in BJX2 looks like it would require a much longer and more
convoluted instruction sequences to express in RISC-V. Some instruction
in RISC-V were "kinda weird" but not otherwise too much of a stretch for
how to express them within the existing pipeline (eg: JAL and JALR do
not have direct function equivalent in BJX2, but could still be mapped
to the existing branch mechanisms).

The main expensive thing I had to add hardware support for
Compare-and-Branch instructions. These were a "mostly unimplemented"
feature in BJX2, but RISC-V uses these as the primary branch type.

There is a cost increase and timing penalty for the RISC-V support,
which I suspect is most likely primarily due to the Compare-and-Branch
operations (the main alternative being to add a mechanism to decompose
them into a 2-op sequence in the decoder).

Although, I had started this out with the idea of being able to support
a Wide-Execute variation of RISC-V, I am starting to debate this, as:
* It would require a compiler to be aware of its existence to be able to
use it;
* RISC-V appears to, kinda suck, so it is debatable if this would be
particularly worthwhile.

While RISC-V does have theoretically larger load/store displacements
(12-bits vs 9-bits).

Given that they are unscaled and sign-extended, their usable range will
actually be less for QWORD operations than the 9-bit displacements in
BJX2 (and, the XGPR encodings effectively add a 10-bit signed load/store
encoding).

Granted:
ADDI Rn, Rm, Imm12s
Is slightly bigger than:
ADD Rm, Imm9s, Rn

One draw-back IMO, is that comparably RISC-V's immediate values are much
worse in the "bit-confetti" sense than BJX2's immediate values. Where,
at least, contiguous bits in the encoding generally also represent
contiguous bits in the value, whereas RISC-V was apparently happy
leaving them a chewed up mess...

As for which would win out in practice in terms of performance, it is
less clear. It is possible that RISC-V could have the advantage of more
mature compiler technology.

RISC-V seems lack anything equivalent to the Conv family instructions:
MOV, EXTU/EXTS/...
I am having a concerned feeling that these may need to be faked using
masks and shifts, say:
EXTS.B R8, R9
Maps to, say:
SLLI X6, X8, 56
SRAI X9, X6, 56

But, how does one express unsigned int16 extension, is it something like:
LUI X6, 16
ADDI X6, X6, -1
AND X9, X8, X6

If so, this kinda sucks...

There is also seemingly no way to load or store a global variable within
a single instruction, ...

Something like:
j=arr[i];

Looks like it will generally require at least 3 instructions (this is
typically a single instruction in BJX2).

I have doubts, on the surface, it appears that all this "kinda sucks".

The main advantage that RISC-V has is that it could potentially allow
for a reasonably small / cheap CPU core, but this is less true of faking
it with a decoder hack on top of the BJX2 core (though, granted, it is
much less absurd than 32-bit ARM or x86 would have been, in that at
least the core ISA mechanics map over reasonably well, avoiding needing
some big/nasty emulation layer).

Well, unless one wants RV64GC or similar, which would likely require a
certain amount of emulation to pull off (or making a "native" RISC-V
core would likely require a full fork).

Or, is all this probably just a garbage feature that makes everything
likely worse-off overall ?...

Any thoughts?...

Subject	Replies	Author
Misc idea: Dual ISA, BJX2 + RISC-V support... By: BGB on Fri, 1 Oct 2021	23	BGB

Thufir's a Harkonnen now.

computers / comp.arch / Misc idea: Dual ISA, BJX2 + RISC-V support...