Message-ID:

"The only way for a reporter to look at a politician is down." -- H. L. Mencken

devel / comp.arch / Re: Misc idea: Dual ISA, BJX2 + RISC-V support...

So, I was poking at a few seemingly random ideas:
* Take the existing RISC-V ISA, and add a WEX like sub-mode.
* Goal would be to remain backwards compatible with both the base ISA
and with RVC (though, they may be mutually-exclusive in a temporal sense);
* Gluing RISC-V decoder support onto my BJX2 core as an optional
extension feature.

So, as can be noted, standard RISC-V ops have the two LSB bits:
00/01/10: RVC Ops;
11: Base 32-bit ops.

So, say, in this sub-mode (possible):
00/01: Reserved
10: Wide-Execute
11: Scalar / End-of-Bundle

The 10 block is nearly identical to the 11 block, just with parts of the
encoding space chopped off and repurposed (such as for an equivalent of
a Jumbo prefix). The 11 block will be mostly unchanged.

Things like Pred or PrWEX encodings would not map over well.

So, I am looking at supporting a core which does both BJX2 and RISC-V
(in particular RV64IC) as sub-modes.

SR(27:26):
00: BJX2, WEX Disabled (Sequential Execution)
10: BJX2, WEX Enabled
01: RISC-V, Scalar Mode (supports RVC)
11: RISC-V, Wide-Execute Mode (Possible)

The RISC-V mode would mostly be handled with some tweaks in the L1 I$,
and an alternate set of decoders.

Otherwise, it looks like a lot of stuff is close enough that it may not
be too much of a stretch to support both ISAs on a single core.

Some of the basic extension sets are being left out for now:
* M: The BJX2 core lacks both full-width integer multiply, and divide.
* F: Doesn't map well to BJX2's FPU
* D: Doesn't map well to BJX2's FPU
* A: Doesn't match up
* V: Doesn't match up
* ...

Went and wrote up some code for some of this already, at least length
determination in RISC-V is fairly easy (within the 16/32 subset). It
appears the ISA supports some longer encoding forms, but I will ignore
them for now.

I may support WEX as an ISA extension in the RISC-V case, but I am
unlikely to support predicated instructions in this mode (partly because
there is a non-trivial point of divergence between the ISAs).

Also, unlike in BJX2, WEX mode in RISC-V would require the use of 32-bit
alignment.

Some other things are not strictly 1:1, so still dunno if all this is
"sane"...

The usage of the WEX sub-mode would likely be first generating RV32I
code, and then (after the fact) running something akin to a Wexifier,
which would try to shuffle and bundle the instructions. Would need a way
to figure out every branch target though, such that it doesn't try to
Wexify across basic-block boundaries (could require either hints or
metadata).

Then again, if the compiler output is already optimized for in-order
superscalar, the shuffling is easier, and the Wexifier would only need
to flag instructions that could be executed in parallel with the
following instruction (and it is no longer necessary to care as much
about basic-block boundaries).

Current leaning is that jumps between BJX2 and RISC-V mode would be
pulled off by branching to an address with the LSB set (the LSB would
effectively function as a mode-change toggle).

I would likely leave out the M extension, as RISC-V specifies a
full-width Multiply and Divide, which are still a bit too expensive to
do in hardware as described. Though, emulating these could be possible
in theory.

The F and D extensions also don't map exactly to BJX2's FPU.

These cases would either require using traps or implementing some sort
of microcode.

Namely, RISC-V specifies an FPU with:
Dedicated FPU Registers
FADD/FSUB/FMUL/FDIV/FSQRT
Meanwhile, BJX2's FPU:
Uses GPRs
FADD/FSUB/FMUL

It is possible I could do a variation where I map F0..F31 to R32..R63,
but have not done so yet, and it still would not support FDIV or FSQRT.

Most likely options are:
Leave this stuff as full software emulation, which is slow.

Add special instructions that map closer to what the underlying hardware
supports, but this partly defeats the point of supporting a more generic
ISA.

Some register remapping is used:
X0 -> ZZR
X1 -> LR
X2 -> SP (R15 in BJX2)
X3 -> GBR
X4 -> TBR
X5 -> DHR (R1 in BJX2)
X6 -> R6
X7 -> R7
X8..X14 -> R8..R14
X15 -> R15U
X16..X31 -> R16..R31

F0..F31: R32..R63 ( Possible FPU Support )

This would not be cross-ISA ABI compatible, but trying to make cross-ISA
ABI calls work does not seem worthwhile. The ABIs also differ in terms
of the relative split between preserve and scratch registers, ...

As such, a call between RISC-V and BJX2 code would likely require the
use of a ABI thunk in addition to jumping between ISA modes.

Note that R2..R5 would be inaccessible in RISC-V mode as they are
effectively shadowed.

Likewise, X1..X5 are not GPRs in this implementation, so it will be
assumed/required that code actually follow the ABI in this area.

X15 goes the opposite direction, where it is a GPR in RISC-V rather than
an SPR in BJX2, leading to needing to define C15 as R15U with this
extension, which maps to this register. The purpose of R15U is mostly to
allow a scheduler written in BJX2 code to be able to task-switch code
running in RISC-V Mode.

The reverse is not true, however, in that a task scheduler written in
RISC-V mode would not be able to task-switch processes running in BJX2 mode.

It looks like RISC-V leaves things like MMU, interrupt handling, ... as
implementation dependent, so I could probably get away with just leaving
these parts mostly as-is.

Similar for things like address space, ...
So, the core would retain BJX2's existing address space and memory map.

Some aspects of the RISC-V ISA design leave me with a bit of a mystery
though, namely how does it "not suck"?...

A lot of fairly common stuff that can be expressed in a single
instruction in BJX2 looks like it would require a much longer and more
convoluted instruction sequences to express in RISC-V. Some instruction
in RISC-V were "kinda weird" but not otherwise too much of a stretch for
how to express them within the existing pipeline (eg: JAL and JALR do
not have direct function equivalent in BJX2, but could still be mapped
to the existing branch mechanisms).

The main expensive thing I had to add hardware support for
Compare-and-Branch instructions. These were a "mostly unimplemented"
feature in BJX2, but RISC-V uses these as the primary branch type.

There is a cost increase and timing penalty for the RISC-V support,
which I suspect is most likely primarily due to the Compare-and-Branch
operations (the main alternative being to add a mechanism to decompose
them into a 2-op sequence in the decoder).

Although, I had started this out with the idea of being able to support
a Wide-Execute variation of RISC-V, I am starting to debate this, as:
* It would require a compiler to be aware of its existence to be able to
use it;
* RISC-V appears to, kinda suck, so it is debatable if this would be
particularly worthwhile.

While RISC-V does have theoretically larger load/store displacements
(12-bits vs 9-bits).

Given that they are unscaled and sign-extended, their usable range will
actually be less for QWORD operations than the 9-bit displacements in
BJX2 (and, the XGPR encodings effectively add a 10-bit signed load/store
encoding).

Granted:
ADDI Rn, Rm, Imm12s
Is slightly bigger than:
ADD Rm, Imm9s, Rn

One draw-back IMO, is that comparably RISC-V's immediate values are much
worse in the "bit-confetti" sense than BJX2's immediate values. Where,
at least, contiguous bits in the encoding generally also represent
contiguous bits in the value, whereas RISC-V was apparently happy
leaving them a chewed up mess...

As for which would win out in practice in terms of performance, it is
less clear. It is possible that RISC-V could have the advantage of more
mature compiler technology.

RISC-V seems lack anything equivalent to the Conv family instructions:
MOV, EXTU/EXTS/...
I am having a concerned feeling that these may need to be faked using
masks and shifts, say:
EXTS.B R8, R9
Maps to, say:
SLLI X6, X8, 56
SRAI X9, X6, 56

But, how does one express unsigned int16 extension, is it something like:
LUI X6, 16
ADDI X6, X6, -1
AND X9, X8, X6

If so, this kinda sucks...

There is also seemingly no way to load or store a global variable within
a single instruction, ...

Something like:
j=arr[i];

Looks like it will generally require at least 3 instructions (this is
typically a single instruction in BJX2).

I have doubts, on the surface, it appears that all this "kinda sucks".

The main advantage that RISC-V has is that it could potentially allow
for a reasonably small / cheap CPU core, but this is less true of faking
it with a decoder hack on top of the BJX2 core (though, granted, it is
much less absurd than 32-bit ARM or x86 would have been, in that at
least the core ISA mechanics map over reasonably well, avoiding needing
some big/nasty emulation layer).

Click here to read the complete article

On Friday, October 1, 2021 at 3:28:47 PM UTC-5, BGB wrote:
> So, I was poking at a few seemingly random ideas:
> * Take the existing RISC-V ISA, and add a WEX like sub-mode.
> * Goal would be to remain backwards compatible with both the base ISA
> and with RVC (though, they may be mutually-exclusive in a temporal sense);
> * Gluing RISC-V decoder support onto my BJX2 core as an optional
> extension feature.
>
> So, as can be noted, standard RISC-V ops have the two LSB bits:
> 00/01/10: RVC Ops;
> 11: Base 32-bit ops.
>
> So, say, in this sub-mode (possible):
> 00/01: Reserved
> 10: Wide-Execute
> 11: Scalar / End-of-Bundle
>
> The 10 block is nearly identical to the 11 block, just with parts of the
> encoding space chopped off and repurposed (such as for an equivalent of
> a Jumbo prefix). The 11 block will be mostly unchanged.
>
> Things like Pred or PrWEX encodings would not map over well.
>
>
>
> So, I am looking at supporting a core which does both BJX2 and RISC-V
> (in particular RV64IC) as sub-modes.
>
> SR(27:26):
> 00: BJX2, WEX Disabled (Sequential Execution)
> 10: BJX2, WEX Enabled
> 01: RISC-V, Scalar Mode (supports RVC)
> 11: RISC-V, Wide-Execute Mode (Possible)
>
>
> The RISC-V mode would mostly be handled with some tweaks in the L1 I$,
> and an alternate set of decoders.
>
> Otherwise, it looks like a lot of stuff is close enough that it may not
> be too much of a stretch to support both ISAs on a single core.
>
> Some of the basic extension sets are being left out for now:
> * M: The BJX2 core lacks both full-width integer multiply, and divide.
> * F: Doesn't map well to BJX2's FPU
> * D: Doesn't map well to BJX2's FPU
> * A: Doesn't match up
> * V: Doesn't match up
> * ...
>
>
> Went and wrote up some code for some of this already, at least length
> determination in RISC-V is fairly easy (within the 16/32 subset). It
> appears the ISA supports some longer encoding forms, but I will ignore
> them for now.
>
> I may support WEX as an ISA extension in the RISC-V case, but I am
> unlikely to support predicated instructions in this mode (partly because
> there is a non-trivial point of divergence between the ISAs).
>
> Also, unlike in BJX2, WEX mode in RISC-V would require the use of 32-bit
> alignment.
>
> Some other things are not strictly 1:1, so still dunno if all this is
> "sane"...
>
> The usage of the WEX sub-mode would likely be first generating RV32I
> code, and then (after the fact) running something akin to a Wexifier,
> which would try to shuffle and bundle the instructions. Would need a way
> to figure out every branch target though, such that it doesn't try to
> Wexify across basic-block boundaries (could require either hints or
> metadata).
>
> Then again, if the compiler output is already optimized for in-order
> superscalar, the shuffling is easier, and the Wexifier would only need
> to flag instructions that could be executed in parallel with the
> following instruction (and it is no longer necessary to care as much
> about basic-block boundaries).
>
> Current leaning is that jumps between BJX2 and RISC-V mode would be
> pulled off by branching to an address with the LSB set (the LSB would
> effectively function as a mode-change toggle).
>
>
> I would likely leave out the M extension, as RISC-V specifies a
> full-width Multiply and Divide, which are still a bit too expensive to
> do in hardware as described. Though, emulating these could be possible
> in theory.
>
> The F and D extensions also don't map exactly to BJX2's FPU.
>
> These cases would either require using traps or implementing some sort
> of microcode.
>
> Namely, RISC-V specifies an FPU with:
> Dedicated FPU Registers
> FADD/FSUB/FMUL/FDIV/FSQRT
> Meanwhile, BJX2's FPU:
> Uses GPRs
> FADD/FSUB/FMUL
>
> It is possible I could do a variation where I map F0..F31 to R32..R63,
> but have not done so yet, and it still would not support FDIV or FSQRT.
>
>
> Most likely options are:
> Leave this stuff as full software emulation, which is slow.
>
> Add special instructions that map closer to what the underlying hardware
> supports, but this partly defeats the point of supporting a more generic
> ISA.
>
>
> Some register remapping is used:
> X0 -> ZZR
> X1 -> LR
> X2 -> SP (R15 in BJX2)
> X3 -> GBR
> X4 -> TBR
> X5 -> DHR (R1 in BJX2)
> X6 -> R6
> X7 -> R7
> X8..X14 -> R8..R14
> X15 -> R15U
> X16..X31 -> R16..R31
>
> F0..F31: R32..R63 ( Possible FPU Support )
>
> This would not be cross-ISA ABI compatible, but trying to make cross-ISA
> ABI calls work does not seem worthwhile. The ABIs also differ in terms
> of the relative split between preserve and scratch registers, ...
>
> As such, a call between RISC-V and BJX2 code would likely require the
> use of a ABI thunk in addition to jumping between ISA modes.
>
> Note that R2..R5 would be inaccessible in RISC-V mode as they are
> effectively shadowed.
>
> Likewise, X1..X5 are not GPRs in this implementation, so it will be
> assumed/required that code actually follow the ABI in this area.
>
> X15 goes the opposite direction, where it is a GPR in RISC-V rather than
> an SPR in BJX2, leading to needing to define C15 as R15U with this
> extension, which maps to this register. The purpose of R15U is mostly to
> allow a scheduler written in BJX2 code to be able to task-switch code
> running in RISC-V Mode.
>
> The reverse is not true, however, in that a task scheduler written in
> RISC-V mode would not be able to task-switch processes running in BJX2 mode.
>
>
> It looks like RISC-V leaves things like MMU, interrupt handling, ... as
> implementation dependent, so I could probably get away with just leaving
> these parts mostly as-is.
>
> Similar for things like address space, ...
> So, the core would retain BJX2's existing address space and memory map.
>
>
>
> Some aspects of the RISC-V ISA design leave me with a bit of a mystery
> though, namely how does it "not suck"?...
>
> A lot of fairly common stuff that can be expressed in a single
> instruction in BJX2 looks like it would require a much longer and more
> convoluted instruction sequences to express in RISC-V. Some instruction
> in RISC-V were "kinda weird" but not otherwise too much of a stretch for
> how to express them within the existing pipeline (eg: JAL and JALR do
> not have direct function equivalent in BJX2, but could still be mapped
> to the existing branch mechanisms).
>
>
> The main expensive thing I had to add hardware support for
> Compare-and-Branch instructions. These were a "mostly unimplemented"
> feature in BJX2, but RISC-V uses these as the primary branch type.
>
> There is a cost increase and timing penalty for the RISC-V support,
> which I suspect is most likely primarily due to the Compare-and-Branch
> operations (the main alternative being to add a mechanism to decompose
> them into a 2-op sequence in the decoder).
>
>
> Although, I had started this out with the idea of being able to support
> a Wide-Execute variation of RISC-V, I am starting to debate this, as:
> * It would require a compiler to be aware of its existence to be able to
> use it;
> * RISC-V appears to, kinda suck, so it is debatable if this would be
> particularly worthwhile.
>
>
> While RISC-V does have theoretically larger load/store displacements
> (12-bits vs 9-bits).
>
> Given that they are unscaled and sign-extended, their usable range will
> actually be less for QWORD operations than the 9-bit displacements in
> BJX2 (and, the XGPR encodings effectively add a 10-bit signed load/store
> encoding).
>
> Granted:
> ADDI Rn, Rm, Imm12s
> Is slightly bigger than:
> ADD Rm, Imm9s, Rn
>
> One draw-back IMO, is that comparably RISC-V's immediate values are much
> worse in the "bit-confetti" sense than BJX2's immediate values. Where,
> at least, contiguous bits in the encoding generally also represent
> contiguous bits in the value, whereas RISC-V was apparently happy
> leaving them a chewed up mess...
>
>
> As for which would win out in practice in terms of performance, it is
> less clear. It is possible that RISC-V could have the advantage of more
> mature compiler technology.
>
>
> RISC-V seems lack anything equivalent to the Conv family instructions:
> MOV, EXTU/EXTS/...
> I am having a concerned feeling that these may need to be faked using
> masks and shifts, say:
> EXTS.B R8, R9
> Maps to, say:
> SLLI X6, X8, 56
> SRAI X9, X6, 56
>
> But, how does one express unsigned int16 extension, is it something like:
> LUI X6, 16
> ADDI X6, X6, -1
> AND X9, X8, X6
>
> If so, this kinda sucks...
>
>
> There is also seemingly no way to load or store a global variable within
> a single instruction, ...
>
> Something like:
> j=arr[i];
>
> Looks like it will generally require at least 3 instructions (this is
> typically a single instruction in BJX2).
>
>
>
> I have doubts, on the surface, it appears that all this "kinda sucks".
>
> The main advantage that RISC-V has is that it could potentially allow
> for a reasonably small / cheap CPU core, but this is less true of faking
> it with a decoder hack on top of the BJX2 core (though, granted, it is
> much less absurd than 32-bit ARM or x86 would have been, in that at
> least the core ISA mechanics map over reasonably well, avoiding needing
> some big/nasty emulation layer).
>
>
> Well, unless one wants RV64GC or similar, which would likely require a
> certain amount of emulation to pull off (or making a "native" RISC-V
> core would likely require a full fork).
>
>
> Or, is all this probably just a garbage feature that makes everything
> likely worse-off overall ?...
>
>
> Any thoughts?...
Some of the not too expensive FPGAs have room for a hundred simple cores,
all running simultaneously. Easy enough to show them on a 4K display.
A solution in search of a problem?

Click here to read the complete article

On Friday, October 1, 2021 at 3:28:47 PM UTC-5, BGB wrote:
> So, I was poking at a few seemingly random ideas:
> * Take the existing RISC-V ISA, and add a WEX like sub-mode.
> * Goal would be to remain backwards compatible with both the base ISA
> and with RVC (though, they may be mutually-exclusive in a temporal sense);
> * Gluing RISC-V decoder support onto my BJX2 core as an optional
> extension feature.
>
> So, as can be noted, standard RISC-V ops have the two LSB bits:
> 00/01/10: RVC Ops;
> 11: Base 32-bit ops.
>
> So, say, in this sub-mode (possible):
> 00/01: Reserved
> 10: Wide-Execute
> 11: Scalar / End-of-Bundle
<
In my personal opinion, WEX is a crutch, Great Big machines will need the
equivalent of WEX^2 or WEX^3, while smallest possible implementations
will be burdened by WEX that they can't execute. So, basically, WEX does
not scale, and this is bad for long term architectures.
<
>
> The 10 block is nearly identical to the 11 block, just with parts of the
> encoding space chopped off and repurposed (such as for an equivalent of
> a Jumbo prefix). The 11 block will be mostly unchanged.
>
> Things like Pred or PrWEX encodings would not map over well.
>
>
>
> So, I am looking at supporting a core which does both BJX2 and RISC-V
> (in particular RV64IC) as sub-modes.
>
> SR(27:26):
> 00: BJX2, WEX Disabled (Sequential Execution)
> 10: BJX2, WEX Enabled
> 01: RISC-V, Scalar Mode (supports RVC)
> 11: RISC-V, Wide-Execute Mode (Possible)
>
>
> The RISC-V mode would mostly be handled with some tweaks in the L1 I$,
> and an alternate set of decoders.
>
> Otherwise, it looks like a lot of stuff is close enough that it may not
> be too much of a stretch to support both ISAs on a single core.
>
> Some of the basic extension sets are being left out for now:
> * M: The BJX2 core lacks both full-width integer multiply, and divide.
> * F: Doesn't map well to BJX2's FPU
> * D: Doesn't map well to BJX2's FPU
> * A: Doesn't match up
> * V: Doesn't match up
> * ...
>
>
> Went and wrote up some code for some of this already, at least length
> determination in RISC-V is fairly easy (within the 16/32 subset). It
> appears the ISA supports some longer encoding forms, but I will ignore
> them for now.
>
> I may support WEX as an ISA extension in the RISC-V case, but I am
> unlikely to support predicated instructions in this mode (partly because
> there is a non-trivial point of divergence between the ISAs).
>
> Also, unlike in BJX2, WEX mode in RISC-V would require the use of 32-bit
> alignment.
>
> Some other things are not strictly 1:1, so still dunno if all this is
> "sane"...
>
> The usage of the WEX sub-mode would likely be first generating RV32I
> code, and then (after the fact) running something akin to a Wexifier,
> which would try to shuffle and bundle the instructions. Would need a way
> to figure out every branch target though, such that it doesn't try to
> Wexify across basic-block boundaries (could require either hints or
> metadata).
<
Do it at the assembly level where the labels are still present.
>
> Then again, if the compiler output is already optimized for in-order
> superscalar, the shuffling is easier, and the Wexifier would only need
> to flag instructions that could be executed in parallel with the
> following instruction (and it is no longer necessary to care as much
> about basic-block boundaries).
>
> Current leaning is that jumps between BJX2 and RISC-V mode would be
> pulled off by branching to an address with the LSB set (the LSB would
> effectively function as a mode-change toggle).
>
>
> I would likely leave out the M extension, as RISC-V specifies a
> full-width Multiply and Divide, which are still a bit too expensive to
> do in hardware as described. Though, emulating these could be possible
> in theory.
>
> The F and D extensions also don't map exactly to BJX2's FPU.
>
> These cases would either require using traps or implementing some sort
> of microcode.
>
> Namely, RISC-V specifies an FPU with:
> Dedicated FPU Registers
> FADD/FSUB/FMUL/FDIV/FSQRT
> Meanwhile, BJX2's FPU:
> Uses GPRs
> FADD/FSUB/FMUL
>
> It is possible I could do a variation where I map F0..F31 to R32..R63,
> but have not done so yet, and it still would not support FDIV or FSQRT.
>
>
> Most likely options are:
> Leave this stuff as full software emulation, which is slow.
>
> Add special instructions that map closer to what the underlying hardware
> supports, but this partly defeats the point of supporting a more generic
> ISA.
>
>
> Some register remapping is used:
> X0 -> ZZR
> X1 -> LR
> X2 -> SP (R15 in BJX2)
> X3 -> GBR
> X4 -> TBR
> X5 -> DHR (R1 in BJX2)
> X6 -> R6
> X7 -> R7
> X8..X14 -> R8..R14
> X15 -> R15U
> X16..X31 -> R16..R31
>
> F0..F31: R32..R63 ( Possible FPU Support )
>
> This would not be cross-ISA ABI compatible, but trying to make cross-ISA
> ABI calls work does not seem worthwhile. The ABIs also differ in terms
> of the relative split between preserve and scratch registers, ...
>
> As such, a call between RISC-V and BJX2 code would likely require the
> use of a ABI thunk in addition to jumping between ISA modes.
>
> Note that R2..R5 would be inaccessible in RISC-V mode as they are
> effectively shadowed.
>
> Likewise, X1..X5 are not GPRs in this implementation, so it will be
> assumed/required that code actually follow the ABI in this area.
>
> X15 goes the opposite direction, where it is a GPR in RISC-V rather than
> an SPR in BJX2, leading to needing to define C15 as R15U with this
> extension, which maps to this register. The purpose of R15U is mostly to
> allow a scheduler written in BJX2 code to be able to task-switch code
> running in RISC-V Mode.
>
> The reverse is not true, however, in that a task scheduler written in
> RISC-V mode would not be able to task-switch processes running in BJX2 mode.
<
It might be good to point out that there is NO task scheduler in My 66000
at least as seen by the instruction stream. The task scheduler is over by the
Memory/DRAM Controller where it collects "interrupts" and convert these into
prioritized context switch packets.
>
>
> It looks like RISC-V leaves things like MMU, interrupt handling, ... as
> implementation dependent, so I could probably get away with just leaving
> these parts mostly as-is.
<
I consider this a mistake on RISC-V's part.
>
> Similar for things like address space, ...
> So, the core would retain BJX2's existing address space and memory map.
>
>
>
> Some aspects of the RISC-V ISA design leave me with a bit of a mystery
> though, namely how does it "not suck"?...
>
> A lot of fairly common stuff that can be expressed in a single
> instruction in BJX2 looks like it would require a much longer and more
> convoluted instruction sequences to express in RISC-V. Some instruction
> in RISC-V were "kinda weird" but not otherwise too much of a stretch for
> how to express them within the existing pipeline (eg: JAL and JALR do
> not have direct function equivalent in BJX2, but could still be mapped
> to the existing branch mechanisms).
<
The MIPS and RISC-V pipelines are organized around the notion of
compare-and-branch.
>
>
> The main expensive thing I had to add hardware support for
> Compare-and-Branch instructions. These were a "mostly unimplemented"
> feature in BJX2, but RISC-V uses these as the primary branch type.
>
> There is a cost increase and timing penalty for the RISC-V support,
> which I suspect is most likely primarily due to the Compare-and-Branch
> operations (the main alternative being to add a mechanism to decompose
> them into a 2-op sequence in the decoder).
<
One either designs the pipeline around the compare-and-branch or one
designs the pipeline such that compare and branch are 2 different
instructions. You cannot do both. Both have merit and demerits.
>
>
> Although, I had started this out with the idea of being able to support
> a Wide-Execute variation of RISC-V, I am starting to debate this, as:
> * It would require a compiler to be aware of its existence to be able to
> use it;
> * RISC-V appears to, kinda suck, so it is debatable if this would be
> particularly worthwhile.
>
>
> While RISC-V does have theoretically larger load/store displacements
> (12-bits vs 9-bits).
<
Piddly crap. My 66000 has 16-bit, 32-bit, and 64-bit displacements.
>
> Given that they are unscaled and sign-extended, their usable range will
> actually be less for QWORD operations than the 9-bit displacements in
> BJX2 (and, the XGPR encodings effectively add a 10-bit signed load/store
> encoding).
<
If you have an Inherently misaligned memory model, this is the only
realistic choice.
>
> Granted:
> ADDI Rn, Rm, Imm12s
> Is slightly bigger than:
> ADD Rm, Imm9s, Rn
<
Try calculating how big:
<
ADD Rd,Rs1,0x123456789101112
<
would be in RISC-V !! (Total size, including both instruction and data)
>
> One draw-back IMO, is that comparably RISC-V's immediate values are much
> worse in the "bit-confetti" sense than BJX2's immediate values. Where,
> at least, contiguous bits in the encoding generally also represent
> contiguous bits in the value, whereas RISC-V was apparently happy
> leaving them a chewed up mess...
<
One of the things that happens when you start with the opcode at the wrong
end of the instruction word.
>
>
> As for which would win out in practice in terms of performance, it is
> less clear. It is possible that RISC-V could have the advantage of more
> mature compiler technology.
>
>
> RISC-V seems lack anything equivalent to the Conv family instructions:
> MOV, EXTU/EXTS/...
> I am having a concerned feeling that these may need to be faked using
> masks and shifts, say:
> EXTS.B R8, R9
> Maps to, say:
> SLLI X6, X8, 56
> SRAI X9, X6, 56
<
RISC-V has 12-bit constants. The lower 6-bits is the shift count, the upper 6-bits
are the size, with the constraint that 0 implies maximum (i.e., register) width.
<
I do this in My 66000. When used as reg-reg, bits<5:0> remain the shift count
while bits <37:32> are the field size. The HW checks that the other bits have
"no significance" and raised the OPERAND exception if the operand pattern
is not within the required domain.
>
> But, how does one express unsigned int16 extension, is it something like:
> LUI X6, 16
> ADDI X6, X6, -1
> AND X9, X8, X6
>
> If so, this kinda sucks...
>
>
> There is also seemingly no way to load or store a global variable within
> a single instruction, ...
>
> Something like:
> j=arr[i];
>
<
LDD Rj,[R0+Ri<<s+arr]
<
> Looks like it will generally require at least 3 instructions (this is
> typically a single instruction in BJX2).
>
>
>
> I have doubts, on the surface, it appears that all this "kinda sucks".
>
> The main advantage that RISC-V has is that it could potentially allow
> for a reasonably small / cheap CPU core, but this is less true of faking
> it with a decoder hack on top of the BJX2 core (though, granted, it is
> much less absurd than 32-bit ARM or x86 would have been, in that at
> least the core ISA mechanics map over reasonably well, avoiding needing
> some big/nasty emulation layer).
<
It is my personal belief, that if you wanted a small core your better starting
point would be MIPS R3000 and integrate the caches.
>
>
> Well, unless one wants RV64GC or similar, which would likely require a
> certain amount of emulation to pull off (or making a "native" RISC-V
> core would likely require a full fork).
>
>
> Or, is all this probably just a garbage feature that makes everything
> likely worse-off overall ?...
>
>
> Any thoughts?...
<
Has any dual/mixed instruction set machine EVER had a long lifetime ?

Click here to read the complete article

On 10/1/2021 4:05 PM, JimBrakefield wrote:
> On Friday, October 1, 2021 at 3:28:47 PM UTC-5, BGB wrote:
>> So, I was poking at a few seemingly random ideas:
>> * Take the existing RISC-V ISA, and add a WEX like sub-mode.
>> * Goal would be to remain backwards compatible with both the base ISA
>> and with RVC (though, they may be mutually-exclusive in a temporal sense);
>> * Gluing RISC-V decoder support onto my BJX2 core as an optional
>> extension feature.
>>

....

>>
>> Well, unless one wants RV64GC or similar, which would likely require a
>> certain amount of emulation to pull off (or making a "native" RISC-V
>> core would likely require a full fork).
>>
>>
>> Or, is all this probably just a garbage feature that makes everything
>> likely worse-off overall ?...
>>
>>
>> Any thoughts?...
> Some of the not too expensive FPGAs have room for a hundred simple cores,
> all running simultaneously. Easy enough to show them on a 4K display.
> A solution in search of a problem?
>

One can have lots of tiny cores, but they are not useful for all that much.

One of the bigger problems I have been facing in my projects is not so
much with doing computations, but with getting data from memory efficiently.

Making the code itself faster has the indirect drawback in that it
increases the relative amount of time spent waiting for memory (and I
still have not succeeded at making Quake all that playable at 50MHz).

Past attempts to work out the math for "what if I ran the RAM using
SERDES at a more 'appropriate' clock speed" only came up with fairly
modest gains.

Most of my recent performance gains have come more from fixing compiler
bugs than from hardware features, eg:
Fixing a bug which resulted in lots of redundant type conversions helped
a certain amount;
More recently, fixing another bug where loading a value from an unsigned
type tended to result in an additional (redundant) type-conversion op
was also helpful (~ 1% smaller binaries, also made Quake a little faster);
....

Some of these savings were more due to reducing register pressure (fewer
temporaries means fewer stack spills).

> For soft cores that support virtual memory, maybe room for ten cores?
> Don't forget 2 to 6 built-in ARM cores. Or:
>
> Actel Polarfire, now Microchip/Microsemi offers four RISC-V cores and FPGA fabric.
>

The FPGAs I am using don't have any hard-wired cores.

Mostly dealing with Spartan and Artix class hardware.

Was previously generally able to fit two BJX2 cores onto an XC7A100T,
and had also been working some on trying to implement a modified core
which could support SMT (would effectively have 6 execute lanes with two
pilelines, but partly conjoins some other hardware).

The hope was that SMT could be cheaper than full dual-core (such as by
sharing the MMU and FPU, ...).

Had looked at possibly doing a 6-wide version of the BJX2 ISA, but there
are enough roadblocks here that this does not seem particularly viable
(mostly involving trying to pull off a unified register file with 12
read ports and 6 write ports).

A purely SMT implementation avoids this issue though, as it can use
independent register files.

I could fit 4 cores into the FPGA I am using if I were running a 64-bit
RISC with only a single execute lane (running at one instruction per
cycle), with smaller L1 caches (say, 4K I$ + 8K D$).

Or around 8 cores if I were using a much more minimalist core, such as a
basic 32-bit single-issue core.

Though, one also has to use smaller/simpler L1 caches for this (such as
512B 16x2x16B), but making the L1 caches smaller or similar tends to
have a significant adverse effect on performance.

Programs like Doom and Quake seem to be most happy with ~ 16K or 32K of
L1, and as much L2 as one can throw at them. After more recently moving
to RAM backed VRAM, this means I can now afford 256K of L2 (with only a
moderate amount of clashing between the active program and
video-display; partly as the display hardware is also routed through the
L2 cache; this seems to hold up pretty OK for 320x200 at least).

However, with 256K of L2 + 16K + 32K for L1, I am basically at the limit
of the available Block-RAM.

In theory, given the "pretty much awful" ILP which I am getting from my
C compiler, I could run a single-issue core at 100MHz vs a 3-wide core
at 50MHz, and have it be faster.

However... I have to make the L1 caches smaller, which hurts the
performance a lot worse than the higher clock-speed helps (it doesn't
make the RAM any faster).

So, it ends up being faster and easier to just stick with everything
being 50 MHz, and instead trying to leverage VLIW. In theory, in-order
super-scalar can do basically the same thing, but is more complicated
and more expensive than VLIW.

Similar, 50MHz single issue is slower than 50MHz VLIW.

And, making an L1 D$ that can be 16K or 32K and also allow single-cycle
access latency, and pass timing at 75 or 100MHz, is easier said than done.

But, kinda in a local-optimum trap here, where I can make Doom run
pretty well, but Quake still tends to remain "pretty much unplayable".

It looks like to get reasonably playable framerates in Quake, I would
need to pull off all 3 of these:
Similar-sized, or larger, L1 caches (eg: 32K L1 D$, or more);
Significantly faster DRAM access;
Being able to run a 3-wide pipeline at 100MHz.

Ironically, I have managed to make a small Minecraft style 3D engine
work semi-passably on this (performance is a little bit better than
Quake; although the draw distance isn't particularly large at only 24
meters).

This isn't really going to happen on the Artix I am using though, and I
still don't have a job at present, so no real money to drop on a much
more expensive Kintex based board.

Some other guy managed to get somewhat better framerates for a Minecraft
style engine running on an ICE40, but his effort offloads most of the 3D
rendering machinery to hardware, and is basically using a small 16-bit
RISC style ISA for the actual CPU part.

In contrast, I am basically using software rendering based around a more
traditional triangle-rasterization approach. However, since
perspective-correct rendering is expensive, my renderer tends to carve
up large triangles and draws them using affine projection (more
subdivision looks nicer, but comes at a fairly steep performance cost).

Note that my 3D renderer also does rendering OpenGL style (and via
implementing the OpenGL API), which basically means linearly
interpolating colors across triangles and multiplying them with the
geometry being drawn. It is also possible to draw geometry with
alpha-blending, or other blending modes, but this is slower (the OpenGL
API implementation is basically sufficient to run Quake 3 Arena though;
but at the moment, Q3A still has relatively little hope of usable
performance).

I guess one thing I had overlooked with a Minecraft style engine based
around using ray-casts for visibility finding, was that the ray-casts
tend to hit only a small part of the world with a short draw distance,
so the cache-miss rates from the raycasts are fairly modest.

....

Not sure how this would work running RISC-V, I somehow suspect that
running the OpenGL rasterizer as RISC-V code would be pretty much awful.

Yet, going by SWeRV's numbers, they get somewhat better DMIPS/MHz than I
am getting from BJX2.

Notes that the SWeRV core is able to run single-core on the same FPGA
board I am using, albeit only passes timing at a lower clock speed than
what I am using for BJX2 (one can run it at 33 MHz I guess...).

I don't know how much is likely due to hardware magic, or better
compiler optimization, will likely need to model some of this a bit better.

But, I don't know...

On Friday, October 1, 2021 at 6:26:24 PM UTC-5, BGB wrote:
> On 10/1/2021 4:05 PM, JimBrakefield wrote:
> > On Friday, October 1, 2021 at 3:28:47 PM UTC-5, BGB wrote:
> >> So, I was poking at a few seemingly random ideas:
> >> * Take the existing RISC-V ISA, and add a WEX like sub-mode.
> >> * Goal would be to remain backwards compatible with both the base ISA
> >> and with RVC (though, they may be mutually-exclusive in a temporal sense);
> >> * Gluing RISC-V decoder support onto my BJX2 core as an optional
> >> extension feature.
> >>
> ...
> >>
> >> Well, unless one wants RV64GC or similar, which would likely require a
> >> certain amount of emulation to pull off (or making a "native" RISC-V
> >> core would likely require a full fork).
> >>
> >>
> >> Or, is all this probably just a garbage feature that makes everything
> >> likely worse-off overall ?...
> >>
> >>
> >> Any thoughts?...
> > Some of the not too expensive FPGAs have room for a hundred simple cores,
> > all running simultaneously. Easy enough to show them on a 4K display.
> > A solution in search of a problem?
> >
> One can have lots of tiny cores, but they are not useful for all that much.
>
> One of the bigger problems I have been facing in my projects is not so
> much with doing computations, but with getting data from memory efficiently.
>
> Making the code itself faster has the indirect drawback in that it
> increases the relative amount of time spent waiting for memory (and I
> still have not succeeded at making Quake all that playable at 50MHz).
>
> Past attempts to work out the math for "what if I ran the RAM using
> SERDES at a more 'appropriate' clock speed" only came up with fairly
> modest gains.
>
>
> Most of my recent performance gains have come more from fixing compiler
> bugs than from hardware features, eg:
> Fixing a bug which resulted in lots of redundant type conversions helped
> a certain amount;
> More recently, fixing another bug where loading a value from an unsigned
> type tended to result in an additional (redundant) type-conversion op
> was also helpful (~ 1% smaller binaries, also made Quake a little faster);
> ...
>
> Some of these savings were more due to reducing register pressure (fewer
> temporaries means fewer stack spills).
> > For soft cores that support virtual memory, maybe room for ten cores?
> > Don't forget 2 to 6 built-in ARM cores. Or:
> >
> > Actel Polarfire, now Microchip/Microsemi offers four RISC-V cores and FPGA fabric.
> >
> The FPGAs I am using don't have any hard-wired cores.
>
> Mostly dealing with Spartan and Artix class hardware.
>
> Was previously generally able to fit two BJX2 cores onto an XC7A100T,
> and had also been working some on trying to implement a modified core
> which could support SMT (would effectively have 6 execute lanes with two
> pilelines, but partly conjoins some other hardware).
>
> The hope was that SMT could be cheaper than full dual-core (such as by
> sharing the MMU and FPU, ...).
>
>
> Had looked at possibly doing a 6-wide version of the BJX2 ISA, but there
> are enough roadblocks here that this does not seem particularly viable
> (mostly involving trying to pull off a unified register file with 12
> read ports and 6 write ports).
>
> A purely SMT implementation avoids this issue though, as it can use
> independent register files.
>
>
>
> I could fit 4 cores into the FPGA I am using if I were running a 64-bit
> RISC with only a single execute lane (running at one instruction per
> cycle), with smaller L1 caches (say, 4K I$ + 8K D$).
>
>
> Or around 8 cores if I were using a much more minimalist core, such as a
> basic 32-bit single-issue core.
>
> Though, one also has to use smaller/simpler L1 caches for this (such as
> 512B 16x2x16B), but making the L1 caches smaller or similar tends to
> have a significant adverse effect on performance.
>
>
> Programs like Doom and Quake seem to be most happy with ~ 16K or 32K of
> L1, and as much L2 as one can throw at them. After more recently moving
> to RAM backed VRAM, this means I can now afford 256K of L2 (with only a
> moderate amount of clashing between the active program and
> video-display; partly as the display hardware is also routed through the
> L2 cache; this seems to hold up pretty OK for 320x200 at least).
>
> However, with 256K of L2 + 16K + 32K for L1, I am basically at the limit
> of the available Block-RAM.
>
>
> In theory, given the "pretty much awful" ILP which I am getting from my
> C compiler, I could run a single-issue core at 100MHz vs a 3-wide core
> at 50MHz, and have it be faster.
>
> However... I have to make the L1 caches smaller, which hurts the
> performance a lot worse than the higher clock-speed helps (it doesn't
> make the RAM any faster).
>
> So, it ends up being faster and easier to just stick with everything
> being 50 MHz, and instead trying to leverage VLIW. In theory, in-order
> super-scalar can do basically the same thing, but is more complicated
> and more expensive than VLIW.
>
> Similar, 50MHz single issue is slower than 50MHz VLIW.
>
>
> And, making an L1 D$ that can be 16K or 32K and also allow single-cycle
> access latency, and pass timing at 75 or 100MHz, is easier said than done.
>
>
>
> But, kinda in a local-optimum trap here, where I can make Doom run
> pretty well, but Quake still tends to remain "pretty much unplayable".
>
> It looks like to get reasonably playable framerates in Quake, I would
> need to pull off all 3 of these:
> Similar-sized, or larger, L1 caches (eg: 32K L1 D$, or more);
> Significantly faster DRAM access;
> Being able to run a 3-wide pipeline at 100MHz.
>
> Ironically, I have managed to make a small Minecraft style 3D engine
> work semi-passably on this (performance is a little bit better than
> Quake; although the draw distance isn't particularly large at only 24
> meters).
>
>
>
> This isn't really going to happen on the Artix I am using though, and I
> still don't have a job at present, so no real money to drop on a much
> more expensive Kintex based board.
>
> Some other guy managed to get somewhat better framerates for a Minecraft
> style engine running on an ICE40, but his effort offloads most of the 3D
> rendering machinery to hardware, and is basically using a small 16-bit
> RISC style ISA for the actual CPU part.
>
>
> In contrast, I am basically using software rendering based around a more
> traditional triangle-rasterization approach. However, since
> perspective-correct rendering is expensive, my renderer tends to carve
> up large triangles and draws them using affine projection (more
> subdivision looks nicer, but comes at a fairly steep performance cost).
>
> Note that my 3D renderer also does rendering OpenGL style (and via
> implementing the OpenGL API), which basically means linearly
> interpolating colors across triangles and multiplying them with the
> geometry being drawn. It is also possible to draw geometry with
> alpha-blending, or other blending modes, but this is slower (the OpenGL
> API implementation is basically sufficient to run Quake 3 Arena though;
> but at the moment, Q3A still has relatively little hope of usable
> performance).
>
>
> I guess one thing I had overlooked with a Minecraft style engine based
> around using ray-casts for visibility finding, was that the ray-casts
> tend to hit only a small part of the world with a short draw distance,
> so the cache-miss rates from the raycasts are fairly modest.
>
> ...
>
>
> Not sure how this would work running RISC-V, I somehow suspect that
> running the OpenGL rasterizer as RISC-V code would be pretty much awful.
>
> Yet, going by SWeRV's numbers, they get somewhat better DMIPS/MHz than I
> am getting from BJX2.
>
>
> Notes that the SWeRV core is able to run single-core on the same FPGA
> board I am using, albeit only passes timing at a lower clock speed than
> what I am using for BJX2 (one can run it at 33 MHz I guess...).
>
> I don't know how much is likely due to hardware magic, or better
> compiler optimization, will likely need to model some of this a bit better.
>
>
> But, I don't know...

Click here to read the complete article

On 10/1/2021 5:37 PM, MitchAlsup wrote:
> On Friday, October 1, 2021 at 3:28:47 PM UTC-5, BGB wrote:
>> So, I was poking at a few seemingly random ideas:
>> * Take the existing RISC-V ISA, and add a WEX like sub-mode.
>> * Goal would be to remain backwards compatible with both the base ISA
>> and with RVC (though, they may be mutually-exclusive in a temporal sense);
>> * Gluing RISC-V decoder support onto my BJX2 core as an optional
>> extension feature.
>>
>> So, as can be noted, standard RISC-V ops have the two LSB bits:
>> 00/01/10: RVC Ops;
>> 11: Base 32-bit ops.
>>
>> So, say, in this sub-mode (possible):
>> 00/01: Reserved
>> 10: Wide-Execute
>> 11: Scalar / End-of-Bundle
> <
> In my personal opinion, WEX is a crutch, Great Big machines will need the
> equivalent of WEX^2 or WEX^3, while smallest possible implementations
> will be burdened by WEX that they can't execute. So, basically, WEX does
> not scale, and this is bad for long term architectures.
> <

Small cores can ignore that it exists and don't have to pay anything.
I am not precluding the use of super-scalar.

Or, if you mean that it costs encoding space bits, granted. It would in
the hacked RISC-V case mean a hit to code-density to be able to use it.

>>
>> The 10 block is nearly identical to the 11 block, just with parts of the
>> encoding space chopped off and repurposed (such as for an equivalent of
>> a Jumbo prefix). The 11 block will be mostly unchanged.
>>
>> Things like Pred or PrWEX encodings would not map over well.
>>
>>
>>
>> So, I am looking at supporting a core which does both BJX2 and RISC-V
>> (in particular RV64IC) as sub-modes.
>>
>> SR(27:26):
>> 00: BJX2, WEX Disabled (Sequential Execution)
>> 10: BJX2, WEX Enabled
>> 01: RISC-V, Scalar Mode (supports RVC)
>> 11: RISC-V, Wide-Execute Mode (Possible)
>>
>>
>> The RISC-V mode would mostly be handled with some tweaks in the L1 I$,
>> and an alternate set of decoders.
>>
>> Otherwise, it looks like a lot of stuff is close enough that it may not
>> be too much of a stretch to support both ISAs on a single core.
>>
>> Some of the basic extension sets are being left out for now:
>> * M: The BJX2 core lacks both full-width integer multiply, and divide.
>> * F: Doesn't map well to BJX2's FPU
>> * D: Doesn't map well to BJX2's FPU
>> * A: Doesn't match up
>> * V: Doesn't match up
>> * ...
>>
>>
>> Went and wrote up some code for some of this already, at least length
>> determination in RISC-V is fairly easy (within the 16/32 subset). It
>> appears the ISA supports some longer encoding forms, but I will ignore
>> them for now.
>>
>> I may support WEX as an ISA extension in the RISC-V case, but I am
>> unlikely to support predicated instructions in this mode (partly because
>> there is a non-trivial point of divergence between the ISAs).
>>
>> Also, unlike in BJX2, WEX mode in RISC-V would require the use of 32-bit
>> alignment.
>>
>> Some other things are not strictly 1:1, so still dunno if all this is
>> "sane"...
>>
>> The usage of the WEX sub-mode would likely be first generating RV32I
>> code, and then (after the fact) running something akin to a Wexifier,
>> which would try to shuffle and bundle the instructions. Would need a way
>> to figure out every branch target though, such that it doesn't try to
>> Wexify across basic-block boundaries (could require either hints or
>> metadata).
> <
> Do it at the assembly level where the labels are still present.

I was imagining doing it at the level of hacking it onto an ELF loader
or similar...

So, say, GCC or similar doesn't need to care that WEX exists, it just
needs to be told to compile for RV64I.

In BJX2, as-is it is sort of hacked on as a post-process after
code-generation has already been performed. The big obvious drawback
with this approach is that enabling it causes any code which uses the
feature to need to be encoded as a fixed-length 32-bit subset of the ISA.

>>
>> Then again, if the compiler output is already optimized for in-order
>> superscalar, the shuffling is easier, and the Wexifier would only need
>> to flag instructions that could be executed in parallel with the
>> following instruction (and it is no longer necessary to care as much
>> about basic-block boundaries).
>>
>> Current leaning is that jumps between BJX2 and RISC-V mode would be
>> pulled off by branching to an address with the LSB set (the LSB would
>> effectively function as a mode-change toggle).
>>
>>
>> I would likely leave out the M extension, as RISC-V specifies a
>> full-width Multiply and Divide, which are still a bit too expensive to
>> do in hardware as described. Though, emulating these could be possible
>> in theory.
>>
>> The F and D extensions also don't map exactly to BJX2's FPU.
>>
>> These cases would either require using traps or implementing some sort
>> of microcode.
>>
>> Namely, RISC-V specifies an FPU with:
>> Dedicated FPU Registers
>> FADD/FSUB/FMUL/FDIV/FSQRT
>> Meanwhile, BJX2's FPU:
>> Uses GPRs
>> FADD/FSUB/FMUL
>>
>> It is possible I could do a variation where I map F0..F31 to R32..R63,
>> but have not done so yet, and it still would not support FDIV or FSQRT.
>>
>>
>> Most likely options are:
>> Leave this stuff as full software emulation, which is slow.
>>
>> Add special instructions that map closer to what the underlying hardware
>> supports, but this partly defeats the point of supporting a more generic
>> ISA.
>>
>>
>> Some register remapping is used:
>> X0 -> ZZR
>> X1 -> LR
>> X2 -> SP (R15 in BJX2)
>> X3 -> GBR
>> X4 -> TBR
>> X5 -> DHR (R1 in BJX2)
>> X6 -> R6
>> X7 -> R7
>> X8..X14 -> R8..R14
>> X15 -> R15U
>> X16..X31 -> R16..R31
>>
>> F0..F31: R32..R63 ( Possible FPU Support )
>>
>> This would not be cross-ISA ABI compatible, but trying to make cross-ISA
>> ABI calls work does not seem worthwhile. The ABIs also differ in terms
>> of the relative split between preserve and scratch registers, ...
>>
>> As such, a call between RISC-V and BJX2 code would likely require the
>> use of a ABI thunk in addition to jumping between ISA modes.
>>
>> Note that R2..R5 would be inaccessible in RISC-V mode as they are
>> effectively shadowed.
>>
>> Likewise, X1..X5 are not GPRs in this implementation, so it will be
>> assumed/required that code actually follow the ABI in this area.
>>
>> X15 goes the opposite direction, where it is a GPR in RISC-V rather than
>> an SPR in BJX2, leading to needing to define C15 as R15U with this
>> extension, which maps to this register. The purpose of R15U is mostly to
>> allow a scheduler written in BJX2 code to be able to task-switch code
>> running in RISC-V Mode.
>>
>> The reverse is not true, however, in that a task scheduler written in
>> RISC-V mode would not be able to task-switch processes running in BJX2 mode.
> <
> It might be good to point out that there is NO task scheduler in My 66000
> at least as seen by the instruction stream. The task scheduler is over by the
> Memory/DRAM Controller where it collects "interrupts" and convert these into
> prioritized context switch packets.

There is in my case.

It needs to save and restore all the registers into memory buffers.
Technically, I am using a hacked/repurposed version of longjmp, except
with the minor difference that a few registers need to be swapped out
for context-switching which are privileged only and thus inaccessible to
a normal user-mode longjmp.

>>
>>
>> It looks like RISC-V leaves things like MMU, interrupt handling, ... as
>> implementation dependent, so I could probably get away with just leaving
>> these parts mostly as-is.
> <
> I consider this a mistake on RISC-V's part.

Possibly...

Does one follow SiFive's example, SWeRV's example, ...

Convenient for me though, as I can just leave all of this stuff as-is.
Would make an epic mess if trying to do a Linux-kernel port though, as
trying to port the Linux kernel to a target where it needs to deal with
two separate ISAs seems to be asking for pain.

Click here to read the complete article

On Friday, October 1, 2021 at 8:42:11 PM UTC-5, BGB wrote:
> On 10/1/2021 5:37 PM, MitchAlsup wrote:
> > On Friday, October 1, 2021 at 3:28:47 PM UTC-5, BGB wrote:
> >> So, I was poking at a few seemingly random ideas:
> >> * Take the existing RISC-V ISA, and add a WEX like sub-mode.
> >> * Goal would be to remain backwards compatible with both the base ISA
> >> and with RVC (though, they may be mutually-exclusive in a temporal sense);
> >> * Gluing RISC-V decoder support onto my BJX2 core as an optional
> >> extension feature.
> >>
> >> So, as can be noted, standard RISC-V ops have the two LSB bits:
> >> 00/01/10: RVC Ops;
> >> 11: Base 32-bit ops.
> >>
> >> So, say, in this sub-mode (possible):
> >> 00/01: Reserved
> >> 10: Wide-Execute
> >> 11: Scalar / End-of-Bundle
> > <
> > In my personal opinion, WEX is a crutch, Great Big machines will need the
> > equivalent of WEX^2 or WEX^3, while smallest possible implementations
> > will be burdened by WEX that they can't execute. So, basically, WEX does
> > not scale, and this is bad for long term architectures.
> > <
> Small cores can ignore that it exists and don't have to pay anything.
> I am not precluding the use of super-scalar.
>
> Or, if you mean that it costs encoding space bits, granted. It would in
> the hacked RISC-V case mean a hit to code-density to be able to use it.
> >>
> >> The 10 block is nearly identical to the 11 block, just with parts of the
> >> encoding space chopped off and repurposed (such as for an equivalent of
> >> a Jumbo prefix). The 11 block will be mostly unchanged.
> >>
> >> Things like Pred or PrWEX encodings would not map over well.
> >>
> >>
> >>
> >> So, I am looking at supporting a core which does both BJX2 and RISC-V
> >> (in particular RV64IC) as sub-modes.
> >>
> >> SR(27:26):
> >> 00: BJX2, WEX Disabled (Sequential Execution)
> >> 10: BJX2, WEX Enabled
> >> 01: RISC-V, Scalar Mode (supports RVC)
> >> 11: RISC-V, Wide-Execute Mode (Possible)
> >>
> >>
> >> The RISC-V mode would mostly be handled with some tweaks in the L1 I$,
> >> and an alternate set of decoders.
> >>
> >> Otherwise, it looks like a lot of stuff is close enough that it may not
> >> be too much of a stretch to support both ISAs on a single core.
> >>
> >> Some of the basic extension sets are being left out for now:
> >> * M: The BJX2 core lacks both full-width integer multiply, and divide.
> >> * F: Doesn't map well to BJX2's FPU
> >> * D: Doesn't map well to BJX2's FPU
> >> * A: Doesn't match up
> >> * V: Doesn't match up
> >> * ...
> >>
> >>
> >> Went and wrote up some code for some of this already, at least length
> >> determination in RISC-V is fairly easy (within the 16/32 subset). It
> >> appears the ISA supports some longer encoding forms, but I will ignore
> >> them for now.
> >>
> >> I may support WEX as an ISA extension in the RISC-V case, but I am
> >> unlikely to support predicated instructions in this mode (partly because
> >> there is a non-trivial point of divergence between the ISAs).
> >>
> >> Also, unlike in BJX2, WEX mode in RISC-V would require the use of 32-bit
> >> alignment.
> >>
> >> Some other things are not strictly 1:1, so still dunno if all this is
> >> "sane"...
> >>
> >> The usage of the WEX sub-mode would likely be first generating RV32I
> >> code, and then (after the fact) running something akin to a Wexifier,
> >> which would try to shuffle and bundle the instructions. Would need a way
> >> to figure out every branch target though, such that it doesn't try to
> >> Wexify across basic-block boundaries (could require either hints or
> >> metadata).
> > <
> > Do it at the assembly level where the labels are still present.
> I was imagining doing it at the level of hacking it onto an ELF loader
> or similar...
>
> So, say, GCC or similar doesn't need to care that WEX exists, it just
> needs to be told to compile for RV64I.
>
>
> In BJX2, as-is it is sort of hacked on as a post-process after
> code-generation has already been performed. The big obvious drawback
> with this approach is that enabling it causes any code which uses the
> feature to need to be encoded as a fixed-length 32-bit subset of the ISA.
> >>
> >> Then again, if the compiler output is already optimized for in-order
> >> superscalar, the shuffling is easier, and the Wexifier would only need
> >> to flag instructions that could be executed in parallel with the
> >> following instruction (and it is no longer necessary to care as much
> >> about basic-block boundaries).
> >>
> >> Current leaning is that jumps between BJX2 and RISC-V mode would be
> >> pulled off by branching to an address with the LSB set (the LSB would
> >> effectively function as a mode-change toggle).
> >>
> >>
> >> I would likely leave out the M extension, as RISC-V specifies a
> >> full-width Multiply and Divide, which are still a bit too expensive to
> >> do in hardware as described. Though, emulating these could be possible
> >> in theory.
> >>
> >> The F and D extensions also don't map exactly to BJX2's FPU.
> >>
> >> These cases would either require using traps or implementing some sort
> >> of microcode.
> >>
> >> Namely, RISC-V specifies an FPU with:
> >> Dedicated FPU Registers
> >> FADD/FSUB/FMUL/FDIV/FSQRT
> >> Meanwhile, BJX2's FPU:
> >> Uses GPRs
> >> FADD/FSUB/FMUL
> >>
> >> It is possible I could do a variation where I map F0..F31 to R32..R63,
> >> but have not done so yet, and it still would not support FDIV or FSQRT.
> >>
> >>
> >> Most likely options are:
> >> Leave this stuff as full software emulation, which is slow.
> >>
> >> Add special instructions that map closer to what the underlying hardware
> >> supports, but this partly defeats the point of supporting a more generic
> >> ISA.
> >>
> >>
> >> Some register remapping is used:
> >> X0 -> ZZR
> >> X1 -> LR
> >> X2 -> SP (R15 in BJX2)
> >> X3 -> GBR
> >> X4 -> TBR
> >> X5 -> DHR (R1 in BJX2)
> >> X6 -> R6
> >> X7 -> R7
> >> X8..X14 -> R8..R14
> >> X15 -> R15U
> >> X16..X31 -> R16..R31
> >>
> >> F0..F31: R32..R63 ( Possible FPU Support )
> >>
> >> This would not be cross-ISA ABI compatible, but trying to make cross-ISA
> >> ABI calls work does not seem worthwhile. The ABIs also differ in terms
> >> of the relative split between preserve and scratch registers, ...
> >>
> >> As such, a call between RISC-V and BJX2 code would likely require the
> >> use of a ABI thunk in addition to jumping between ISA modes.
> >>
> >> Note that R2..R5 would be inaccessible in RISC-V mode as they are
> >> effectively shadowed.
> >>
> >> Likewise, X1..X5 are not GPRs in this implementation, so it will be
> >> assumed/required that code actually follow the ABI in this area.
> >>
> >> X15 goes the opposite direction, where it is a GPR in RISC-V rather than
> >> an SPR in BJX2, leading to needing to define C15 as R15U with this
> >> extension, which maps to this register. The purpose of R15U is mostly to
> >> allow a scheduler written in BJX2 code to be able to task-switch code
> >> running in RISC-V Mode.
> >>
> >> The reverse is not true, however, in that a task scheduler written in
> >> RISC-V mode would not be able to task-switch processes running in BJX2 mode.
> > <
> > It might be good to point out that there is NO task scheduler in My 66000
> > at least as seen by the instruction stream. The task scheduler is over by the
> > Memory/DRAM Controller where it collects "interrupts" and convert these into
> > prioritized context switch packets.
> There is in my case.
>
> It needs to save and restore all the registers into memory buffers.
> Technically, I am using a hacked/repurposed version of longjmp, except
> with the minor difference that a few registers need to be swapped out
> for context-switching which are privileged only and thus inaccessible to
> a normal user-mode longjmp.
> >>
> >>
> >> It looks like RISC-V leaves things like MMU, interrupt handling, ... as
> >> implementation dependent, so I could probably get away with just leaving
> >> these parts mostly as-is.
> > <
> > I consider this a mistake on RISC-V's part.
> Possibly...
>
> Does one follow SiFive's example, SWeRV's example, ...
>
> Convenient for me though, as I can just leave all of this stuff as-is.
> Would make an epic mess if trying to do a Linux-kernel port though, as
> trying to port the Linux kernel to a target where it needs to deal with
> two separate ISAs seems to be asking for pain.
> >>
> >> Similar for things like address space, ...
> >> So, the core would retain BJX2's existing address space and memory map.
> >>
> >>
> >>
> >> Some aspects of the RISC-V ISA design leave me with a bit of a mystery
> >> though, namely how does it "not suck"?...
> >>
> >> A lot of fairly common stuff that can be expressed in a single
> >> instruction in BJX2 looks like it would require a much longer and more
> >> convoluted instruction sequences to express in RISC-V. Some instruction
> >> in RISC-V were "kinda weird" but not otherwise too much of a stretch for
> >> how to express them within the existing pipeline (eg: JAL and JALR do
> >> not have direct function equivalent in BJX2, but could still be mapped
> >> to the existing branch mechanisms).
> > <
> > The MIPS and RISC-V pipelines are organized around the notion of
> > compare-and-branch.
> I was more thinking about this:
> JAL, does a branch while saving the next PC into an explicit register,
> or R0/ZR for a plain branch;
> JALR does a branch to a given register plus a displacement, possibly
> also saving the next PC into a register.
>
> BJX2 differs here:
> BRA does an unconditional branch;
> BSR does a branch saving the next PC to LR;
> JMP does a branch to a register.
> JSR does a branch to a register, saving LR.
>
> You only get LR. The main change I needed to make was that if you do a
> BSR or JSR internally, and Rn!=ZZR, then the LR value is saved to Rn,
> else a JAL or JALR with ZR as a target is decoded as BRA or JMP.
>
>
> Normally, BJX2 did compare-and-branch via, sat:
> CMPEQ Rm, Rn
> BT .lbl
>
> Whereas RISC-V does:
> BEQ Rm, Rm, .lbl
>
> I added the logic for this (by routing the compare output signals over
> to the EX1 unit directly), but this costs around 1300 LUTs, and timing
> isn't super happy about it.
> >>
> >>
> >> The main expensive thing I had to add hardware support for
> >> Compare-and-Branch instructions. These were a "mostly unimplemented"
> >> feature in BJX2, but RISC-V uses these as the primary branch type.
> >>
> >> There is a cost increase and timing penalty for the RISC-V support,
> >> which I suspect is most likely primarily due to the Compare-and-Branch
> >> operations (the main alternative being to add a mechanism to decompose
> >> them into a 2-op sequence in the decoder).
> > <
> > One either designs the pipeline around the compare-and-branch or one
> > designs the pipeline such that compare and branch are 2 different
> > instructions. You cannot do both. Both have merit and demerits.
> I had done them as two separate ops in BJX2, RISC-V wants them as one
> op. Hence the issue here...
> >>
> >>
> >> Although, I had started this out with the idea of being able to support
> >> a Wide-Execute variation of RISC-V, I am starting to debate this, as:
> >> * It would require a compiler to be aware of its existence to be able to
> >> use it;
> >> * RISC-V appears to, kinda suck, so it is debatable if this would be
> >> particularly worthwhile.
> >>
> >>
> >> While RISC-V does have theoretically larger load/store displacements
> >> (12-bits vs 9-bits).
> > <
> > Piddly crap. My 66000 has 16-bit, 32-bit, and 64-bit displacements.
> BJX2 can encode a 33-bis displacement in a 64-bit Jumbo encoding...
>
> RISC-V can't encode it at all...
>
> One would have to do something like like:
> LUI X6, disp_hi
> ADD X6, X6, X8
> LD X9, X6, disp_lo
>
> This sucks...
>
> Granted:
> MOV.Q (R8, Disp33s), R9
> Still costs 8 bytes to encode, but it can be executed in a single cycle.
> >>
> >> Given that they are unscaled and sign-extended, their usable range will
> >> actually be less for QWORD operations than the 9-bit displacements in
> >> BJX2 (and, the XGPR encodings effectively add a 10-bit signed load/store
> >> encoding).
> > <
> > If you have an Inherently misaligned memory model, this is the only
> > realistic choice.
> BJX2 does misaligned access, but assumes that struct fields are aligned
> by default. This saves several bits per load in most cases.
>
> Granted, it does penalize use of packed structures, say:
> MOV disp, R0
> MOV.Q (R8, R0), R9
>
> But, this is still less cycles than in RISC-V.
> >>
> >> Granted:
> >> ADDI Rn, Rm, Imm12s
> >> Is slightly bigger than:
> >> ADD Rm, Imm9s, Rn
> > <
> > Try calculating how big:
> > <
> > ADD Rd,Rs1,0x123456789101112
> > <
> > would be in RISC-V !! (Total size, including both instruction and data)
> Yeah...
>
> Not good in any case.
>
> In BJX2, this will cost 16 bytes, and 2 clock cycles.
>
> Meanwhile:
> ADD 0x123456789101112, R10
>
> Is 12 bytes and a single clock cycle.
> >>
> >> One draw-back IMO, is that comparably RISC-V's immediate values are much
> >> worse in the "bit-confetti" sense than BJX2's immediate values. Where,
> >> at least, contiguous bits in the encoding generally also represent
> >> contiguous bits in the value, whereas RISC-V was apparently happy
> >> leaving them a chewed up mess...
> > <
> > One of the things that happens when you start with the opcode at the wrong
> > end of the instruction word.
> I lack a good explanation for why they did it like this...
>
> In BJX2, there is a little bit of a mess, but, at least, if one lays out
> the words in left-to-right order, then the bits also combine in
> left-to-right order.
>
> RISC-V has instructions with 20 bits contiguous, but is the field
> contiguous? No, it has the bits scrambled all over the place. One bit
> here, another bit over there, ... Me: "WTF?".
>
> I will be the compiler logic for implementing relocations on this is
> really clean and elegant... (or not).
>
>
> Granted, registers are not contiguous in BJX2, but this was mostly because:
> I was organizing everything in terms of nybbles;
> BJX2 was derived from BJX1 and BSR1, both of which were natively 16 GPR
> designs.
> >>
> >>
> >> As for which would win out in practice in terms of performance, it is
> >> less clear. It is possible that RISC-V could have the advantage of more
> >> mature compiler technology.
> >>
> >>
> >> RISC-V seems lack anything equivalent to the Conv family instructions:
> >> MOV, EXTU/EXTS/...
> >> I am having a concerned feeling that these may need to be faked using
> >> masks and shifts, say:
> >> EXTS.B R8, R9
> >> Maps to, say:
> >> SLLI X6, X8, 56
> >> SRAI X9, X6, 56
> > <
> > RISC-V has 12-bit constants. The lower 6-bits is the shift count, the upper 6-bits
> > are the size, with the constraint that 0 implies maximum (i.e., register) width.
> > <
> > I do this in My 66000. When used as reg-reg, bits<5:0> remain the shift count
> > while bits <37:32> are the field size. The HW checks that the other bits have
> > "no significance" and raised the OPERAND exception if the operand pattern
> > is not within the required domain.
> I didn't see any mention of things working this way in the spec.
>
> It appears that everything is using linear sign-extended values,
> nevermind that some of them are confetti.
>
>
> Granted, there might be something like this in the bit-manipulation
> extension, I haven't looked at this part yet.
>
> If there is some way to do (Rm+Ri*Sc) addressing, that would be useful,
> as this is basically the model my core is based around internally.
> >>
> >> But, how does one express unsigned int16 extension, is it something like:
> >> LUI X6, 16
> >> ADDI X6, X6, -1
> >> AND X9, X8, X6
> >>
> >> If so, this kinda sucks...
> >>
> >>
> >> There is also seemingly no way to load or store a global variable within
> >> a single instruction, ...
> >>
> >> Something like:
> >> j=arr[i];
> >>
> > <
> > LDD Rj,[R0+Ri<<s+arr]
> > <
> Nothing like this was seen in the RISC-V spec I am looking at (2.2),
> hence my annoyance...
> >> Looks like it will generally require at least 3 instructions (this is
> >> typically a single instruction in BJX2).
> >>
> >>
> >>
> >> I have doubts, on the surface, it appears that all this "kinda sucks".
> >>
> >> The main advantage that RISC-V has is that it could potentially allow
> >> for a reasonably small / cheap CPU core, but this is less true of faking
> >> it with a decoder hack on top of the BJX2 core (though, granted, it is
> >> much less absurd than 32-bit ARM or x86 would have been, in that at
> >> least the core ISA mechanics map over reasonably well, avoiding needing
> >> some big/nasty emulation layer).
> > <
> > It is my personal belief, that if you wanted a small core your better starting
> > point would be MIPS R3000 and integrate the caches.
> Could be.
>
> My past attempts at small cores were:
> B32V, basically a stripped-down SH-2 derived ISA;
> BSR1, similar feature-set to B32V, but failed to be smaller.
>
> RV32E could be promising, but represents a very different design
> philosophy from B32V or BSR1.
>
> Compared with B32V or BSR1, my BJX2 core is gigantic...
>
>
> Though, these cores implicitly have the drawback that, as-is, neither
> could easily be plugged into my ring-bus.
>
> The general design for B32V was along the lines of:
> 16-bit ops, 2R/1W register file;
> 16 registers, 32-bits;
> Only does (Rm) and (Rm,R0) addressing, no auto-increment;
> Aligned-only memory access;
> No variable-shift unit (used shift slides, like in MSP430);
> ...
>
> IIRC, it was ~ 4000 LUTs.
>
> Never ended up having a strong use-case for it (had sort of intended it
> as an IO controller for BJX1). Partly this was because I had originally
> imagined audio and display hardware as using a "racing the beam" style
> similar to the Atari 2600, with a dedicated processor mostly for reading
> stuff from memory and shoving it into the hardware buffers.
>
> I ended up not going that direction.
>
>
> However, to be small, the memory caches need to remain extremely
> limited, ...
>
>
> I don't mind so much the superficial complexity of BJX2's ISA design,
> because best I can tell the FPGA doesn't really care that much either.
>
> Things like register file, pipeline, ALU, L1 caches, ... seem to matter
> a lot more for the LUT budget than the decoder.
>
> Or, at least, when everything is reasonably consistent, and the decoder
> is mostly a table lookup which encodes which execute-units to use, and
> some magic numbers encoding how to unpack the instruction parameters
> into the formats expected by the pipeline.
>
>
> BJX2 was sort of going the opposite directly, trying to be fast, and
> doing whatever I could that could potentially make it faster and still
> fit in the FPGA.
>
>
> Hence, why it has VLIW and so on.
> A tiny ISA core would not have VLIW.
> >>
> >>
> >> Well, unless one wants RV64GC or similar, which would likely require a
> >> certain amount of emulation to pull off (or making a "native" RISC-V
> >> core would likely require a full fork).
> >>
> >>
> >> Or, is all this probably just a garbage feature that makes everything
> >> likely worse-off overall ?...
> >>
> >>
> >> Any thoughts?...
> > <
> > Has any dual/mixed instruction set machine EVER had a long lifetime ?
> >
> There was 32-bit ARM:
> Original ARM ISA;
> Thumb;
> Thumb-2;
> Thumb-EE;
> Jazelle DBX;
> ...
>
>
> Though, I am left wondering how exactly SweRV is getting its claimed
> performance numbers, when the RISC-V ISA seems to suck so badly...
>
> So, you have a core that is dual-issue, running an ISA that seems to
> suck real hard, running at 2/3 the clock speed, and getting ~ 165K in
> Dhrystone.
>
>
> Meanwhile, all I can seem to get at the moment is a much less impressive
> seeming ~ 61K.
>
> Same FPGA...
>
>
>
> Though, looking at the Verilog, it appears they may not directly stall
> the pipeline on memory operations, but instead put memory requests into
> a queue and then let the memory requests be completed asynchronously. It
> appears instructions may be pulled into execute from another queue.
> ...
>
> Hrmm...
>
> I am getting a sense this may not be quite the same class of CPU core.

Click here to read the complete article

On 10/1/2021 9:32 PM, JimBrakefield wrote:
> On Friday, October 1, 2021 at 8:42:11 PM UTC-5, BGB wrote:
>> On 10/1/2021 5:37 PM, MitchAlsup wrote:
>>> On Friday, October 1, 2021 at 3:28:47 PM UTC-5, BGB wrote:
>>>> So, I was poking at a few seemingly random ideas:
>>>> * Take the existing RISC-V ISA, and add a WEX like sub-mode.
>>>> * Goal would be to remain backwards compatible with both the base ISA
>>>> and with RVC (though, they may be mutually-exclusive in a temporal sense);
>>>> * Gluing RISC-V decoder support onto my BJX2 core as an optional
>>>> extension feature.
>>>>
>>>> So, as can be noted, standard RISC-V ops have the two LSB bits:
>>>> 00/01/10: RVC Ops;
>>>> 11: Base 32-bit ops.
>>>>
>>>> So, say, in this sub-mode (possible):
>>>> 00/01: Reserved
>>>> 10: Wide-Execute
>>>> 11: Scalar / End-of-Bundle
>>> <
>>> In my personal opinion, WEX is a crutch, Great Big machines will need the
>>> equivalent of WEX^2 or WEX^3, while smallest possible implementations
>>> will be burdened by WEX that they can't execute. So, basically, WEX does
>>> not scale, and this is bad for long term architectures.
>>> <
>> Small cores can ignore that it exists and don't have to pay anything.
>> I am not precluding the use of super-scalar.
>>
>> Or, if you mean that it costs encoding space bits, granted. It would in
>> the hacked RISC-V case mean a hit to code-density to be able to use it.

....

>>>>
>>>>
>>>> Well, unless one wants RV64GC or similar, which would likely require a
>>>> certain amount of emulation to pull off (or making a "native" RISC-V
>>>> core would likely require a full fork).
>>>>
>>>>
>>>> Or, is all this probably just a garbage feature that makes everything
>>>> likely worse-off overall ?...
>>>>
>>>>
>>>> Any thoughts?...
>>> <
>>> Has any dual/mixed instruction set machine EVER had a long lifetime ?
>>>
>> There was 32-bit ARM:
>> Original ARM ISA;
>> Thumb;
>> Thumb-2;
>> Thumb-EE;
>> Jazelle DBX;
>> ...
>>
>>
>> Though, I am left wondering how exactly SweRV is getting its claimed
>> performance numbers, when the RISC-V ISA seems to suck so badly...
>>
>> So, you have a core that is dual-issue, running an ISA that seems to
>> suck real hard, running at 2/3 the clock speed, and getting ~ 165K in
>> Dhrystone.
>>
>>
>> Meanwhile, all I can seem to get at the moment is a much less impressive
>> seeming ~ 61K.
>>
>> Same FPGA...
>>
>>
>>
>> Though, looking at the Verilog, it appears they may not directly stall
>> the pipeline on memory operations, but instead put memory requests into
>> a queue and then let the memory requests be completed asynchronously. It
>> appears instructions may be pulled into execute from another queue.
>> ...
>>
>> Hrmm...
>>
>> I am getting a sense this may not be quite the same class of CPU core.
>
> There is an apples to apples 2019 comparison of nine soft-core RISC-V FPGA implementations at
> "A Catalog and In-Hardware Evaluation of Open-Source Drop-In Compatible RISC-V Softcore Processors"
> https://www.esa.informatik.tu-darmstadt.de
> With an order of magnitude variation in LUT counts!
>

Not sure what I am looking for there...

> Also, looks like https://us.artechhouse.com/A-Hands-On-Guide-to-Designing-Embedded-Systems-P2204.aspx
> will be required reading for myself (yeah, probably know most of it, but the rest is worth the book cost).
>

Didn't find that much in quick searches, but did find something else,
namely someone apparently noticing a fairly drastic speed difference
between GCC and Clang when it came to Dhrystone performance on RISC-V...

I guess it is also possible that GCC may be able to "cheat" on the
benchmark in ways that BGBCC is not.

So, it is possible that GCC compiling for RISC-V vs BGBCC compiling for
BJX2 may not be entirely apples-to-apples either.

More so when a lot of my recent performance improvements to BGBCC
haven't been so much from fancy new optimizations, so much as noting
problems in the generated compiler output and fixing issues which were
resulting in the crappy compiler output.

Then there is also the issue that the compiler's Wexifier stage tends to
miss many cases where instructions could have been swapped around to
execute things in parallel.

Partly it is limited in that the compiler will not swap memory loads and
stores, but I had considered a possible rule tweak, say:
Both operations involve the same base register, and,
Both operations represent offsets which don't overlap.
Then it will allow swaps effectively if the compiler can verify that the
swap will not change the end result.

This could potentially allow things like reorganizing expressions around
stack fills and spills and similar.

....

On 10/1/2021 6:54 PM, JimBrakefield wrote:
> On Friday, October 1, 2021 at 6:26:24 PM UTC-5, BGB wrote:
>> On 10/1/2021 4:05 PM, JimBrakefield wrote:
>>> On Friday, October 1, 2021 at 3:28:47 PM UTC-5, BGB wrote:
>>>> So, I was poking at a few seemingly random ideas:
>>>> * Take the existing RISC-V ISA, and add a WEX like sub-mode.
>>>> * Goal would be to remain backwards compatible with both the base ISA
>>>> and with RVC (though, they may be mutually-exclusive in a temporal sense);
>>>> * Gluing RISC-V decoder support onto my BJX2 core as an optional
>>>> extension feature.
>>>>
>> ...
>>>>
>>>> Well, unless one wants RV64GC or similar, which would likely require a
>>>> certain amount of emulation to pull off (or making a "native" RISC-V
>>>> core would likely require a full fork).
>>>>
>>>>
>>>> Or, is all this probably just a garbage feature that makes everything
>>>> likely worse-off overall ?...
>>>>
>>>>
>>>> Any thoughts?...
>>> Some of the not too expensive FPGAs have room for a hundred simple cores,
>>> all running simultaneously. Easy enough to show them on a 4K display.
>>> A solution in search of a problem?
>>>
>> One can have lots of tiny cores, but they are not useful for all that much.
>>
>> One of the bigger problems I have been facing in my projects is not so
>> much with doing computations, but with getting data from memory efficiently.
>>
>> Making the code itself faster has the indirect drawback in that it
>> increases the relative amount of time spent waiting for memory (and I
>> still have not succeeded at making Quake all that playable at 50MHz).
>>
>> Past attempts to work out the math for "what if I ran the RAM using
>> SERDES at a more 'appropriate' clock speed" only came up with fairly
>> modest gains.
>>
>>
>> Most of my recent performance gains have come more from fixing compiler
>> bugs than from hardware features, eg:
>> Fixing a bug which resulted in lots of redundant type conversions helped
>> a certain amount;
>> More recently, fixing another bug where loading a value from an unsigned
>> type tended to result in an additional (redundant) type-conversion op
>> was also helpful (~ 1% smaller binaries, also made Quake a little faster);
>> ...
>>
>> Some of these savings were more due to reducing register pressure (fewer
>> temporaries means fewer stack spills).
>>> For soft cores that support virtual memory, maybe room for ten cores?
>>> Don't forget 2 to 6 built-in ARM cores. Or:
>>>
>>> Actel Polarfire, now Microchip/Microsemi offers four RISC-V cores and FPGA fabric.
>>>
>> The FPGAs I am using don't have any hard-wired cores.
>>
>> Mostly dealing with Spartan and Artix class hardware.
>>
>> Was previously generally able to fit two BJX2 cores onto an XC7A100T,
>> and had also been working some on trying to implement a modified core
>> which could support SMT (would effectively have 6 execute lanes with two
>> pilelines, but partly conjoins some other hardware).
>>
>> The hope was that SMT could be cheaper than full dual-core (such as by
>> sharing the MMU and FPU, ...).
>>
>>
>> Had looked at possibly doing a 6-wide version of the BJX2 ISA, but there
>> are enough roadblocks here that this does not seem particularly viable
>> (mostly involving trying to pull off a unified register file with 12
>> read ports and 6 write ports).
>>
>> A purely SMT implementation avoids this issue though, as it can use
>> independent register files.
>>
>>
>>
>> I could fit 4 cores into the FPGA I am using if I were running a 64-bit
>> RISC with only a single execute lane (running at one instruction per
>> cycle), with smaller L1 caches (say, 4K I$ + 8K D$).
>>
>>
>> Or around 8 cores if I were using a much more minimalist core, such as a
>> basic 32-bit single-issue core.
>>
>> Though, one also has to use smaller/simpler L1 caches for this (such as
>> 512B 16x2x16B), but making the L1 caches smaller or similar tends to
>> have a significant adverse effect on performance.
>>
>>
>> Programs like Doom and Quake seem to be most happy with ~ 16K or 32K of
>> L1, and as much L2 as one can throw at them. After more recently moving
>> to RAM backed VRAM, this means I can now afford 256K of L2 (with only a
>> moderate amount of clashing between the active program and
>> video-display; partly as the display hardware is also routed through the
>> L2 cache; this seems to hold up pretty OK for 320x200 at least).
>>
>> However, with 256K of L2 + 16K + 32K for L1, I am basically at the limit
>> of the available Block-RAM.
>>
>>
>> In theory, given the "pretty much awful" ILP which I am getting from my
>> C compiler, I could run a single-issue core at 100MHz vs a 3-wide core
>> at 50MHz, and have it be faster.
>>
>> However... I have to make the L1 caches smaller, which hurts the
>> performance a lot worse than the higher clock-speed helps (it doesn't
>> make the RAM any faster).
>>
>> So, it ends up being faster and easier to just stick with everything
>> being 50 MHz, and instead trying to leverage VLIW. In theory, in-order
>> super-scalar can do basically the same thing, but is more complicated
>> and more expensive than VLIW.
>>
>> Similar, 50MHz single issue is slower than 50MHz VLIW.
>>
>>
>> And, making an L1 D$ that can be 16K or 32K and also allow single-cycle
>> access latency, and pass timing at 75 or 100MHz, is easier said than done.
>>
>>
>>
>> But, kinda in a local-optimum trap here, where I can make Doom run
>> pretty well, but Quake still tends to remain "pretty much unplayable".
>>
>> It looks like to get reasonably playable framerates in Quake, I would
>> need to pull off all 3 of these:
>> Similar-sized, or larger, L1 caches (eg: 32K L1 D$, or more);
>> Significantly faster DRAM access;
>> Being able to run a 3-wide pipeline at 100MHz.
>>
>> Ironically, I have managed to make a small Minecraft style 3D engine
>> work semi-passably on this (performance is a little bit better than
>> Quake; although the draw distance isn't particularly large at only 24
>> meters).
>>
>>
>>
>> This isn't really going to happen on the Artix I am using though, and I
>> still don't have a job at present, so no real money to drop on a much
>> more expensive Kintex based board.
>>
>> Some other guy managed to get somewhat better framerates for a Minecraft
>> style engine running on an ICE40, but his effort offloads most of the 3D
>> rendering machinery to hardware, and is basically using a small 16-bit
>> RISC style ISA for the actual CPU part.
>>
>>
>> In contrast, I am basically using software rendering based around a more
>> traditional triangle-rasterization approach. However, since
>> perspective-correct rendering is expensive, my renderer tends to carve
>> up large triangles and draws them using affine projection (more
>> subdivision looks nicer, but comes at a fairly steep performance cost).
>>
>> Note that my 3D renderer also does rendering OpenGL style (and via
>> implementing the OpenGL API), which basically means linearly
>> interpolating colors across triangles and multiplying them with the
>> geometry being drawn. It is also possible to draw geometry with
>> alpha-blending, or other blending modes, but this is slower (the OpenGL
>> API implementation is basically sufficient to run Quake 3 Arena though;
>> but at the moment, Q3A still has relatively little hope of usable
>> performance).
>>
>>
>> I guess one thing I had overlooked with a Minecraft style engine based
>> around using ray-casts for visibility finding, was that the ray-casts
>> tend to hit only a small part of the world with a short draw distance,
>> so the cache-miss rates from the raycasts are fairly modest.
>>
>> ...
>>
>>
>> Not sure how this would work running RISC-V, I somehow suspect that
>> running the OpenGL rasterizer as RISC-V code would be pretty much awful.
>>
>> Yet, going by SWeRV's numbers, they get somewhat better DMIPS/MHz than I
>> am getting from BJX2.
>>
>>
>> Notes that the SWeRV core is able to run single-core on the same FPGA
>> board I am using, albeit only passes timing at a lower clock speed than
>> what I am using for BJX2 (one can run it at 33 MHz I guess...).
>>
>> I don't know how much is likely due to hardware magic, or better
>> compiler optimization, will likely need to model some of this a bit better.
>>
>>
>> But, I don't know...
>
> I'm surprised you can only get to 50MHz on the Artix-7?
> Vivado offers register retiming - ugh, no personal experience with retiming
>

Click here to read the complete article

On Saturday, October 2, 2021 at 3:31:25 AM UTC-5, BGB wrote:
> On 10/1/2021 6:54 PM, JimBrakefield wrote:
> > On Friday, October 1, 2021 at 6:26:24 PM UTC-5, BGB wrote:
> >> On 10/1/2021 4:05 PM, JimBrakefield wrote:
> >>> On Friday, October 1, 2021 at 3:28:47 PM UTC-5, BGB wrote:
> >>>> So, I was poking at a few seemingly random ideas:
> >>>> * Take the existing RISC-V ISA, and add a WEX like sub-mode.
> >>>> * Goal would be to remain backwards compatible with both the base ISA
> >>>> and with RVC (though, they may be mutually-exclusive in a temporal sense);
> >>>> * Gluing RISC-V decoder support onto my BJX2 core as an optional
> >>>> extension feature.
> >>>>
> >> ...
> >>>>
> >>>> Well, unless one wants RV64GC or similar, which would likely require a
> >>>> certain amount of emulation to pull off (or making a "native" RISC-V
> >>>> core would likely require a full fork).
> >>>>
> >>>>
> >>>> Or, is all this probably just a garbage feature that makes everything
> >>>> likely worse-off overall ?...
> >>>>
> >>>>
> >>>> Any thoughts?...
> >>> Some of the not too expensive FPGAs have room for a hundred simple cores,
> >>> all running simultaneously. Easy enough to show them on a 4K display.
> >>> A solution in search of a problem?
> >>>
> >> One can have lots of tiny cores, but they are not useful for all that much.
> >>
> >> One of the bigger problems I have been facing in my projects is not so
> >> much with doing computations, but with getting data from memory efficiently.
> >>
> >> Making the code itself faster has the indirect drawback in that it
> >> increases the relative amount of time spent waiting for memory (and I
> >> still have not succeeded at making Quake all that playable at 50MHz).
> >>
> >> Past attempts to work out the math for "what if I ran the RAM using
> >> SERDES at a more 'appropriate' clock speed" only came up with fairly
> >> modest gains.
> >>
> >>
> >> Most of my recent performance gains have come more from fixing compiler
> >> bugs than from hardware features, eg:
> >> Fixing a bug which resulted in lots of redundant type conversions helped
> >> a certain amount;
> >> More recently, fixing another bug where loading a value from an unsigned
> >> type tended to result in an additional (redundant) type-conversion op
> >> was also helpful (~ 1% smaller binaries, also made Quake a little faster);
> >> ...
> >>
> >> Some of these savings were more due to reducing register pressure (fewer
> >> temporaries means fewer stack spills).
> >>> For soft cores that support virtual memory, maybe room for ten cores?
> >>> Don't forget 2 to 6 built-in ARM cores. Or:
> >>>
> >>> Actel Polarfire, now Microchip/Microsemi offers four RISC-V cores and FPGA fabric.
> >>>
> >> The FPGAs I am using don't have any hard-wired cores.
> >>
> >> Mostly dealing with Spartan and Artix class hardware.
> >>
> >> Was previously generally able to fit two BJX2 cores onto an XC7A100T,
> >> and had also been working some on trying to implement a modified core
> >> which could support SMT (would effectively have 6 execute lanes with two
> >> pilelines, but partly conjoins some other hardware).
> >>
> >> The hope was that SMT could be cheaper than full dual-core (such as by
> >> sharing the MMU and FPU, ...).
> >>
> >>
> >> Had looked at possibly doing a 6-wide version of the BJX2 ISA, but there
> >> are enough roadblocks here that this does not seem particularly viable
> >> (mostly involving trying to pull off a unified register file with 12
> >> read ports and 6 write ports).
> >>
> >> A purely SMT implementation avoids this issue though, as it can use
> >> independent register files.
> >>
> >>
> >>
> >> I could fit 4 cores into the FPGA I am using if I were running a 64-bit
> >> RISC with only a single execute lane (running at one instruction per
> >> cycle), with smaller L1 caches (say, 4K I$ + 8K D$).
> >>
> >>
> >> Or around 8 cores if I were using a much more minimalist core, such as a
> >> basic 32-bit single-issue core.
> >>
> >> Though, one also has to use smaller/simpler L1 caches for this (such as
> >> 512B 16x2x16B), but making the L1 caches smaller or similar tends to
> >> have a significant adverse effect on performance.
> >>
> >>
> >> Programs like Doom and Quake seem to be most happy with ~ 16K or 32K of
> >> L1, and as much L2 as one can throw at them. After more recently moving
> >> to RAM backed VRAM, this means I can now afford 256K of L2 (with only a
> >> moderate amount of clashing between the active program and
> >> video-display; partly as the display hardware is also routed through the
> >> L2 cache; this seems to hold up pretty OK for 320x200 at least).
> >>
> >> However, with 256K of L2 + 16K + 32K for L1, I am basically at the limit
> >> of the available Block-RAM.
> >>
> >>
> >> In theory, given the "pretty much awful" ILP which I am getting from my
> >> C compiler, I could run a single-issue core at 100MHz vs a 3-wide core
> >> at 50MHz, and have it be faster.
> >>
> >> However... I have to make the L1 caches smaller, which hurts the
> >> performance a lot worse than the higher clock-speed helps (it doesn't
> >> make the RAM any faster).
> >>
> >> So, it ends up being faster and easier to just stick with everything
> >> being 50 MHz, and instead trying to leverage VLIW. In theory, in-order
> >> super-scalar can do basically the same thing, but is more complicated
> >> and more expensive than VLIW.
> >>
> >> Similar, 50MHz single issue is slower than 50MHz VLIW.
> >>
> >>
> >> And, making an L1 D$ that can be 16K or 32K and also allow single-cycle
> >> access latency, and pass timing at 75 or 100MHz, is easier said than done.
> >>
> >>
> >>
> >> But, kinda in a local-optimum trap here, where I can make Doom run
> >> pretty well, but Quake still tends to remain "pretty much unplayable".
> >>
> >> It looks like to get reasonably playable framerates in Quake, I would
> >> need to pull off all 3 of these:
> >> Similar-sized, or larger, L1 caches (eg: 32K L1 D$, or more);
> >> Significantly faster DRAM access;
> >> Being able to run a 3-wide pipeline at 100MHz.
> >>
> >> Ironically, I have managed to make a small Minecraft style 3D engine
> >> work semi-passably on this (performance is a little bit better than
> >> Quake; although the draw distance isn't particularly large at only 24
> >> meters).
> >>
> >>
> >>
> >> This isn't really going to happen on the Artix I am using though, and I
> >> still don't have a job at present, so no real money to drop on a much
> >> more expensive Kintex based board.
> >>
> >> Some other guy managed to get somewhat better framerates for a Minecraft
> >> style engine running on an ICE40, but his effort offloads most of the 3D
> >> rendering machinery to hardware, and is basically using a small 16-bit
> >> RISC style ISA for the actual CPU part.
> >>
> >>
> >> In contrast, I am basically using software rendering based around a more
> >> traditional triangle-rasterization approach. However, since
> >> perspective-correct rendering is expensive, my renderer tends to carve
> >> up large triangles and draws them using affine projection (more
> >> subdivision looks nicer, but comes at a fairly steep performance cost).
> >>
> >> Note that my 3D renderer also does rendering OpenGL style (and via
> >> implementing the OpenGL API), which basically means linearly
> >> interpolating colors across triangles and multiplying them with the
> >> geometry being drawn. It is also possible to draw geometry with
> >> alpha-blending, or other blending modes, but this is slower (the OpenGL
> >> API implementation is basically sufficient to run Quake 3 Arena though;
> >> but at the moment, Q3A still has relatively little hope of usable
> >> performance).
> >>
> >>
> >> I guess one thing I had overlooked with a Minecraft style engine based
> >> around using ray-casts for visibility finding, was that the ray-casts
> >> tend to hit only a small part of the world with a short draw distance,
> >> so the cache-miss rates from the raycasts are fairly modest.
> >>
> >> ...
> >>
> >>
> >> Not sure how this would work running RISC-V, I somehow suspect that
> >> running the OpenGL rasterizer as RISC-V code would be pretty much awful.
> >>
> >> Yet, going by SWeRV's numbers, they get somewhat better DMIPS/MHz than I
> >> am getting from BJX2.
> >>
> >>
> >> Notes that the SWeRV core is able to run single-core on the same FPGA
> >> board I am using, albeit only passes timing at a lower clock speed than
> >> what I am using for BJX2 (one can run it at 33 MHz I guess...).
> >>
> >> I don't know how much is likely due to hardware magic, or better
> >> compiler optimization, will likely need to model some of this a bit better.
> >>
> >>
> >> But, I don't know...
> >
> > I'm surprised you can only get to 50MHz on the Artix-7?
> > Vivado offers register retiming - ugh, no personal experience with retiming
> >
> I achieve 50MHz, with 3 execute lanes, 6 read / 3 write register file,
> and 16K + 32K L1 caches.
>
> It is possible to get 100MHz, but:
> 1 lane, 3R + 1W regfile, and 2K L1 caches.
>
> But, as noted, 1 lane and 2kB L1's are not good for performance...
>
>
>
> As-is, at 50 MHz, around 60% or so of the clock cycles seem to be going
> into cache misses, but this is still an improvement from ~ 85% on some
> earlier versions of my core.
>
> From what I can gather, a partial factor is that each level of the
> cache hierchy roughly doubles the number of requests that go to the next
> level down on a miss:
> 1 L1 access -> ~2 L2 requests -> ~4 DRAM requests.
> > If you are at liberty to get another FPGA board, consider the Kria K26 SOM
> > ~same price, more than twice the LUTs, 3MB of block RAM, 16nm, ugh export controlled
> >
> Possible, I guess it could be a possible, though looks like it would
> need some sort of IO interface or something (no normal IO connections or
> PMOD connectors; unlike the Arty or Nexys).
>
> Also not sure if it would be usable via the freeware version of Vivado, ...
>
>
>
> I am also using a CMod-S7 (XC7S25), and have managed to fit a smaller
> version of the BJX2 core on this. Though, this device doesn't have any
> external RAM.

Click here to read the complete article

On 10/2/2021 1:31 AM, BGB wrote:

snip

> I achieve 50MHz, with 3 execute lanes, 6 read / 3 write register file,
> and 16K + 32K L1 caches.
>
> It is possible to get 100MHz, but:
> 1 lane, 3R + 1W regfile, and 2K L1 caches.
>
> But, as noted, 1 lane and 2kB L1's are not good for performance...
>
>
>
> As-is, at 50 MHz, around 60% or so of the clock cycles seem to be going
> into cache misses, but this is still an improvement from ~ 85% on some
> earlier versions of my core.
>
> From what I can gather, a partial factor is that each level of the
> cache hierchy roughly doubles the number of requests that go to the next
> level down on a miss:
> 1 L1 access -> ~2 L2 requests -> ~4 DRAM requests.

So each level of cache is getting only a 33% hit rate???? That truly is
terrible.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

On 10/2/2021 10:27 AM, Stephen Fuld wrote:
> On 10/2/2021 1:31 AM, BGB wrote:
>
> snip
>
>> I achieve 50MHz, with 3 execute lanes, 6 read / 3 write register file,
>> and 16K + 32K L1 caches.
>>
>> It is possible to get 100MHz, but:
>> 1 lane, 3R + 1W regfile, and 2K L1 caches.
>>
>> But, as noted, 1 lane and 2kB L1's are not good for performance...
>>
>>
>>
>> As-is, at 50 MHz, around 60% or so of the clock cycles seem to be
>> going into cache misses, but this is still an improvement from ~ 85%
>> on some earlier versions of my core.
>>
>> From what I can gather, a partial factor is that each level of the
>> cache hierchy roughly doubles the number of requests that go to the
>> next level down on a miss:
>> 1 L1 access -> ~2 L2 requests -> ~4 DRAM requests.
>
> So each level of cache is getting only a 33% hit rate???? That truly is
> terrible.
>

No. If it were, I would consider the caches broken.

I am typically getting around a 98% hit rate for L1, and around 60% to
80% for L2.

But, I had modeled what happens assuming each time a cache level misses,
and it nearly doubles each time; likely from evicting an old cache line
and fetching a new one.

None the less, that small percentage of L1 misses tends to amount to
around 60% of the total clock cycles (in Doom/Quake/...).

This is with a DDR controller which takes 59 clock cycles to do a 64B
cache/line swap operation with RAM (or 38 load + 35 store, but the 59
cycle swap operation tends to come out ahead).

This is with a 16-bit wide DDR2 interface being run at 50MHz
(non-standard operating mode, DLL disabled). It is sorta possible to run
it at 75MHz, but stability/reliability kinda goes in the toilet.

Apparent "standard" option would be to run the chip at 400-600 MHz via
SERDES, but alas...

When one accounts for RAM timing latency and similar, the theoretical
gains are likely to be relatively modest, so hasn't seemed worth the
effort (and it appears most other people are just using Vivado MIG +
AXI, rather than using a custom written DDR controller).

So, max speeds for DRAM:
54 MB/s swap;
84 MB/s load;
91 MB/s store.

Some memory performance stats:
L1 memcpy: 275 MB/s
L2 memcpy: 67 MB/s
DRAM memcpy: 18 MB/s

L1 memset: 305 MB/s
L2 memset: 130 MB/s
DRAM memset: 44 MB/s

L1 memload: 448 MB/s
L2 memload: 175 MB/s
DRAM memload: 73 MB/s

And, when accounting for the multi-level scaling factors, it sends to
map up reasonably OK.

Note that, in this case, this is with a direct-mapped L1 cache, and a
2-way set-associative L2 cache.

When ~ 85% of the cycles were in mem-stall, the memcpy numbers were more
like:
L2 memcpy: 49 MB/s
DRAM memcpy: 8 MB/s

Maximum theoretical limits (100% hit rate and 0% loop overhead, MOV.X):
Memcpy: 400 MB/s
Memset: 800 MB/s
Memload: 800 MB/s
And, MOV.Q:
Memcpy: 200 MB/s
Memset: 400 MB/s
Memload: 400 MB/s

However, I have noted that past tests aren't entirely immune to loop
overhead effecting the measurements.

I am not sure what the stats were with my older bus, which gave, roughly
(from memory):
L2 memcpy: ~ 12 MB/s
DRAM memcpy: ~ 6 MB/s

But, given the core tended to run with the memory-stall related LEDs lit
pretty much constantly, I am pretty sure it was bad...

I also get generally somewhat better performance with the newer bus.

Note that the older bus was a "one request at a time" bus:
Wait for bus status to be READY;
Put request onto bus (address, data, command);
Wait for bus status to become OK;
Put IDLE on the bus command;
Wait for bus to become READY;
...

Which would then forward across multiple levels (analogous to a
circuit-switched telephone, just using the address like a phone number).

The newer bus, at least at the L1/L2 interface is using a ring bus:
Requests/responses move along the ring at one message per cycle;
Each node handles requests/responses that are addressed to them;
All the units are basically connected together into a big ring.

The ring also seems to deal a lot better with multiple parties using it,
for example, I was able to move the video display over to operating on
the ring bus for accessing RAM-backed VRAM with no real visible effect
on the performance of the CPU core (actually, it is improved, because
now it can have bigger L1 and L2 caches due to not needing to eat 128K
of Block-RAM for the older dedicated VRAM; albeit with the tradeoff that
L2 misses on VRAM can now result in graphical glitches in the display).

However, the L2<->DRAM interface still uses a variation on the older
bus, mostly because it's logic works across clock-domain crossings.
Though, I had developed a "sequence number" trick which is able to
reduce latency somewhat in this case.

There was an older version of the DDR controller which used 16B cache
lines (in the L2<->DRAM interface), which while Doom doesn't see much
difference from the bigger cache lines, many other things seem to benefit.

The bigger lines increase total bandwidth at the expense of higher L2
miss latency.

The RAM backed VRAM only seems to only work acceptably with the 64B
cache lines, as trying to use it with 16B L2 cache lines results in the
DRAM interface clogging up and everything going to crap (big performance
hit with rather broken looking display output).

....

On 2021-10-02, Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
> On 10/2/2021 1:31 AM, BGB wrote:
>
> snip
>
>> I achieve 50MHz, with 3 execute lanes, 6 read / 3 write register file,
>> and 16K + 32K L1 caches.
>>
>> It is possible to get 100MHz, but:
>> 1 lane, 3R + 1W regfile, and 2K L1 caches.
>>
>> But, as noted, 1 lane and 2kB L1's are not good for performance...
>>
>>
>>
>> As-is, at 50 MHz, around 60% or so of the clock cycles seem to be going
>> into cache misses, but this is still an improvement from ~ 85% on some
>> earlier versions of my core.
>>
>> From what I can gather, a partial factor is that each level of the
>> cache hierchy roughly doubles the number of requests that go to the next
>> level down on a miss:
>> 1 L1 access -> ~2 L2 requests -> ~4 DRAM requests.
>
> So each level of cache is getting only a 33% hit rate???? That truly is
> terrible.
>
Depends on how cache goes for each level...

7-77-777
Evil Sinner!
to weak you should be meek, and you should brainfuck stronger
https://github.com/rofl0r/chaos-pp

On 10/2/2021 8:28 AM, JimBrakefield wrote:
> On Saturday, October 2, 2021 at 3:31:25 AM UTC-5, BGB wrote:
>> On 10/1/2021 6:54 PM, JimBrakefield wrote:
>>> On Friday, October 1, 2021 at 6:26:24 PM UTC-5, BGB wrote:
>>>> On 10/1/2021 4:05 PM, JimBrakefield wrote:
>>>>> On Friday, October 1, 2021 at 3:28:47 PM UTC-5, BGB wrote:
>>>>>> So, I was poking at a few seemingly random ideas:
>>>>>> * Take the existing RISC-V ISA, and add a WEX like sub-mode.
>>>>>> * Goal would be to remain backwards compatible with both the base ISA
>>>>>> and with RVC (though, they may be mutually-exclusive in a temporal sense);
>>>>>> * Gluing RISC-V decoder support onto my BJX2 core as an optional
>>>>>> extension feature.
>>>>>>
>>>> ...
>>>>>>
>>>>>> Well, unless one wants RV64GC or similar, which would likely require a
>>>>>> certain amount of emulation to pull off (or making a "native" RISC-V
>>>>>> core would likely require a full fork).
>>>>>>
>>>>>>
>>>>>> Or, is all this probably just a garbage feature that makes everything
>>>>>> likely worse-off overall ?...
>>>>>>
>>>>>>
>>>>>> Any thoughts?...
>>>>> Some of the not too expensive FPGAs have room for a hundred simple cores,
>>>>> all running simultaneously. Easy enough to show them on a 4K display.
>>>>> A solution in search of a problem?
>>>>>
>>>> One can have lots of tiny cores, but they are not useful for all that much.
>>>>
>>>> One of the bigger problems I have been facing in my projects is not so
>>>> much with doing computations, but with getting data from memory efficiently.
>>>>
>>>> Making the code itself faster has the indirect drawback in that it
>>>> increases the relative amount of time spent waiting for memory (and I
>>>> still have not succeeded at making Quake all that playable at 50MHz).
>>>>
>>>> Past attempts to work out the math for "what if I ran the RAM using
>>>> SERDES at a more 'appropriate' clock speed" only came up with fairly
>>>> modest gains.
>>>>
>>>>
>>>> Most of my recent performance gains have come more from fixing compiler
>>>> bugs than from hardware features, eg:
>>>> Fixing a bug which resulted in lots of redundant type conversions helped
>>>> a certain amount;
>>>> More recently, fixing another bug where loading a value from an unsigned
>>>> type tended to result in an additional (redundant) type-conversion op
>>>> was also helpful (~ 1% smaller binaries, also made Quake a little faster);
>>>> ...
>>>>
>>>> Some of these savings were more due to reducing register pressure (fewer
>>>> temporaries means fewer stack spills).
>>>>> For soft cores that support virtual memory, maybe room for ten cores?
>>>>> Don't forget 2 to 6 built-in ARM cores. Or:
>>>>>
>>>>> Actel Polarfire, now Microchip/Microsemi offers four RISC-V cores and FPGA fabric.
>>>>>
>>>> The FPGAs I am using don't have any hard-wired cores.
>>>>
>>>> Mostly dealing with Spartan and Artix class hardware.
>>>>
>>>> Was previously generally able to fit two BJX2 cores onto an XC7A100T,
>>>> and had also been working some on trying to implement a modified core
>>>> which could support SMT (would effectively have 6 execute lanes with two
>>>> pilelines, but partly conjoins some other hardware).
>>>>
>>>> The hope was that SMT could be cheaper than full dual-core (such as by
>>>> sharing the MMU and FPU, ...).
>>>>
>>>>
>>>> Had looked at possibly doing a 6-wide version of the BJX2 ISA, but there
>>>> are enough roadblocks here that this does not seem particularly viable
>>>> (mostly involving trying to pull off a unified register file with 12
>>>> read ports and 6 write ports).
>>>>
>>>> A purely SMT implementation avoids this issue though, as it can use
>>>> independent register files.
>>>>
>>>>
>>>>
>>>> I could fit 4 cores into the FPGA I am using if I were running a 64-bit
>>>> RISC with only a single execute lane (running at one instruction per
>>>> cycle), with smaller L1 caches (say, 4K I$ + 8K D$).
>>>>
>>>>
>>>> Or around 8 cores if I were using a much more minimalist core, such as a
>>>> basic 32-bit single-issue core.
>>>>
>>>> Though, one also has to use smaller/simpler L1 caches for this (such as
>>>> 512B 16x2x16B), but making the L1 caches smaller or similar tends to
>>>> have a significant adverse effect on performance.
>>>>
>>>>
>>>> Programs like Doom and Quake seem to be most happy with ~ 16K or 32K of
>>>> L1, and as much L2 as one can throw at them. After more recently moving
>>>> to RAM backed VRAM, this means I can now afford 256K of L2 (with only a
>>>> moderate amount of clashing between the active program and
>>>> video-display; partly as the display hardware is also routed through the
>>>> L2 cache; this seems to hold up pretty OK for 320x200 at least).
>>>>
>>>> However, with 256K of L2 + 16K + 32K for L1, I am basically at the limit
>>>> of the available Block-RAM.
>>>>
>>>>
>>>> In theory, given the "pretty much awful" ILP which I am getting from my
>>>> C compiler, I could run a single-issue core at 100MHz vs a 3-wide core
>>>> at 50MHz, and have it be faster.
>>>>
>>>> However... I have to make the L1 caches smaller, which hurts the
>>>> performance a lot worse than the higher clock-speed helps (it doesn't
>>>> make the RAM any faster).
>>>>
>>>> So, it ends up being faster and easier to just stick with everything
>>>> being 50 MHz, and instead trying to leverage VLIW. In theory, in-order
>>>> super-scalar can do basically the same thing, but is more complicated
>>>> and more expensive than VLIW.
>>>>
>>>> Similar, 50MHz single issue is slower than 50MHz VLIW.
>>>>
>>>>
>>>> And, making an L1 D$ that can be 16K or 32K and also allow single-cycle
>>>> access latency, and pass timing at 75 or 100MHz, is easier said than done.
>>>>
>>>>
>>>>
>>>> But, kinda in a local-optimum trap here, where I can make Doom run
>>>> pretty well, but Quake still tends to remain "pretty much unplayable".
>>>>
>>>> It looks like to get reasonably playable framerates in Quake, I would
>>>> need to pull off all 3 of these:
>>>> Similar-sized, or larger, L1 caches (eg: 32K L1 D$, or more);
>>>> Significantly faster DRAM access;
>>>> Being able to run a 3-wide pipeline at 100MHz.
>>>>
>>>> Ironically, I have managed to make a small Minecraft style 3D engine
>>>> work semi-passably on this (performance is a little bit better than
>>>> Quake; although the draw distance isn't particularly large at only 24
>>>> meters).
>>>>
>>>>
>>>>
>>>> This isn't really going to happen on the Artix I am using though, and I
>>>> still don't have a job at present, so no real money to drop on a much
>>>> more expensive Kintex based board.
>>>>
>>>> Some other guy managed to get somewhat better framerates for a Minecraft
>>>> style engine running on an ICE40, but his effort offloads most of the 3D
>>>> rendering machinery to hardware, and is basically using a small 16-bit
>>>> RISC style ISA for the actual CPU part.
>>>>
>>>>
>>>> In contrast, I am basically using software rendering based around a more
>>>> traditional triangle-rasterization approach. However, since
>>>> perspective-correct rendering is expensive, my renderer tends to carve
>>>> up large triangles and draws them using affine projection (more
>>>> subdivision looks nicer, but comes at a fairly steep performance cost).
>>>>
>>>> Note that my 3D renderer also does rendering OpenGL style (and via
>>>> implementing the OpenGL API), which basically means linearly
>>>> interpolating colors across triangles and multiplying them with the
>>>> geometry being drawn. It is also possible to draw geometry with
>>>> alpha-blending, or other blending modes, but this is slower (the OpenGL
>>>> API implementation is basically sufficient to run Quake 3 Arena though;
>>>> but at the moment, Q3A still has relatively little hope of usable
>>>> performance).
>>>>
>>>>
>>>> I guess one thing I had overlooked with a Minecraft style engine based
>>>> around using ray-casts for visibility finding, was that the ray-casts
>>>> tend to hit only a small part of the world with a short draw distance,
>>>> so the cache-miss rates from the raycasts are fairly modest.
>>>>
>>>> ...
>>>>
>>>>
>>>> Not sure how this would work running RISC-V, I somehow suspect that
>>>> running the OpenGL rasterizer as RISC-V code would be pretty much awful.
>>>>
>>>> Yet, going by SWeRV's numbers, they get somewhat better DMIPS/MHz than I
>>>> am getting from BJX2.
>>>>
>>>>
>>>> Notes that the SWeRV core is able to run single-core on the same FPGA
>>>> board I am using, albeit only passes timing at a lower clock speed than
>>>> what I am using for BJX2 (one can run it at 33 MHz I guess...).
>>>>
>>>> I don't know how much is likely due to hardware magic, or better
>>>> compiler optimization, will likely need to model some of this a bit better.
>>>>
>>>>
>>>> But, I don't know...
>>>
>>> I'm surprised you can only get to 50MHz on the Artix-7?
>>> Vivado offers register retiming - ugh, no personal experience with retiming
>>>
>> I achieve 50MHz, with 3 execute lanes, 6 read / 3 write register file,
>> and 16K + 32K L1 caches.
>>
>> It is possible to get 100MHz, but:
>> 1 lane, 3R + 1W regfile, and 2K L1 caches.
>>
>> But, as noted, 1 lane and 2kB L1's are not good for performance...
>>
>>
>>
>> As-is, at 50 MHz, around 60% or so of the clock cycles seem to be going
>> into cache misses, but this is still an improvement from ~ 85% on some
>> earlier versions of my core.
>>
>> From what I can gather, a partial factor is that each level of the
>> cache hierchy roughly doubles the number of requests that go to the next
>> level down on a miss:
>> 1 L1 access -> ~2 L2 requests -> ~4 DRAM requests.
>>> If you are at liberty to get another FPGA board, consider the Kria K26 SOM
>>> ~same price, more than twice the LUTs, 3MB of block RAM, 16nm, ugh export controlled
>>>
>> Possible, I guess it could be a possible, though looks like it would
>> need some sort of IO interface or something (no normal IO connections or
>> PMOD connectors; unlike the Arty or Nexys).
>>
>> Also not sure if it would be usable via the freeware version of Vivado, ...
>>
>>
>>
>> I am also using a CMod-S7 (XC7S25), and have managed to fit a smaller
>> version of the BJX2 core on this. Though, this device doesn't have any
>> external RAM.
>
> |> I achieve 50MHz, with 3 execute lanes, 6 read / 3 write register file,
> |> and 16K + 32K L1 caches.
> |> It is possible to get 100MHz, but:
> |> 1 lane, 3R + 1W regfile, and 2K L1 caches.
>
> Haven't ventured into the world of memory mapping and caches, so
> don't know what the expectations are for Fmax in that situation.
>

Click here to read the complete article

On 10/2/2021 1:10 PM, Branimir Maksimovic wrote:
> On 2021-10-02, Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
>> On 10/2/2021 1:31 AM, BGB wrote:
>>
>> snip
>>
>>> I achieve 50MHz, with 3 execute lanes, 6 read / 3 write register file,
>>> and 16K + 32K L1 caches.
>>>
>>> It is possible to get 100MHz, but:
>>> 1 lane, 3R + 1W regfile, and 2K L1 caches.
>>>
>>> But, as noted, 1 lane and 2kB L1's are not good for performance...
>>>
>>>
>>>
>>> As-is, at 50 MHz, around 60% or so of the clock cycles seem to be going
>>> into cache misses, but this is still an improvement from ~ 85% on some
>>> earlier versions of my core.
>>>
>>> From what I can gather, a partial factor is that each level of the
>>> cache hierchy roughly doubles the number of requests that go to the next
>>> level down on a miss:
>>> 1 L1 access -> ~2 L2 requests -> ~4 DRAM requests.
>>
>> So each level of cache is getting only a 33% hit rate???? That truly is
>> terrible.
>>
> Depends on how cache goes for each level...
>

Pretty much...

L1 hits 98% of the time
But, if it misses, it throws ~ 2 requests at the L2 cache.
L2 cache hits 60-80% of the time.
But, if it misses, it throws ~ 2 requests at DRAM.

With the caches being NINE + Write-Back in this case.

Cache consistency semantics are currently "better hope you remembered to
flush". Though, there is a "volatile flag" trick, and it looks like I
may have reason to add memory-ops with an explicit volatile flag (with
the cache line being flushed automatically following the operation in
question).

A majority of the misses tend to result in a store+load, with a minority
resulting in a load by itself (eg: if by some chance, the target cache
line wasn't marked as Dirty).

Though, I had added a "swap request" for DRAM which basically sends a
cache line in both directions at the same time. This mostly just reduces
bus signaling overhead though, since one still effectively needs to do
two sets of burst transfers at the level of the DRAM.

So, the swap request effectively does both DRAM requests as a single
operation which takes ~ 80% as many clock cycles as doing a store+load
end-to-end.

....

On 2021-10-02, BGB <cr88192@gmail.com> wrote:
>
>
> So, max speeds for DRAM:
> 54 MB/s swap;
> 84 MB/s load;
> 91 MB/s store.
>
> Some memory performance stats:
> L1 memcpy: 275 MB/s
> L2 memcpy: 67 MB/s
> DRAM memcpy: 18 MB/s
>
> L1 memset: 305 MB/s
> L2 memset: 130 MB/s
> DRAM memset: 44 MB/s
>
> L1 memload: 448 MB/s
> L2 memload: 175 MB/s
> DRAM memload: 73 MB/s
>
>
/So you are working on acient hardware and make conclusions :P

> ...

7-77-777
Evil Sinner!
to weak you should be meek, and you should brainfuck stronger
https://github.com/rofl0r/chaos-pp

On 2021-10-02, BGB <cr88192@gmail.com> wrote:
> On 10/2/2021 1:10 PM, Branimir Maksimovic wrote:
>> On 2021-10-02, Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
>>> On 10/2/2021 1:31 AM, BGB wrote:
>>>
>>> snip
>>>
>>>> I achieve 50MHz, with 3 execute lanes, 6 read / 3 write register file,
>>>> and 16K + 32K L1 caches.
>>>>
>>>> It is possible to get 100MHz, but:
>>>> 1 lane, 3R + 1W regfile, and 2K L1 caches.
>>>>
>>>> But, as noted, 1 lane and 2kB L1's are not good for performance...
>>>>
>>>>
>>>>
>>>> As-is, at 50 MHz, around 60% or so of the clock cycles seem to be going
>>>> into cache misses, but this is still an improvement from ~ 85% on some
>>>> earlier versions of my core.
>>>>
>>>> From what I can gather, a partial factor is that each level of the
>>>> cache hierchy roughly doubles the number of requests that go to the next
>>>> level down on a miss:
>>>> 1 L1 access -> ~2 L2 requests -> ~4 DRAM requests.
>>>
>>> So each level of cache is getting only a 33% hit rate???? That truly is
>>> terrible.
>>>
>> Depends on how cache goes for each level...
>>
>
> Pretty much...
>

I|ntel switched to cache victim schema and got owfull performance
of L3, while adding *AVX512* :P

> ...
>

7-77-777
Evil Sinner!
to weak you should be meek, and you should brainfuck stronger
https://github.com/rofl0r/chaos-pp

On 10/2/2021 3:01 PM, Branimir Maksimovic wrote:
> On 2021-10-02, BGB <cr88192@gmail.com> wrote:
>>
>>
>> So, max speeds for DRAM:
>> 54 MB/s swap;
>> 84 MB/s load;
>> 91 MB/s store.
>>
>> Some memory performance stats:
>> L1 memcpy: 275 MB/s
>> L2 memcpy: 67 MB/s
>> DRAM memcpy: 18 MB/s
>>
>> L1 memset: 305 MB/s
>> L2 memset: 130 MB/s
>> DRAM memset: 44 MB/s
>>
>> L1 memload: 448 MB/s
>> L2 memload: 175 MB/s
>> DRAM memload: 73 MB/s
>>
>>
> /So you are working on acient hardware and make conclusions :P
>

The XC7A100T isn't exactly all that old by FPGA standards...

Granted, looking into it, I guess the Artix-7 line was launched in 2010,
so it has been a little while.

Likewise, this line basically optimized more to be low-power than fast,
and generally it seems that the BJX2 runs in the mW range, ...

Not as sure when the Nexys A7 was released, apparently it was designed
mostly to be used by students in ECE classes. Granted, "nerd who wanted
to design their own CPUs" is possibly also a valid category.

But, realistically, probably can't do *that* much better for a core
running at 50MHz on an FPGA, and there isn't that much alternative which
isn't either more expensive, or "generally worse", ...

Granted, the jury is still out on whether I should go find some
other/faster DDR controller to yank, write a faster DDR controller, or
just give in and use MIG and AXI.

But, I guess the other thing is the whole "keep using my own ISA, or
start looking at transitioning stuff over to RISC-V".

But, based on what I can see, it seems like BJX2 should be able to be
functionally superior to RISC-V on the performance front, if given
similar hardware capabilities and similar levels of compiler maturity.

Though, compared with other ISAs I could glue on, RISC-V maps reasonably
well to the existing pipeline, and is not patent encumbered. So, this is
mostly just a case of using an alternate instruction decoder...

Or, at least within the RV64IC subset. The M/A/F/D extensions don't
match up quite so well.

Could add something like a dedicated microcode ROM, but bleh...

On 2021-10-02, BGB <cr88192@gmail.com> wrote:
> On 10/2/2021 3:01 PM, Branimir Maksimovic wrote:
>> On 2021-10-02, BGB <cr88192@gmail.com> wrote:
>>>
>>>
>>> So, max speeds for DRAM:
>>> 54 MB/s swap;
>>> 84 MB/s load;
>>> 91 MB/s store.
>>>
>>> Some memory performance stats:
>>> L1 memcpy: 275 MB/s
>>> L2 memcpy: 67 MB/s
>>> DRAM memcpy: 18 MB/s
>>>
>>> L1 memset: 305 MB/s
>>> L2 memset: 130 MB/s
>>> DRAM memset: 44 MB/s
>>>
>>> L1 memload: 448 MB/s
>>> L2 memload: 175 MB/s
>>> DRAM memload: 73 MB/s
>>>
>>>
>> /So you are working on acient hardware and make conclusions :P
>>
>
> The XC7A100T isn't exactly all that old by FPGA standards...
>
> Granted, looking into it, I guess the Artix-7 line was launched in 2010,
> so it has been a little while.
>
//FPGA, isn't it for different thinking? Not classic programmer thinking?

> Or, at least within the RV64IC subset. The M/A/F/D extensions don't
> match up quite so well.
>
> Could add something like a dedicated microcode ROM, but bleh...
>
:P

7-77-777
Evil Sinner!
to weak you should be meek, and you should brainfuck stronger
https://github.com/rofl0r/chaos-pp

BGB <cr88192@gmail.com> schrieb:

> But, I guess the other thing is the whole "keep using my own ISA, or
> start looking at transitioning stuff over to RISC-V".

You could also use the basic POWER ABI, that has also been opened up.

> But, based on what I can see, it seems like BJX2 should be able to be
> functionally superior to RISC-V on the performance front, if given
> similar hardware capabilities and similar levels of compiler maturity.

RISC-V is not a very good design, IMHO.

On 10/2/2021 10:25 AM, BGB wrote:
> On 10/2/2021 10:27 AM, Stephen Fuld wrote:
>> On 10/2/2021 1:31 AM, BGB wrote:
>>
>> snip
>>
>>> I achieve 50MHz, with 3 execute lanes, 6 read / 3 write register
>>> file, and 16K + 32K L1 caches.
>>>
>>> It is possible to get 100MHz, but:
>>> 1 lane, 3R + 1W regfile, and 2K L1 caches.
>>>
>>> But, as noted, 1 lane and 2kB L1's are not good for performance...
>>>
>>>
>>>
>>> As-is, at 50 MHz, around 60% or so of the clock cycles seem to be
>>> going into cache misses, but this is still an improvement from ~ 85%
>>> on some earlier versions of my core.
>>>
>>> From what I can gather, a partial factor is that each level of the
>>> cache hierchy roughly doubles the number of requests that go to the
>>> next level down on a miss:
>>> 1 L1 access -> ~2 L2 requests -> ~4 DRAM requests.
>>
>> So each level of cache is getting only a 33% hit rate???? That truly
>> is terrible.
>>
>
> No. If it were, I would consider the caches broken.
>
> I am typically getting around a 98% hit rate for L1, and around 60% to
> 80% for L2.

OK, I was confused by your description. Those numbers are certainly
much more reasonable,

> But, I had modeled what happens assuming each time a cache level misses,
> and it nearly doubles each time; likely from evicting an old cache line
> and fetching a new one.

Again, I may be misinterpreting what you are saying, and if so I
apologize, but a cache miss should, in general, not take 2X the access
time of the next level down to resolve. I don't recall whether you said
your caches were exclusive or inclusive, and write thru or not, which
effect the details, but even if you have to both write the old line out
to the next level and read the new line in, the addition of a small
buffer to hold the old line while you read the new one in allows you to
overlap the writing the old line to the buffer with the read access for
the new line. Then you write the old line from the buffer while the CPU
proceeds with its processing. This allows the vast majority of time, to
take only one access time to resolve the miss.

> None the less, that small percentage of L1 misses tends to amount to
> around 60% of the total clock cycles (in Doom/Quake/...).
>
> This is with a DDR controller which takes 59 clock cycles to do a 64B
> cache/line swap operation with RAM (or 38 load + 35 store, but the 59
> cycle swap operation tends to come out ahead).

But you are only going to DRAM .02 * .4 or .008 of the time, which is an
average of far less than one cycle.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

On 2021-10-02 03:42, BGB wrote:
> On 10/1/2021 5:37 PM, MitchAlsup wrote:
>> On Friday, October 1, 2021 at 3:28:47 PM UTC-5, BGB wrote:

[snip]

>>> RISC-V seems lack anything equivalent to the Conv family instructions:
>>> MOV, EXTU/EXTS/...
>>> I am having a concerned feeling that these may need to be faked using
>>> masks and shifts, say:
>>> EXTS.B R8, R9
>>> Maps to, say:
>>> SLLI X6, X8, 56
>>> SRAI X9, X6, 56
>> <
>> RISC-V has 12-bit constants. The lower 6-bits is the shift count, the
>> upper 6-bits
>> are the size, with the constraint that 0 implies maximum (i.e.,
>> register) width.
>> <
>> I do this in My 66000. When used as reg-reg, bits<5:0> remain the
>> shift count
>> while bits <37:32> are the field size. The HW checks that the other
>> bits have
>> "no significance" and raised the OPERAND exception if the operand pattern
>> is not within the required domain.
>
> I didn't see any mention of things working this way in the spec.

It may be that most of these things are hidden in the not-yet-finished
bitmanip extension.

/Marcus

On 10/3/2021 6:40 AM, Stephen Fuld wrote:
> On 10/2/2021 10:25 AM, BGB wrote:
>> On 10/2/2021 10:27 AM, Stephen Fuld wrote:
>>> On 10/2/2021 1:31 AM, BGB wrote:
>>>
>>> snip
>>>
>>>> I achieve 50MHz, with 3 execute lanes, 6 read / 3 write register
>>>> file, and 16K + 32K L1 caches.
>>>>
>>>> It is possible to get 100MHz, but:
>>>> 1 lane, 3R + 1W regfile, and 2K L1 caches.
>>>>
>>>> But, as noted, 1 lane and 2kB L1's are not good for performance...
>>>>
>>>>
>>>>
>>>> As-is, at 50 MHz, around 60% or so of the clock cycles seem to be
>>>> going into cache misses, but this is still an improvement from ~ 85%
>>>> on some earlier versions of my core.
>>>>
>>>> From what I can gather, a partial factor is that each level of the
>>>> cache hierchy roughly doubles the number of requests that go to the
>>>> next level down on a miss:
>>>> 1 L1 access -> ~2 L2 requests -> ~4 DRAM requests.
>>>
>>> So each level of cache is getting only a 33% hit rate???? That truly
>>> is terrible.
>>>
>>
>> No. If it were, I would consider the caches broken.
>>
>> I am typically getting around a 98% hit rate for L1, and around 60% to
>> 80% for L2.
>
> OK, I was confused by your description. Those numbers are certainly
> much more reasonable,
>

OK.

In cases where the L1 breaks in some major way, performance is *glacial*...

So, in this case, one would not be so much dealing with 12-16 fps in
Doom, so much as 1 fps in Doom...

Meanwhile, if I disable L2 miss modeling in the partial simulation (the
test-bench behaving as if the L2 always hits), Doom tends to mostly run
at the speed of the 32-fps limiter.

The partial simulation does generally include logic to mimic the
behavior of L2 misses such that the numbers are not too far off from
those of the full core.

Well, and also the emulator includes models which try to mimic the
behavior of the cache subsystem (mostly to calculate penalties, ...). I
have on/off considered trying to add logic for "staleness detection"
(eg, where an access happens which might result in reading stale
results), but haven't done so yet.

>
>> But, I had modeled what happens assuming each time a cache level
>> misses, and it nearly doubles each time; likely from evicting an old
>> cache line and fetching a new one.
>
> Again, I may be misinterpreting what you are saying, and if so I
> apologize, but a cache miss should, in general, not take 2X the access
> time of the next level down to resolve. I don't recall whether you said
> your caches were exclusive or inclusive, and write thru or not, which
> effect the details, but even if you have to both write the old line out
> to the next level and read the new line in, the addition of a small
> buffer to hold the old line while you read the new one in allows you to
> overlap the writing the old line to the buffer with the read access for
> the new line. Then you write the old line from the buffer while the CPU
> proceeds with its processing. This allows the vast majority of time, to
> take only one access time to resolve the miss.
>

It is not quite 2.0, but more like 1.8 and 1.75, so a combined scale of
around 3.15.

Though, rounding it up makes it easier to reason about.

As noted elsewhere, the cache is Write-Back, and I later identified it
as using a NINE policy (this is the pattern that emerges from the way I
had implemented it, rather than any particular engineering decision).

>
>> None the less, that small percentage of L1 misses tends to amount to
>> around 60% of the total clock cycles (in Doom/Quake/...).
>>
>> This is with a DDR controller which takes 59 clock cycles to do a 64B
>> cache/line swap operation with RAM (or 38 load + 35 store, but the 59
>> cycle swap operation tends to come out ahead).
>
> But you are only going to DRAM .02 * .4 or .008 of the time, which is an
> average of far less than one cycle.
>

Round trip seems to cost ~ 12 cycles (ringbus/...) + ~ 2x 59 cycles
(DRAM) + ~ 2x 10 cycles (L2 machinery).

So, say, around 150 cycles for a missed memory access that goes to DRAM.

So, assuming 98% L1 hit, 60% L2 hit:
So, ~0.8% of the time, or ~ 1.2 cycle penalty...

So, now say we have memory access ops which (in an ideal case, take 1
cycle, and add ~ 1.2 cycle of memory access penalty).

Then, 1.2% of the time, where the cost is ~ 12 cycles, so, another 0.144
cycle.

Total cost is: 1 + 1.2 + 0.144, or around 2.344 cycles per memory access.

For a lot of code, memory accesses tend to dominate over pretty much
everything else (in terms of usage frequency, *1).

And, also because a fair percentage of the non-memory ALU ops get
executed in parallel with the memory ops (so, from a clock-cycle
perspective, it is almost like one has code consisting largely of solid
unbroken blocks of memory loads and stores).

So, yeah, I think this is where the 60% overhead is coming from...

There is possibly still a bit of room for improvement here.

But, it is kinda lame when one realizes that a lot of code isn't doing
so much computation as it is just sort of shuffling values from one
place in memory to another place in memory with occasional bits of
arithmetic thrown in.

For example, there are some "semi-dense" drawing-related loops in Doom
which manage to spend nearly 90% of their clock cycles in cache-miss
penalties (according to the models).

*1: Ironically, they would likely be less dominant for RISC-V for the
main reason that the ISA would need to be spending a lot more cycles on
address-calculation tasks (more so, in the absence of instruction bundles).

....

Subject	Author
Misc idea: Dual ISA, BJX2 + RISC-V support...	BGB
Re: Misc idea: Dual ISA, BJX2 + RISC-V support...	JimBrakefield
Re: Misc idea: Dual ISA, BJX2 + RISC-V support...	BGB
Re: Misc idea: Dual ISA, BJX2 + RISC-V support...	JimBrakefield
Re: Misc idea: Dual ISA, BJX2 + RISC-V support...	BGB
Re: Misc idea: Dual ISA, BJX2 + RISC-V support...	JimBrakefield
Re: Misc idea: Dual ISA, BJX2 + RISC-V support...	BGB
Re: Misc idea: Dual ISA, BJX2 + RISC-V support...	Stephen Fuld
Re: Misc idea: Dual ISA, BJX2 + RISC-V support...	BGB
Re: Misc idea: Dual ISA, BJX2 + RISC-V support...	Branimir Maksimovic
Re: Misc idea: Dual ISA, BJX2 + RISC-V support...	BGB
Re: Misc idea: Dual ISA, BJX2 + RISC-V support...	Branimir Maksimovic
Re: Misc idea: Dual ISA, BJX2 + RISC-V support...	Thomas Koenig
Re: Misc idea: Dual ISA, BJX2 + RISC-V support...	BGB
Re: Misc idea: Dual ISA, BJX2 + RISC-V support...	Stephen Fuld
Re: Misc idea: Dual ISA, BJX2 + RISC-V support...	BGB
Re: Misc idea: Dual ISA, BJX2 + RISC-V support...	Branimir Maksimovic
Re: Misc idea: Dual ISA, BJX2 + RISC-V support...	BGB
Re: Misc idea: Dual ISA, BJX2 + RISC-V support...	Branimir Maksimovic
Re: Misc idea: Dual ISA, BJX2 + RISC-V support...	MitchAlsup
Re: Misc idea: Dual ISA, BJX2 + RISC-V support...	BGB
Re: Misc idea: Dual ISA, BJX2 + RISC-V support...	JimBrakefield
Re: Misc idea: Dual ISA, BJX2 + RISC-V support...	BGB
Re: Misc idea: Dual ISA, BJX2 + RISC-V support...	Marcus