novaBBS - comp.arch - Re: RISC-V vs. Aarch64

On 12/24/2021 9:38 AM, Anton Ertl wrote:
> I have recently (despite me) watched
> <https://www.youtube.com/watch?v=Ii_pEXKKYUg>, where Chris Celio
> compares some aspects RV64G, RV64GC, ARM A32, ARM A64, IA-32 and
> AMD64. There is also a tech report <https://arxiv.org/abs/1607.02318>
> <https://arxiv.org/pdf/1607.02318.pdf>
>
> In this posting I look at the different approaches taken in A64 (aka
> Aarch64) and RV64GC:
>
> * RISC-V has simple instructions, e.g., with only one addressing mode
> (reg+offset), and the base instruction set encodes each instruction
> in 32 bits, but with the C (compressed) extension adds a 16-bit
> encoding for the most frequent instructions.
>

A 16/32 encoding can save ~ 40-60% IME vs fixed 32-bit instructions.

For example: I was recently experimenting again with a "Fix32" variant
of my ISA, but building the Boot ROM in Fix32 mode basically blows out
the ROM size (had to both omit some stuff and also expand the ROM size
to 48K in this test; vs the 16/32 ISA where it still fits in 32K).

I had experimented in the past, noting that 16/24/32 doesn't save enough
over 16/32 to justify the cost.

For the main ROM, to be able to boot Fix32 a secondary core (and be
within the 32K limit), I had to put some of the initial Boot ASM in
Fix32 mode. It detects that it is on a secondary core and then goes into
a loop where it spins indefinitely and waits for a special interrupt
(with any secondary cores being "woken up" by a special inter-processor
interrupt post-boot).

I suspect a lower reasonable limit for addressing modes is (Reg,Disp)
and (Reg,Index). Fewer than this leads to unnecessary inefficiency, and
both modes are "very common" in practice.

More elaborate modes quickly run into diminishing returns territory; so,
things like auto-increment and similar are better off left out IMO.

> * A64 has fixed-length 32-bit instructions, but they can be more
> complex: A64 has more addressing modes and additional instructions
> like load pair and store pair; in particular, the A64 architects
> seem to have had few concerns about the register read and write
> ports needed per instruction; E.g., a store-pair instruction can
> need four read ports, and a load pair instruction can need three
> write ports (AFAIK).
>

These are less of an issue if one assumes a minimum width for the core.
If the core is always at least 3-wide, this isn't an issue.

For a 1-wide core, there may need to be compromise. For example, my
Fix32 core mentioned previously is only 1-wide, so needed to omit the
MOV.X instruction. I did end up retaining Jumbo encodings, but they use
a slightly different mechanism from the 3-wide core.

I have noted I can pair a 3-wide core with a 1-wide core, and fit it
(barely) in the FPGA. Partial issue is that I am paying a fairly steep
cost for the 64B burst transfers in the L2/DDR stage, but I kinda need
these to get sufficient memory bandwidth for a RAM-backed framebuffer to
work (otherwise the screen is broken garbage).

Ironically though, the 1-wide core isn't *that* much cheaper than the
3-wide core, despite being significantly less capable.

I could maybe make the core a bit cheaper by leaving out RV64I support,
which avoids a few fairly "costly" mechanisms (Compare-And-Branch;
Arbitrary register as link-register; ...). Or, conversely, a core which
*only* does RV64I.

Still wont save much for the amount of LUTs eaten by the L1D$ and TLB
though (just these two are around 1/3 of the total LUT cost of the
1-wide core).

Meanwhile, things like the instruction decoder mostly seem to mostly
"disappear into the noise" in this case.

> Celio argues that the instruction density of RV64GC is competetive,
> and that the C extension is cheap to implement for a small
> implementation.
>
> It's pretty obvious that a small implementation of RV64G is smaller
> than a small implementation of A64, and adding the C extension to a
> small implementation of RV64G (to turn it to RV64GC) is reported in
> the talk IIRC (it's on the order of 700 transistors, so still cheap),
> so you can get a small RV64GC cheaper than a small A64 implementation
> and have similar code density.
>

Yeah, by the time you have all the other stuff in 'G', the cost of the
'C' extension is probably negligible.

> Celio also argues that instruction fusion can overcome the problem of
>
> But I wonder how things turn out for larger implementations: Now
> RV64GC needs twice as many decoders to decode, say, 32 bytes of
> instructions, and then some additional hardware that selects which of
> these decodes are valid (and of course all this extra hardware costs
> energy).
>
> And with instruction fusion the end result of the decoding step
> becomes even more complex.
>

Instruction fusion seems like a needless complexity IMO. It would be
better if the ISA can be "dense" enough to make the fusion mostly
unnecessary. Though, "where to draw the line" is probably a subject of
debate.

Granted, it could probably be justified if one has already paid the
complexity cost needed to support superscalar (since it should be able
to use the same mechanism to deal with both).

I have come to the realization though that, because my cores lack many
of the tricks that fancier RISC-V cores use, my cores running in RISC-V
mode will at-best have somewhat worse performance than something like SweRV.

If I were designing a "similar" ISA to RISC-V (with no status flags), I
would probably leave out full Compare-and-Branch instructions, and
instead have a few "simpler" conditional branches, say:
BEQZ reg, label //Branch if reg==0
BNEZ reg, label //Branch if reg!=0
BGEZ reg, label //Branch if reg>=0
BLTZ reg, label //Branch if reg< 0

While conceptually, this doesn't save much, it would be cheaper to
implement in hardware. Relative compares could then use compare
instructions:
CMPx Rs, Rt, Rd
Where:
(Rs==Rt) => 0;
(Rs> Rt) => 1;
(Rs< Rt) => -1.

Though, one issue with a plain SUB is that it would not work correctly
for comparing integer values the same width as the machine registers (if
the difference is too large, the intermediate value will overflow).

> OTOH, a high-performance A64 implementation probably wants some kind
> of register port allocator (possibly with port requirements across
> several cycles), so it has its own source of complexity. I wonder if
> that's the reason why some high-performance ARMs have microcode
> caches; I would normally have thought that microcode caches are
> unnecesary for a fixed-length instruction format.
>

I wouldn't think this necessary, if I were implementing it I could do it
more like how I deal with 128-bit SIMD and similar in my existing core,
and map a single instruction to multiple lanes when needed.

> What do people more familiar with hardware design think?
>
> A classic question is the classification of the architecture: Celio
> claims that ARM is CISC (using an A32 Load-multiple instruction as
> example). These questions are not decided by touchy-feely arguments,
> but rather by instruction set properties. All RISCs are load/store
> general-purpose instruction sets; even the A32 load/store-multiple
> instructions can be considered a variant of that; one might argue that
> accessing two pages in one instruction is not RISC, but by that
> criterion RISC-V is not RISC (it allows page-crossing unaligned
> accesses). One other common trait in RISCs is fixed-length 32-bit
> instructions (A32, MIPS, SPARC, 29K, 88K, Power, Alpha), but there are
> exceptions: IBM ROMP, ARM Thumb, (IIRC) MIPS-X, and now RISC-V with
> the C extension.
>

IMO, Load/Store is the big issue...

Load/Store allows decent performance from a simplistic pipeline.

As soon as one deviates from Load/Store, one has effectively thrown a
hand grenade into the mix (one is doomed to either needing a much more
complex decoder, OoO, or paying a cost in terms of a significant
increase in clock-cycle counts per instruction).

In comparison, variable-length instructions and misaligned memory access
are not nearly as destructive. Granted, they are not free either. I
suspect the L1 D$ in my case would be somewhat cheaper if it did not
need to deal with misaligned access.

However, besides "cheap core", it is also nice to be able to have "fast
LZ77 decoding" and similar, which is an area where misaligned memory
access pays off.

Page-crossing doesn't seem to be too big of an issue, since it is rare,
and can be handled with two TLB misses in a row if needed (only the
first TLB miss gets served; when the interrupt returns and the
instruction tries again, it causes another TLB miss).

> An interesting point in the talk is that zero-extension is a common
> idiom in RISC-V; the talk does not explain this completely, but
> apparently the 32-bit variants of instructions sign-extend (like Alpha
> does, while AMD64 and A64 zero-extend), and the ABI passes 32-bit
> values around in sign-extended form (including 32-bit unsigned
> values).
>

It is a tradeoff.

In BJX2, I went with signed values being kept in sign-extended form, and
unsigned values kept in zero-extended form (and casts requiring explicit
sign or zero extension of the results).

This is a general rule even if there are only a minority of cases where
it should matter.

Subject	Replies	Author
RISC-V vs. Aarch64 By: Anton Ertl on Fri, 24 Dec 2021	351	Anton Ertl

The debate rages on: Is PL/I Bachtrian or Dromedary?

computers / comp.arch / Re: RISC-V vs. Aarch64