novaBBS - comp.arch - Re: RISC-V vs. Aarch64

On 12/24/2021 4:06 PM, MitchAlsup wrote:
> On Friday, December 24, 2021 at 3:20:36 PM UTC-6, BGB wrote:
>> On 12/24/2021 9:38 AM, Anton Ertl wrote:
> <snip>
>>
>> I suspect a lower reasonable limit for addressing modes is (Reg,Disp)
>> and (Reg,Index). Fewer than this leads to unnecessary inefficiency, and
>> both modes are "very common" in practice.
> <
> I added [Rbase+Rindex<<scale+disp] even if it is only 2%-3% because
> it is expressive (more to FORTRAN needs than C needs) and it saves
> instruction space. I provided ONLY the above two (2) in M88K and
> later when doing x86-64 found out the error of my ways.
> <
> The least one can add is [Rbase+Rindex<<scale]

There is an implicit scale implied with the Index, but it in my case, it
is element sized in the 32-bit encoding.

In BJX2, there is a [Rbase + Rindex<<Sc + Disp] mode, but:
It is optional;
It requires an Op64 encoding;
My compiler isn't smart enough to utilize it effectively.

It is a little cheaper to support:
[Rbase + Rindex<<Sc]
Encoded as:
[Rbase + Rindex<<Sc + 0]

Which doesn't support any additional machinery, but then I need to
distinguish these cases (both cases use the same encoding, just in the
latter case, the displacement is Zero). It is partly definition, but
also has a minor impact in terms of how I define things in terms of the
CPUID feature flags and similar.

>>
>> More elaborate modes quickly run into diminishing returns territory; so,
>> things like auto-increment and similar are better off left out IMO.
> <
> Agreed except for [Rbase+Rindex<<scale+disp]
> <

Granted, it is in a gray area.

Though, lacking this case is not nearly as adverse as lacking
(Reg,Index) or (Reg,Index*Sc) ...

> <snip again>
>> Ironically though, the 1-wide core isn't *that* much cheaper than the
>> 3-wide core, despite being significantly less capable.
>>
>> I could maybe make the core a bit cheaper by leaving out RV64I support,
>> which avoids a few fairly "costly" mechanisms (Compare-And-Branch;
>> Arbitrary register as link-register; ...). Or, conversely, a core which
>> *only* does RV64I.
> <
> When Compare and Branch is correctly specified it can be performed
> in 1 pipeline cycle. Then specified without looking "at the gates" is
> seldom is !

In the present form, the determination is made in EX1, and the branch is
initiated in EX2 (invalidating whatever is in EX1 at the time). Pushing
it later would likely require using an interlock or similar (so as to
not allow any instructions into the pipeline until the branch direction
can be determined).

The RISC-V style compare-and-branch instructions at present come with a
fairly steep cost in terms of LUTs and timing, as otherwise the results
of the compare would not be used until 1 cycle later.

>>
>> Still wont save much for the amount of LUTs eaten by the L1D$ and TLB
>> though (just these two are around 1/3 of the total LUT cost of the
>> 1-wide core).
>>
> <snip>
>>>
>>> And with instruction fusion the end result of the decoding step
>>> becomes even more complex.
>>>
>> Instruction fusion seems like a needless complexity IMO. It would be
>> better if the ISA can be "dense" enough to make the fusion mostly
>> unnecessary. Though, "where to draw the line" is probably a subject of
>> debate.
> <
> Instruction fusing is worthwhile on a 1-wide machine with certain pipeline
> organization to execute 2 instructions per cycle and a few other tricks.
> <
> In my 1-wide My 66130 core, I can CoIssue several instruction pairs:
> a) any instruction and an unconditional branch
> b) any result delivering instruction and a conditional branch or predicate
> consuming the result.
> c) Store instruction followed by a calculation instruction such that the
> total number of register reads is <= 3.
> d) any non-branch instruction followed by an unconditional branch
> instruction allow the branch to be taken at execution cost of ZERO
> cycles.
> <
> An eXcel spreadsheet shows up to 30% performance advantage over a
> strict 1-wide machine. {This is within spitting distance of the gain of
> In Order 2-wide over In Order 1-wide with less than 20% of the cost}

OK. My existing core does not support (or perform) any sort of
instruction fusion.

Things like instruction fetch / PC-step are determined in the 'IF'
stage, but this is done in terms of instruction bits and mode flags,
rather than recognizing any instructions in particular.

It seems like one could fetch a block of instructions and then include
an "offset" based on how many instructions were executed from the prior
cycle; however this seems more complicated than the current approach,
and for a machine with 96 bit bundles would likely require fetching in
terms of 256 bit blocks rather than the 96 bits corresponding to the
current bundle (since one likely wouldn't know the exact position of the
bundle, or how far to adjust PC, until ID1 or similar).

>>
>> Granted, it could probably be justified if one has already paid the
>> complexity cost needed to support superscalar (since it should be able
>> to use the same mechanism to deal with both).
> <
> I justified it not because of SuperScalar infrastructure (which I do not have)
> But because My 66000 ISA <practically> requires an instruction fetch buffer;
> So, the vast majority of the time the instructions are present and simply
> waiting for the pipeline to get around to them.

OK. I am still using an approach more like:
Send PC to the I$ during PF;
Get a 96 bit bundle and PC_step during IF;
Advance PC and feed back into I$ (PF for next cycle).

This works, but requires things to be known exactly during IF, and does
not deal well with any uncertainty here.

Similarly, trying to stick either fusion or superscalar logic into the
IF stage seems problematic (cost, and if there are any false positives,
the pipeline state is basically hosed).

>>
>> I have come to the realization though that, because my cores lack many
>> of the tricks that fancier RISC-V cores use, my cores running in RISC-V
>> mode will at-best have somewhat worse performance than something like SweRV.
> <
> Yes, not being like the others has certain disadvantages, too.

Yeah. I can make the BJX2 core run RISC-V, but not "well". To do better
would require doing a core specifically for running RISC-V, but then
there is little purpose to do so apart from interop with my existing
cores on a common bus.

My recent "minicore" also aims to support RISC-V, but is at least
"slightly less absurd" in that it is closer to the common featureset of
BJX2 and RISC-V (it is just sort of a crappy subset of both ISAs). Its
performance isn't likely to be all that impressive though in either case.

>>
>>
>>
>> If I were designing a "similar" ISA to RISC-V (with no status flags), I
> <
> Before you head in this direction, I find RISC-V ISA rather dreadful and
> only barely serviceable.
> <

Possibly, there are things I am not a fan of.

Main reasons I am bothering with it at all:
It is much more popular;
It is supported by GCC;
It (sorta) maps onto my existing pipeline...
Or at least RV64I and RV64IC.
Pretty much everything beyond this is not really 1:1.

However, there are issues:
Pure RISC-V isn't really able to operate the BJX2 core;
BGBCC doesn't yet have a RISC-V target (or ELF object files);
...

This basically limits cross-ISA interaction at present mostly to ASM blobs.

I have yet to work out specifics for how I will do cross ISA interfacing
to a sufficient degree to actually load up RISC-V binaries on the BJX2
core (it can already be done in theory, but they can't do a whole lot as
of yet).

Harder case would be bare-metal, which is basically what I would need to
have any real hope of a Linux port or similar; but in this case would
require ability to link object files between compilers and across ISA
boundaries.

The simpler route would just sorta being to load RISC-V ELF binaries
into TestKern.

Trying to port the Linux kernel to build on BGBCC is likely no-go.

Though, even if I did so, most existing Linux on RISC-V ports assume
RV64G or RV64GC, which is a bit out-of-reach at present.

>> would probably leave out full Compare-and-Branch instructions, and
>> instead have a few "simpler" conditional branches, say:
>> BEQZ reg, label //Branch if reg==0
>> BNEZ reg, label //Branch if reg!=0
>> BGEZ reg, label //Branch if reg>=0
>> BLTZ reg, label //Branch if reg< 0
>>
>> While conceptually, this doesn't save much, it would be cheaper to
>> implement in hardware.
> <
> Having done both, I can warn you that your assumption is filled with badly
> formed misconceptions. From a purity standpoint you do have a point;
> from a gate count perspective and a cycle time perspective you do not.
> <

It would on average cost more clock cycles, but it seems like:
Detect 0 on input (already have this to detect Inf);
Look at sign bit;
Initiate branch signal.

Should be cheaper (in terms of latency) than:
Subtract values;
Use carry, zero, and sign bits of result to determine branch;
Initiate branch signal.

Granted, the actual branch mechanism doesn't initiate until EX2 in this
case, which is probably the only reason it works at all. This basically
overrides the next cycle's PC with the branch target generated in EX1,
and then initiating a flush of the rest of the pipeline stages (unless
the branch predictor also handled this).

Granted, doing this would break compatibility with RISC-V, defeating the
whole point. And, otherwise, I tend to leave these out in favor of a
2-op sequence using the SR.T bit.

>> Relative compares could then use compare
>> instructions:
>> CMPx Rs, Rt, Rd
>> Where:
>> (Rs==Rt) => 0;
>> (Rs> Rt) => 1;
>> (Rs< Rt) => -1.
>>
>> Though, one issue with a plain SUB is that it would not work correctly
>> for comparing integer values the same width as the machine registers (if
>> the difference is too large, the intermediate value will overflow).
> <
> Which is why one needs CMP instructions and not to rely on SUB to do 98%
> of the work.
> <

Granted. Traditionally many would use SUB here, but as noted SUB only
works if:
The values are smaller than the register width;
The values are properly sign or zero extended.

>>> OTOH, a high-performance A64 implementation probably wants some kind
>>> of register port allocator (possibly with port requirements across
>>> several cycles), so it has its own source of complexity. I wonder if
>>> that's the reason why some high-performance ARMs have microcode
>>> caches; I would normally have thought that microcode caches are
>>> unnecesary for a fixed-length instruction format.
>>>
>> I wouldn't think this necessary, if I were implementing it I could do it
>> more like how I deal with 128-bit SIMD and similar in my existing core,
>> and map a single instruction to multiple lanes when needed.
>>> What do people more familiar with hardware design think?
>>>
>>> A classic question is the classification of the architecture: Celio
>>> claims that ARM is CISC (using an A32 Load-multiple instruction as
>>> example). These questions are not decided by touchy-feely arguments,
>>> but rather by instruction set properties. All RISCs are load/store
>>> general-purpose instruction sets; even the A32 load/store-multiple
>>> instructions can be considered a variant of that; one might argue that
>>> accessing two pages in one instruction is not RISC, but by that
>>> criterion RISC-V is not RISC (it allows page-crossing unaligned
>>> accesses). One other common trait in RISCs is fixed-length 32-bit
>>> instructions (A32, MIPS, SPARC, 29K, 88K, Power, Alpha), but there are
>>> exceptions: IBM ROMP, ARM Thumb, (IIRC) MIPS-X, and now RISC-V with
>>> the C extension.
>>>
>> IMO, Load/Store is the big issue...
>>
>> Load/Store allows decent performance from a simplistic pipeline.
> <
> Ahem: 4-wide machines do not have simplistic pipelines, so the burden is
> only on the lesser implementations and nothing on the higher performance
> ones.

I am thinking mostly of lower-end implementations.

In theory, one could split the complex addressing modes into multiple
uOps or similar, and then execute them end-to-end. But, otherwise, one
has a problem.

>>
>> As soon as one deviates from Load/Store, one has effectively thrown a
>> hand grenade into the mix (one is doomed to either needing a much more
>> complex decoder, OoO, or paying a cost in terms of a significant
>> increase in clock-cycle counts per instruction).
>>
> There is the multi-fire reservation station trick used on Athlon and Opteron,
> which pretty much makes the problem vanish.

OK.

>>
>> In comparison, variable-length instructions and misaligned memory access
>> are not nearly as destructive. Granted, they are not free either. I
>> suspect the L1 D$ in my case would be somewhat cheaper if it did not
>> need to deal with misaligned access.
> <
> I would assert that they are better than FREE. They add more performance
> than the added gates to do these things cost.

Granted.

I will note that BJX2 has both variable-length instructions and
misaligned load/store. They are useful at least, but have some LUT cost.

Having binaries be 40-60% smaller, and ability to get a significant
speedup from things like LZ77 and Huffman/Rice decoders, is a worthwhile
tradeoff.

>>
>> However, besides "cheap core", it is also nice to be able to have "fast
>> LZ77 decoding" and similar, which is an area where misaligned memory
>> access pays off.
> <
> ¿dynamic bitfields?

Yes, it is also useful for things like efficient Huffman and Rice
decoding, but I didn't mention this.

>>
>> Page-crossing doesn't seem to be too big of an issue, since it is rare,
>> and can be handled with two TLB misses in a row if needed (only the
>> first TLB miss gets served; when the interrupt returns and the
>> instruction tries again, it causes another TLB miss).
> <
> It is only time, and 98%99% of the time these crossing don't happen.
> {98% in 4KB pages, 99% in 8KB pages}

I am mostly using 16K pages as my testing implied this to be the local
optimum.

Smaller, and TLB miss rate increases significantly.
Larger, and memory overhead increases without much reduction in miss rate.

> <
>>> An interesting point in the talk is that zero-extension is a common
>>> idiom in RISC-V; the talk does not explain this completely, but
>>> apparently the 32-bit variants of instructions sign-extend (like Alpha
>>> does, while AMD64 and A64 zero-extend), and the ABI passes 32-bit
>>> values around in sign-extended form (including 32-bit unsigned
>>> values).
>>>
>> It is a tradeoff.
>>
>> In BJX2, I went with signed values being kept in sign-extended form, and
>> unsigned values kept in zero-extended form (and casts requiring explicit
>> sign or zero extension of the results).
> <
> The tradeoff is even harder where the linker fills in the "upper" bits of a
> register. If the underlying premise is sign extension, one COULD need 2
> registers to hold the "same value" for 2 different large address constants.
> <
> If the underlying premise is zero extension, certain bit pasting ways using
> + need to be changed to use |.
> <
> Another reason not to make SW, of any sort, have to paste bits together to
> make <larger> constants.

OK.

As noted though, bit pasting isn't needed with Jumbo encodings.
The encoding is kinda confetti, but hardware mostly deals with this.

Does sorta lead to an annoyance though for a compiler in that one needs
a bunch of different reloc types to deal with all this (or, when the
same reloc fixup code is dealing with effectively 4 different ISA's
worth of reloc types; and all 4 of them used bit-confetti encodings).

>>
>> This is a general rule even if there are only a minority of cases where
>> it should matter.
> <
> More than you alude.

Possibly.

In general, code needs to keep values in the correct form because there
are cases where it matters:
Load/Store ops ended up using 33 bit displacements;
Some operations always operate on full 64-bit inputs;
...

Going the x86-64 / A64 route effectively requires doubling up nearly
every operation with both 32-bit and 64-bit forms; rather than leaving
most of the ISA as 64-bit except in cases where 64b is too costly and/or
32-bit semantics are required (such as to make C code work as expected).

One thing I have observed is that one can get wonky results from C code
if 'int' values are allowed to go outside of the expected range, more so
when load/store gets involved (the compiler can save and reload a value
and then have the value differ).

So, I ended up with operations like "ADDS.L" and "ADDU.L" whose basic
sole purpose is to do ADD and similar in ways which produce results
which wrap in the expected ways in the case of integer overflow.

....

Subject	Replies	Author
RISC-V vs. Aarch64 By: Anton Ertl on Fri, 24 Dec 2021	351	Anton Ertl

The less time planning, the more time programming.

computers / comp.arch / Re: RISC-V vs. Aarch64