Message-ID:

A mathematician is a device for turning coffee into theorems. -- P. Erdos

On 10/26/2021 9:27 AM, MitchAlsup wrote:
> On Tuesday, October 26, 2021 at 2:21:26 AM UTC-5, BGB wrote:
>> Thus far, the pipeline in BJX2 has worked in a certain way:
>> L1 D$ misses, the whole pipeline stalls;
>> L1 I$ misses, the whole pipeline stalls.
>>
>> However, I recently realized:
>> L1 D$ can potentially be made to instead store-back the result
>> asynchronously, with an interlock being used in place of a stall
>> (instructions/bundles will only move into the EX stages once their
>> dependencies are available);
>> L1 I$ can also made to inject NOPs and set PcStep to 0 rather than stall.
>>
>> The latter of these mechanisms had recently been added as a special case
>> to deal with a bug related to TLB miss handling with the L1 I$. Partly,
>> this was because of a TLB miss bug where the pipeline could possibly end
>> up executing "stale garbage". It was needed to be able to let the
>> pipeline move forwards (so that it could initiate the ISR) while
>> preferably not letting the pipeline be able to execute garbage instructions.
>>
>> This case has experimentally been extended to deal with general L1 I$
>> misses as well, which can also make timing easier (*). In this case, the
>> logic for handling an L1 I$ miss will no longer driving the whole rest
>> of the pipeline.
>>
>> *: Though in Vivado, timing does not improve much, but the LUT cost went
>> down and power-usage estimate also drops, which seems like an improvement.
>>
>>
>> It also seems to slightly improve performance in cases where branches
>> trigger an L1 miss, since the two delays are no longer additive (the L1
>> miss-handling part of the branch and the pipeline-flushing part can now
>> happen in parallel with each other).
>>
>> Apart from the pipeline now executing NOPs on an I$ miss, there isn't a
>> huge functional difference. Note that PC1 <= PC+Step, so if Step==0,
>> then PC1 <= PC.
>>
>>
>> Doing similar for the L1 D$ is also a possibility, and could potentially
>> reduce performance cost in the case of L1 D$ misses, since a miss would
>> no longer need to stall the execution of bundles which don't immediately
>> depend on the result.
>>
>>
>> It doesn't seem like too far of a stretch to imagine a pipeline where
>> much of the logic is effectively non-stalling, and instead uses
>> buffering and interlocks to move data between stages.
>>
>> Though, for the time being, some level of stalling may be unavoidable.
>>
>> But, say:
>> (PF IF), Self-Stall (Spews NOPs on L1 Miss)
>> (ID1 ID2), Interlock Stall (Triggered by ID2)
>> (EX1 EX2 EX3 WB), No Stall
>>
>> It is likely that in this case, Load and Store operations would need to
>> be either partially (or entirely) moved into a sort of FIFO queue structure.
>>
>>
>> Though, even with a queue, there may still need to be a stall if the
>> queue is finite size and "gets full" (say, something like memset comes
>> along and does stores at a faster rate than they can be handled).
>>
>> Another possibility that comes up with this is whether loads and stores
>> could be semi-independent, and it would be possible to handle a hit on a
>> load while at the same time waiting for a cache miss on a store (or vice
>> versa), though possibly with a "clash heuristic" (not allowing a load to
>> proceed if there is a pending store to the same cache line).
>>
>> It seems like if something like this could reduce the number of cycles
>> spent on dealing with L1 misses, it could be worthwhile.
>>
>>
>> Though, some recent compiler optimizations / bug-fixes do appear to have
>> somewhat reduced the "cycles spent on cache miss" heuristic. One of
>> these bugs was resulting in pretty much every "obj->obj2" operation,
>> where obj2 was structure type, doing an extra unnecessary address
>> calculation and memory store (mostly due to more old/stale logic in the
>> compiler).
>>
>> Not sure how many compiler bugs of this sort remain, though I can at
>> least say that the generated machine code does seem to have a visible
>> reduction in "entirely pointless" operations (though, in some cases,
>> reducing the "visual fluff" from one set of pointless operations makes
>> another set of pointless operations more evident).
>>
>> The relative amount of cycles spent on inter-instruction interlocks
>> seems to be going up as well. For this case, would likely need to figure
>> out a good way to get the compiler to load values into registers several
>> cycles before they are needed in an expression.
>>
>> There seems to still be some unresolved bugs which lead to crashes in
>> some cases when using virtual memory on the CPU core.
>>
>>
>>
>> Recently, have also gone and implemented the 96-bit XMOV extensions, at
>> present:
>> When used, the 96-bit addressing mode reduces TLB associativity by half;
>> It was either this or double the Block-RAM cost of the TLB.
>> Some trickery was used to limit the resource cost impact on the L1 caches;
>> Several registers gain "high-register" halves (PC, GBR, VBR, ...).
>> PCH:PC, GBH:GBR, VBH:VBR, ...
>>
>>
>> The impact on resource cost and timing was a bit smaller than originally
>> expected. I was originally expecting it to be "very expensive", but this
>> isn't really the case.
>>
>>
>> Some specifics are still being fine tuned.
>> For example:
>> TBR was going to be extended, but decided against this;
>> SP and LR were extended, but I have also dropped this.
>> Extending these would add unnecessary complexity to the C ABI;
>> The added complexity was also not good for timing.
>> SP is relative to GBH;
>> LR is relative to PCH.
>>
>>
>>
>> At present:
>> Load/Store addressing with a single register is handled as GBH:Rn;
>> Branches are handled as PCH:PC or PCH:Rn;
>> Any narrow jumps are local;
>> XMOV, Load/Store ops can use a register pair as an address;
>> JMPX, Can can be used to do a long jump to a register pair.
>>
>> So, in terms of ISA mechanics, it functions sort of like 8086 style
>> segmented addressing, just with the results appended into a 96 bit
>> linear address and fed through the TLB.
>>
>> It is possible that the TLB could be used to mimic a segmented
>> addressing mode.
>>
>> Similarly, the extended address space may also be treated like multiple
>> smaller 32 or 48 bit spaces.
>>
>> I have generally been leaning towards calling this addressing scheme
>> "quadrant" addressing, for various reasons. I feel it is "sufficiently
>> different" from traditional segmented addressing schemes to likely
>> justify its own term (and its mechanism is very different from segmented
>> addressing as understood on the 286 or 386).
>>
>>
>> Pointers are still generally kept as 64-bits with a 48-bit address by
>> default.
>>
>> However, '__far' or '__huge' may be used to get the larger pointers.
>> Though, the '__near', '__far', and '__huge' keywords will depend on the
>> current pointer size:
>> sizeof(void *)==4:
>> __near: 32-bit
>> __far: 64-bit
>> __huge: 128-bit
>> sizeof(void *)==8:
>> __near: 64-bit
>> __far: 128-bit
>> __huge: 128-bit
>>
>> But, still working out some of the details here...
>>
>>
>> Any thoughts?...
> <
> Skid buffers (how one makes a stalling pipeline into a non-stalling pipeline)
> consume(create) lots of flip-flops (area), but do not consume much power.
> <
> Used properly, a skid buffer on the D$ pipe could allow for hit under miss
> cache access.
>

Yeah, still need to look more into it.

If I could eliminate the stalling from the L1 D$, it is potentially
possible that this could help some with performance and timing.

I still consider the non-stalling L1 I$ to be experimental:
It appears to change behaviors in some cases (*1);
If the I$ deadlocks, the pipelines' deadlock-detect doesn't trigger.

*1: Annoyingly, I still don't have things entirely stable with virtual
memory it seems (the remaining bugs seem fairly esoteric and difficult
to isolate, generally take multiple hours of simulation time to
recreate, and poking at nearly anything in the Verilog causes them to
move around; possibly a timing issue or race condition somewhere).
Though, virtual memory seems to be working "for the most part".

....

Still not sure though why seemingly increasing the ring-bus address
width, TLB width, L1 cache address width, ... were fairly cheap, but...

Adding more paths from which high-order address bits can come from, or
supporting an expanded link register, is fairly bad.

Say, for XMOV:
Op is Wide Address: High-bits come from Other Register.
Default: High-bits come from GBH.
Is OK, but:
Op is Wide Address: High-bits come from Other Register.
Base is PC: High-bits come from PCH.
Base is TBR: High-bits come from TBH.
Base is SP: High-bits come from SPH.
Default: High-bits come from GBH.
Is much worse.

Likewise trying to be able to capture and restore the high order bits
for the Link Register also basically ruined timing.

Though, this could be an adverse reaction with the RISC-V mode, which
ends up somewhat increasing the logic for the "save captured
Link-Register state" case (since RISC-V mode allows directing the
captured PC to an arbitrary register; vs just to LR).

This does favor a "simpler mode" where generally everything is assumed
to be within a single quadrant by default (with the main exception being
to have GBH/PCH and similar separate mostly for sake of things like
system-calls and interrupt handling).

For related funkiness, it is likely that things like kernel-related
address ranges would need to be mapped into multiple quadrants (as
opposed to being able to treat it like a 96-bit linear address space in
this sense).

Though, it is possible that a sub-mode could allow full 64-bit linear
addressing (64k quadrants, or "64 kiloquad" ...). Such a mode could more
directly mimic something like x86-64 or ARM64 addressing modes.

Eg, essentially:
VA = (Quadrant[47:0]<<48) +
ZExt(Base[63:0]) +
(SExt(Index[33:0])<<Sc);

Though, timing is pretty tight on this mode...

It also would not work for executable code (as-is) nor would using it be
compatible with my existing runtime library, ...

Have yet to decide on the page-table schemes though; it is either some
sort of split-level scheme, or just adding a lot more page-table levels.

But, dealing with these huge address spaces via a 9-level page table
(with 16K pages) or similar seems a bit absurd.

There is also now a flag which controls whether user-mode code is
allowed to use XMOV instructions. This could allow for restricting a
program to a 32-bit or 48-bit userland within a larger 96-bit address space.

Subject	Replies	Author
Misc: Non-Stalling Pipeline, 96-bit XMOV, thoughts and direction By: BGB on Tue, 26 Oct 2021	15	BGB

A mathematician is a device for turning coffee into theorems. -- P. Erdos

computers / comp.arch / Re: Misc: Non-Stalling Pipeline, 96-bit XMOV, thoughts and direction