Message-ID:

It is now pitch dark. If you proceed, you will likely fall into a pit.

On 5/24/2021 12:52 PM, MitchAlsup wrote:
> On Sunday, May 23, 2021 at 11:16:58 PM UTC-5, BGB wrote:
>> On 5/23/2021 8:32 PM, MitchAlsup wrote:
>>> On Sunday, May 23, 2021 at 8:20:16 PM UTC-5, BGB wrote:
>>>> On 5/23/2021 3:29 PM, MitchAlsup wrote:
>>>>> On Sunday, May 23, 2021 at 2:27:41 PM UTC-5, Ivan Godard wrote:
>>>>>> On 5/23/2021 11:13 AM, BGB wrote:
>>>>>>> On 5/23/2021 5:26 AM, Ivan Godard wrote:
>>>>>>>> On 5/23/2021 1:33 AM, BGB wrote:
>>>>>>>
>>>>>>> Other ops would have 3 registers (18 bits), or 2 registers (12 bits).
>>>>>>> Compare ops could have a 2-bit predicate-destination field.
>>>>>>>
>>>>>>> It is possible that 01:00 (Never Execute) could be used to encode a
>>>>>>> Jumbo Prefix or similar (or, maybe a few unconditional large-immed
>>>>>>> instructions or similar).
>>>>> <
>>>>>> Having predication for most ops is just entropy clutter and a waste of
>>>>>> power: it costs more to *not* do an ADD than to do it, so always do it,
>>>>>> and junk the predicates. You need predicates for ops that might do a
>>>>>> hard throw, or that change persistent state like store and control flow;
>>>>>> nowhere else.
>>>>> <
>>>>> If the standard word size was 36-bits I would disagree, here, but since it is
>>>>> 32-bits, I have to agree.
>>>>>
>>>> I did start writing up some ideas for an ISA spec (as an idea, I called
>>>> is BSR4W-A).
>>>>
>>>> Relative to BJX2, it would gains 3 encoding bits due to not having any
>>>> 16-bit ops, but then lose 5 encoding bits (due to 6-bit register IDs and
>>>> a 2-bit predicate-register field), meaning a net loss of 2 bits.
>>>>
>>>>
>>>> Compared to BJX2, this would mean somewhat less usable encoding space
>>>> for opcodes, meaning it is likely either:
>>>> I have fewer, or smaller, ops with immediate and displacement fields;
>>>> Parts of the core ISA would need to be encoded using jumbo-encodings.
>>>>
>>>>
>>>> Ideally I would like to keep the Disp9+Jumbo24 -> Disp33s pattern for
>>>> Loads/Stores, so it is likely would need to "squeeze things" somewhere
>>>> else to make room.
>>>>
>>>> My current estimate is that if I populated the encoding space, it would
>>>> start out basically "already full".
>>> <
>>> Would I be out of line to state that this sounds like a poor starting point?
>> Probably.
>>
>> It is more akin to designing for a 16-bit ISA, where it doesn't take
>> much to eat through pretty much all of it.

Clarification:
I had meant, "it probably was a poor starting point" rather than "it was
probably out of line"...

>>> <
>>> My 66000 has 1/3rd of its Major OpCode space unallocated,
>>> a bit less than 1/2 of its memory reference OpCode Space allocated,
>>> a bit less than 1/2 of its 2-operand OpCode Space allocated,
>>> a bit less than 1/128 of its 1-operand Op[Code Apace allocated,
>>> and 1/4 of its 3-operand OpCode Space unallocated.
>> Starts looking at it a little more, and realizing encoding space may be
>> a more serious problem than I realized initially...
>>
>>
>> I can't really map BJX2 to this new space, it just doesn't fit...
>>
>>
>> Then again, maybe it might win more points with the "RISC means small
>> ISA listing" crowd... Because one runs out of encoding bits before they
>> can fit all that much into it...
>>
>>
>> "Well, Imma define some Disp9 Load/Store Ops...",
>> "Oh-Noes, that was 1/4 of the encoding space!",
>> "How about some 3R Load/Store ops and 3R ALU ops and 2R space",
>> "Now it at 1/2 of the opcode space!"
> <
> To be fair, I made a loot of these mistakes in Mc 88K, and corrected the
> vast majority of them in My 66000.
>>
>> Then one has to struggle to fit some useful 3RI ALU ops, 2RI ops, and
>> Branch ops, before realizing they are already basically out of encoding
>> space...
> <
> The important thing to remember is that the most precious resource is
> the Major OpCode space--and the reason is that this gives you access
> to the other spaces.
> <
> In My 66000, the Major OpCode space consists of all 16-bit immediates
> The branches with IP relative offsets, and the extension OpCodes, of
> which there are 6 {Predication, Shifts, 2R+Disp memory refs, 2-Operand,
> 3-Operand, and 1-Operand.}
> <
> For all of the extended instructions, My 66000 has 3-bits to control the
> signs of the operands and access to long immediates, and access to
> 5-bit immediates in Src1. This supports things like 1<<k in a single instruction.
> <
> The second most important resource is the 3-operand space because
> there are only 8 available entries and we need FMAC (single and double),
> CMOV, and INSert.
> <
> The other spaces are so partially populated that one has a pretty free
> reign.

OK.

In my initial layout, was starting from a 6-bit major space, with a
4-bit minor space for 3R ops, and an additional 6 bits for 2R ops.

Doing a Disp9 or Imm9 op would only have the 6-bit major opcode, which
doesn't really go all that far.

Meanwhile, in this top-level space, in BJX2 it was 3+4+1 (8) bits, with
3R ops adding 4-bits, and 2R ops adding 8 bits.

The new design was seriously choked in the top-level space, but could
have more space for 2R ops.

The result would be to mostly drop Imm9 and use Imm6 instead, but:
There is a lot more that doesn't fit into 6 bits that would have fit into 9;
It basically precludes being able to encode an arbitrary 32 or 33 bit
value in a 64-bit pair (since the Jumbo prefix also needs its chunk of
encoding space, and a 27-bit jumbo prefix is basically no-go).

For my existing ISA, I did go and make a tweak:
Some of the flag bits from SR are saved in the high-order bits of LR
during function calls (and restored on function return);
This means predication should now work across function calls, and also
resolves a few potential ISA semantics issues involving WEX (the WEX
Enable state and similar is also now preserved across function calls).

Though, it does result in LUT cost increasing by a few %, which isn't
ideal. Can't really how much of this is due to an actual/significant
cost increase vs random fluctuation.

WNS hasn't really changed either way, and the WNS value usually
indicates if one has poked at something serious. Likewise, the slow
paths still seem to be mostly stuff within the memory subsystem.

>>
>>
>> Yeah, a shortfall of several bits seems to make a pretty big difference...
>>
>>
>> It goes a little further if one does Load/Store and 3RI ops using
>> Disp6/Imm6 instead of Disp9/Imm9.
>>
>> Not enough bits to encode an Imm33/Disp33 in a 64-bit pair, and not
>> enough bits to encode Imm64 in 96-bits, ...
>>
>>
>> Yeah, "poor starting point" is starting to seem fairly evident...
>>>>
>>>>
>>>> Still not fully settled on instruction layouts yet, and don't feel
>>>> particularly inclined at the moment to pursue this, since the main way
>>>> to "actually take advantage of it" would like require use of modulo loop
>>>> scheduling or clever function inlining or similar (or, basically, one of
>>>> the same issues which Itanium had to deal with).
>>>>
>>>> Some possible debate is whether code would benefit from a move from 32
>>>> to 64 GPRs. Short of some tasks which come up in an OpenGL rasterizer
>>>> (namely parallel edge walking over a bunch of parameters or similar), I
>>>> have doubts.
>>>>
>>>>
>>>> It is more likely to pay off for a wider core, but this would assume
>>>> having a compiler which is effective enough to use the additional width
>>>> (whereas, as-is, my compiler can't even really manage 3-wide effectively).
>>>>
>>> I have lived under the assumption that the wider cores have the HW resources
>>> to do many of these things for themselves, so that code written, compiled, and
>>> scheduled for the 1-wide cores run within spitting distance of the best compiled
>>> code one could target at the GBOoO core. I developed this assumption from the
>>> Mc 88120 effort where we even achieved 2.0 IPC running SPEC 89 XLISP ! and
>>> 5.99 IPC running MATRIX300.
> <
>> I am assuming a lack of any OoO or GBOoO capabilities, and instead a
>> strictly in-order bundle-at-a-time core more like the existing BJX2
>> pipeline, just possibly widened from 3 to 5 or similar.
> <
> Yes, you are targeting a particular chip to hold your design, while I am
> designing from the very small (1-wide In Order) to the moderately large
> (8-wide Out of Order)

Granted.

For me, significantly smaller FPGA's fall into the "not generally sold
on FPGA dev boards on Amazon or similar" territory (1).

And, bigger FPGA's in the "they are too expensive and I don't have money
to afford them" territory.

Similarly, custom ASIC's are well outside anything I am able likely able
to be able to afford (and, the ability to "actually do things" is a
limiting factor).

1: I can't tell if this is due to obscurity/supply reasons or similar,
or due to the FPGA's being too small to really be worth doing all that
much with.

>>
>> But, the amount of heavy lifting the compiler would need to do to make
>> this worthwhile is a problem.
> <
> Having designed a 6-wide GBOoO and understanding you medium of
> expression, I can understand why you are not.

Yeah.

At an earlier stage, I wasn't even expecting as much as what I got...

Granted, at an earlier stage, I also expected to do a "lot of small
cores" rather than "one or two big cores", but didn't really realize
that the cost difference between a "small core" and a "big core" is
still fairly modest at these scales.

>>
>> Some stuff I have read does seem to imply that GCC may have some of
>> needed optimizations to be able to make this workable though.
>>
>>
>>
>> Likewise, going the other direction, a 1-wide RISC-like core (vs 3-wide):
>> I can clock it at 75 or 100MHz;
>> I still need to reduce the L1 cache sizes at these speeds.
> <
> What would the trade off be if you added a pipe stage to LD so you
> could run at the higher frequency and have the larger cache ?

Not sure...

Adding an EX4 stage would not be good (interlock and forwarding, ...),
so don't really want to go this route.

Adding in a pipeline stall for every Ld/St would also be pretty bad (the
performance impact seems to be "pretty much devastating"), and somewhat
worse than the impact from the smaller L1. (Despite my efforts,
Load/Store ops are still in top-place for the most-frequently-used
instructions...)

I had considered the possibility of an intermediate "L1.5" cache, but
haven't implemented it yet. In this case, the L1.5 would sit on the L1
ring and handle some requests which would have otherwise gone to the L2,
but potentially slightly faster.

I had also considered the possibility of a 2-level L1:
L0: 1K, LUTRAM (single-cycle access)
L1: 16K BRAM, may have multi-cycle access latency.

This could also make sense, but falls into "haven't implemented yet"
category; trade-off is mostly the added complexity.

This option is likely to give better performance than the L1.5 option,
but is likely to be more complicated.

Both options "make sense" at 75MHz, but seem not so useful at 50.
But, 75MHz makes timing for pretty much everything a lot more difficult.

>>
>>
>> With smaller L1's, I can have a 100MHz core that runs slower than a
>> 3-wide core running at 50MHz...
>> Which is kinda how I ended up in the current boat to begin with.
> <
> What about adding a pipe stage to LD and making the cache 75%
> of a cycle longer ?
> <
> I don't remember the sizes of your L1s, but this sounds like the classical
> Cache pipe-stage delima. 8K 2-cycle versus 64K 3-cycle. for high miss
> rate applications the larger slower cache is better especially if the core
> frequency improves.

L1 Sizes (what I could get away with):
100MHz: 1K ( 32x 32B)
75MHz: 2K ( 64x 32B)
50MHz: 16K (512x 32B)

The L1 tries to access the cache arrays along a single clock-edge, which
doesn't really work if using Block-RAM at 75MHz, but does work in 2 cycles.

The 1K and 2K cases can be implemented using LUTRAM, and work well.

Trying to use LUTRAM for a 4K or 8K cache doesn't work out either.

In a single-core config at 50MHz, I have also experimented with 32K L1
caches. These show some (albeit more modest) improvement over the 16K L1's.

The issue seems to be more that 1K and 2K caches have a fairly high miss
rate.

>>
>> And, a 3-wide core at 75MHz is faster than a 1-wide core at 100MHz.
> <
> Back when we had the 1-wides up and running and were designing the
> 2-wides we simulated a bunch of the design space and found:: in general:
> 1-wide could get 0.7 IPC, 2-wide 0.95 IPC, 3-wide 1.1 IPC, 4-wide 1.2 IPC.
> {Note these were NOT the OoO machines.}
> <
> So, based on those numbers your 3-wide should be 20% faster.
> <
> How fast is it relative to the 1-wide.

Based on whether or not bundling is enabled in the compiler, there also
seems to be a roughly 20% difference (if both cases are at 50MHz).

But, the primary thing which determines performance more seems to be
cache sizes.

If I could run at 75MHz with the same L1 cache sizes as 50MHz, the
performance numbers would be pretty solid.

Similarly, the "just let it fail timing and run it anyways", it is
rather crash-prone.

In this territory, one can potentially see ~ 4-8 fps in GLQuake... At
least until it crashes...

>>
>> And the 50MHz core rolls in with its ability to have massively larger L1
>> caches and similar, and "owns it".
>>
>>
>>
>>
>> Decided to leave out going off onto a tangent about my ongoing battles
>> with DRAM bandwidth... (ATM, it appears to be mostly a "death by 1000
>> paper cuts" situation, though now mostly confined to the L2 cache and L2
>> Cache <-> DDR Controller interface and similar).
>>
>>
>> If there were some good way to predict cache misses before they
>> happened, this could be useful...
>>
>> May also consider a "sweep L2 and evict old dirty cache lines"
>> mechanism, ...

Subject	Replies	Author
Branch prediction hints By: Thomas Koenig on Sat, 22 May 2021	72	Thomas Koenig

It is now pitch dark. If you proceed, you will likely fall into a pit.

computers / comp.arch / Re: Branch prediction hints