Message-ID:

"You need tender loving care once a week - so that I can slap you into shape." -- Ellyn Mustard

On 4/27/2021 10:53 AM, MitchAlsup wrote:
> On Tuesday, April 27, 2021 at 1:10:05 AM UTC-5, BGB wrote:
>> On 4/24/2021 10:26 PM, MitchAlsup wrote:
>>> On Saturday, April 24, 2021 at 8:08:15 PM UTC-5, BGB wrote:
>>>> On 4/24/2021 4:31 PM, MitchAlsup wrote:
>>>>> On Saturday, April 24, 2021 at 2:58:08 PM UTC-5, BGB wrote:
>>>>>> ( Got distracted... )
>>>>>> On 4/10/2021 2:20 PM, EricP wrote:
>>>>>>> BGB wrote:
>>>>>>>> So, as probably people are already aware, I have been working for a
>>>>>>>> little while on implementing a new bus (on/off, for several months
>>>>>>>> thus far), which has ended up requiring a near complete rewrite of the
>>>>>>>> memory subsystem...
>>>>>>>>
>> ...
>>>>>>
>>>>>> The L2 is currently not quite as fast as expected.
>>>>>>
>>>>>> Stats I am currently seeing look like:
>>>>>> Memcpy (L2): ~ 25 MB/s
>>>>>> Memset (L2): ~ 70 MB/s
>>>>>>
>>>>>> Memcpy (DRAM): ~ 9 MB/s
>>>>>> Memset (DRAM): ~ 11 MB/s
>>>>>>
>>>>>> For reference (L1 local):
>>>>>> Memcpy (L1): ~ 250 MB/s
>>>>>> Memset (L1): ~ 277 MB/s
>>>>>>
>>>>>> With both the CPU core and ring-bus running at 50 MHz (with DDR2 at
>>>>>> 75MHz; DLL Disabled).
>>>>> <
>>>>> Yeah, seems somewhat slow, should be closer to L1/4 (60 MB/s)
>>>>>
>>>>> Remind me of the associativity of the L1 and L2 ??
>>>>> <
>>>> L1 and L2 are both 1-way / direct-mapped in this case.
>>>>
>>>>
>>>> Though, traditional descriptions imply that direct-mapped is strictly
>>>> Mod-N, and set-associative is basically Mod-N but with 2 or 4 possible
>>>> cache lines per slot.
>>>>
>>>>
>>>> I am using a mapping which basically does an XOR hash of the address
>>>> bits to generate an index into the cache, which then operates on a
>>>> similar premise to a hash table.
>>> <
>>> You should look up Andrew Seznic's paper on skewed associative caches.
>>
>> I had considered it, though as I understand it, a skewed-associative
>> hash would require a multi-cycle state-machine.
>>
> The Skewed Associative Cache uses a different hash function on each of the sets
> in the cache.
>
> In a normal set-associative Cache, one uses the same address to each set.
>
> The extra path length is no more than a single XOR gate of delay--but you do not
> store multiple tags under a single index like you can in a normal SA cache.
>
> But as you are using DM caches, it is on a different path.

I meant in the sense that a skew associative cache, using a single
cache-line array (as in a DM cache), would likely require several cycles
to probe the array, eg:
Stage 1: Probe Spot A
Stage 2: Probe Spot B
Stage 3: Try Stomp A (for Store)
Stage 4: Initiate Miss Handler

Though, I guess it is possible that this state could be encoded in the
request message (rather than using a dedicated state machine).

It looks like one otherwise needs to first implement a 2-way
set-associative cache to have the needed mechanisms for the
skew-associative cache.

But, could be useful if it shows an improvement over a more conventional
2-way cache.

>>
>>
>>
>>
>> Sadly, finding and fixing a bug that was resulting in memory corruption
>> also came at the cost of a pretty big hit to performance...
>>
>>
>> The cache would issue multiple requests at a time, but wouldn't
>> necessarily check that the responses came back for memory stores before
>> continuing (it would get the response for a Load, then assume everything
>> was good).
>>
>> This worked in the partial simulation test-bench, which always returned
>> results in-order, but was failing in some cases if the store took
>> longer. There were cases where the program would start doing stuff and
>> issuing more requests before the first store request was handled,
>> resulting in loads occasionally returning stale data.
>>
>>
>> As a result, went and added logic to the partial simulation to mimic the
>> behavior of the L2 cache, which also recreated the bugs I was seeing in
>> the full simulation.
>>
>>
>> I added some logic to the L1 cache make it wait for *both* the Load and
>> Store responses to get back before continuing, and tightening up
>> behavior in a few other related areas, which (sadly) seems to have put a
>> dent on performance.
>>
>>
>> At the moment:
>> Memcpy (DRAM): ~ 8.4 MB/s
>> Memset (DRAM): ~ 14.7 MB/s
>> Memcpy (L2): ~ 12.6 MB/s
>> Memset (L2): ~ 49.1 MB/s
>> Memcpy (L1): ~ 250 MB/s
>> Memset (L1): ~ 275 MB/s
>>
>>
>> L2 speeds "could" be a bit higher, but it seems that the L2 is having a
>> fairly high miss rate during the "L2" test.
>>
>> The test logic in the partial-simulation test-bench also implies that
>> the high miss rate is due to the design of the L2 cache, rather than due
>> to a bug in the L2's Verilog.
>>
>>
>> L2 hit rate seems to be in the area of ~ 45% to 60%.
>>
> For an L2 4×-to-8× the size of L1 those local miss rates are "about right".

Yeah.

I did go and model a 2-way associative cache (in the partial-sim
test-bench), and this was able to boost up the L2 stats back up a
reasonable amount (up to ~ 30 MB/s memcpy and 70 MB/s memset).

DRAM case showed relatively little impact.

Attempting to implement a 2-way support onto the existing L2 cache was
much less effective. Though, I suspect this was more due to bugs and it
being "barely functional".

Technically, it was using a strategy similar to what I had previously
tried with the L1 caches though (1.5-way?), namely dividing the cache
lines up into A and B sets:
Set A: May contain cache lines in either a clean or dirty state;
Set B: May only contain cache lines in a "clean" state.

Load may fetch from either A or B.
Miss, A is Not Dirty:
Load fetched line into A
Miss, A is Dirty:
Load fetched line into B

Store will only store to A, but behave differently based on state:
A is Dirty: Hit or Cache miss as in 1-way.
Write A back to DRAM;
Load "null line" into A
A is Not Dirty:
Hit: Replace A
Miss: Copy A to B, Store in A

There is no real need to preserve the state of Set B.

Miss, Either A or B is set to Flush:
Write A back to DRAM if Dirty;
Load into A;
Stomp whatever is in B.

My initial attempt isn't really seeing much speedup over the normal
1-way cache, and is giving very different results from mocking up
similar logic in the test-bench. I suspect it isn't working correctly as
of yet.

Though, even if it does work as expected, it does require making the
cache half as long, and this approach in-effect makes "memcpy" style
patterns a little faster at the expense of making "memset" patterns
slower (but, can work out reasonably well if the working set for loads
is larger than the working set for stores).

Granted, a cache where both A and B may contain dirty cache-lines would
be better, but this is more complicated.

Granted, still need to work on it some more...

>>
>>
>> It seems there is a fairly frequent pattern of endlessly evicting one
>> cache line to load another, then evicting that cache line to load the
>> first line again, ...
>>
> There is this thing called a victim buffer.

It is possible.

>>
>> May need to come up with some sort of workaround, in any case.
>>>>
>>>> In any case, results for the hashed indexing are still better than with
>>>> Mod-N direct-mapped caches.
>>> <
>>> To be expected
>> I have since gone and modeled various hash functions directly, results I
>> am seeing (for a 13-bit index, based on the boot sequence):
>> (Addr[12:0] ): ~ 46% (Mod-N)
>> (Addr[12:0]^Addr[24:12] ): ~ 58% (A)
>> (Addr[12:0]^Addr[25:13] ): ~ 54% (B)
>> (Addr[12:0]^Addr[24:12]^Addr[27:15]): ~ 56% (C)
>> (Addr[12:0]^Addr[25:13]^Addr[19: 7]): ~ 59% (D)
>> (Addr[12:1]^Addr[24:13]^Addr[27:16], Addr[0]): ~ 55% (E)
>>
>> Where Addr is 28 bits, selected from memAddr[31:4].
>>
>>
>> Or, basically, the hashed addresses somewhat beat out the use of a naive
>> modular mapping.
>>
>>
>> Some of the other schemes tried (such as transposing the high-order
>> bits, or passing the low-order bits unmodified), were in-general "not
>> particularly effective" according to this test.
>>
>>
>> That said, leaving the low-order bits intact does significantly reduce
>> the frequency of "pogo misses".
>>
>> So, in this case, the option which does best according to hit rate (D),
>> also happens to have a drawback of a higher pogo-miss rate than some of
>> the other options.
>>
>> Likewise, C vs E differ mostly in that E has a lot fewer pogo misses.
>>
>> Option A seems to do reasonably well and also has a relatively low
>> pogo-miss rate relative to the others.
>>
>> Mod-N has the lowest hit rate but also apparently the lowest pogo-miss
>> rate (in that they don't seem to occur with this function in these tests).
>>
>>
>>
>> ...
>>>>>>
>>>>>>
>>>>>> Another "promising" type of pattern appears to be to do Mod-N for the
>>>>>> lower address bits, but then XOR'ing the higher-order bits, say:
>>>>>> { Addr[16:10] ^ Addr[22:16], Addr[9:4] }
>>>>>> Or:
>>>>>> { Addr[16:8] ^ Addr[24:16], Addr[7:4] }
>>>>> OR:
>>>>> {Addr[8:16] ^ Addr[24:16], Addr[7:4]}
>>>>> OR:
>>>>> {Addr[16:8] ^ Addr[16:24], Addr[7:4]}
>>>> The tools I am losing lose their crap if one reverses the numbers here...
>>>>
>>>> The general notion seems to be that "addr[msb:lsb]" is the only valid
>>>> way to write bit-range selection.
>>> <
>>> At the language level:: "its only wires" and the tool loses its *&^% !!
>>>
>> I am not sure the specifics as to why...
>> But, I do seem to see a lot of arbitrary limitations here.
>>
>> Also Verilog's preprocessor kinda sucks in that the tools can't seem to
>> entirely agree on how to resolve "`include" directives, "readmemh"
>> behavior, how exactly macro expansion behaves, ..
>>
>>
>> Then I write some stuff, Vivado warns about it in Verilog 97 mode, and
>> Quartus refuses to accept it unless it is told to use System Verilog.
>>
>>
>> But, I ended up doing it that way because the alternatives kinda suck
>> (and if one tries to use a macro expansion in a "case" in Verilator, its
>> parser freaks out).
>>
>> Well, and also if one writes:
>> 8'h40, 8'h60: ...
>> Or similar in a "case", Verilator's parser also freaks out, ...
>> But, this does work if one uses constants declared via "parameter".
>>
>>
>>
>> Well, also if one passes a variable from a combinatorial block into a
>> module parameter, and the module works via combinatorial logic, and then
>> tries to use the output of said module via another combinatorial
>> block... Verilator often freaks out thinking one has written something
>> based on circular logic.
>>
> I have seen this effect, too.
>
> Not knowing yuo aren't creating a circular path (like a flip-flop) it
> guesses I can't see far enough and warns you anyway. Most of the time
> we build blocks big enough that they were flopped on the inputs and
> outputs.

Possible...

I ended up also fixing up these cases (and accepting the existence of
this warning), because it also happens to be a reasonable indicator of
paths which are likely to fail timing.

>>
>> Though, this issue can be partly sidestepped by using "assign" rather
>> than "always @*" blocks in these cases...
>>
>>
>> Well, and then also in Vivado, one can't drive outputs via "always"
>> blocks, but need to be like:
>>
>> output[15:0] outVal;
>> reg[15:0] tOutVal;
>> assign outVal = tOutVal;
>>
>> ...
>>> Well you could do it like this::
>>>
>>> {Addr[16] ^ Addr[24], Addr[15] ^ Addr[17], Addr[14]^Addr[23]. .....
>>>
>>> Do the bit rearrangement by hand--this works only as long as you only use it in one
>>> place or embed it in a instance.
>> Manual bit rearrangement is the usual practice...
>>>>
>>>>
>>>> Verilator seems to first give a message complaining about the bit range
>>>> being reversed, then gives a message about 4294967287 bits being greater
>>>> than 9, and then promptly crashes with a segmentation fault message...
>>>>
>>>> Vivado gives an error message about bit-reversed selection being invalid.
>>>>
>>>> Some information online claims that Quartus supports reverse bit-range
>>>> selection in this case (and doing so will reverse the bits).
>>>>
>>>> The usual strategy is to write something like: "{ addr[0], addr[1],
>>>> addr[2], ... }", but this kinda sucks...
>>> <
>>> What sucks worse is a bad hash !!!
>> Granted...
>>
>> I had been trying various options.

Subject	Replies	Author
Re: Misc: First testing with new bus (BJX2 core) By: BGB on Tue, 27 Apr 2021	5	BGB

"You need tender loving care once a week - so that I can slap you into shape." -- Ellyn Mustard

computers / comp.arch / Re: Misc: First testing with new bus (BJX2 core)