novaBBS - comp.arch - Re: Misc: First testing with new bus (BJX2 core)

On 4/24/2021 10:26 PM, MitchAlsup wrote:
> On Saturday, April 24, 2021 at 8:08:15 PM UTC-5, BGB wrote:
>> On 4/24/2021 4:31 PM, MitchAlsup wrote:
>>> On Saturday, April 24, 2021 at 2:58:08 PM UTC-5, BGB wrote:
>>>> ( Got distracted... )
>>>> On 4/10/2021 2:20 PM, EricP wrote:
>>>>> BGB wrote:
>>>>>> So, as probably people are already aware, I have been working for a
>>>>>> little while on implementing a new bus (on/off, for several months
>>>>>> thus far), which has ended up requiring a near complete rewrite of the
>>>>>> memory subsystem...
>>>>>>

....

>>>>
>>>> The L2 is currently not quite as fast as expected.
>>>>
>>>> Stats I am currently seeing look like:
>>>> Memcpy (L2): ~ 25 MB/s
>>>> Memset (L2): ~ 70 MB/s
>>>>
>>>> Memcpy (DRAM): ~ 9 MB/s
>>>> Memset (DRAM): ~ 11 MB/s
>>>>
>>>> For reference (L1 local):
>>>> Memcpy (L1): ~ 250 MB/s
>>>> Memset (L1): ~ 277 MB/s
>>>>
>>>> With both the CPU core and ring-bus running at 50 MHz (with DDR2 at
>>>> 75MHz; DLL Disabled).
>>> <
>>> Yeah, seems somewhat slow, should be closer to L1/4 (60 MB/s)
>>>
>>> Remind me of the associativity of the L1 and L2 ??
>>> <
>> L1 and L2 are both 1-way / direct-mapped in this case.
>>
>>
>> Though, traditional descriptions imply that direct-mapped is strictly
>> Mod-N, and set-associative is basically Mod-N but with 2 or 4 possible
>> cache lines per slot.
>>
>>
>> I am using a mapping which basically does an XOR hash of the address
>> bits to generate an index into the cache, which then operates on a
>> similar premise to a hash table.
> <
> You should look up Andrew Seznic's paper on skewed associative caches.

I had considered it, though as I understand it, a skewed-associative
hash would require a multi-cycle state-machine.

Sadly, finding and fixing a bug that was resulting in memory corruption
also came at the cost of a pretty big hit to performance...

The cache would issue multiple requests at a time, but wouldn't
necessarily check that the responses came back for memory stores before
continuing (it would get the response for a Load, then assume everything
was good).

This worked in the partial simulation test-bench, which always returned
results in-order, but was failing in some cases if the store took
longer. There were cases where the program would start doing stuff and
issuing more requests before the first store request was handled,
resulting in loads occasionally returning stale data.

As a result, went and added logic to the partial simulation to mimic the
behavior of the L2 cache, which also recreated the bugs I was seeing in
the full simulation.

I added some logic to the L1 cache make it wait for *both* the Load and
Store responses to get back before continuing, and tightening up
behavior in a few other related areas, which (sadly) seems to have put a
dent on performance.

At the moment:
Memcpy (DRAM): ~ 8.4 MB/s
Memset (DRAM): ~ 14.7 MB/s
Memcpy (L2): ~ 12.6 MB/s
Memset (L2): ~ 49.1 MB/s
Memcpy (L1): ~ 250 MB/s
Memset (L1): ~ 275 MB/s

L2 speeds "could" be a bit higher, but it seems that the L2 is having a
fairly high miss rate during the "L2" test.

The test logic in the partial-simulation test-bench also implies that
the high miss rate is due to the design of the L2 cache, rather than due
to a bug in the L2's Verilog.

L2 hit rate seems to be in the area of ~ 45% to 60%.

It seems there is a fairly frequent pattern of endlessly evicting one
cache line to load another, then evicting that cache line to load the
first line again, ...

May need to come up with some sort of workaround, in any case.

>>
>> In any case, results for the hashed indexing are still better than with
>> Mod-N direct-mapped caches.
> <
> To be expected

I have since gone and modeled various hash functions directly, results I
am seeing (for a 13-bit index, based on the boot sequence):
(Addr[12:0] ): ~ 46% (Mod-N)
(Addr[12:0]^Addr[24:12] ): ~ 58% (A)
(Addr[12:0]^Addr[25:13] ): ~ 54% (B)
(Addr[12:0]^Addr[24:12]^Addr[27:15]): ~ 56% (C)
(Addr[12:0]^Addr[25:13]^Addr[19: 7]): ~ 59% (D)
(Addr[12:1]^Addr[24:13]^Addr[27:16], Addr[0]): ~ 55% (E)

Where Addr is 28 bits, selected from memAddr[31:4].

Or, basically, the hashed addresses somewhat beat out the use of a naive
modular mapping.

Some of the other schemes tried (such as transposing the high-order
bits, or passing the low-order bits unmodified), were in-general "not
particularly effective" according to this test.

That said, leaving the low-order bits intact does significantly reduce
the frequency of "pogo misses".

So, in this case, the option which does best according to hit rate (D),
also happens to have a drawback of a higher pogo-miss rate than some of
the other options.

Likewise, C vs E differ mostly in that E has a lot fewer pogo misses.

Option A seems to do reasonably well and also has a relatively low
pogo-miss rate relative to the others.

Mod-N has the lowest hit rate but also apparently the lowest pogo-miss
rate (in that they don't seem to occur with this function in these tests).

....

>>>>
>>>>
>>>> Another "promising" type of pattern appears to be to do Mod-N for the
>>>> lower address bits, but then XOR'ing the higher-order bits, say:
>>>> { Addr[16:10] ^ Addr[22:16], Addr[9:4] }
>>>> Or:
>>>> { Addr[16:8] ^ Addr[24:16], Addr[7:4] }
>>> OR:
>>> {Addr[8:16] ^ Addr[24:16], Addr[7:4]}
>>> OR:
>>> {Addr[16:8] ^ Addr[16:24], Addr[7:4]}
>> The tools I am losing lose their crap if one reverses the numbers here...
>>
>> The general notion seems to be that "addr[msb:lsb]" is the only valid
>> way to write bit-range selection.
> <
> At the language level:: "its only wires" and the tool loses its *&^% !!
>

I am not sure the specifics as to why...
But, I do seem to see a lot of arbitrary limitations here.

Also Verilog's preprocessor kinda sucks in that the tools can't seem to
entirely agree on how to resolve "`include" directives, "readmemh"
behavior, how exactly macro expansion behaves, ..

Then I write some stuff, Vivado warns about it in Verilog 97 mode, and
Quartus refuses to accept it unless it is told to use System Verilog.

But, I ended up doing it that way because the alternatives kinda suck
(and if one tries to use a macro expansion in a "case" in Verilator, its
parser freaks out).

Well, and also if one writes:
8'h40, 8'h60: ...
Or similar in a "case", Verilator's parser also freaks out, ...
But, this does work if one uses constants declared via "parameter".

Well, also if one passes a variable from a combinatorial block into a
module parameter, and the module works via combinatorial logic, and then
tries to use the output of said module via another combinatorial
block... Verilator often freaks out thinking one has written something
based on circular logic.

Though, this issue can be partly sidestepped by using "assign" rather
than "always @*" blocks in these cases...

Well, and then also in Vivado, one can't drive outputs via "always"
blocks, but need to be like:

output[15:0] outVal;
reg[15:0] tOutVal;
assign outVal = tOutVal;

....

> Well you could do it like this::
>
> {Addr[16] ^ Addr[24], Addr[15] ^ Addr[17], Addr[14]^Addr[23]. .....
>
> Do the bit rearrangement by hand--this works only as long as you only use it in one
> place or embed it in a instance.

Manual bit rearrangement is the usual practice...

>>
>>
>> Verilator seems to first give a message complaining about the bit range
>> being reversed, then gives a message about 4294967287 bits being greater
>> than 9, and then promptly crashes with a segmentation fault message...
>>
>> Vivado gives an error message about bit-reversed selection being invalid.
>>
>> Some information online claims that Quartus supports reverse bit-range
>> selection in this case (and doing so will reverse the bits).
>>
>> The usual strategy is to write something like: "{ addr[0], addr[1],
>> addr[2], ... }", but this kinda sucks...
> <
> What sucks worse is a bad hash !!!

Granted...

I had been trying various options.

On 4/26/2021 11:10 PM, BGB wrote:

snip

> At the moment:
> Memcpy (DRAM): ~ 8.4 MB/s
> Memset (DRAM): ~ 14.7 MB/s
> Memcpy (L2): ~ 12.6 MB/s
> Memset (L2): ~ 49.1 MB/s
> Memcpy (L1): ~ 250 MB/s
> Memset (L1): ~ 275 MB/s
>
>
> L2 speeds "could" be a bit higher, but it seems that the L2 is having a
> fairly high miss rate during the "L2" test.
>
> The test logic in the partial-simulation test-bench also implies that
> the high miss rate is due to the design of the L2 cache, rather than due
> to a bug in the L2's Verilog.
>
>
> L2 hit rate seems to be in the area of ~ 45% to 60%.
>
>
> It seems there is a fairly frequent pattern of endlessly evicting one
> cache line to load another, then evicting that cache line to load the
> first line again, ...
>
> May need to come up with some sort of workaround, in any case.

That is exactly the scenario that a multi-way associative cache was
designed to fix.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

On 4/27/2021 1:26 AM, Stephen Fuld wrote:
> On 4/26/2021 11:10 PM, BGB wrote:
>
> snip
>
>
>
>> At the moment:
>> Memcpy (DRAM): ~ 8.4 MB/s
>> Memset (DRAM): ~ 14.7 MB/s
>> Memcpy (L2): ~ 12.6 MB/s
>> Memset (L2): ~ 49.1 MB/s
>> Memcpy (L1): ~ 250 MB/s
>> Memset (L1): ~ 275 MB/s
>>
>>
>> L2 speeds "could" be a bit higher, but it seems that the L2 is having
>> a fairly high miss rate during the "L2" test.
>>
>> The test logic in the partial-simulation test-bench also implies that
>> the high miss rate is due to the design of the L2 cache, rather than
>> due to a bug in the L2's Verilog.
>>
>>
>> L2 hit rate seems to be in the area of ~ 45% to 60%.
>>
>>
>> It seems there is a fairly frequent pattern of endlessly evicting one
>> cache line to load another, then evicting that cache line to load the
>> first line again, ...
>>
>> May need to come up with some sort of workaround, in any case.
>
> That is exactly the scenario that a multi-way associative cache was
> designed to fix.
>
>

Yes, and a set-associative L2 cache is one likely way to address this
issue... The main alternative would be finding some other "cheaper" option.

Annoyingly, a test buffer big enough to not get a boost from the L1, is
also big enough to also have a fairly high L2 miss rate with a 1-way
cache design...

So, will probably look into it.

On Tuesday, April 27, 2021 at 1:10:05 AM UTC-5, BGB wrote:
> On 4/24/2021 10:26 PM, MitchAlsup wrote:
> > On Saturday, April 24, 2021 at 8:08:15 PM UTC-5, BGB wrote:
> >> On 4/24/2021 4:31 PM, MitchAlsup wrote:
> >>> On Saturday, April 24, 2021 at 2:58:08 PM UTC-5, BGB wrote:
> >>>> ( Got distracted... )
> >>>> On 4/10/2021 2:20 PM, EricP wrote:
> >>>>> BGB wrote:
> >>>>>> So, as probably people are already aware, I have been working for a
> >>>>>> little while on implementing a new bus (on/off, for several months
> >>>>>> thus far), which has ended up requiring a near complete rewrite of the
> >>>>>> memory subsystem...
> >>>>>>
> ...
> >>>>
> >>>> The L2 is currently not quite as fast as expected.
> >>>>
> >>>> Stats I am currently seeing look like:
> >>>> Memcpy (L2): ~ 25 MB/s
> >>>> Memset (L2): ~ 70 MB/s
> >>>>
> >>>> Memcpy (DRAM): ~ 9 MB/s
> >>>> Memset (DRAM): ~ 11 MB/s
> >>>>
> >>>> For reference (L1 local):
> >>>> Memcpy (L1): ~ 250 MB/s
> >>>> Memset (L1): ~ 277 MB/s
> >>>>
> >>>> With both the CPU core and ring-bus running at 50 MHz (with DDR2 at
> >>>> 75MHz; DLL Disabled).
> >>> <
> >>> Yeah, seems somewhat slow, should be closer to L1/4 (60 MB/s)
> >>>
> >>> Remind me of the associativity of the L1 and L2 ??
> >>> <
> >> L1 and L2 are both 1-way / direct-mapped in this case.
> >>
> >>
> >> Though, traditional descriptions imply that direct-mapped is strictly
> >> Mod-N, and set-associative is basically Mod-N but with 2 or 4 possible
> >> cache lines per slot.
> >>
> >>
> >> I am using a mapping which basically does an XOR hash of the address
> >> bits to generate an index into the cache, which then operates on a
> >> similar premise to a hash table.
> > <
> > You should look up Andrew Seznic's paper on skewed associative caches.
>
> I had considered it, though as I understand it, a skewed-associative
> hash would require a multi-cycle state-machine.
>
The Skewed Associative Cache uses a different hash function on each of the sets
in the cache.

In a normal set-associative Cache, one uses the same address to each set.

The extra path length is no more than a single XOR gate of delay--but you do not
store multiple tags under a single index like you can in a normal SA cache.

But as you are using DM caches, it is on a different path.
>
>
>
>
> Sadly, finding and fixing a bug that was resulting in memory corruption
> also came at the cost of a pretty big hit to performance...
>
>
> The cache would issue multiple requests at a time, but wouldn't
> necessarily check that the responses came back for memory stores before
> continuing (it would get the response for a Load, then assume everything
> was good).
>
> This worked in the partial simulation test-bench, which always returned
> results in-order, but was failing in some cases if the store took
> longer. There were cases where the program would start doing stuff and
> issuing more requests before the first store request was handled,
> resulting in loads occasionally returning stale data.
>
>
> As a result, went and added logic to the partial simulation to mimic the
> behavior of the L2 cache, which also recreated the bugs I was seeing in
> the full simulation.
>
>
> I added some logic to the L1 cache make it wait for *both* the Load and
> Store responses to get back before continuing, and tightening up
> behavior in a few other related areas, which (sadly) seems to have put a
> dent on performance.
>
>
> At the moment:
> Memcpy (DRAM): ~ 8.4 MB/s
> Memset (DRAM): ~ 14.7 MB/s
> Memcpy (L2): ~ 12.6 MB/s
> Memset (L2): ~ 49.1 MB/s
> Memcpy (L1): ~ 250 MB/s
> Memset (L1): ~ 275 MB/s
>
>
> L2 speeds "could" be a bit higher, but it seems that the L2 is having a
> fairly high miss rate during the "L2" test.
>
> The test logic in the partial-simulation test-bench also implies that
> the high miss rate is due to the design of the L2 cache, rather than due
> to a bug in the L2's Verilog.
>
>
> L2 hit rate seems to be in the area of ~ 45% to 60%.
>
For an L2 4×-to-8× the size of L1 those local miss rates are "about right".
>
>
> It seems there is a fairly frequent pattern of endlessly evicting one
> cache line to load another, then evicting that cache line to load the
> first line again, ...
>
There is this thing called a victim buffer.
>
> May need to come up with some sort of workaround, in any case.
> >>
> >> In any case, results for the hashed indexing are still better than with
> >> Mod-N direct-mapped caches.
> > <
> > To be expected
> I have since gone and modeled various hash functions directly, results I
> am seeing (for a 13-bit index, based on the boot sequence):
> (Addr[12:0] ): ~ 46% (Mod-N)
> (Addr[12:0]^Addr[24:12] ): ~ 58% (A)
> (Addr[12:0]^Addr[25:13] ): ~ 54% (B)
> (Addr[12:0]^Addr[24:12]^Addr[27:15]): ~ 56% (C)
> (Addr[12:0]^Addr[25:13]^Addr[19: 7]): ~ 59% (D)
> (Addr[12:1]^Addr[24:13]^Addr[27:16], Addr[0]): ~ 55% (E)
>
> Where Addr is 28 bits, selected from memAddr[31:4].
>
>
> Or, basically, the hashed addresses somewhat beat out the use of a naive
> modular mapping.
>
>
> Some of the other schemes tried (such as transposing the high-order
> bits, or passing the low-order bits unmodified), were in-general "not
> particularly effective" according to this test.
>
>
> That said, leaving the low-order bits intact does significantly reduce
> the frequency of "pogo misses".
>
> So, in this case, the option which does best according to hit rate (D),
> also happens to have a drawback of a higher pogo-miss rate than some of
> the other options.
>
> Likewise, C vs E differ mostly in that E has a lot fewer pogo misses.
>
> Option A seems to do reasonably well and also has a relatively low
> pogo-miss rate relative to the others.
>
> Mod-N has the lowest hit rate but also apparently the lowest pogo-miss
> rate (in that they don't seem to occur with this function in these tests)..
>
>
>
> ...
> >>>>
> >>>>
> >>>> Another "promising" type of pattern appears to be to do Mod-N for the
> >>>> lower address bits, but then XOR'ing the higher-order bits, say:
> >>>> { Addr[16:10] ^ Addr[22:16], Addr[9:4] }
> >>>> Or:
> >>>> { Addr[16:8] ^ Addr[24:16], Addr[7:4] }
> >>> OR:
> >>> {Addr[8:16] ^ Addr[24:16], Addr[7:4]}
> >>> OR:
> >>> {Addr[16:8] ^ Addr[16:24], Addr[7:4]}
> >> The tools I am losing lose their crap if one reverses the numbers here....
> >>
> >> The general notion seems to be that "addr[msb:lsb]" is the only valid
> >> way to write bit-range selection.
> > <
> > At the language level:: "its only wires" and the tool loses its *&^% !!
> >
> I am not sure the specifics as to why...
> But, I do seem to see a lot of arbitrary limitations here.
>
> Also Verilog's preprocessor kinda sucks in that the tools can't seem to
> entirely agree on how to resolve "`include" directives, "readmemh"
> behavior, how exactly macro expansion behaves, ..
>
>
> Then I write some stuff, Vivado warns about it in Verilog 97 mode, and
> Quartus refuses to accept it unless it is told to use System Verilog.
>
>
> But, I ended up doing it that way because the alternatives kinda suck
> (and if one tries to use a macro expansion in a "case" in Verilator, its
> parser freaks out).
>
> Well, and also if one writes:
> 8'h40, 8'h60: ...
> Or similar in a "case", Verilator's parser also freaks out, ...
> But, this does work if one uses constants declared via "parameter".
>
>
>
> Well, also if one passes a variable from a combinatorial block into a
> module parameter, and the module works via combinatorial logic, and then
> tries to use the output of said module via another combinatorial
> block... Verilator often freaks out thinking one has written something
> based on circular logic.
>
I have seen this effect, too.

Click here to read the complete article

On 4/27/2021 10:53 AM, MitchAlsup wrote:
> On Tuesday, April 27, 2021 at 1:10:05 AM UTC-5, BGB wrote:
>> On 4/24/2021 10:26 PM, MitchAlsup wrote:
>>> On Saturday, April 24, 2021 at 8:08:15 PM UTC-5, BGB wrote:
>>>> On 4/24/2021 4:31 PM, MitchAlsup wrote:
>>>>> On Saturday, April 24, 2021 at 2:58:08 PM UTC-5, BGB wrote:
>>>>>> ( Got distracted... )
>>>>>> On 4/10/2021 2:20 PM, EricP wrote:
>>>>>>> BGB wrote:
>>>>>>>> So, as probably people are already aware, I have been working for a
>>>>>>>> little while on implementing a new bus (on/off, for several months
>>>>>>>> thus far), which has ended up requiring a near complete rewrite of the
>>>>>>>> memory subsystem...
>>>>>>>>
>> ...
>>>>>>
>>>>>> The L2 is currently not quite as fast as expected.
>>>>>>
>>>>>> Stats I am currently seeing look like:
>>>>>> Memcpy (L2): ~ 25 MB/s
>>>>>> Memset (L2): ~ 70 MB/s
>>>>>>
>>>>>> Memcpy (DRAM): ~ 9 MB/s
>>>>>> Memset (DRAM): ~ 11 MB/s
>>>>>>
>>>>>> For reference (L1 local):
>>>>>> Memcpy (L1): ~ 250 MB/s
>>>>>> Memset (L1): ~ 277 MB/s
>>>>>>
>>>>>> With both the CPU core and ring-bus running at 50 MHz (with DDR2 at
>>>>>> 75MHz; DLL Disabled).
>>>>> <
>>>>> Yeah, seems somewhat slow, should be closer to L1/4 (60 MB/s)
>>>>>
>>>>> Remind me of the associativity of the L1 and L2 ??
>>>>> <
>>>> L1 and L2 are both 1-way / direct-mapped in this case.
>>>>
>>>>
>>>> Though, traditional descriptions imply that direct-mapped is strictly
>>>> Mod-N, and set-associative is basically Mod-N but with 2 or 4 possible
>>>> cache lines per slot.
>>>>
>>>>
>>>> I am using a mapping which basically does an XOR hash of the address
>>>> bits to generate an index into the cache, which then operates on a
>>>> similar premise to a hash table.
>>> <
>>> You should look up Andrew Seznic's paper on skewed associative caches.
>>
>> I had considered it, though as I understand it, a skewed-associative
>> hash would require a multi-cycle state-machine.
>>
> The Skewed Associative Cache uses a different hash function on each of the sets
> in the cache.
>
> In a normal set-associative Cache, one uses the same address to each set.
>
> The extra path length is no more than a single XOR gate of delay--but you do not
> store multiple tags under a single index like you can in a normal SA cache.
>
> But as you are using DM caches, it is on a different path.

I meant in the sense that a skew associative cache, using a single
cache-line array (as in a DM cache), would likely require several cycles
to probe the array, eg:
Stage 1: Probe Spot A
Stage 2: Probe Spot B
Stage 3: Try Stomp A (for Store)
Stage 4: Initiate Miss Handler

Though, I guess it is possible that this state could be encoded in the
request message (rather than using a dedicated state machine).

It looks like one otherwise needs to first implement a 2-way
set-associative cache to have the needed mechanisms for the
skew-associative cache.

But, could be useful if it shows an improvement over a more conventional
2-way cache.

>>
>>
>>
>>
>> Sadly, finding and fixing a bug that was resulting in memory corruption
>> also came at the cost of a pretty big hit to performance...
>>
>>
>> The cache would issue multiple requests at a time, but wouldn't
>> necessarily check that the responses came back for memory stores before
>> continuing (it would get the response for a Load, then assume everything
>> was good).
>>
>> This worked in the partial simulation test-bench, which always returned
>> results in-order, but was failing in some cases if the store took
>> longer. There were cases where the program would start doing stuff and
>> issuing more requests before the first store request was handled,
>> resulting in loads occasionally returning stale data.
>>
>>
>> As a result, went and added logic to the partial simulation to mimic the
>> behavior of the L2 cache, which also recreated the bugs I was seeing in
>> the full simulation.
>>
>>
>> I added some logic to the L1 cache make it wait for *both* the Load and
>> Store responses to get back before continuing, and tightening up
>> behavior in a few other related areas, which (sadly) seems to have put a
>> dent on performance.
>>
>>
>> At the moment:
>> Memcpy (DRAM): ~ 8.4 MB/s
>> Memset (DRAM): ~ 14.7 MB/s
>> Memcpy (L2): ~ 12.6 MB/s
>> Memset (L2): ~ 49.1 MB/s
>> Memcpy (L1): ~ 250 MB/s
>> Memset (L1): ~ 275 MB/s
>>
>>
>> L2 speeds "could" be a bit higher, but it seems that the L2 is having a
>> fairly high miss rate during the "L2" test.
>>
>> The test logic in the partial-simulation test-bench also implies that
>> the high miss rate is due to the design of the L2 cache, rather than due
>> to a bug in the L2's Verilog.
>>
>>
>> L2 hit rate seems to be in the area of ~ 45% to 60%.
>>
> For an L2 4×-to-8× the size of L1 those local miss rates are "about right".

Yeah.

I did go and model a 2-way associative cache (in the partial-sim
test-bench), and this was able to boost up the L2 stats back up a
reasonable amount (up to ~ 30 MB/s memcpy and 70 MB/s memset).

DRAM case showed relatively little impact.

Attempting to implement a 2-way support onto the existing L2 cache was
much less effective. Though, I suspect this was more due to bugs and it
being "barely functional".

Technically, it was using a strategy similar to what I had previously
tried with the L1 caches though (1.5-way?), namely dividing the cache
lines up into A and B sets:
Set A: May contain cache lines in either a clean or dirty state;
Set B: May only contain cache lines in a "clean" state.

Load may fetch from either A or B.
Miss, A is Not Dirty:
Load fetched line into A
Miss, A is Dirty:
Load fetched line into B

Store will only store to A, but behave differently based on state:
A is Dirty: Hit or Cache miss as in 1-way.
Write A back to DRAM;
Load "null line" into A
A is Not Dirty:
Hit: Replace A
Miss: Copy A to B, Store in A

There is no real need to preserve the state of Set B.

Miss, Either A or B is set to Flush:
Write A back to DRAM if Dirty;
Load into A;
Stomp whatever is in B.

My initial attempt isn't really seeing much speedup over the normal
1-way cache, and is giving very different results from mocking up
similar logic in the test-bench. I suspect it isn't working correctly as
of yet.

Though, even if it does work as expected, it does require making the
cache half as long, and this approach in-effect makes "memcpy" style
patterns a little faster at the expense of making "memset" patterns
slower (but, can work out reasonably well if the working set for loads
is larger than the working set for stores).

Granted, a cache where both A and B may contain dirty cache-lines would
be better, but this is more complicated.

Granted, still need to work on it some more...

>>
>>
>> It seems there is a fairly frequent pattern of endlessly evicting one
>> cache line to load another, then evicting that cache line to load the
>> first line again, ...
>>
> There is this thing called a victim buffer.

It is possible.

>>
>> May need to come up with some sort of workaround, in any case.
>>>>
>>>> In any case, results for the hashed indexing are still better than with
>>>> Mod-N direct-mapped caches.
>>> <
>>> To be expected
>> I have since gone and modeled various hash functions directly, results I
>> am seeing (for a 13-bit index, based on the boot sequence):
>> (Addr[12:0] ): ~ 46% (Mod-N)
>> (Addr[12:0]^Addr[24:12] ): ~ 58% (A)
>> (Addr[12:0]^Addr[25:13] ): ~ 54% (B)
>> (Addr[12:0]^Addr[24:12]^Addr[27:15]): ~ 56% (C)
>> (Addr[12:0]^Addr[25:13]^Addr[19: 7]): ~ 59% (D)
>> (Addr[12:1]^Addr[24:13]^Addr[27:16], Addr[0]): ~ 55% (E)
>>
>> Where Addr is 28 bits, selected from memAddr[31:4].
>>
>>
>> Or, basically, the hashed addresses somewhat beat out the use of a naive
>> modular mapping.
>>
>>
>> Some of the other schemes tried (such as transposing the high-order
>> bits, or passing the low-order bits unmodified), were in-general "not
>> particularly effective" according to this test.
>>
>>
>> That said, leaving the low-order bits intact does significantly reduce
>> the frequency of "pogo misses".
>>
>> So, in this case, the option which does best according to hit rate (D),
>> also happens to have a drawback of a higher pogo-miss rate than some of
>> the other options.
>>
>> Likewise, C vs E differ mostly in that E has a lot fewer pogo misses.
>>
>> Option A seems to do reasonably well and also has a relatively low
>> pogo-miss rate relative to the others.
>>
>> Mod-N has the lowest hit rate but also apparently the lowest pogo-miss
>> rate (in that they don't seem to occur with this function in these tests).
>>
>>
>>
>> ...
>>>>>>
>>>>>>
>>>>>> Another "promising" type of pattern appears to be to do Mod-N for the
>>>>>> lower address bits, but then XOR'ing the higher-order bits, say:
>>>>>> { Addr[16:10] ^ Addr[22:16], Addr[9:4] }
>>>>>> Or:
>>>>>> { Addr[16:8] ^ Addr[24:16], Addr[7:4] }
>>>>> OR:
>>>>> {Addr[8:16] ^ Addr[24:16], Addr[7:4]}
>>>>> OR:
>>>>> {Addr[16:8] ^ Addr[16:24], Addr[7:4]}
>>>> The tools I am losing lose their crap if one reverses the numbers here...
>>>>
>>>> The general notion seems to be that "addr[msb:lsb]" is the only valid
>>>> way to write bit-range selection.
>>> <
>>> At the language level:: "its only wires" and the tool loses its *&^% !!
>>>
>> I am not sure the specifics as to why...
>> But, I do seem to see a lot of arbitrary limitations here.
>>
>> Also Verilog's preprocessor kinda sucks in that the tools can't seem to
>> entirely agree on how to resolve "`include" directives, "readmemh"
>> behavior, how exactly macro expansion behaves, ..
>>
>>
>> Then I write some stuff, Vivado warns about it in Verilog 97 mode, and
>> Quartus refuses to accept it unless it is told to use System Verilog.
>>
>>
>> But, I ended up doing it that way because the alternatives kinda suck
>> (and if one tries to use a macro expansion in a "case" in Verilator, its
>> parser freaks out).
>>
>> Well, and also if one writes:
>> 8'h40, 8'h60: ...
>> Or similar in a "case", Verilator's parser also freaks out, ...
>> But, this does work if one uses constants declared via "parameter".
>>
>>
>>
>> Well, also if one passes a variable from a combinatorial block into a
>> module parameter, and the module works via combinatorial logic, and then
>> tries to use the output of said module via another combinatorial
>> block... Verilator often freaks out thinking one has written something
>> based on circular logic.
>>
> I have seen this effect, too.
>
> Not knowing yuo aren't creating a circular path (like a flip-flop) it
> guesses I can't see far enough and warns you anyway. Most of the time
> we build blocks big enough that they were flopped on the inputs and
> outputs.

Click here to read the complete article

On 4/27/2021 10:35 PM, BGB wrote:
> On 4/27/2021 10:53 AM, MitchAlsup wrote:
>> On Tuesday, April 27, 2021 at 1:10:05 AM UTC-5, BGB wrote:
>>> On 4/24/2021 10:26 PM, MitchAlsup wrote:
>>>> On Saturday, April 24, 2021 at 8:08:15 PM UTC-5, BGB wrote:
>>>>> On 4/24/2021 4:31 PM, MitchAlsup wrote:
>>>>>> On Saturday, April 24, 2021 at 2:58:08 PM UTC-5, BGB wrote:
>>>>>>> ( Got distracted... )
>>>>>>> On 4/10/2021 2:20 PM, EricP wrote:
>>>>>>>> BGB wrote:
>>>>>>>>> So, as probably people are already aware, I have been working
>>>>>>>>> for a
>>>>>>>>> little while on implementing a new bus (on/off, for several months
>>>>>>>>> thus far), which has ended up requiring a near complete rewrite
>>>>>>>>> of the
>>>>>>>>> memory subsystem...
>>>>>>>>>
>>> ...
>>>>>>>
>>>>>>> The L2 is currently not quite as fast as expected.
>>>>>>>
>>>>>>> Stats I am currently seeing look like:
>>>>>>> Memcpy (L2): ~ 25 MB/s
>>>>>>> Memset (L2): ~ 70 MB/s
>>>>>>>
>>>>>>> Memcpy (DRAM): ~ 9 MB/s
>>>>>>> Memset (DRAM): ~ 11 MB/s
>>>>>>>
>>>>>>> For reference (L1 local):
>>>>>>> Memcpy (L1): ~ 250 MB/s
>>>>>>> Memset (L1): ~ 277 MB/s
>>>>>>>
>>>>>>> With both the CPU core and ring-bus running at 50 MHz (with DDR2 at
>>>>>>> 75MHz; DLL Disabled).
>>>>>> <
>>>>>> Yeah, seems somewhat slow, should be closer to L1/4 (60 MB/s)
>>>>>>
>>>>>> Remind me of the associativity of the L1 and L2 ??
>>>>>> <
>>>>> L1 and L2 are both 1-way / direct-mapped in this case.
>>>>>
>>>>>
>>>>> Though, traditional descriptions imply that direct-mapped is strictly
>>>>> Mod-N, and set-associative is basically Mod-N but with 2 or 4 possible
>>>>> cache lines per slot.
>>>>>
>>>>>
>>>>> I am using a mapping which basically does an XOR hash of the address
>>>>> bits to generate an index into the cache, which then operates on a
>>>>> similar premise to a hash table.
>>>> <
>>>> You should look up Andrew Seznic's paper on skewed associative caches.
>>>
>>> I had considered it, though as I understand it, a skewed-associative
>>> hash would require a multi-cycle state-machine.
>>>
>> The Skewed Associative Cache uses a different hash function on each of
>> the sets
>> in the cache.
>>
>> In a normal set-associative Cache, one uses the same address to each set.
>>
>> The extra path length is no more than a single XOR gate of delay--but
>> you do not
>> store multiple tags under a single index like you can in a normal SA
>> cache.
>>
>> But as you are using DM caches, it is on a different path.
>
>
> I meant in the sense that a skew associative cache, using a single
> cache-line array (as in a DM cache), would likely require several cycles
> to probe the array, eg:
> Stage 1: Probe Spot A
> Stage 2: Probe Spot B
> Stage 3: Try Stomp A (for Store)
> Stage 4: Initiate Miss Handler
>
> Though, I guess it is possible that this state could be encoded in the
> request message (rather than using a dedicated state machine).
>
>
> It looks like one otherwise needs to first implement a 2-way
> set-associative cache to have the needed mechanisms for the
> skew-associative cache.
>
> But, could be useful if it shows an improvement over a more conventional
> 2-way cache.
>

After implementing it, in-general, the 2-way L2 cache seems to be an
improvement over the direct-mapped cache (back up to ~ 30 MB/s for L2
memcpy, after the big hit taken due to "fixing" an L1 bug).

Have observed though that its "effectiveness" depends a lot on the
effective size of the working set.

This could be why some tests (with the old L1 caches) did better with
2-way, and others with 1-way.

If the working set is significantly larger than the cache, then
direct-mapping has an advantage, but if the effective working set is
comparable-to or smaller than the cache, then associative caching does
better.

So, in cases where the test is mostly linear sweeping over several MB of
RAM, the direct-mapped L2 does better, but in most other cases, the
associative L2 does better, and for "generic code" the hit rate is
significantly better.

Meanwhile, with a 16K L1 cache, it was a bit more hit-and-miss (even
with the old bus), and the direct-mapped L1's have the primary advantage
of being cheaper.

It appears that the associative L2 does also help slightly with the L1
tests as well.

>
>>>
>>>
>>>
>>>
>>> Sadly, finding and fixing a bug that was resulting in memory corruption
>>> also came at the cost of a pretty big hit to performance...
>>>
>>>
>>> The cache would issue multiple requests at a time, but wouldn't
>>> necessarily check that the responses came back for memory stores before
>>> continuing (it would get the response for a Load, then assume everything
>>> was good).
>>>
>>> This worked in the partial simulation test-bench, which always returned
>>> results in-order, but was failing in some cases if the store took
>>> longer. There were cases where the program would start doing stuff and
>>> issuing more requests before the first store request was handled,
>>> resulting in loads occasionally returning stale data.
>>>
>>>
>>> As a result, went and added logic to the partial simulation to mimic the
>>> behavior of the L2 cache, which also recreated the bugs I was seeing in
>>> the full simulation.
>>>
>>>
>>> I added some logic to the L1 cache make it wait for *both* the Load and
>>> Store responses to get back before continuing, and tightening up
>>> behavior in a few other related areas, which (sadly) seems to have put a
>>> dent on performance.
>>>
>>>
>>> At the moment:
>>> Memcpy (DRAM): ~ 8.4 MB/s
>>> Memset (DRAM): ~ 14.7 MB/s
>>> Memcpy (L2): ~ 12.6 MB/s
>>> Memset (L2): ~ 49.1 MB/s
>>> Memcpy (L1): ~ 250 MB/s
>>> Memset (L1): ~ 275 MB/s
>>>
>>>
>>> L2 speeds "could" be a bit higher, but it seems that the L2 is having a
>>> fairly high miss rate during the "L2" test.
>>>
>>> The test logic in the partial-simulation test-bench also implies that
>>> the high miss rate is due to the design of the L2 cache, rather than due
>>> to a bug in the L2's Verilog.
>>>
>>>
>>> L2 hit rate seems to be in the area of ~ 45% to 60%.
>>>
>> For an L2 4×-to-8× the size of L1 those local miss rates are "about
>> right".
>
> Yeah.
>
>
> I did go and model a 2-way associative cache (in the partial-sim
> test-bench), and this was able to boost up the L2 stats back up a
> reasonable amount (up to ~ 30 MB/s memcpy and 70 MB/s memset).
>
> DRAM case showed relatively little impact.
>
>
>
> Attempting to implement a 2-way support onto the existing L2 cache was
> much less effective. Though, I suspect this was more due to bugs and it
> being "barely functional".
>
> Technically, it was using a strategy similar to what I had previously
> tried with the L1 caches though (1.5-way?), namely dividing the cache
> lines up into A and B sets:
> Set A: May contain cache lines in either a clean or dirty state;
> Set B: May only contain cache lines in a "clean" state.
>
> Load may fetch from either A or B.
> Miss, A is Not Dirty:
> Load fetched line into A
> Miss, A is Dirty:
> Load fetched line into B
>

Change:
Miss, A is Not Dirty:
Load fetched line into A or B depending on a hashed bit.

> Store will only store to A, but behave differently based on state:
> A is Dirty: Hit or Cache miss as in 1-way.
>    Write A back to DRAM;
>    Load "null line" into A
> A is Not Dirty:
>    Hit: Replace A
>    Miss: Copy A to B, Store in A
>

Click here to read the complete article

PLUG IT IN!!!

devel / comp.arch / Re: Misc: First testing with new bus (BJX2 core)

Subject	Author
Re: Misc: First testing with new bus (BJX2 core)	BGB
Re: Misc: First testing with new bus (BJX2 core)	Stephen Fuld
Re: Misc: First testing with new bus (BJX2 core)	BGB
Re: Misc: First testing with new bus (BJX2 core)	MitchAlsup
Re: Misc: First testing with new bus (BJX2 core)	BGB
Re: Misc: First testing with new bus (BJX2 core)	BGB