Welcome to novaBBS (click a section below)

mail files register newsreader groups login

Message-ID:

Do not underestimate the value of print statements for debugging.

Observations

Subject	Author
Misc: Cache Sizes / Observations	BGB
Re: Misc: Cache Sizes / Observations	MitchAlsup
Re: Misc: Cache Sizes / Observations	BGB
Re: Misc: Cache Sizes / Observations	Stephen Fuld
Re: Misc: Cache Sizes / Observations	BGB
Re: Misc: Cache Sizes / Observations	BGB
Re: Misc: Cache Sizes / Observations	Anton Ertl
Re: Misc: Cache Sizes / Observations	BGB
Re: Misc: Cache Sizes / Observations	Anton Ertl
Re: Misc: Cache Sizes / Observations	BGB
Re: Misc: Cache Sizes / Observations	Stephen Fuld
Re: Misc: Cache Sizes / Observations	MitchAlsup
Re: Misc: Cache Sizes / Observations	BGB
Re: Misc: Cache Sizes / Observations	Scott Smader
Re: Misc: Cache Sizes / Observations	robf...@gmail.com
Re: Misc: Cache Sizes / Observations	BGB
Re: Misc: Cache Sizes / Observations	Terje Mathisen
Re: Misc: Cache Sizes / Observations	robf...@gmail.com
Re: Misc: Cache Sizes / Observations	MitchAlsup
Re: Misc: Cache Sizes / Observations	robf...@gmail.com
Re: Misc: Cache Sizes / Observations	MitchAlsup
Re: Misc: Cache Sizes / Observations	MitchAlsup
Re: Misc: Cache Sizes / Observations	BGB
Re: Misc: Cache Sizes / Observations	Anton Ertl

Misc: Cache Sizes / Observations

<svj5qj$19u$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23824&group=comp.arch#23824

copy link Newsgroups: comp.arch

by: BGB - Mon, 28 Feb 2022 18:55 UTC

Mostly just noting my own observations here, just wondering how much
they "jive" with others' experiences, prompted mostly be a conversation
that came up elsewhere.

I am not sure how general my observations are, or if they are mostly an
artifact of the sorts of software I am using as test cases.

A lot of this was seen when using models (*), rather than actually
testing every configuration in Verilog. Though, some of what I am saying
here is based on generalizations of what I have seen, rather than on
empirical measurements.

*: A lookup table in an emulator which implements the same lookups as
the cache can work as a reasonable stand-in for a cache with a similar
configuration.

These numbers are mostly assuming a 16B L1 cache line size (where,
roughly half of the memory used by the cache, is tag bits...).

So:
Hit-rate for smaller L1 caches depends primarily on size;
Associativity does not make a big difference here.
Say, if a 2K DM cache is getting ~ 40% hit,
A 2K 2-way will also get ~ 40% hit,
And, a 4-way would still only get ~ 40%.
Hit rate is relative to the log2 of the cache size, eg:
1K, 25%; 2K, 40%
4K, 56%; 8K, 74%
16K, 90%; 32K, 95%
There is a plateau point, typically ~ 16K or 32K for L1
Hit rate will hit ~ 90-95% for direct-mapping, and stop there.
Increasing the size of a DM cache further will see no real gain.
In models, increasing associativity helps at this plateau.
It can push the hit rate up to ~ 99%.
For very small caches, it also levels off slightly
Say, one still gets ~ 5% with a 32B cache.
So, the overall shape would resemble a sigmoid curve.

Or, also:
Under 8K, set-associativity would be mostly pointless.
16K or 32K, 2-way can help.
64K 2-way or 4-way could make sense
64K DM does not really gain anything vs 32K DM

I suspect in these "larger cache" scenarios, the majority of the misses
are due to "conflict misses" or similar.

Or, say, for example, one accesses something at 0x123410, and then
accesses 0x173410, which both fall on index 1A0, causing both addresses
to be mutually exclusive (with a modulo scheme).

Cache line size also seems to have an effect on hit rate, where larger
cache line size reduces the hit rate (slightly) relative to total cache
size. However, it may also cause memory bandwidth to increase, despite
the higher number of cache misses.

Actually, this is related with another observation:
My L1 caches are organized into Even/Odd pairs of lines.
So, Even for Addr[4]==0, Odd for Addr[4]==1
Both are accessed at the same time
These are used mostly to support misaligned access.
If Even OR Odd misses, it is faster to miss on both Evan AND Odd.
As opposed to only missing on the Even OR Odd side.
This trick seemingly gains ~ 50% on memory bandwidth.

I am not sure if others use the Even/Odd scheme, many diagrams online
only show a single row of cache lines, but don't make it obvious how
such a scheme would deal with a misaligned access which spans two cache
lines.

Only real alternative I can think of is, say, a single row of 32B cache
lines, split into Even/Odd halves, and with a redundant Tag array. Not
clear what (if any) advantage this would have over the Even/Odd scheme
though.

There is a tradeoff between hashed and modulo index calculation:
L1 D$: Hashed indexing gives a slightly better hit rate than modulo.
However, it has the downside that it breaks double-mapped pages.

However, double-mapped pages are also partly broken if page size is less
than L1 size, and the double-mapping uses an alignment smaller than the
L1 size (this is a potential issue with both 4K and 16K pages with a 32K
L1).

One consideration could be that I could require mmaps to have a minimum
64K alignment, and then XOR the index with Addr[15]. This could regain
some of the advantage of hashed indices, while also allowing for
double-mapping. It is likely that this alignment restriction would only
need to be enforced for mmap'ing files with MAP_SHARED or similar though.

Index hashing seems to have no real advantage for the L1 I$ though. I
suspect it is mostly due to a side effect of memory alignment within
heap objects or similar. However, stuff within the I$ does not tend to
follow any large-scale alignment.

The L2 cache has different properties:
Baseline hit rate for L2 is significantly lower than L1;
Say, 60% for 128K, 80% for 256K
It seems to benefit more from set-associativity than the L1.
For example, a 2-way L2 has a notably better hit rate than DM
Actually, direct mapped L2 "kinda sucks".
A 4-way L2 would likely be "better" here.
However, its LUT cost is likely to be impractical.
Index hashing in the L2 seems to be reasonably effective.
It is also better if the L2 is significantly larger than the L1.

....

As for TLB:
I am using a 4-way TLB (at least in 48-bit address mode);
Modulo or "LowBits+HighBits" seems to work better than hashing;
In terms of hit rate, it seems to follow a "bigger is better" pattern.

However, TLB is not about absolute hit/miss rate (the actual TLB miss
rate is "very low" relative to the number of memory accesses), but more
about how many TLB misses occur per second (mostly so the ISR doesn't
eat too many cycles).

I have yet to find the "best" tradeoff here (64x4 and 256x4 both work
fairly well). Meanwhile, 16x4 and 32x4 see a significant increase in TLB
miss rate (though the practical performance impact is harder to
determine in this case).

Staying below around 100..200 TLB misses/second seems like a sane metric
though, whereas 1k+ or 10k+ ISR's per second seems bad (so, eg, setting
the TLB to 16x4 makes more sense for debugging the MMU than as an actual
setting that would be sane to use "in practice").

The TLB also needs currently a minimum of being 2-way associative to
avoid getting stuck in an endless TLB Miss ISR loop.

....

Any thoughts / comments ?...

Re: Misc: Cache Sizes / Observations

<9434b50e-1e2b-4063-9443-07f9d57f56ean@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23825&group=comp.arch#23825

copy link Newsgroups: comp.arch

X-Received: by 2002:a5d:4d0e:0:b0:1ec:7868:b92d with SMTP id z14-20020a5d4d0e000000b001ec7868b92dmr15014842wrt.507.1646078485543;
Mon, 28 Feb 2022 12:01:25 -0800 (PST)
X-Received: by 2002:a05:6870:30e:b0:bf:9b7f:7c63 with SMTP id
m14-20020a056870030e00b000bf9b7f7c63mr1718602oaf.84.1646078484423; Mon, 28
Feb 2022 12:01:24 -0800 (PST)
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Mon, 28 Feb 2022 12:01:24 -0800 (PST)
In-Reply-To: <svj5qj$19u$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:61c6:834d:e389:4730;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:61c6:834d:e389:4730
References: <svj5qj$19u$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <9434b50e-1e2b-4063-9443-07f9d57f56ean@googlegroups.com>
Subject: Re: Misc: Cache Sizes / Observations
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Mon, 28 Feb 2022 20:01:25 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Mon, 28 Feb 2022 20:01 UTC

On Monday, February 28, 2022 at 12:55:18 PM UTC-6, BGB wrote:
> Mostly just noting my own observations here, just wondering how much
> they "jive" with others' experiences, prompted mostly be a conversation
> that came up elsewhere.
>
>
> I am not sure how general my observations are, or if they are mostly an
> artifact of the sorts of software I am using as test cases.
>
> A lot of this was seen when using models (*), rather than actually
> testing every configuration in Verilog. Though, some of what I am saying
> here is based on generalizations of what I have seen, rather than on
> empirical measurements.
>
> *: A lookup table in an emulator which implements the same lookups as
> the cache can work as a reasonable stand-in for a cache with a similar
> configuration.
>
>
> These numbers are mostly assuming a 16B L1 cache line size (where,
> roughly half of the memory used by the cache, is tag bits...).
>
> So:
> Hit-rate for smaller L1 caches depends primarily on size;
> Associativity does not make a big difference here.
> Say, if a 2K DM cache is getting ~ 40% hit,
> A 2K 2-way will also get ~ 40% hit,
> And, a 4-way would still only get ~ 40%.
2KB 64B lines
1-way 2-way 3-way 4-way
Instruction: 14.1% 11.6% 12.5% 10.9%
Data: 14.3% 11.8% 13.3% 10.8%
Unified 2K: 18.8% 14.9% 16.3% 14.1%
Unified 4K: 12.9% 11.2% 12.3% 10.2%
<
{Note the 3-way cache is 3/4 the size of the 2-way or 4-way cache which
distorts the hit rate numbers}
> Hit rate is relative to the log2 of the cache size, eg:
> 1K, 25%; 2K, 40%
> 4K, 56%; 8K, 74%
> 16K, 90%; 32K, 95%
<
4× the cache size means ½ the miss rate.
<
> There is a plateau point, typically ~ 16K or 32K for L1
<
There are 2 effects here::
a) There is the hit rate of the application
b) There is the hit rate of the application and OS considered together
<
(a) does not show a plateau with increasing size or sets
(b) does show a plateau with small sets (< 4-way)
<
> Hit rate will hit ~ 90-95% for direct-mapping, and stop there.
<
Chip design should always target knee of the curve 90%-95%
is the typical knee of the curve for non system running
{Data base, I/O throughput, network,...} is different than
typical applications.
<
> Increasing the size of a DM cache further will see no real gain.
<
This is an interplay between cache latency and pipeline
frequency. There is generally more performance to be had
at some point going with IL1+DL1+bigL2 than bigL1s.
The choice point for typical applications versus server-stuff
is difficult: typical favors smaller L1s and bigger L2s and L3s
server favors bigger L1s and flatter cache hierarchies.
<
Applications are typically well served with simple L2 pipeline
Servers are better served with multiple banked L2 pipelines.
<
> In models, increasing associativity helps at this plateau.
> It can push the hit rate up to ~ 99%.
<
99% is reached at::
Instruction L1: 128KB 2-way or 256KB 1-way
Data L1: 512KB 4-way or 1024KB 2-way
Unified: 512KB 2-way
<
> For very small caches, it also levels off slightly
> Say, one still gets ~ 5% with a 32B cache.
<
32KB 64B lines
1-way 2-way 3-way 4-way
Instruction: 2.7% 2.2% 2.2% 2.0%
Data: 4.9% 4.0% 4.1% 3.7%
Unified 32K: 5.2% 4.3% 4.4% 4.0%
Unified 64K: 3.8% 2.9% 3.4% 2.7%
<
> So, the overall shape would resemble a sigmoid curve.
>
>
> Or, also:
> Under 8K, set-associativity would be mostly pointless.
8KB 64B lines
1-way 2-way 3-way 4-way
Instruction: 7.3% 5.8% 6.8% 5.5%
Data: 8.6% 6.8% 7.4% 6.4%
Unified 8K: 10.5% 8.3% 9.3% 7.6%
Unified 16K: 7.2% 5.9% 6.7% 5.4%
<
> 16K or 32K, 2-way can help.
> 64K 2-way or 4-way could make sense
> 64K DM does not really gain anything vs 32K DM
>
> I suspect in these "larger cache" scenarios, the majority of the misses
> are due to "conflict misses" or similar.
>
> Or, say, for example, one accesses something at 0x123410, and then
> accesses 0x173410, which both fall on index 1A0, causing both addresses
> to be mutually exclusive (with a modulo scheme).
>
Seznic indicates you can greatly moderate this with hashing.
>
> Cache line size also seems to have an effect on hit rate, where larger
> cache line size reduces the hit rate (slightly) relative to total cache
> size. However, it may also cause memory bandwidth to increase, despite
> the higher number of cache misses.
<
Cache line size changes a lot of the underlying assumptions. In CPU
design we like to keep the bus transaction (beats) short, no smaller
the 4-beats, no larger than 8-beats. You can decide on how wide the
bus is.
>
> Actually, this is related with another observation:
> My L1 caches are organized into Even/Odd pairs of lines.
> So, Even for Addr[4]==0, Odd for Addr[4]==1
> Both are accessed at the same time
> These are used mostly to support misaligned access.
<
Perfect !
<
> If Even OR Odd misses, it is faster to miss on both Evan AND Odd.
> As opposed to only missing on the Even OR Odd side.
> This trick seemingly gains ~ 50% on memory bandwidth.
>
> I am not sure if others use the Even/Odd scheme, many diagrams online
> only show a single row of cache lines, but don't make it obvious how
> such a scheme would deal with a misaligned access which spans two cache
> lines.
<
Opteron used 2-way banking of its L1 caches, to get 1 misaligned or
2 aligned accesses per cycle.
<
A GBOoO machine should probably have 4-way banking of its DL1
in an attempt at supporting 3-memory references per cycle.
>
> Only real alternative I can think of is, say, a single row of 32B cache
> lines, split into Even/Odd halves, and with a redundant Tag array. Not
> clear what (if any) advantage this would have over the Even/Odd scheme
> though.
<
There is a lot of cheating you can do at the tag level.
>
>
> There is a tradeoff between hashed and modulo index calculation:
> L1 D$: Hashed indexing gives a slightly better hit rate than modulo.
> However, it has the downside that it breaks double-mapped pages.
>
> However, double-mapped pages are also partly broken if page size is less
> than L1 size, and the double-mapping uses an alignment smaller than the
> L1 size (this is a potential issue with both 4K and 16K pages with a 32K
> L1).
<
The opposite also comes with consequences:: if the L2s are bigger than
page sizes you get address aliasing between virtual address index and
physical address tag.
>
> One consideration could be that I could require mmaps to have a minimum
> 64K alignment, and then XOR the index with Addr[15]. This could regain
> some of the advantage of hashed indices, while also allowing for
> double-mapping. It is likely that this alignment restriction would only
> need to be enforced for mmap'ing files with MAP_SHARED or similar though.
<
We got a lot of hassle requiring stuff like that in 32-bit designs. It is
"not that hard" to solve in HW with a bit of thought (very similar to
accepting that HW will perform misaligned accesses).
>
> Index hashing seems to have no real advantage for the L1 I$ though. I
> suspect it is mostly due to a side effect of memory alignment within
> heap objects or similar. However, stuff within the I$ does not tend to
> follow any large-scale alignment.
>
>
>
> The L2 cache has different properties:
> Baseline hit rate for L2 is significantly lower than L1;
> Say, 60% for 128K, 80% for 256K
> It seems to benefit more from set-associativity than the L1.
<
Yes, the L1 cleaned up the easy stuff, the L2 is left with harder problems.
But if you relate the L2 hit/miss rates "with" the L1 traffic, you will see
that the L2 miss rates are quite reasonable.
<
> For example, a 2-way L2 has a notably better hit rate than DM
> Actually, direct mapped L2 "kinda sucks".
<
Sure--taking both IL1 and DL1 misses.
<
> A 4-way L2 would likely be "better" here.
<
Opteron was 8-way for example.
<
> However, its LUT cost is likely to be impractical.
<
Bank it..........and spew the accesses over the banks, running
multiple accesses simultaneously. Worked for mainframe
and supercomputer memory.......
<
> Index hashing in the L2 seems to be reasonably effective.
> It is also better if the L2 is significantly larger than the L1.
>
> ...
>
>
> As for TLB:
> I am using a 4-way TLB (at least in 48-bit address mode);
> Modulo or "LowBits+HighBits" seems to work better than hashing;
> In terms of hit rate, it seems to follow a "bigger is better" pattern.
<
Note HW can easily has several fields with some fields bit-reversed.
This is expensive in SW and brain dead easy in HW.
>
> However, TLB is not about absolute hit/miss rate (the actual TLB miss
> rate is "very low" relative to the number of memory accesses), but more
> about how many TLB misses occur per second (mostly so the ISR doesn't
> eat too many cycles).
<
Over the July 4th weekend back in 1991 I left M88120 simulator running
MATRIX300. The first 2 Billion cycles were adequately served with a
64-entry fully associative TLB, the next 6 Billion we were taking a TLB miss
every 8 LDs, later on we were taking a TLB miss every LD. A 256 entry
DM TLB solved both issues. Modern number cruncher SW will want
larger TLBs, so L1 and L2 TLBs have come to the forefront. Data Base
will want as big as you can afford to build TLBs.
<
I, personally, played around with associating multiple PTEs with a single
tag and never found forward progress at the L1 level.
>
> I have yet to find the "best" tradeoff here (64x4 and 256x4 both work
> fairly well). Meanwhile, 16x4 and 32x4 see a significant increase in TLB
> miss rate (though the practical performance impact is harder to
> determine in this case).
>
> Staying below around 100..200 TLB misses/second seems like a sane metric
> though, whereas 1k+ or 10k+ ISR's per second seems bad (so, eg, setting
> the TLB to 16x4 makes more sense for debugging the MMU than as an actual
> setting that would be sane to use "in practice").
>
> The TLB also needs currently a minimum of being 2-way associative to
> avoid getting stuck in an endless TLB Miss ISR loop.
>
> ...
>
>
>
> Any thoughts / comments ?...
<
I have a complete set of data if you like. eXcel format.

Click here to read the complete article

Re: Misc: Cache Sizes / Observations

<svjak4$9kl$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23827&group=comp.arch#23827

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Misc: Cache Sizes / Observations
Date: Mon, 28 Feb 2022 12:17:06 -0800
Organization: A noiseless patient Spider
Lines: 101
Message-ID: <svjak4$9kl$1@dont-email.me>
References: <svj5qj$19u$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 28 Feb 2022 20:17:08 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="d23d7fa2528549449f9c4a723bc7355c";
logging-data="9877"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/4Z3hvW/n4RPTzQwioVuK0RqIsVGhmEUI="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:ULnzbzOCPGf3HrPfQbOgPmhf47w=
In-Reply-To: <svj5qj$19u$1@dont-email.me>
Content-Language: en-US

by: Stephen Fuld - Mon, 28 Feb 2022 20:17 UTC

On 2/28/2022 10:55 AM, BGB wrote:
> Mostly just noting my own observations here, just wondering how much
> they "jive" with others' experiences, prompted mostly be a conversation
> that came up elsewhere.
>
>
> I am not sure how general my observations are, or if they are mostly an
> artifact of the sorts of software I am using as test cases.
>
> A lot of this was seen when using models (*), rather than actually
> testing every configuration in Verilog. Though, some of what I am saying
> here is based on generalizations of what I have seen, rather than on
> empirical measurements.
>
> *: A lookup table in an emulator which implements the same lookups as
> the cache can work as a reasonable stand-in for a cache with a similar
> configuration.

snip

> Or, also:
> Under 8K, set-associativity would be mostly pointless.
> 16K or 32K, 2-way can help.
> 64K 2-way or 4-way could make sense
> 64K DM does not really gain anything vs 32K DM
>
> I suspect in these "larger cache" scenarios, the majority of the misses
> are due to "conflict misses" or similar.

You should be able to easily add a "trace" capability to your emulator
so you could test this hypothesis. i.e. generate a trace entry for
every miss indicating whether it was a conflict or a "cold" miss. You
cold also trace cache hits, but that would add substantially to the
trace volume and cost of doing the tracing. A compromise would be to
include in every miss entry, a count of the number of hit entries since
the last miss. Analyzing the trace output offline can give you lots of
interesting data.

Also, you didn't mention what algorithm you are using for replacement on
other than the direct mapped caches. There are several alternatives,
each with different costs and benefits. A trace could help you here as
well, especially if you trace hits. Even without that, you could
calculate the distribution of the time between eviction and a subsequent
miss of the same line.

> Or, say, for example, one accesses something at 0x123410, and then
> accesses 0x173410, which both fall on index 1A0, causing both addresses
> to be mutually exclusive (with a modulo scheme).
>
>
> Cache line size also seems to have an effect on hit rate, where larger
> cache line size reduces the hit rate (slightly) relative to total cache
> size. However, it may also cause memory bandwidth to increase, despite
> the higher number of cache misses.

Yup. Another tradeoff. You can get an idea from the trace how often the
"successive address" is accessed.

> Actually, this is related with another observation:
> My L1 caches are organized into Even/Odd pairs of lines.
>     So, Even for Addr[4]==0, Odd for Addr[4]==1
>     Both are accessed at the same time
>     These are used mostly to support misaligned access.
> If Even OR Odd misses, it is faster to miss on both Evan AND Odd.
>     As opposed to only missing on the Even OR Odd side.
>     This trick seemingly gains ~ 50% on memory bandwidth.

I think this could made sense if the first miss is to the even address,
as this improves sequential access. If the miss is to the odd address,
I don't think there is as much likelihood of a hit to the "previous"
address.

snip

> The L2 cache has different properties:
> Baseline hit rate for L2 is significantly lower than L1;
>     Say, 60% for 128K, 80% for 256K
> It seems to benefit more from set-associativity than the L1.
>     For example, a 2-way L2 has a notably better hit rate than DM
>     Actually, direct mapped L2 "kinda sucks".
>     A 4-way L2 would likely be "better" here.
>       However, its LUT cost is likely to be impractical.
> Index hashing in the L2 seems to be reasonably effective.
> It is also better if the L2 is significantly larger than the L1.

You might want to to play with exclusive versus inclusive L1s. For
smaller L2s, this might make a difference.

> Any thoughts / comments ?...

You are venturing into a very interesting area. :-)

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Misc: Cache Sizes / Observations

<svjtvh$mc9$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23839&group=comp.arch#23839

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Cache Sizes / Observations
Date: Mon, 28 Feb 2022 19:47:26 -0600
Organization: A noiseless patient Spider
Lines: 501
Message-ID: <svjtvh$mc9$1@dont-email.me>
References: <svj5qj$19u$1@dont-email.me>
<9434b50e-1e2b-4063-9443-07f9d57f56ean@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 1 Mar 2022 01:47:29 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="838ca4ca89ab57ed7d76891a76dbef0d";
logging-data="22921"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1//zwPseziHctZpNwHS+IAn"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:KEN/CtGF+tLYRyUlY8dfZw6OKbU=
In-Reply-To: <9434b50e-1e2b-4063-9443-07f9d57f56ean@googlegroups.com>
Content-Language: en-US

by: BGB - Tue, 1 Mar 2022 01:47 UTC

On 2/28/2022 2:01 PM, MitchAlsup wrote:
> On Monday, February 28, 2022 at 12:55:18 PM UTC-6, BGB wrote:
>> Mostly just noting my own observations here, just wondering how much
>> they "jive" with others' experiences, prompted mostly be a conversation
>> that came up elsewhere.
>>
>>
>> I am not sure how general my observations are, or if they are mostly an
>> artifact of the sorts of software I am using as test cases.
>>
>> A lot of this was seen when using models (*), rather than actually
>> testing every configuration in Verilog. Though, some of what I am saying
>> here is based on generalizations of what I have seen, rather than on
>> empirical measurements.
>>
>> *: A lookup table in an emulator which implements the same lookups as
>> the cache can work as a reasonable stand-in for a cache with a similar
>> configuration.
>>
>>
>> These numbers are mostly assuming a 16B L1 cache line size (where,
>> roughly half of the memory used by the cache, is tag bits...).
>>
>> So:
>> Hit-rate for smaller L1 caches depends primarily on size;
>> Associativity does not make a big difference here.
>> Say, if a 2K DM cache is getting ~ 40% hit,
>> A 2K 2-way will also get ~ 40% hit,
>> And, a 4-way would still only get ~ 40%.
> 2KB 64B lines
> 1-way 2-way 3-way 4-way
> Instruction: 14.1% 11.6% 12.5% 10.9%
> Data: 14.3% 11.8% 13.3% 10.8%
> Unified 2K: 18.8% 14.9% 16.3% 14.1%
> Unified 4K: 12.9% 11.2% 12.3% 10.2%
> <
> {Note the 3-way cache is 3/4 the size of the 2-way or 4-way cache which
> distorts the hit rate numbers}

OK, so there is a difference, just not as big as "just make the cache
bigger".

I went with slightly larger direct-mapped caches, mostly because (within
modest limits) the cost increase due to BRAM use seems a lot smaller
than the cost increase due to LUTs.

This is a little different with LUTRAM though, since if one goes over ~
32 or 64 entries in the cache, LUTRAM *sucks*.

So, 64x2x16=2K, 64x4x16=4K.

But, 512 or 1024 works nicely for BRAM, and 512x32=16K ...

>> Hit rate is relative to the log2 of the cache size, eg:
>> 1K, 25%; 2K, 40%
>> 4K, 56%; 8K, 74%
>> 16K, 90%; 32K, 95%
> <
> 4× the cache size means ½ the miss rate.
> <

There was a certain amount of "hand waving" here, so can't claim my
numbers are particularly accurate (these were not directly measured in
this case, but examples based on "my current understanding of previously
seen behavior").

But, yeah, that metric would imply a much more rapid fall off in hit
rate than what I remember seeing.

I guess if needed, it could be possible to set up some code in the
emulator to model and then log the miss rates according to various
parameters (cache sizes, associativity, line size, ...).

I guess potentially one could try to generate a graph in a spreadsheet
or similar.

>> There is a plateau point, typically ~ 16K or 32K for L1
> <
> There are 2 effects here::
> a) There is the hit rate of the application
> b) There is the hit rate of the application and OS considered together
> <
> (a) does not show a plateau with increasing size or sets
> (b) does show a plateau with small sets (< 4-way)

Most of this was scenarios like "Doom running on 'bare metal' and similar".

The plateau effect was most obvious with DM / 1-way.
Where it occurs does seem to vary from one program to another.

> <
>> Hit rate will hit ~ 90-95% for direct-mapping, and stop there.
> <
> Chip design should always target knee of the curve 90%-95%
> is the typical knee of the curve for non system running
> {Data base, I/O throughput, network,...} is different than
> typical applications.

OK. I was generally hitting this point at 16K or 32K for the L1.

L2 doesn't hit this. I would need several MB of L2 to pull this off, but
256K is roughly the practical size limit on the XC7A100T.

> <
>> Increasing the size of a DM cache further will see no real gain.
> <
> This is an interplay between cache latency and pipeline
> frequency. There is generally more performance to be had
> at some point going with IL1+DL1+bigL2 than bigL1s.
> The choice point for typical applications versus server-stuff
> is difficult: typical favors smaller L1s and bigger L2s and L3s
> server favors bigger L1s and flatter cache hierarchies.
> <
> Applications are typically well served with simple L2 pipeline
> Servers are better served with multiple banked L2 pipelines.
> <

I was ignoring clock speed here.

From what I was seeing in past fiddling, it appears as (at least for a
direct mapped cache), the hit rate would reach ~ 95% and then stop
increasing.

Associative caches seemed able to break this limit (by fixed
increments), but simply making the cache bigger would not.

>> In models, increasing associativity helps at this plateau.
>> It can push the hit rate up to ~ 99%.
> <
> 99% is reached at::
> Instruction L1: 128KB 2-way or 256KB 1-way
> Data L1: 512KB 4-way or 1024KB 2-way
> Unified: 512KB 2-way
> <

I was seeing 99% in models with large 4 and 8-way caches, but these fell
outside what seemed viable.

>> For very small caches, it also levels off slightly
>> Say, one still gets ~ 5% with a 32B cache.
> <
> 32KB 64B lines
> 1-way 2-way 3-way 4-way
> Instruction: 2.7% 2.2% 2.2% 2.0%
> Data: 4.9% 4.0% 4.1% 3.7%
> Unified 32K: 5.2% 4.3% 4.4% 4.0%
> Unified 64K: 3.8% 2.9% 3.4% 2.7%
> <

By 32B cache, I meant a cache with only a single entry.
16B: Even Line
16B: Odd Line
Where, it either hits the line currently in the cache, or misses.

Some of my earlier prototypes did this, but it is probably pretty
obvious why I am no longer using this design.

>> So, the overall shape would resemble a sigmoid curve.
>>
>>
>> Or, also:
>> Under 8K, set-associativity would be mostly pointless.
> 8KB 64B lines
> 1-way 2-way 3-way 4-way
> Instruction: 7.3% 5.8% 6.8% 5.5%
> Data: 8.6% 6.8% 7.4% 6.4%
> Unified 8K: 10.5% 8.3% 9.3% 7.6%
> Unified 16K: 7.2% 5.9% 6.7% 5.4%
> <

OK.

As noted, I was using 16B lines for L1, but 64B for L2.
The bigger L2 lines being mostly because these can be moved to/from DDR
RAM with less overhead.

>> 16K or 32K, 2-way can help.
>> 64K 2-way or 4-way could make sense
>> 64K DM does not really gain anything vs 32K DM
>>
>> I suspect in these "larger cache" scenarios, the majority of the misses
>> are due to "conflict misses" or similar.
>>
>> Or, say, for example, one accesses something at 0x123410, and then
>> accesses 0x173410, which both fall on index 1A0, causing both addresses
>> to be mutually exclusive (with a modulo scheme).
>>
> Seznic indicates you can greatly moderate this with hashing.

Yeah, this was partly mentioned later.

I had been using hashing, but ended up moving away due to:
Double-mapping concerns;
Doesn't really seem to gain anything for I$.

>>
>> Cache line size also seems to have an effect on hit rate, where larger
>> cache line size reduces the hit rate (slightly) relative to total cache
>> size. However, it may also cause memory bandwidth to increase, despite
>> the higher number of cache misses.
> <
> Cache line size changes a lot of the underlying assumptions. In CPU
> design we like to keep the bus transaction (beats) short, no smaller
> the 4-beats, no larger than 8-beats. You can decide on how wide the
> bus is.

In my ringbus design, the L1's lines are 128-bits, and the ringbus is
128-bits wide.

So, basically, 1 cache line in 1 clock cycle.
Using multiple cycles per transfer seemed like needless complexity.

The current bus has a partial split in the L1 and L2 rings:
L1 ring: 128b Data, 96b Addr, 16b OPM, 16b SEQ (256b)
L2 ring: 128b Data, 48b Addr, 16b OPM, 16b SEQ (208b)

OPM: Gives the Request/Response Code
SEQ: Identifies the source of the request, and request sequence number.

Moves from the L1 to L2 ring truncate the address, but the L2 ring
assumes that requests are using physical rather than virtual addresses
(the L1 ring can pass virtual addresses, which pass through the TLB
before reaching the L2 ring).

Requests move along at 1 position per clock cycle.

Performance is generally a lot better than my "old bus".

>>
>> Actually, this is related with another observation:
>> My L1 caches are organized into Even/Odd pairs of lines.
>> So, Even for Addr[4]==0, Odd for Addr[4]==1
>> Both are accessed at the same time
>> These are used mostly to support misaligned access.
> <
> Perfect !
> <
>> If Even OR Odd misses, it is faster to miss on both Evan AND Odd.
>> As opposed to only missing on the Even OR Odd side.
>> This trick seemingly gains ~ 50% on memory bandwidth.
>>
>> I am not sure if others use the Even/Odd scheme, many diagrams online
>> only show a single row of cache lines, but don't make it obvious how
>> such a scheme would deal with a misaligned access which spans two cache
>> lines.
> <
> Opteron used 2-way banking of its L1 caches, to get 1 misaligned or
> 2 aligned accesses per cycle.
> <
> A GBOoO machine should probably have 4-way banking of its DL1
> in an attempt at supporting 3-memory references per cycle.

Click here to read the complete article

Re: Misc: Cache Sizes / Observations

<svkeq1$phm$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23843&group=comp.arch#23843

copy link Newsgroups: comp.arch

by: BGB - Tue, 1 Mar 2022 06:34 UTC

On 2/28/2022 2:17 PM, Stephen Fuld wrote:
> On 2/28/2022 10:55 AM, BGB wrote:
>> Mostly just noting my own observations here, just wondering how much
>> they "jive" with others' experiences, prompted mostly be a
>> conversation that came up elsewhere.
>>
>>
>> I am not sure how general my observations are, or if they are mostly
>> an artifact of the sorts of software I am using as test cases.
>>
>> A lot of this was seen when using models (*), rather than actually
>> testing every configuration in Verilog. Though, some of what I am
>> saying here is based on generalizations of what I have seen, rather
>> than on empirical measurements.
>>
>> *: A lookup table in an emulator which implements the same lookups as
>> the cache can work as a reasonable stand-in for a cache with a similar
>> configuration.
>
> snip
>
>> Or, also:
>>    Under 8K, set-associativity would be mostly pointless.
>>    16K or 32K, 2-way can help.
>>    64K 2-way or 4-way could make sense
>>      64K DM does not really gain anything vs 32K DM
>>
>> I suspect in these "larger cache" scenarios, the majority of the
>> misses are due to "conflict misses" or similar.
>
> You should be able to easily add a "trace" capability to your emulator
> so you could test this hypothesis. i.e. generate a trace entry for
> every miss indicating whether it was a conflict or a "cold" miss. You
> cold also trace cache hits, but that would add substantially to the
> trace volume and cost of doing the tracing. A compromise would be to
> include in every miss entry, a count of the number of hit entries since
> the last miss. Analyzing the trace output offline can give you lots of
> interesting data.
>

Yeah, I may need to add something like this.

Goes and adds some stats gathering code:
Size 1-way 2-way
131072 1.982% 2.023%
65536 2.863% 2.650%
32768 3.638% 3.397%
16384 6.113% 6.134%
8192 8.761% 8.299%
4096 11.433% 11.419%
2048 15.046% 14.950%
1024 18.954% 18.174%

This shows miss rate relative to cache size, modeling for both 1-way and
a fairly naive 2-way scheme (FIFO).

The code for this was fairly "quick and dirty".

Oddly, this model doesn't agree with either my predictions or my other
L1 cache model, in that it doesn't show any obvious "plateau".

(Well, and it totally also wrecks emulator performance...).

However, it would appear FIFO is is "kinda useless" here.

Goes and adds LRU:
131072 1.982% 1.922%
65536 2.852% 2.558%
32768 3.663% 3.243%
16384 6.051% 5.761%
8192 8.735% 7.913%
4096 11.545% 11.181%
2048 15.449% 14.819%
1024 19.522% 17.879%

This is at least slightly better...

Still only showing small gains for 2-way though.

> Also, you didn't mention what algorithm you are using for replacement on
> other than the direct mapped caches. There are several alternatives,
> each with different costs and benefits. A trace could help you here as
> well, especially if you trace hits. Even without that, you could
> calculate the distribution of the time between eviction and a subsequent
> miss of the same line.
>

For my 2-way caches, it was typically:
There is an 'A' and 'B' set
Combined with Even/Odd
So, say we call the active lines: EA,OA,EB,OB
These were often also labeled as A,B,C,D.
The 'A' pair may hold either Clean or Dirty lines
The 'B' pair may only hold Clean lines
This makes stuff cheaper...

The limitation that each 'set' may only contain a single dirty line may
potentially reduce efficiency

Load:
A or B Matches, use A or B
A not Dirty:
Replace whichever is Older.
A is Dirty:
A is Older
Evict A
Load into A
B is Older
Load into B

Store:
A Matches, use A
B Matches
A not Dirty
Swap A and B
Store to new A.
A is Dirty
Evict A
Swap A and B
Store to new A.
A is Dirty
Evict A
Load into A
Do Store
A is not Dirty
Load into A
Do Store

Age would be based on an Epoch Counter, which is basically a
coarse-grain clock cycle counter.

The Epoch Count may also auto-evict cache lines that have been in the
cache for too long. During times when the cache is idle, a rover sweeps
across the cache arrays, and may trigger things to be evicted
asynchronously.

I am not sure if there is a name for this.

>
>> Or, say, for example, one accesses something at 0x123410, and then
>> accesses 0x173410, which both fall on index 1A0, causing both
>> addresses to be mutually exclusive (with a modulo scheme).
>>
>>
>> Cache line size also seems to have an effect on hit rate, where larger
>> cache line size reduces the hit rate (slightly) relative to total
>> cache size. However, it may also cause memory bandwidth to increase,
>> despite the higher number of cache misses.
>
> Yup. Another tradeoff. You can get an idea from the trace how often the
> "successive address" is accessed.
>

May need to 'stat' this...

>
>> Actually, this is related with another observation:
>>    My L1 caches are organized into Even/Odd pairs of lines.
>>      So, Even for Addr[4]==0, Odd for Addr[4]==1
>>      Both are accessed at the same time
>>      These are used mostly to support misaligned access.
>>    If Even OR Odd misses, it is faster to miss on both Evan AND Odd.
>>      As opposed to only missing on the Even OR Odd side.
>>      This trick seemingly gains ~ 50% on memory bandwidth.
>
> I think this could made sense if the first miss is to the even address,
> as this improves sequential access. If the miss is to the odd address,
> I don't think there is as much likelihood of a hit to the "previous"
> address.
>

Dunno.

I had done it both ways (for both Even and Odd miss), but I guess it
might be worth testing.

> snip
>
>
>> The L2 cache has different properties:
>>    Baseline hit rate for L2 is significantly lower than L1;
>>      Say, 60% for 128K, 80% for 256K
>>    It seems to benefit more from set-associativity than the L1.
>>      For example, a 2-way L2 has a notably better hit rate than DM
>>      Actually, direct mapped L2 "kinda sucks".
>>      A 4-way L2 would likely be "better" here.
>>        However, its LUT cost is likely to be impractical.
>>    Index hashing in the L2 seems to be reasonably effective.
>>    It is also better if the L2 is significantly larger than the L1.
>
> You might want to to play with exclusive versus inclusive L1s. For
> smaller L2s, this might make a difference.
>

I am still using a very informal caching scheme (essentially NINE).

The current schemes for cache coherence between cores are pretty weak:
Default: Epoch Timer
Volatile: Auto-evict after a few clock cycles.

It looks like either Inclusive or Exclusive caching may be needed to
implement a proper cache coherence protocol, I may need to get to this
at some point.

>
>> Any thoughts / comments ?...
>
> You are venturing into a very interesting area. :-)
>
>
>

Re: Misc: Cache Sizes / Observations

<svm7r9$3ec$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23863&group=comp.arch#23863

copy link Newsgroups: comp.arch

by: BGB - Tue, 1 Mar 2022 22:48 UTC

On 3/1/2022 12:34 AM, BGB wrote:
> On 2/28/2022 2:17 PM, Stephen Fuld wrote:
>> On 2/28/2022 10:55 AM, BGB wrote:
>>> Mostly just noting my own observations here, just wondering how much
>>> they "jive" with others' experiences, prompted mostly be a
>>> conversation that came up elsewhere.
>>>
>>>
>>> I am not sure how general my observations are, or if they are mostly
>>> an artifact of the sorts of software I am using as test cases.
>>>
>>> A lot of this was seen when using models (*), rather than actually
>>> testing every configuration in Verilog. Though, some of what I am
>>> saying here is based on generalizations of what I have seen, rather
>>> than on empirical measurements.
>>>
>>> *: A lookup table in an emulator which implements the same lookups as
>>> the cache can work as a reasonable stand-in for a cache with a
>>> similar configuration.
>>
>> snip
>>
>>> Or, also:
>>>    Under 8K, set-associativity would be mostly pointless.
>>>    16K or 32K, 2-way can help.
>>>    64K 2-way or 4-way could make sense
>>>      64K DM does not really gain anything vs 32K DM
>>>
>>> I suspect in these "larger cache" scenarios, the majority of the
>>> misses are due to "conflict misses" or similar.
>>
>> You should be able to easily add a "trace" capability to your emulator
>> so you could test this hypothesis. i.e. generate a trace entry for
>> every miss indicating whether it was a conflict or a "cold" miss. You
>> cold also trace cache hits, but that would add substantially to the
>> trace volume and cost of doing the tracing. A compromise would be to
>> include in every miss entry, a count of the number of hit entries
>> since the last miss. Analyzing the trace output offline can give you
>> lots of interesting data.
>>
>
> Yeah, I may need to add something like this.
>
> Goes and adds some stats gathering code:
>    Size      1-way      2-way
> 131072      1.982%     2.023%
> 65536      2.863%     2.650%
> 32768      3.638%     3.397%
> 16384      6.113%     6.134%
>    8192      8.761%     8.299%
>    4096     11.433%    11.419%
>    2048     15.046%    14.950%
>    1024     18.954%    18.174%
>
> This shows miss rate relative to cache size, modeling for both 1-way and
> a fairly naive 2-way scheme (FIFO).
>
> The code for this was fairly "quick and dirty".
>
> Oddly, this model doesn't agree with either my predictions or my other
> L1 cache model, in that it doesn't show any obvious "plateau".
>
> (Well, and it totally also wrecks emulator performance...).
>
>
> However, it would appear FIFO is is "kinda useless" here.
>
> Goes and adds LRU:
> 131072      1.982%     1.922%
> 65536      2.852%     2.558%
> 32768      3.663%     3.243%
> 16384      6.051%     5.761%
>    8192      8.735%     7.913%
>    4096     11.545%    11.181%
>    2048     15.449%    14.819%
>    1024     19.522%    17.879%
>
> This is at least slightly better...
>
> Still only showing small gains for 2-way though.
>

After more fiddling with the model...

(Emulator performance really does not like this one...).

Cache Size Miss% Conflict% (n-way, line size)
131072 2.004% 1.246% (1 way, 16 line)
65536 2.851% 1.437% (1 way, 16 line)
32768 3.540% 1.387% (1 way, 16 line)
16384 6.604% 3.947% (1 way, 16 line)
8192 9.112% 5.609% (1 way, 16 line)
4096 11.310% 4.514% (1 way, 16 line)
2048 14.326% 5.713% (1 way, 16 line)
1024 17.632% 6.608% (1 way, 16 line)

131072 1.966% 0.709% (2 way, 16 line)
65536 2.550% 0.446% (2 way, 16 line)
32768 3.303% 0.690% (2 way, 16 line)
16384 6.779% 3.366% (2 way, 16 line)
8192 8.484% 1.762% (2 way, 16 line)
4096 10.905% 2.454% (2 way, 16 line)
2048 13.588% 2.808% (2 way, 16 line)
1024 16.022% 2.516% (2 way, 16 line)

262144 15.748% 14.724% (1 way, 32 line)
131072 15.929% 14.348% (1 way, 32 line)
65536 16.274% 14.021% (1 way, 32 line)
32768 16.843% 14.067% (1 way, 32 line)
16384 19.819% 16.420% (1 way, 32 line)
8192 22.360% 17.992% (1 way, 32 line)
4096 24.583% 16.250% (1 way, 32 line)
2048 27.325% 16.206% (1 way, 32 line)

262144 2.093% 0.805% (2 way, 32 line)
131072 2.893% 0.785% (2 way, 32 line)
65536 3.540% 0.881% (2 way, 32 line)
32768 4.854% 1.648% (2 way, 32 line)
16384 8.788% 4.655% (2 way, 32 line)
8192 11.456% 3.432% (2 way, 32 line)
4096 14.699% 3.980% (2 way, 32 line)
2048 17.580% 3.795% (2 way, 32 line)

524288 23.561% 20.547% (1 way, 64 line)
262144 23.725% 20.382% (1 way, 64 line)
131072 23.851% 19.919% (1 way, 64 line)
65536 24.245% 19.661% (1 way, 64 line)
32768 24.903% 19.838% (1 way, 64 line)
16384 28.103% 22.219% (1 way, 64 line)
8192 30.857% 23.528% (1 way, 64 line)
4096 33.036% 20.892% (1 way, 64 line)

524288 10.146% 7.522% (2 way, 64 line)
262144 10.324% 7.084% (2 way, 64 line)
131072 10.683% 6.786% (2 way, 64 line)
65536 11.162% 6.781% (2 way, 64 line)
32768 12.508% 7.321% (2 way, 64 line)
16384 16.721% 10.109% (2 way, 64 line)
8192 19.341% 7.942% (2 way, 64 line)
4096 21.941% 7.292% (2 way, 64 line)

So, at least with this model, it seems:
16B line size, conflict misses are a minor issue.
32B line size, OK
64B line size, conflict misses really suck here...

So, it would appear that associativity may be significantly more
important with a larger cache-line size.

Also, a DM L1 cache with 64B cache lines would suck...

Also looks like a bit of an incentive to stay with 16B L1 cache lines...

Re: Misc: Cache Sizes / Observations

<2022Mar2.101217@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23865&group=comp.arch#23865

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Misc: Cache Sizes / Observations
Date: Wed, 02 Mar 2022 09:12:17 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 78
Message-ID: <2022Mar2.101217@mips.complang.tuwien.ac.at>
References: <svj5qj$19u$1@dont-email.me> <svjak4$9kl$1@dont-email.me> <svkeq1$phm$1@dont-email.me> <svm7r9$3ec$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="58f5aa2b81cd968f124b82d9145f6dbd";
logging-data="8897"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19eGHZV4vxI+4ivfi5BlhWj"
Cancel-Lock: sha1:2r1EtkEGQceUw2q4pn7y+Of6xlA=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Wed, 2 Mar 2022 09:12 UTC

BGB <cr88192@gmail.com> writes:
>After more fiddling with the model...
>
>(Emulator performance really does not like this one...).
>
>
>Cache Size Miss% Conflict% (n-way, line size)
> 131072 2.004% 1.246% (1 way, 16 line)
> 65536 2.851% 1.437% (1 way, 16 line)
> 32768 3.540% 1.387% (1 way, 16 line)
> 16384 6.604% 3.947% (1 way, 16 line)
> 8192 9.112% 5.609% (1 way, 16 line)
> 4096 11.310% 4.514% (1 way, 16 line)
> 2048 14.326% 5.713% (1 way, 16 line)
> 1024 17.632% 6.608% (1 way, 16 line)
>
> 131072 1.966% 0.709% (2 way, 16 line)
> 65536 2.550% 0.446% (2 way, 16 line)
> 32768 3.303% 0.690% (2 way, 16 line)
> 16384 6.779% 3.366% (2 way, 16 line)
> 8192 8.484% 1.762% (2 way, 16 line)
> 4096 10.905% 2.454% (2 way, 16 line)
> 2048 13.588% 2.808% (2 way, 16 line)
> 1024 16.022% 2.516% (2 way, 16 line)
>
> 262144 15.748% 14.724% (1 way, 32 line)
> 131072 15.929% 14.348% (1 way, 32 line)
> 65536 16.274% 14.021% (1 way, 32 line)
> 32768 16.843% 14.067% (1 way, 32 line)
> 16384 19.819% 16.420% (1 way, 32 line)
> 8192 22.360% 17.992% (1 way, 32 line)
> 4096 24.583% 16.250% (1 way, 32 line)
> 2048 27.325% 16.206% (1 way, 32 line)
>
> 262144 2.093% 0.805% (2 way, 32 line)
> 131072 2.893% 0.785% (2 way, 32 line)
> 65536 3.540% 0.881% (2 way, 32 line)
> 32768 4.854% 1.648% (2 way, 32 line)
> 16384 8.788% 4.655% (2 way, 32 line)
> 8192 11.456% 3.432% (2 way, 32 line)
> 4096 14.699% 3.980% (2 way, 32 line)
> 2048 17.580% 3.795% (2 way, 32 line)
>
> 524288 23.561% 20.547% (1 way, 64 line)
> 262144 23.725% 20.382% (1 way, 64 line)
> 131072 23.851% 19.919% (1 way, 64 line)
> 65536 24.245% 19.661% (1 way, 64 line)
> 32768 24.903% 19.838% (1 way, 64 line)
> 16384 28.103% 22.219% (1 way, 64 line)
> 8192 30.857% 23.528% (1 way, 64 line)
> 4096 33.036% 20.892% (1 way, 64 line)
>
> 524288 10.146% 7.522% (2 way, 64 line)
> 262144 10.324% 7.084% (2 way, 64 line)
> 131072 10.683% 6.786% (2 way, 64 line)
> 65536 11.162% 6.781% (2 way, 64 line)
> 32768 12.508% 7.321% (2 way, 64 line)
> 16384 16.721% 10.109% (2 way, 64 line)
> 8192 19.341% 7.942% (2 way, 64 line)
> 4096 21.941% 7.292% (2 way, 64 line)
>
>
>So, at least with this model, it seems:
> 16B line size, conflict misses are a minor issue.
> 32B line size, OK
> 64B line size, conflict misses really suck here...

A 512KB cache with 64-byte lines has 8192 cache lines, just like a
256KB cache with 32-byte lines and a 128KB cache with 16-byte lines.
Without spatial locality, I would expect similar miss rates for all of
them for the same associativity; and given that programs have spatial
locality, I would expect the larger among these configurations to have
an advantage. Are you sure that your cache simulator has no bugs?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Misc: Cache Sizes / Observations

<svoa5q$ud$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23870&group=comp.arch#23870

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Cache Sizes / Observations
Date: Wed, 2 Mar 2022 11:40:07 -0600
Organization: A noiseless patient Spider
Lines: 161
Message-ID: <svoa5q$ud$1@dont-email.me>
References: <svj5qj$19u$1@dont-email.me> <svjak4$9kl$1@dont-email.me>
<svkeq1$phm$1@dont-email.me> <svm7r9$3ec$1@dont-email.me>
<2022Mar2.101217@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 2 Mar 2022 17:40:10 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="ad13c59147a05662898aa93d34e41054";
logging-data="973"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18hyljafc9KZzw+fp3us+qL"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:DE2MCw2fG1yy/eaEEKp4iZFxs+Y=
In-Reply-To: <2022Mar2.101217@mips.complang.tuwien.ac.at>
Content-Language: en-US

by: BGB - Wed, 2 Mar 2022 17:40 UTC

On 3/2/2022 3:12 AM, Anton Ertl wrote:
> BGB <cr88192@gmail.com> writes:
>> After more fiddling with the model...
>>
>> (Emulator performance really does not like this one...).
>>
>>
>> Cache Size Miss% Conflict% (n-way, line size)
>> 131072 2.004% 1.246% (1 way, 16 line)
>> 65536 2.851% 1.437% (1 way, 16 line)
>> 32768 3.540% 1.387% (1 way, 16 line)
>> 16384 6.604% 3.947% (1 way, 16 line)
>> 8192 9.112% 5.609% (1 way, 16 line)
>> 4096 11.310% 4.514% (1 way, 16 line)
>> 2048 14.326% 5.713% (1 way, 16 line)
>> 1024 17.632% 6.608% (1 way, 16 line)
>>
>> 131072 1.966% 0.709% (2 way, 16 line)
>> 65536 2.550% 0.446% (2 way, 16 line)
>> 32768 3.303% 0.690% (2 way, 16 line)
>> 16384 6.779% 3.366% (2 way, 16 line)
>> 8192 8.484% 1.762% (2 way, 16 line)
>> 4096 10.905% 2.454% (2 way, 16 line)
>> 2048 13.588% 2.808% (2 way, 16 line)
>> 1024 16.022% 2.516% (2 way, 16 line)
>>
>> 262144 15.748% 14.724% (1 way, 32 line)
>> 131072 15.929% 14.348% (1 way, 32 line)
>> 65536 16.274% 14.021% (1 way, 32 line)
>> 32768 16.843% 14.067% (1 way, 32 line)
>> 16384 19.819% 16.420% (1 way, 32 line)
>> 8192 22.360% 17.992% (1 way, 32 line)
>> 4096 24.583% 16.250% (1 way, 32 line)
>> 2048 27.325% 16.206% (1 way, 32 line)
>>
>> 262144 2.093% 0.805% (2 way, 32 line)
>> 131072 2.893% 0.785% (2 way, 32 line)
>> 65536 3.540% 0.881% (2 way, 32 line)
>> 32768 4.854% 1.648% (2 way, 32 line)
>> 16384 8.788% 4.655% (2 way, 32 line)
>> 8192 11.456% 3.432% (2 way, 32 line)
>> 4096 14.699% 3.980% (2 way, 32 line)
>> 2048 17.580% 3.795% (2 way, 32 line)
>>
>> 524288 23.561% 20.547% (1 way, 64 line)
>> 262144 23.725% 20.382% (1 way, 64 line)
>> 131072 23.851% 19.919% (1 way, 64 line)
>> 65536 24.245% 19.661% (1 way, 64 line)
>> 32768 24.903% 19.838% (1 way, 64 line)
>> 16384 28.103% 22.219% (1 way, 64 line)
>> 8192 30.857% 23.528% (1 way, 64 line)
>> 4096 33.036% 20.892% (1 way, 64 line)
>>
>> 524288 10.146% 7.522% (2 way, 64 line)
>> 262144 10.324% 7.084% (2 way, 64 line)
>> 131072 10.683% 6.786% (2 way, 64 line)
>> 65536 11.162% 6.781% (2 way, 64 line)
>> 32768 12.508% 7.321% (2 way, 64 line)
>> 16384 16.721% 10.109% (2 way, 64 line)
>> 8192 19.341% 7.942% (2 way, 64 line)
>> 4096 21.941% 7.292% (2 way, 64 line)
>>
>>
>> So, at least with this model, it seems:
>> 16B line size, conflict misses are a minor issue.
>> 32B line size, OK
>> 64B line size, conflict misses really suck here...
>
> A 512KB cache with 64-byte lines has 8192 cache lines, just like a
> 256KB cache with 32-byte lines and a 128KB cache with 16-byte lines.
> Without spatial locality, I would expect similar miss rates for all of
> them for the same associativity; and given that programs have spatial
> locality, I would expect the larger among these configurations to have
> an advantage. Are you sure that your cache simulator has no bugs?
>

It may actually be buggy, will have to look into it more.
I don't actually have a great explanation for why the 64B cache lines
seem to suck so bad.

As can be noted it does conflict in some values with my "main" cache
model, and the logic for this model was very quick-and-dirty.

I posted a graph of this on Twitter:
https://twitter.com/cr88192/status/1498864923434180609

It can be noted that there is a hump in the conflict-miss numbers, which
implies that it is exceeding the limits of the (4-way) cache used for
implementing the conflict-miss estimate (may make sense to expand it to
an 8-way LRU cache).

Looks some more... There was a bug: The model was not discarding the
low-order bits for larger cache line sizes.

After fixing this bug:
131072 2.004% 1.318% (1 way, 16 line)
65536 2.851% 1.516% (1 way, 16 line)
32768 3.540% 1.445% (1 way, 16 line)
16384 6.604% 4.043% (1 way, 16 line)
8192 9.112% 6.052% (1 way, 16 line)
4096 11.310% 4.504% (1 way, 16 line)
2048 14.326% 5.990% (1 way, 16 line)
1024 17.632% 7.066% (1 way, 16 line)

131072 1.966% 0.821% (2 way, 16 line)
65536 2.550% 0.513% (2 way, 16 line)
32768 3.303% 0.791% (2 way, 16 line)
16384 6.779% 3.842% (2 way, 16 line)
8192 8.484% 1.766% (2 way, 16 line)
4096 10.905% 2.773% (2 way, 16 line)
2048 13.588% 3.319% (2 way, 16 line)
1024 16.022% 3.229% (2 way, 16 line)

262144 15.748% 14.750% (1 way, 32 line)
131072 15.929% 14.382% (1 way, 32 line)
65536 16.274% 14.057% (1 way, 32 line)
32768 16.843% 14.122% (1 way, 32 line)
16384 19.819% 16.567% (1 way, 32 line)
8192 22.360% 18.319% (1 way, 32 line)
4096 24.583% 16.422% (1 way, 32 line)
2048 27.325% 16.513% (1 way, 32 line)

262144 2.093% 0.872% (2 way, 32 line)
131072 2.893% 0.836% (2 way, 32 line)
65536 3.540% 0.949% (2 way, 32 line)
32768 4.854% 1.808% (2 way, 32 line)
16384 8.788% 4.997% (2 way, 32 line)
8192 11.456% 3.638% (2 way, 32 line)
4096 14.699% 4.352% (2 way, 32 line)
2048 17.580% 4.251% (2 way, 32 line)

524288 10.825% 10.370% (1 way, 64 line)
262144 11.019% 10.412% (1 way, 64 line)
131072 11.188% 10.232% (1 way, 64 line)
65536 11.652% 10.320% (1 way, 64 line)
32768 12.388% 10.699% (1 way, 64 line)
16384 15.806% 13.581% (1 way, 64 line)
8192 18.753% 15.604% (1 way, 64 line)
4096 21.158% 13.259% (1 way, 64 line)

524288 0.863% 0.439% (2 way, 64 line)
262144 1.297% 0.535% (2 way, 64 line)
131072 1.896% 0.700% (2 way, 64 line)
65536 2.505% 0.979% (2 way, 64 line)
32768 3.876% 1.923% (2 way, 64 line)
16384 8.341% 5.535% (2 way, 64 line)
8192 11.346% 3.956% (2 way, 64 line)
4096 14.262% 3.722% (2 way, 64 line)

It would appear that 1-way still does poorly with larger cache lines,
but 2-way does a lot better...

As for why now 1-way 32B appears to be doing worse than 1-way 64B, no
idea, this doesn't really make sense. May still be additional bugs.

....

Re: Misc: Cache Sizes / Observations

<2022Mar2.192515@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23873&group=comp.arch#23873

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Misc: Cache Sizes / Observations
Date: Wed, 02 Mar 2022 18:25:15 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 79
Message-ID: <2022Mar2.192515@mips.complang.tuwien.ac.at>
References: <svj5qj$19u$1@dont-email.me> <svjak4$9kl$1@dont-email.me> <svkeq1$phm$1@dont-email.me> <svm7r9$3ec$1@dont-email.me> <2022Mar2.101217@mips.complang.tuwien.ac.at> <svoa5q$ud$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="58f5aa2b81cd968f124b82d9145f6dbd";
logging-data="11627"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/xBEVa8DqxhkU6yaQco4pv"
Cancel-Lock: sha1:34HQHVuXp7x8OM2Td7W4p+8W1s0=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Wed, 2 Mar 2022 18:25 UTC

BGB <cr88192@gmail.com> writes:
>On 3/2/2022 3:12 AM, Anton Ertl wrote:
>> A 512KB cache with 64-byte lines has 8192 cache lines, just like a
>> 256KB cache with 32-byte lines and a 128KB cache with 16-byte lines.
>> Without spatial locality, I would expect similar miss rates for all of
>> them for the same associativity; and given that programs have spatial
>> locality, I would expect the larger among these configurations to have
>> an advantage. Are you sure that your cache simulator has no bugs?
....
>After fixing this bug:
> 131072 2.004% 1.318% (1 way, 16 line)
> 65536 2.851% 1.516% (1 way, 16 line)
> 32768 3.540% 1.445% (1 way, 16 line)
> 16384 6.604% 4.043% (1 way, 16 line)
> 8192 9.112% 6.052% (1 way, 16 line)
> 4096 11.310% 4.504% (1 way, 16 line)
> 2048 14.326% 5.990% (1 way, 16 line)
> 1024 17.632% 7.066% (1 way, 16 line)
>
> 131072 1.966% 0.821% (2 way, 16 line)
> 65536 2.550% 0.513% (2 way, 16 line)
> 32768 3.303% 0.791% (2 way, 16 line)
> 16384 6.779% 3.842% (2 way, 16 line)
> 8192 8.484% 1.766% (2 way, 16 line)
> 4096 10.905% 2.773% (2 way, 16 line)
> 2048 13.588% 3.319% (2 way, 16 line)
> 1024 16.022% 3.229% (2 way, 16 line)
>
> 262144 15.748% 14.750% (1 way, 32 line)
> 131072 15.929% 14.382% (1 way, 32 line)
> 65536 16.274% 14.057% (1 way, 32 line)
> 32768 16.843% 14.122% (1 way, 32 line)
> 16384 19.819% 16.567% (1 way, 32 line)
> 8192 22.360% 18.319% (1 way, 32 line)
> 4096 24.583% 16.422% (1 way, 32 line)
> 2048 27.325% 16.513% (1 way, 32 line)
>
> 262144 2.093% 0.872% (2 way, 32 line)
> 131072 2.893% 0.836% (2 way, 32 line)
> 65536 3.540% 0.949% (2 way, 32 line)
> 32768 4.854% 1.808% (2 way, 32 line)
> 16384 8.788% 4.997% (2 way, 32 line)
> 8192 11.456% 3.638% (2 way, 32 line)
> 4096 14.699% 4.352% (2 way, 32 line)
> 2048 17.580% 4.251% (2 way, 32 line)
>
> 524288 10.825% 10.370% (1 way, 64 line)
> 262144 11.019% 10.412% (1 way, 64 line)
> 131072 11.188% 10.232% (1 way, 64 line)
> 65536 11.652% 10.320% (1 way, 64 line)
> 32768 12.388% 10.699% (1 way, 64 line)
> 16384 15.806% 13.581% (1 way, 64 line)
> 8192 18.753% 15.604% (1 way, 64 line)
> 4096 21.158% 13.259% (1 way, 64 line)
>
> 524288 0.863% 0.439% (2 way, 64 line)
> 262144 1.297% 0.535% (2 way, 64 line)
> 131072 1.896% 0.700% (2 way, 64 line)
> 65536 2.505% 0.979% (2 way, 64 line)
> 32768 3.876% 1.923% (2 way, 64 line)
> 16384 8.341% 5.535% (2 way, 64 line)
> 8192 11.346% 3.956% (2 way, 64 line)
> 4096 14.262% 3.722% (2 way, 64 line)
>
>
>It would appear that 1-way still does poorly with larger cache lines,

Probably still a bug, see above.

>As for why now 1-way 32B appears to be doing worse than 1-way 64B, no
>idea, this doesn't really make sense.

For the same number of cache lines, that's expected, due to spatial
locality.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Misc: Cache Sizes / Observations

<svogd7$o9s$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23876&group=comp.arch#23876

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Cache Sizes / Observations
Date: Wed, 2 Mar 2022 13:26:28 -0600
Organization: A noiseless patient Spider
Lines: 169
Message-ID: <svogd7$o9s$1@dont-email.me>
References: <svj5qj$19u$1@dont-email.me> <svjak4$9kl$1@dont-email.me>
<svkeq1$phm$1@dont-email.me> <svm7r9$3ec$1@dont-email.me>
<2022Mar2.101217@mips.complang.tuwien.ac.at> <svoa5q$ud$1@dont-email.me>
<2022Mar2.192515@mips.complang.tuwien.ac.at>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 2 Mar 2022 19:26:31 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="ad13c59147a05662898aa93d34e41054";
logging-data="24892"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/hcc+6GFRQr2Q0ojxvP2Dj"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:9mpH4lfumtXmPTFlSnJ7sDH0mmE=
In-Reply-To: <2022Mar2.192515@mips.complang.tuwien.ac.at>
Content-Language: en-US

by: BGB - Wed, 2 Mar 2022 19:26 UTC

On 3/2/2022 12:25 PM, Anton Ertl wrote:
> BGB <cr88192@gmail.com> writes:
>> On 3/2/2022 3:12 AM, Anton Ertl wrote:
>>> A 512KB cache with 64-byte lines has 8192 cache lines, just like a
>>> 256KB cache with 32-byte lines and a 128KB cache with 16-byte lines.
>>> Without spatial locality, I would expect similar miss rates for all of
>>> them for the same associativity; and given that programs have spatial
>>> locality, I would expect the larger among these configurations to have
>>> an advantage. Are you sure that your cache simulator has no bugs?
> ...
>> After fixing this bug:
>> 131072 2.004% 1.318% (1 way, 16 line)
>> 65536 2.851% 1.516% (1 way, 16 line)
>> 32768 3.540% 1.445% (1 way, 16 line)
>> 16384 6.604% 4.043% (1 way, 16 line)
>> 8192 9.112% 6.052% (1 way, 16 line)
>> 4096 11.310% 4.504% (1 way, 16 line)
>> 2048 14.326% 5.990% (1 way, 16 line)
>> 1024 17.632% 7.066% (1 way, 16 line)
>>
>> 131072 1.966% 0.821% (2 way, 16 line)
>> 65536 2.550% 0.513% (2 way, 16 line)
>> 32768 3.303% 0.791% (2 way, 16 line)
>> 16384 6.779% 3.842% (2 way, 16 line)
>> 8192 8.484% 1.766% (2 way, 16 line)
>> 4096 10.905% 2.773% (2 way, 16 line)
>> 2048 13.588% 3.319% (2 way, 16 line)
>> 1024 16.022% 3.229% (2 way, 16 line)
>>
>> 262144 15.748% 14.750% (1 way, 32 line)
>> 131072 15.929% 14.382% (1 way, 32 line)
>> 65536 16.274% 14.057% (1 way, 32 line)
>> 32768 16.843% 14.122% (1 way, 32 line)
>> 16384 19.819% 16.567% (1 way, 32 line)
>> 8192 22.360% 18.319% (1 way, 32 line)
>> 4096 24.583% 16.422% (1 way, 32 line)
>> 2048 27.325% 16.513% (1 way, 32 line)
>>
>> 262144 2.093% 0.872% (2 way, 32 line)
>> 131072 2.893% 0.836% (2 way, 32 line)
>> 65536 3.540% 0.949% (2 way, 32 line)
>> 32768 4.854% 1.808% (2 way, 32 line)
>> 16384 8.788% 4.997% (2 way, 32 line)
>> 8192 11.456% 3.638% (2 way, 32 line)
>> 4096 14.699% 4.352% (2 way, 32 line)
>> 2048 17.580% 4.251% (2 way, 32 line)
>>
>> 524288 10.825% 10.370% (1 way, 64 line)
>> 262144 11.019% 10.412% (1 way, 64 line)
>> 131072 11.188% 10.232% (1 way, 64 line)
>> 65536 11.652% 10.320% (1 way, 64 line)
>> 32768 12.388% 10.699% (1 way, 64 line)
>> 16384 15.806% 13.581% (1 way, 64 line)
>> 8192 18.753% 15.604% (1 way, 64 line)
>> 4096 21.158% 13.259% (1 way, 64 line)
>>
>> 524288 0.863% 0.439% (2 way, 64 line)
>> 262144 1.297% 0.535% (2 way, 64 line)
>> 131072 1.896% 0.700% (2 way, 64 line)
>> 65536 2.505% 0.979% (2 way, 64 line)
>> 32768 3.876% 1.923% (2 way, 64 line)
>> 16384 8.341% 5.535% (2 way, 64 line)
>> 8192 11.346% 3.956% (2 way, 64 line)
>> 4096 14.262% 3.722% (2 way, 64 line)
>>
>>
>> It would appear that 1-way still does poorly with larger cache lines,
>
> Probably still a bug, see above.
>
>> As for why now 1-way 32B appears to be doing worse than 1-way 64B, no
>> idea, this doesn't really make sense.
>
> For the same number of cache lines, that's expected, due to spatial
> locality.
>

After posting this, I did find another bug:
The cache index pairs were always being calculated as-if the cache line
were 16B. After fixing this bug, all the lines got much closer together
(as shown on another graph posted to Twitter after the last graph).

In this case, now the overall hit/miss ratio is more consistent between
cache-line sizes. The main difference is that larger cache-line sizes
still have a higher proportion of conflict misses.

Values following the more recent bugfix:
131072 2.004% 1.318% (1 way, 16 line)
65536 2.851% 1.516% (1 way, 16 line)
32768 3.540% 1.445% (1 way, 16 line)
16384 6.604% 4.043% (1 way, 16 line)
8192 9.112% 6.052% (1 way, 16 line)
4096 11.310% 4.504% (1 way, 16 line)
2048 14.326% 5.990% (1 way, 16 line)
1024 17.632% 7.066% (1 way, 16 line)

262144 0.905% 0.630% (1 way, 32 line)
131072 1.326% 0.912% (1 way, 32 line)
65536 1.976% 1.160% (1 way, 32 line)
32768 2.712% 1.504% (1 way, 32 line)
16384 5.964% 4.456% (1 way, 32 line)
8192 8.728% 6.671% (1 way, 32 line)
4096 11.262% 5.347% (1 way, 32 line)
2048 14.579% 6.659% (1 way, 32 line)

262144 0.671% 0.329% (2 way, 32 line)
131072 1.148% 0.436% (2 way, 32 line)
65536 1.514% 0.348% (2 way, 32 line)
32768 2.259% 0.795% (2 way, 32 line)
16384 6.041% 4.169% (2 way, 32 line)
8192 8.102% 2.335% (2 way, 32 line)
4096 10.690% 3.071% (2 way, 32 line)
2048 13.354% 3.134% (2 way, 32 line)

524288 0.442% 0.311% (1 way, 64 line)
262144 0.717% 0.561% (1 way, 64 line)
131072 1.023% 0.759% (1 way, 64 line)
65536 1.661% 1.154% (1 way, 64 line)
32768 2.538% 1.818% (1 way, 64 line)
16384 6.168% 5.201% (1 way, 64 line)
8192 9.279% 7.640% (1 way, 64 line)
4096 12.084% 6.163% (1 way, 64 line)

524288 0.219% 0.079% (2 way, 64 line)
262144 0.428% 0.215% (2 way, 64 line)
131072 0.697% 0.251% (2 way, 64 line)
65536 0.984% 0.298% (2 way, 64 line)
32768 1.924% 1.005% (2 way, 64 line)
16384 6.169% 4.797% (2 way, 64 line)
8192 8.642% 2.983% (2 way, 64 line)
4096 10.878% 2.647% (2 way, 64 line)

Though, it seems 16B cache lines are no longer a clear winner in this
case...

There is still a hump in the conflict miss estimates, I still suspect
this is likely due to the smaller caches hitting the limit of the 4-way
estimator.

Well, either that, or I am not estimating conflict miss rate correctly...

It may make sense to switch this part to comparing against an 8-way LRU
and calculate the conflict miss rate as the relative percentage of
misses that are conflict misses, rather than the absolute percentage of
all accesses. Probably TODO.

Predictions from this model are still a little different from my
original predictions though (there is no plateau, it just sorta levels
off slightly starting at around 32K).

Eg:
https://twitter.com/cr88192/status/1499085026335600640/photo/1

> - anton

Re: Misc: Cache Sizes / Observations

<svokin$rtp$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23880&group=comp.arch#23880

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Misc: Cache Sizes / Observations
Date: Wed, 2 Mar 2022 12:37:41 -0800
Organization: A noiseless patient Spider
Lines: 189
Message-ID: <svokin$rtp$1@dont-email.me>
References: <svj5qj$19u$1@dont-email.me> <svjak4$9kl$1@dont-email.me>
<svkeq1$phm$1@dont-email.me> <svm7r9$3ec$1@dont-email.me>
<2022Mar2.101217@mips.complang.tuwien.ac.at> <svoa5q$ud$1@dont-email.me>
<2022Mar2.192515@mips.complang.tuwien.ac.at> <svogd7$o9s$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 2 Mar 2022 20:37:44 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="618ac8c3c579c6d3ff5aca798a93b9ac";
logging-data="28601"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+h0SZQeZsWNS+Fqg6nrjRnOjPMp/jWsXY="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:dnOludjS/qakjDpTyjyZU3FJNns=
In-Reply-To: <svogd7$o9s$1@dont-email.me>
Content-Language: en-US

by: Stephen Fuld - Wed, 2 Mar 2022 20:37 UTC

On 3/2/2022 11:26 AM, BGB wrote:
> On 3/2/2022 12:25 PM, Anton Ertl wrote:
>> BGB <cr88192@gmail.com> writes:
>>> On 3/2/2022 3:12 AM, Anton Ertl wrote:
>>>> A 512KB cache with 64-byte lines has 8192 cache lines, just like a
>>>> 256KB cache with 32-byte lines and a 128KB cache with 16-byte lines.
>>>> Without spatial locality, I would expect similar miss rates for all of
>>>> them for the same associativity; and given that programs have spatial
>>>> locality, I would expect the larger among these configurations to have
>>>> an advantage. Are you sure that your cache simulator has no bugs?
>> ...
>>> After fixing this bug:
>>> 131072      2.004%     1.318% (1 way, 16 line)
>>>    65536      2.851%     1.516% (1 way, 16 line)
>>>    32768      3.540%     1.445% (1 way, 16 line)
>>>    16384      6.604%     4.043% (1 way, 16 line)
>>>     8192      9.112%     6.052% (1 way, 16 line)
>>>     4096     11.310%     4.504% (1 way, 16 line)
>>>     2048     14.326%     5.990% (1 way, 16 line)
>>>     1024     17.632%     7.066% (1 way, 16 line)
>>>
>>> 131072      1.966%     0.821% (2 way, 16 line)
>>>    65536      2.550%     0.513% (2 way, 16 line)
>>>    32768      3.303%     0.791% (2 way, 16 line)
>>>    16384      6.779%     3.842% (2 way, 16 line)
>>>     8192      8.484%     1.766% (2 way, 16 line)
>>>     4096     10.905%     2.773% (2 way, 16 line)
>>>     2048     13.588%     3.319% (2 way, 16 line)
>>>     1024     16.022%     3.229% (2 way, 16 line)
>>>
>>> 262144     15.748%    14.750% (1 way, 32 line)
>>> 131072     15.929%    14.382% (1 way, 32 line)
>>>    65536     16.274%    14.057% (1 way, 32 line)
>>>    32768     16.843%    14.122% (1 way, 32 line)
>>>    16384     19.819%    16.567% (1 way, 32 line)
>>>     8192     22.360%    18.319% (1 way, 32 line)
>>>     4096     24.583%    16.422% (1 way, 32 line)
>>>     2048     27.325%    16.513% (1 way, 32 line)
>>>
>>> 262144      2.093%     0.872% (2 way, 32 line)
>>> 131072      2.893%     0.836% (2 way, 32 line)
>>>    65536      3.540%     0.949% (2 way, 32 line)
>>>    32768      4.854%     1.808% (2 way, 32 line)
>>>    16384      8.788%     4.997% (2 way, 32 line)
>>>     8192     11.456%     3.638% (2 way, 32 line)
>>>     4096     14.699%     4.352% (2 way, 32 line)
>>>     2048     17.580%     4.251% (2 way, 32 line)
>>>
>>> 524288     10.825%    10.370% (1 way, 64 line)
>>> 262144     11.019%    10.412% (1 way, 64 line)
>>> 131072     11.188%    10.232% (1 way, 64 line)
>>>    65536     11.652%    10.320% (1 way, 64 line)
>>>    32768     12.388%    10.699% (1 way, 64 line)
>>>    16384     15.806%    13.581% (1 way, 64 line)
>>>     8192     18.753%    15.604% (1 way, 64 line)
>>>     4096     21.158%    13.259% (1 way, 64 line)
>>>
>>> 524288      0.863%     0.439% (2 way, 64 line)
>>> 262144      1.297%     0.535% (2 way, 64 line)
>>> 131072      1.896%     0.700% (2 way, 64 line)
>>>    65536      2.505%     0.979% (2 way, 64 line)
>>>    32768      3.876%     1.923% (2 way, 64 line)
>>>    16384      8.341%     5.535% (2 way, 64 line)
>>>     8192     11.346%     3.956% (2 way, 64 line)
>>>     4096     14.262%     3.722% (2 way, 64 line)
>>>
>>>
>>> It would appear that 1-way still does poorly with larger cache lines,
>>
>> Probably still a bug, see above.
>>
>>> As for why now 1-way 32B appears to be doing worse than 1-way 64B, no
>>> idea, this doesn't really make sense.
>>
>> For the same number of cache lines, that's expected, due to spatial
>> locality.
>>
>
> After posting this, I did find another bug:
> The cache index pairs were always being calculated as-if the cache line
> were 16B. After fixing this bug, all the lines got much closer together
> (as shown on another graph posted to Twitter after the last graph).
>
> In this case, now the overall hit/miss ratio is more consistent between
> cache-line sizes. The main difference is that larger cache-line sizes
> still have a higher proportion of conflict misses.
>
>
> Values following the more recent bugfix:
> 131072      2.004%     1.318% (1 way, 16 line)
> 65536      2.851%     1.516% (1 way, 16 line)
> 32768      3.540%     1.445% (1 way, 16 line)
> 16384      6.604%     4.043% (1 way, 16 line)
>    8192      9.112%     6.052% (1 way, 16 line)
>    4096     11.310%     4.504% (1 way, 16 line)
>    2048     14.326%     5.990% (1 way, 16 line)
>    1024     17.632%     7.066% (1 way, 16 line)
>
> 131072      1.966%     0.821% (2 way, 16 line)
> 65536      2.550%     0.513% (2 way, 16 line)
> 32768      3.303%     0.791% (2 way, 16 line)
> 16384      6.779%     3.842% (2 way, 16 line)
>    8192      8.484%     1.766% (2 way, 16 line)
>    4096     10.905%     2.773% (2 way, 16 line)
>    2048     13.588%     3.319% (2 way, 16 line)
>    1024     16.022%     3.229% (2 way, 16 line)
>
> 262144      0.905%     0.630% (1 way, 32 line)
> 131072      1.326%     0.912% (1 way, 32 line)
> 65536      1.976%     1.160% (1 way, 32 line)
> 32768      2.712%     1.504% (1 way, 32 line)
> 16384      5.964%     4.456% (1 way, 32 line)
>    8192      8.728%     6.671% (1 way, 32 line)
>    4096     11.262%     5.347% (1 way, 32 line)
>    2048     14.579%     6.659% (1 way, 32 line)
>
> 262144      0.671%     0.329% (2 way, 32 line)
> 131072      1.148%     0.436% (2 way, 32 line)
> 65536      1.514%     0.348% (2 way, 32 line)
> 32768      2.259%     0.795% (2 way, 32 line)
> 16384      6.041%     4.169% (2 way, 32 line)
>    8192      8.102%     2.335% (2 way, 32 line)
>    4096     10.690%     3.071% (2 way, 32 line)
>    2048     13.354%     3.134% (2 way, 32 line)
>
> 524288      0.442%     0.311% (1 way, 64 line)
> 262144      0.717%     0.561% (1 way, 64 line)
> 131072      1.023%     0.759% (1 way, 64 line)
> 65536      1.661%     1.154% (1 way, 64 line)
> 32768      2.538%     1.818% (1 way, 64 line)
> 16384      6.168%     5.201% (1 way, 64 line)
>    8192      9.279%     7.640% (1 way, 64 line)
>    4096     12.084%     6.163% (1 way, 64 line)
>
> 524288      0.219%     0.079% (2 way, 64 line)
> 262144      0.428%     0.215% (2 way, 64 line)
> 131072      0.697%     0.251% (2 way, 64 line)
> 65536      0.984%     0.298% (2 way, 64 line)
> 32768      1.924%     1.005% (2 way, 64 line)
> 16384      6.169%     4.797% (2 way, 64 line)
>    8192      8.642%     2.983% (2 way, 64 line)
>    4096     10.878%     2.647% (2 way, 64 line)
>
>
> Though, it seems 16B cache lines are no longer a clear winner in this
> case...

I am not sure how you determined this. To do an apples to apples
comparison, as Anton explained, you have to compare the same amount of
total cache. Thus, for example, comparing a 16 byte line at a
particular cache size should be compared against a 32 byte line at twice
the cache size. Otherwise, the you can't distinguish between the total
cache size effect and the cache line size effect. To me, it seems like
for equivalent cache sizes, 16B lines are a clear win. This reflects a
higher proportion of temporal locality than spatial locality.

BTW, with a little more work, you can distinguish spatial locality hits
from temporal locality hits. You have to keep the actual address that
caused the miss, then on subsequent hits, see if they are to the same or
a different address.

Click here to read the complete article

Re: Misc: Cache Sizes / Observations

<3215daef-0814-4dd1-b2ae-a00cd22d0d1dn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23886&group=comp.arch#23886

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:5dca:0:b0:2de:57d8:7a89 with SMTP id e10-20020ac85dca000000b002de57d87a89mr26106985qtx.635.1646264600255;
Wed, 02 Mar 2022 15:43:20 -0800 (PST)
X-Received: by 2002:a05:6808:3021:b0:2cf:177:968a with SMTP id
ay33-20020a056808302100b002cf0177968amr2176652oib.119.1646264600008; Wed, 02
Mar 2022 15:43:20 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 2 Mar 2022 15:43:19 -0800 (PST)
In-Reply-To: <svokin$rtp$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:446a:c769:120a:97bb;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:446a:c769:120a:97bb
References: <svj5qj$19u$1@dont-email.me> <svjak4$9kl$1@dont-email.me>
<svkeq1$phm$1@dont-email.me> <svm7r9$3ec$1@dont-email.me> <2022Mar2.101217@mips.complang.tuwien.ac.at>
<svoa5q$ud$1@dont-email.me> <2022Mar2.192515@mips.complang.tuwien.ac.at>
<svogd7$o9s$1@dont-email.me> <svokin$rtp$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3215daef-0814-4dd1-b2ae-a00cd22d0d1dn@googlegroups.com>
Subject: Re: Misc: Cache Sizes / Observations
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Wed, 02 Mar 2022 23:43:20 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 186

by: MitchAlsup - Wed, 2 Mar 2022 23:43 UTC

On Wednesday, March 2, 2022 at 2:37:47 PM UTC-6, Stephen Fuld wrote:
> On 3/2/2022 11:26 AM, BGB wrote:
> > On 3/2/2022 12:25 PM, Anton Ertl wrote:
> >> BGB <cr8...@gmail.com> writes:
> >>> On 3/2/2022 3:12 AM, Anton Ertl wrote:
> >>>> A 512KB cache with 64-byte lines has 8192 cache lines, just like a
> >>>> 256KB cache with 32-byte lines and a 128KB cache with 16-byte lines.
> >>>> Without spatial locality, I would expect similar miss rates for all of
> >>>> them for the same associativity; and given that programs have spatial
> >>>> locality, I would expect the larger among these configurations to have
> >>>> an advantage. Are you sure that your cache simulator has no bugs?
> >> ...
> >>> After fixing this bug:
> >>> 131072 2.004% 1.318% (1 way, 16 line)
> >>> 65536 2.851% 1.516% (1 way, 16 line)
> >>> 32768 3.540% 1.445% (1 way, 16 line)
> >>> 16384 6.604% 4.043% (1 way, 16 line)
> >>> 8192 9.112% 6.052% (1 way, 16 line)
> >>> 4096 11.310% 4.504% (1 way, 16 line)
> >>> 2048 14.326% 5.990% (1 way, 16 line)
> >>> 1024 17.632% 7.066% (1 way, 16 line)
> >>>
> >>> 131072 1.966% 0.821% (2 way, 16 line)
> >>> 65536 2.550% 0.513% (2 way, 16 line)
> >>> 32768 3.303% 0.791% (2 way, 16 line)
> >>> 16384 6.779% 3.842% (2 way, 16 line)
> >>> 8192 8.484% 1.766% (2 way, 16 line)
> >>> 4096 10.905% 2.773% (2 way, 16 line)
> >>> 2048 13.588% 3.319% (2 way, 16 line)
> >>> 1024 16.022% 3.229% (2 way, 16 line)
> >>>
> >>> 262144 15.748% 14.750% (1 way, 32 line)
> >>> 131072 15.929% 14.382% (1 way, 32 line)
> >>> 65536 16.274% 14.057% (1 way, 32 line)
> >>> 32768 16.843% 14.122% (1 way, 32 line)
> >>> 16384 19.819% 16.567% (1 way, 32 line)
> >>> 8192 22.360% 18.319% (1 way, 32 line)
> >>> 4096 24.583% 16.422% (1 way, 32 line)
> >>> 2048 27.325% 16.513% (1 way, 32 line)
> >>>
> >>> 262144 2.093% 0.872% (2 way, 32 line)
> >>> 131072 2.893% 0.836% (2 way, 32 line)
> >>> 65536 3.540% 0.949% (2 way, 32 line)
> >>> 32768 4.854% 1.808% (2 way, 32 line)
> >>> 16384 8.788% 4.997% (2 way, 32 line)
> >>> 8192 11.456% 3.638% (2 way, 32 line)
> >>> 4096 14.699% 4.352% (2 way, 32 line)
> >>> 2048 17.580% 4.251% (2 way, 32 line)
> >>>
> >>> 524288 10.825% 10.370% (1 way, 64 line)
> >>> 262144 11.019% 10.412% (1 way, 64 line)
> >>> 131072 11.188% 10.232% (1 way, 64 line)
> >>> 65536 11.652% 10.320% (1 way, 64 line)
> >>> 32768 12.388% 10.699% (1 way, 64 line)
> >>> 16384 15.806% 13.581% (1 way, 64 line)
> >>> 8192 18.753% 15.604% (1 way, 64 line)
> >>> 4096 21.158% 13.259% (1 way, 64 line)
> >>>
> >>> 524288 0.863% 0.439% (2 way, 64 line)
> >>> 262144 1.297% 0.535% (2 way, 64 line)
> >>> 131072 1.896% 0.700% (2 way, 64 line)
> >>> 65536 2.505% 0.979% (2 way, 64 line)
> >>> 32768 3.876% 1.923% (2 way, 64 line)
> >>> 16384 8.341% 5.535% (2 way, 64 line)
> >>> 8192 11.346% 3.956% (2 way, 64 line)
> >>> 4096 14.262% 3.722% (2 way, 64 line)
> >>>
> >>>
> >>> It would appear that 1-way still does poorly with larger cache lines,
> >>
> >> Probably still a bug, see above.
> >>
> >>> As for why now 1-way 32B appears to be doing worse than 1-way 64B, no
> >>> idea, this doesn't really make sense.
> >>
> >> For the same number of cache lines, that's expected, due to spatial
> >> locality.
> >>
> >
> > After posting this, I did find another bug:
> > The cache index pairs were always being calculated as-if the cache line
> > were 16B. After fixing this bug, all the lines got much closer together
> > (as shown on another graph posted to Twitter after the last graph).
> >
> > In this case, now the overall hit/miss ratio is more consistent between
> > cache-line sizes. The main difference is that larger cache-line sizes
> > still have a higher proportion of conflict misses.
> >
> >
> > Values following the more recent bugfix:
> > 131072 2.004% 1.318% (1 way, 16 line)
> > 65536 2.851% 1.516% (1 way, 16 line)
> > 32768 3.540% 1.445% (1 way, 16 line)
> > 16384 6.604% 4.043% (1 way, 16 line)
> > 8192 9.112% 6.052% (1 way, 16 line)
> > 4096 11.310% 4.504% (1 way, 16 line)
> > 2048 14.326% 5.990% (1 way, 16 line)
> > 1024 17.632% 7.066% (1 way, 16 line)
> >
> > 131072 1.966% 0.821% (2 way, 16 line)
> > 65536 2.550% 0.513% (2 way, 16 line)
> > 32768 3.303% 0.791% (2 way, 16 line)
> > 16384 6.779% 3.842% (2 way, 16 line)
> > 8192 8.484% 1.766% (2 way, 16 line)
> > 4096 10.905% 2.773% (2 way, 16 line)
> > 2048 13.588% 3.319% (2 way, 16 line)
> > 1024 16.022% 3.229% (2 way, 16 line)
> >
> > 262144 0.905% 0.630% (1 way, 32 line)
> > 131072 1.326% 0.912% (1 way, 32 line)
> > 65536 1.976% 1.160% (1 way, 32 line)
> > 32768 2.712% 1.504% (1 way, 32 line)
> > 16384 5.964% 4.456% (1 way, 32 line)
> > 8192 8.728% 6.671% (1 way, 32 line)
> > 4096 11.262% 5.347% (1 way, 32 line)
> > 2048 14.579% 6.659% (1 way, 32 line)
> >
> > 262144 0.671% 0.329% (2 way, 32 line)
> > 131072 1.148% 0.436% (2 way, 32 line)
> > 65536 1.514% 0.348% (2 way, 32 line)
> > 32768 2.259% 0.795% (2 way, 32 line)
> > 16384 6.041% 4.169% (2 way, 32 line)
> > 8192 8.102% 2.335% (2 way, 32 line)
> > 4096 10.690% 3.071% (2 way, 32 line)
> > 2048 13.354% 3.134% (2 way, 32 line)
> >
> > 524288 0.442% 0.311% (1 way, 64 line)
> > 262144 0.717% 0.561% (1 way, 64 line)
> > 131072 1.023% 0.759% (1 way, 64 line)
> > 65536 1.661% 1.154% (1 way, 64 line)
> > 32768 2.538% 1.818% (1 way, 64 line)
> > 16384 6.168% 5.201% (1 way, 64 line)
> > 8192 9.279% 7.640% (1 way, 64 line)
> > 4096 12.084% 6.163% (1 way, 64 line)
> >
> > 524288 0.219% 0.079% (2 way, 64 line)
> > 262144 0.428% 0.215% (2 way, 64 line)
> > 131072 0.697% 0.251% (2 way, 64 line)
> > 65536 0.984% 0.298% (2 way, 64 line)
> > 32768 1.924% 1.005% (2 way, 64 line)
> > 16384 6.169% 4.797% (2 way, 64 line)
> > 8192 8.642% 2.983% (2 way, 64 line)
> > 4096 10.878% 2.647% (2 way, 64 line)
> >
> >
> > Though, it seems 16B cache lines are no longer a clear winner in this
> > case...
> I am not sure how you determined this. To do an apples to apples
> comparison, as Anton explained, you have to compare the same amount of
> total cache. Thus, for example, comparing a 16 byte line at a
> particular cache size should be compared against a 32 byte line at twice
> the cache size. Otherwise, the you can't distinguish between the total
> cache size effect and the cache line size effect. To me, it seems like
> for equivalent cache sizes, 16B lines are a clear win. This reflects a
> higher proportion of temporal locality than spatial locality.
<
And then there is that bus occupancy thing.........
>
> BTW, with a little more work, you can distinguish spatial locality hits
> from temporal locality hits. You have to keep the actual address that
> caused the miss, then on subsequent hits, see if they are to the same or
> a different address.
<
We figured out (around 1992) that one could model all cache sizes
and association levels simultaneously.
>
> One more suggestion. If you are not outputting a trace, but keeping the
> statistics "on the fly" in the emulator, this is non-optimal. Yes,
> outputting the full trace (of all loads and stores) will slow the
> emulator down significantly, you only have to do it once. Then by
> creating another program that just emulates the cache behavior, you can
> run lots of tests on the same trace data (different line sizes, cache
> sizes, LRU policies, number of ways, etc.) without rerunning the full
> simulator. This second program should be quite fast, as it doesn't have
> to emulate the whole CPU.
> > There is still a hump in the conflict miss estimates, I still suspect
> > this is likely due to the smaller caches hitting the limit of the 4-way
> > estimator.
> >
> > Well, either that, or I am not estimating conflict miss rate correctly...
> See
>
>
> https://en.wikipedia.org/wiki/Cache_performance_measurement_and_metric#Conflict_misses
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

Click here to read the complete article

Re: Misc: Cache Sizes / Observations

<svp6sl$s7t$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23890&group=comp.arch#23890

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Cache Sizes / Observations
Date: Wed, 2 Mar 2022 19:50:10 -0600
Organization: A noiseless patient Spider
Lines: 232
Message-ID: <svp6sl$s7t$1@dont-email.me>
References: <svj5qj$19u$1@dont-email.me> <svjak4$9kl$1@dont-email.me>
<svkeq1$phm$1@dont-email.me> <svm7r9$3ec$1@dont-email.me>
<2022Mar2.101217@mips.complang.tuwien.ac.at> <svoa5q$ud$1@dont-email.me>
<2022Mar2.192515@mips.complang.tuwien.ac.at> <svogd7$o9s$1@dont-email.me>
<svokin$rtp$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 3 Mar 2022 01:50:13 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="c8be40ab30704a686bce3661f550954b";
logging-data="28925"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19HDvBWPbUhvGsuTUZZKR3i"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:QhFDJ5NrX2MTAHtPGVAbS3xNc4A=
In-Reply-To: <svokin$rtp$1@dont-email.me>
Content-Language: en-US

by: BGB - Thu, 3 Mar 2022 01:50 UTC

(Response delayed, I was going to leave this until after I had something
worthwhile to add to this, but this didn't happen...).

On 3/2/2022 2:37 PM, Stephen Fuld wrote:
> On 3/2/2022 11:26 AM, BGB wrote:
>> On 3/2/2022 12:25 PM, Anton Ertl wrote:
>>> BGB <cr88192@gmail.com> writes:
>>>> On 3/2/2022 3:12 AM, Anton Ertl wrote:
>>>>> A 512KB cache with 64-byte lines has 8192 cache lines, just like a
>>>>> 256KB cache with 32-byte lines and a 128KB cache with 16-byte lines.
>>>>> Without spatial locality, I would expect similar miss rates for all of
>>>>> them for the same associativity; and given that programs have spatial
>>>>> locality, I would expect the larger among these configurations to have
>>>>> an advantage. Are you sure that your cache simulator has no bugs?
>>> ...
>>>> After fixing this bug:
>>>> 131072      2.004%     1.318% (1 way, 16 line)
>>>>    65536      2.851%     1.516% (1 way, 16 line)
>>>>    32768      3.540%     1.445% (1 way, 16 line)
>>>>    16384      6.604%     4.043% (1 way, 16 line)
>>>>     8192      9.112%     6.052% (1 way, 16 line)
>>>>     4096     11.310%     4.504% (1 way, 16 line)
>>>>     2048     14.326%     5.990% (1 way, 16 line)
>>>>     1024     17.632%     7.066% (1 way, 16 line)
>>>>
>>>> 131072      1.966%     0.821% (2 way, 16 line)
>>>>    65536      2.550%     0.513% (2 way, 16 line)
>>>>    32768      3.303%     0.791% (2 way, 16 line)
>>>>    16384      6.779%     3.842% (2 way, 16 line)
>>>>     8192      8.484%     1.766% (2 way, 16 line)
>>>>     4096     10.905%     2.773% (2 way, 16 line)
>>>>     2048     13.588%     3.319% (2 way, 16 line)
>>>>     1024     16.022%     3.229% (2 way, 16 line)
>>>>
>>>> 262144     15.748%    14.750% (1 way, 32 line)
>>>> 131072     15.929%    14.382% (1 way, 32 line)
>>>>    65536     16.274%    14.057% (1 way, 32 line)
>>>>    32768     16.843%    14.122% (1 way, 32 line)
>>>>    16384     19.819%    16.567% (1 way, 32 line)
>>>>     8192     22.360%    18.319% (1 way, 32 line)
>>>>     4096     24.583%    16.422% (1 way, 32 line)
>>>>     2048     27.325%    16.513% (1 way, 32 line)
>>>>
>>>> 262144      2.093%     0.872% (2 way, 32 line)
>>>> 131072      2.893%     0.836% (2 way, 32 line)
>>>>    65536      3.540%     0.949% (2 way, 32 line)
>>>>    32768      4.854%     1.808% (2 way, 32 line)
>>>>    16384      8.788%     4.997% (2 way, 32 line)
>>>>     8192     11.456%     3.638% (2 way, 32 line)
>>>>     4096     14.699%     4.352% (2 way, 32 line)
>>>>     2048     17.580%     4.251% (2 way, 32 line)
>>>>
>>>> 524288     10.825%    10.370% (1 way, 64 line)
>>>> 262144     11.019%    10.412% (1 way, 64 line)
>>>> 131072     11.188%    10.232% (1 way, 64 line)
>>>>    65536     11.652%    10.320% (1 way, 64 line)
>>>>    32768     12.388%    10.699% (1 way, 64 line)
>>>>    16384     15.806%    13.581% (1 way, 64 line)
>>>>     8192     18.753%    15.604% (1 way, 64 line)
>>>>     4096     21.158%    13.259% (1 way, 64 line)
>>>>
>>>> 524288      0.863%     0.439% (2 way, 64 line)
>>>> 262144      1.297%     0.535% (2 way, 64 line)
>>>> 131072      1.896%     0.700% (2 way, 64 line)
>>>>    65536      2.505%     0.979% (2 way, 64 line)
>>>>    32768      3.876%     1.923% (2 way, 64 line)
>>>>    16384      8.341%     5.535% (2 way, 64 line)
>>>>     8192     11.346%     3.956% (2 way, 64 line)
>>>>     4096     14.262%     3.722% (2 way, 64 line)
>>>>
>>>>
>>>> It would appear that 1-way still does poorly with larger cache lines,
>>>
>>> Probably still a bug, see above.
>>>
>>>> As for why now 1-way 32B appears to be doing worse than 1-way 64B, no
>>>> idea, this doesn't really make sense.
>>>
>>> For the same number of cache lines, that's expected, due to spatial
>>> locality.
>>>
>>
>> After posting this, I did find another bug:
>> The cache index pairs were always being calculated as-if the cache
>> line were 16B. After fixing this bug, all the lines got much closer
>> together (as shown on another graph posted to Twitter after the last
>> graph).
>>
>> In this case, now the overall hit/miss ratio is more consistent
>> between cache-line sizes. The main difference is that larger
>> cache-line sizes still have a higher proportion of conflict misses.
>>
>>
>> Values following the more recent bugfix:
>>   131072      2.004%     1.318% (1 way, 16 line)
>>    65536      2.851%     1.516% (1 way, 16 line)
>>    32768      3.540%     1.445% (1 way, 16 line)
>>    16384      6.604%     4.043% (1 way, 16 line)
>>     8192      9.112%     6.052% (1 way, 16 line)
>>     4096     11.310%     4.504% (1 way, 16 line)
>>     2048     14.326%     5.990% (1 way, 16 line)
>>     1024     17.632%     7.066% (1 way, 16 line)
>>
>>   131072      1.966%     0.821% (2 way, 16 line)
>>    65536      2.550%     0.513% (2 way, 16 line)
>>    32768      3.303%     0.791% (2 way, 16 line)
>>    16384      6.779%     3.842% (2 way, 16 line)
>>     8192      8.484%     1.766% (2 way, 16 line)
>>     4096     10.905%     2.773% (2 way, 16 line)
>>     2048     13.588%     3.319% (2 way, 16 line)
>>     1024     16.022%     3.229% (2 way, 16 line)
>>
>>   262144      0.905%     0.630% (1 way, 32 line)
>>   131072      1.326%     0.912% (1 way, 32 line)
>>    65536      1.976%     1.160% (1 way, 32 line)
>>    32768      2.712%     1.504% (1 way, 32 line)
>>    16384      5.964%     4.456% (1 way, 32 line)
>>     8192      8.728%     6.671% (1 way, 32 line)
>>     4096     11.262%     5.347% (1 way, 32 line)
>>     2048     14.579%     6.659% (1 way, 32 line)
>>
>>   262144      0.671%     0.329% (2 way, 32 line)
>>   131072      1.148%     0.436% (2 way, 32 line)
>>    65536      1.514%     0.348% (2 way, 32 line)
>>    32768      2.259%     0.795% (2 way, 32 line)
>>    16384      6.041%     4.169% (2 way, 32 line)
>>     8192      8.102%     2.335% (2 way, 32 line)
>>     4096     10.690%     3.071% (2 way, 32 line)
>>     2048     13.354%     3.134% (2 way, 32 line)
>>
>>   524288      0.442%     0.311% (1 way, 64 line)
>>   262144      0.717%     0.561% (1 way, 64 line)
>>   131072      1.023%     0.759% (1 way, 64 line)
>>    65536      1.661%     1.154% (1 way, 64 line)
>>    32768      2.538%     1.818% (1 way, 64 line)
>>    16384      6.168%     5.201% (1 way, 64 line)
>>     8192      9.279%     7.640% (1 way, 64 line)
>>     4096     12.084%     6.163% (1 way, 64 line)
>>
>>   524288      0.219%     0.079% (2 way, 64 line)
>>   262144      0.428%     0.215% (2 way, 64 line)
>>   131072      0.697%     0.251% (2 way, 64 line)
>>    65536      0.984%     0.298% (2 way, 64 line)
>>    32768      1.924%     1.005% (2 way, 64 line)
>>    16384      6.169%     4.797% (2 way, 64 line)
>>     8192      8.642%     2.983% (2 way, 64 line)
>>     4096     10.878%     2.647% (2 way, 64 line)
>>
>>
>> Though, it seems 16B cache lines are no longer a clear winner in this
>> case...
>
> I am not sure how you determined this. To do an apples to apples
> comparison, as Anton explained, you have to compare the same amount of
> total cache. Thus, for example, comparing a 16 byte line at a
> particular cache size should be compared against a 32 byte line at twice
> the cache size. Otherwise, the you can't distinguish between the total
> cache size effect and the cache line size effect. To me, it seems like
> for equivalent cache sizes, 16B lines are a clear win. This reflects a
> higher proportion of temporal locality than spatial locality.
>

Click here to read the complete article

Re: Misc: Cache Sizes / Observations

<svp7va$3ga$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23891&group=comp.arch#23891

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Cache Sizes / Observations
Date: Wed, 2 Mar 2022 20:08:39 -0600
Organization: A noiseless patient Spider
Lines: 212
Message-ID: <svp7va$3ga$1@dont-email.me>
References: <svj5qj$19u$1@dont-email.me> <svjak4$9kl$1@dont-email.me>
<svkeq1$phm$1@dont-email.me> <svm7r9$3ec$1@dont-email.me>
<2022Mar2.101217@mips.complang.tuwien.ac.at> <svoa5q$ud$1@dont-email.me>
<2022Mar2.192515@mips.complang.tuwien.ac.at> <svogd7$o9s$1@dont-email.me>
<svokin$rtp$1@dont-email.me>
<3215daef-0814-4dd1-b2ae-a00cd22d0d1dn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 3 Mar 2022 02:08:42 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="c8be40ab30704a686bce3661f550954b";
logging-data="3594"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/dSR/JSUGgBZaaiOK+/ov1"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:ezUeFaau7LmeeL24syvu1vEY870=
In-Reply-To: <3215daef-0814-4dd1-b2ae-a00cd22d0d1dn@googlegroups.com>
Content-Language: en-US

by: BGB - Thu, 3 Mar 2022 02:08 UTC

On 3/2/2022 5:43 PM, MitchAlsup wrote:
> On Wednesday, March 2, 2022 at 2:37:47 PM UTC-6, Stephen Fuld wrote:
>> On 3/2/2022 11:26 AM, BGB wrote:
>>> On 3/2/2022 12:25 PM, Anton Ertl wrote:
>>>> BGB <cr8...@gmail.com> writes:
>>>>> On 3/2/2022 3:12 AM, Anton Ertl wrote:
>>>>>> A 512KB cache with 64-byte lines has 8192 cache lines, just like a
>>>>>> 256KB cache with 32-byte lines and a 128KB cache with 16-byte lines.
>>>>>> Without spatial locality, I would expect similar miss rates for all of
>>>>>> them for the same associativity; and given that programs have spatial
>>>>>> locality, I would expect the larger among these configurations to have
>>>>>> an advantage. Are you sure that your cache simulator has no bugs?
>>>> ...
>>>>> After fixing this bug:
>>>>> 131072 2.004% 1.318% (1 way, 16 line)
>>>>> 65536 2.851% 1.516% (1 way, 16 line)
>>>>> 32768 3.540% 1.445% (1 way, 16 line)
>>>>> 16384 6.604% 4.043% (1 way, 16 line)
>>>>> 8192 9.112% 6.052% (1 way, 16 line)
>>>>> 4096 11.310% 4.504% (1 way, 16 line)
>>>>> 2048 14.326% 5.990% (1 way, 16 line)
>>>>> 1024 17.632% 7.066% (1 way, 16 line)
>>>>>
>>>>> 131072 1.966% 0.821% (2 way, 16 line)
>>>>> 65536 2.550% 0.513% (2 way, 16 line)
>>>>> 32768 3.303% 0.791% (2 way, 16 line)
>>>>> 16384 6.779% 3.842% (2 way, 16 line)
>>>>> 8192 8.484% 1.766% (2 way, 16 line)
>>>>> 4096 10.905% 2.773% (2 way, 16 line)
>>>>> 2048 13.588% 3.319% (2 way, 16 line)
>>>>> 1024 16.022% 3.229% (2 way, 16 line)
>>>>>
>>>>> 262144 15.748% 14.750% (1 way, 32 line)
>>>>> 131072 15.929% 14.382% (1 way, 32 line)
>>>>> 65536 16.274% 14.057% (1 way, 32 line)
>>>>> 32768 16.843% 14.122% (1 way, 32 line)
>>>>> 16384 19.819% 16.567% (1 way, 32 line)
>>>>> 8192 22.360% 18.319% (1 way, 32 line)
>>>>> 4096 24.583% 16.422% (1 way, 32 line)
>>>>> 2048 27.325% 16.513% (1 way, 32 line)
>>>>>
>>>>> 262144 2.093% 0.872% (2 way, 32 line)
>>>>> 131072 2.893% 0.836% (2 way, 32 line)
>>>>> 65536 3.540% 0.949% (2 way, 32 line)
>>>>> 32768 4.854% 1.808% (2 way, 32 line)
>>>>> 16384 8.788% 4.997% (2 way, 32 line)
>>>>> 8192 11.456% 3.638% (2 way, 32 line)
>>>>> 4096 14.699% 4.352% (2 way, 32 line)
>>>>> 2048 17.580% 4.251% (2 way, 32 line)
>>>>>
>>>>> 524288 10.825% 10.370% (1 way, 64 line)
>>>>> 262144 11.019% 10.412% (1 way, 64 line)
>>>>> 131072 11.188% 10.232% (1 way, 64 line)
>>>>> 65536 11.652% 10.320% (1 way, 64 line)
>>>>> 32768 12.388% 10.699% (1 way, 64 line)
>>>>> 16384 15.806% 13.581% (1 way, 64 line)
>>>>> 8192 18.753% 15.604% (1 way, 64 line)
>>>>> 4096 21.158% 13.259% (1 way, 64 line)
>>>>>
>>>>> 524288 0.863% 0.439% (2 way, 64 line)
>>>>> 262144 1.297% 0.535% (2 way, 64 line)
>>>>> 131072 1.896% 0.700% (2 way, 64 line)
>>>>> 65536 2.505% 0.979% (2 way, 64 line)
>>>>> 32768 3.876% 1.923% (2 way, 64 line)
>>>>> 16384 8.341% 5.535% (2 way, 64 line)
>>>>> 8192 11.346% 3.956% (2 way, 64 line)
>>>>> 4096 14.262% 3.722% (2 way, 64 line)
>>>>>
>>>>>
>>>>> It would appear that 1-way still does poorly with larger cache lines,
>>>>
>>>> Probably still a bug, see above.
>>>>
>>>>> As for why now 1-way 32B appears to be doing worse than 1-way 64B, no
>>>>> idea, this doesn't really make sense.
>>>>
>>>> For the same number of cache lines, that's expected, due to spatial
>>>> locality.
>>>>
>>>
>>> After posting this, I did find another bug:
>>> The cache index pairs were always being calculated as-if the cache line
>>> were 16B. After fixing this bug, all the lines got much closer together
>>> (as shown on another graph posted to Twitter after the last graph).
>>>
>>> In this case, now the overall hit/miss ratio is more consistent between
>>> cache-line sizes. The main difference is that larger cache-line sizes
>>> still have a higher proportion of conflict misses.
>>>
>>>
>>> Values following the more recent bugfix:
>>> 131072 2.004% 1.318% (1 way, 16 line)
>>> 65536 2.851% 1.516% (1 way, 16 line)
>>> 32768 3.540% 1.445% (1 way, 16 line)
>>> 16384 6.604% 4.043% (1 way, 16 line)
>>> 8192 9.112% 6.052% (1 way, 16 line)
>>> 4096 11.310% 4.504% (1 way, 16 line)
>>> 2048 14.326% 5.990% (1 way, 16 line)
>>> 1024 17.632% 7.066% (1 way, 16 line)
>>>
>>> 131072 1.966% 0.821% (2 way, 16 line)
>>> 65536 2.550% 0.513% (2 way, 16 line)
>>> 32768 3.303% 0.791% (2 way, 16 line)
>>> 16384 6.779% 3.842% (2 way, 16 line)
>>> 8192 8.484% 1.766% (2 way, 16 line)
>>> 4096 10.905% 2.773% (2 way, 16 line)
>>> 2048 13.588% 3.319% (2 way, 16 line)
>>> 1024 16.022% 3.229% (2 way, 16 line)
>>>
>>> 262144 0.905% 0.630% (1 way, 32 line)
>>> 131072 1.326% 0.912% (1 way, 32 line)
>>> 65536 1.976% 1.160% (1 way, 32 line)
>>> 32768 2.712% 1.504% (1 way, 32 line)
>>> 16384 5.964% 4.456% (1 way, 32 line)
>>> 8192 8.728% 6.671% (1 way, 32 line)
>>> 4096 11.262% 5.347% (1 way, 32 line)
>>> 2048 14.579% 6.659% (1 way, 32 line)
>>>
>>> 262144 0.671% 0.329% (2 way, 32 line)
>>> 131072 1.148% 0.436% (2 way, 32 line)
>>> 65536 1.514% 0.348% (2 way, 32 line)
>>> 32768 2.259% 0.795% (2 way, 32 line)
>>> 16384 6.041% 4.169% (2 way, 32 line)
>>> 8192 8.102% 2.335% (2 way, 32 line)
>>> 4096 10.690% 3.071% (2 way, 32 line)
>>> 2048 13.354% 3.134% (2 way, 32 line)
>>>
>>> 524288 0.442% 0.311% (1 way, 64 line)
>>> 262144 0.717% 0.561% (1 way, 64 line)
>>> 131072 1.023% 0.759% (1 way, 64 line)
>>> 65536 1.661% 1.154% (1 way, 64 line)
>>> 32768 2.538% 1.818% (1 way, 64 line)
>>> 16384 6.168% 5.201% (1 way, 64 line)
>>> 8192 9.279% 7.640% (1 way, 64 line)
>>> 4096 12.084% 6.163% (1 way, 64 line)
>>>
>>> 524288 0.219% 0.079% (2 way, 64 line)
>>> 262144 0.428% 0.215% (2 way, 64 line)
>>> 131072 0.697% 0.251% (2 way, 64 line)
>>> 65536 0.984% 0.298% (2 way, 64 line)
>>> 32768 1.924% 1.005% (2 way, 64 line)
>>> 16384 6.169% 4.797% (2 way, 64 line)
>>> 8192 8.642% 2.983% (2 way, 64 line)
>>> 4096 10.878% 2.647% (2 way, 64 line)
>>>
>>>
>>> Though, it seems 16B cache lines are no longer a clear winner in this
>>> case...
>> I am not sure how you determined this. To do an apples to apples
>> comparison, as Anton explained, you have to compare the same amount of
>> total cache. Thus, for example, comparing a 16 byte line at a
>> particular cache size should be compared against a 32 byte line at twice
>> the cache size. Otherwise, the you can't distinguish between the total
>> cache size effect and the cache line size effect. To me, it seems like
>> for equivalent cache sizes, 16B lines are a clear win. This reflects a
>> higher proportion of temporal locality than spatial locality.
> <
> And then there is that bus occupancy thing.........

Click here to read the complete article

Re: Misc: Cache Sizes / Observations

<89d9f2fb-57a0-42bc-ae83-0bca231076bcn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23892&group=comp.arch#23892

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:c89:b0:432:c49b:41a5 with SMTP id r9-20020a0562140c8900b00432c49b41a5mr18602893qvr.48.1646278979964;
Wed, 02 Mar 2022 19:42:59 -0800 (PST)
X-Received: by 2002:a05:6870:7391:b0:d9:ae66:b8df with SMTP id
z17-20020a056870739100b000d9ae66b8dfmr2312266oam.7.1646278979702; Wed, 02 Mar
2022 19:42:59 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 2 Mar 2022 19:42:59 -0800 (PST)
In-Reply-To: <svp7va$3ga$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=162.229.185.59; posting-account=Gm3E_woAAACkDRJFCvfChVjhgA24PTsb
NNTP-Posting-Host: 162.229.185.59
References: <svj5qj$19u$1@dont-email.me> <svjak4$9kl$1@dont-email.me>
<svkeq1$phm$1@dont-email.me> <svm7r9$3ec$1@dont-email.me> <2022Mar2.101217@mips.complang.tuwien.ac.at>
<svoa5q$ud$1@dont-email.me> <2022Mar2.192515@mips.complang.tuwien.ac.at>
<svogd7$o9s$1@dont-email.me> <svokin$rtp$1@dont-email.me> <3215daef-0814-4dd1-b2ae-a00cd22d0d1dn@googlegroups.com>
<svp7va$3ga$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <89d9f2fb-57a0-42bc-ae83-0bca231076bcn@googlegroups.com>
Subject: Re: Misc: Cache Sizes / Observations
From: yogaman...@yahoo.com (Scott Smader)
Injection-Date: Thu, 03 Mar 2022 03:42:59 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 210

by: Scott Smader - Thu, 3 Mar 2022 03:42 UTC

On Wednesday, March 2, 2022 at 6:08:45 PM UTC-8, BGB wrote:
> On 3/2/2022 5:43 PM, MitchAlsup wrote:
> > On Wednesday, March 2, 2022 at 2:37:47 PM UTC-6, Stephen Fuld wrote:
> >> On 3/2/2022 11:26 AM, BGB wrote:
> >>> On 3/2/2022 12:25 PM, Anton Ertl wrote:
> >>>> BGB writes:
> >>>>> On 3/2/2022 3:12 AM, Anton Ertl wrote:
> >>>>>> A 512KB cache with 64-byte lines has 8192 cache lines, just like a
> >>>>>> 256KB cache with 32-byte lines and a 128KB cache with 16-byte lines.
> >>>>>> Without spatial locality, I would expect similar miss rates for all of
> >>>>>> them for the same associativity; and given that programs have spatial
> >>>>>> locality, I would expect the larger among these configurations to have
> >>>>>> an advantage. Are you sure that your cache simulator has no bugs?
> >>>> ...
> >>>>> After fixing this bug:
> >>>>> 131072 2.004% 1.318% (1 way, 16 line)
> >>>>> 65536 2.851% 1.516% (1 way, 16 line)
> >>>>> 32768 3.540% 1.445% (1 way, 16 line)
> >>>>> 16384 6.604% 4.043% (1 way, 16 line)
> >>>>> 8192 9.112% 6.052% (1 way, 16 line)
> >>>>> 4096 11.310% 4.504% (1 way, 16 line)
> >>>>> 2048 14.326% 5.990% (1 way, 16 line)
> >>>>> 1024 17.632% 7.066% (1 way, 16 line)
> >>>>>
> >>>>> 131072 1.966% 0.821% (2 way, 16 line)
> >>>>> 65536 2.550% 0.513% (2 way, 16 line)
> >>>>> 32768 3.303% 0.791% (2 way, 16 line)
> >>>>> 16384 6.779% 3.842% (2 way, 16 line)
> >>>>> 8192 8.484% 1.766% (2 way, 16 line)
> >>>>> 4096 10.905% 2.773% (2 way, 16 line)
> >>>>> 2048 13.588% 3.319% (2 way, 16 line)
> >>>>> 1024 16.022% 3.229% (2 way, 16 line)
> >>>>>
> >>>>> 262144 15.748% 14.750% (1 way, 32 line)
> >>>>> 131072 15.929% 14.382% (1 way, 32 line)
> >>>>> 65536 16.274% 14.057% (1 way, 32 line)
> >>>>> 32768 16.843% 14.122% (1 way, 32 line)
> >>>>> 16384 19.819% 16.567% (1 way, 32 line)
> >>>>> 8192 22.360% 18.319% (1 way, 32 line)
> >>>>> 4096 24.583% 16.422% (1 way, 32 line)
> >>>>> 2048 27.325% 16.513% (1 way, 32 line)
> >>>>>
> >>>>> 262144 2.093% 0.872% (2 way, 32 line)
> >>>>> 131072 2.893% 0.836% (2 way, 32 line)
> >>>>> 65536 3.540% 0.949% (2 way, 32 line)
> >>>>> 32768 4.854% 1.808% (2 way, 32 line)
> >>>>> 16384 8.788% 4.997% (2 way, 32 line)
> >>>>> 8192 11.456% 3.638% (2 way, 32 line)
> >>>>> 4096 14.699% 4.352% (2 way, 32 line)
> >>>>> 2048 17.580% 4.251% (2 way, 32 line)
> >>>>>
> >>>>> 524288 10.825% 10.370% (1 way, 64 line)
> >>>>> 262144 11.019% 10.412% (1 way, 64 line)
> >>>>> 131072 11.188% 10.232% (1 way, 64 line)
> >>>>> 65536 11.652% 10.320% (1 way, 64 line)
> >>>>> 32768 12.388% 10.699% (1 way, 64 line)
> >>>>> 16384 15.806% 13.581% (1 way, 64 line)
> >>>>> 8192 18.753% 15.604% (1 way, 64 line)
> >>>>> 4096 21.158% 13.259% (1 way, 64 line)
> >>>>>
> >>>>> 524288 0.863% 0.439% (2 way, 64 line)
> >>>>> 262144 1.297% 0.535% (2 way, 64 line)
> >>>>> 131072 1.896% 0.700% (2 way, 64 line)
> >>>>> 65536 2.505% 0.979% (2 way, 64 line)
> >>>>> 32768 3.876% 1.923% (2 way, 64 line)
> >>>>> 16384 8.341% 5.535% (2 way, 64 line)
> >>>>> 8192 11.346% 3.956% (2 way, 64 line)
> >>>>> 4096 14.262% 3.722% (2 way, 64 line)
> >>>>>
> >>>>>
> >>>>> It would appear that 1-way still does poorly with larger cache lines,
> >>>>
> >>>> Probably still a bug, see above.
> >>>>
> >>>>> As for why now 1-way 32B appears to be doing worse than 1-way 64B, no
> >>>>> idea, this doesn't really make sense.
> >>>>
> >>>> For the same number of cache lines, that's expected, due to spatial
> >>>> locality.
> >>>>
> >>>
> >>> After posting this, I did find another bug:
> >>> The cache index pairs were always being calculated as-if the cache line
> >>> were 16B. After fixing this bug, all the lines got much closer together
> >>> (as shown on another graph posted to Twitter after the last graph).
> >>>
> >>> In this case, now the overall hit/miss ratio is more consistent between
> >>> cache-line sizes. The main difference is that larger cache-line sizes
> >>> still have a higher proportion of conflict misses.
> >>>
> >>>
> >>> Values following the more recent bugfix:
> >>> 131072 2.004% 1.318% (1 way, 16 line)
> >>> 65536 2.851% 1.516% (1 way, 16 line)
> >>> 32768 3.540% 1.445% (1 way, 16 line)
> >>> 16384 6.604% 4.043% (1 way, 16 line)
> >>> 8192 9.112% 6.052% (1 way, 16 line)
> >>> 4096 11.310% 4.504% (1 way, 16 line)
> >>> 2048 14.326% 5.990% (1 way, 16 line)
> >>> 1024 17.632% 7.066% (1 way, 16 line)
> >>>
> >>> 131072 1.966% 0.821% (2 way, 16 line)
> >>> 65536 2.550% 0.513% (2 way, 16 line)
> >>> 32768 3.303% 0.791% (2 way, 16 line)
> >>> 16384 6.779% 3.842% (2 way, 16 line)
> >>> 8192 8.484% 1.766% (2 way, 16 line)
> >>> 4096 10.905% 2.773% (2 way, 16 line)
> >>> 2048 13.588% 3.319% (2 way, 16 line)
> >>> 1024 16.022% 3.229% (2 way, 16 line)
> >>>
> >>> 262144 0.905% 0.630% (1 way, 32 line)
> >>> 131072 1.326% 0.912% (1 way, 32 line)
> >>> 65536 1.976% 1.160% (1 way, 32 line)
> >>> 32768 2.712% 1.504% (1 way, 32 line)
> >>> 16384 5.964% 4.456% (1 way, 32 line)
> >>> 8192 8.728% 6.671% (1 way, 32 line)
> >>> 4096 11.262% 5.347% (1 way, 32 line)
> >>> 2048 14.579% 6.659% (1 way, 32 line)
> >>>
> >>> 262144 0.671% 0.329% (2 way, 32 line)
> >>> 131072 1.148% 0.436% (2 way, 32 line)
> >>> 65536 1.514% 0.348% (2 way, 32 line)
> >>> 32768 2.259% 0.795% (2 way, 32 line)
> >>> 16384 6.041% 4.169% (2 way, 32 line)
> >>> 8192 8.102% 2.335% (2 way, 32 line)
> >>> 4096 10.690% 3.071% (2 way, 32 line)
> >>> 2048 13.354% 3.134% (2 way, 32 line)
> >>>
> >>> 524288 0.442% 0.311% (1 way, 64 line)
> >>> 262144 0.717% 0.561% (1 way, 64 line)
> >>> 131072 1.023% 0.759% (1 way, 64 line)
> >>> 65536 1.661% 1.154% (1 way, 64 line)
> >>> 32768 2.538% 1.818% (1 way, 64 line)
> >>> 16384 6.168% 5.201% (1 way, 64 line)
> >>> 8192 9.279% 7.640% (1 way, 64 line)
> >>> 4096 12.084% 6.163% (1 way, 64 line)
> >>>
> >>> 524288 0.219% 0.079% (2 way, 64 line)
> >>> 262144 0.428% 0.215% (2 way, 64 line)
> >>> 131072 0.697% 0.251% (2 way, 64 line)
> >>> 65536 0.984% 0.298% (2 way, 64 line)
> >>> 32768 1.924% 1.005% (2 way, 64 line)
> >>> 16384 6.169% 4.797% (2 way, 64 line)
> >>> 8192 8.642% 2.983% (2 way, 64 line)
> >>> 4096 10.878% 2.647% (2 way, 64 line)
> >>>
> >>>
> >>> Though, it seems 16B cache lines are no longer a clear winner in this
> >>> case...
> >> I am not sure how you determined this. To do an apples to apples
> >> comparison, as Anton explained, you have to compare the same amount of
> >> total cache. Thus, for example, comparing a 16 byte line at a
> >> particular cache size should be compared against a 32 byte line at twice
> >> the cache size. Otherwise, the you can't distinguish between the total
> >> cache size effect and the cache line size effect. To me, it seems like
> >> for equivalent cache sizes, 16B lines are a clear win. This reflects a
> >> higher proportion of temporal locality than spatial locality.
> > <
> > And then there is that bus occupancy thing.........
> I am not modelling the bus or latency in this case, only miss rates and
> similar.
> >>
> >> BTW, with a little more work, you can distinguish spatial locality hits
> >> from temporal locality hits. You have to keep the actual address that
> >> caused the miss, then on subsequent hits, see if they are to the same or
> >> a different address.
> > <
> > We figured out (around 1992) that one could model all cache sizes
> > and association levels simultaneously.
> I was running a bunch of instances of the cache in parallel and then
> periodically dumping the output statistics into a log file (once every
> 16-million memory accesses).
>
> Then was using the numbers in the final dump of the log file (before I
> closed the emulator).
>
>
> Annoyingly, I had to hand-edit them into a form where I could load them
> into a spreadsheet for a graph (kinda annoying). Hadn't gotten around
> around to write an alternate set of logic that dumps the numbers in CSV
> form (as for whatever reason, OpenOffice can't import tables correctly
> where the numbers are padded into columns by varying amounts of
> whitespace, ...).

Click here to read the complete article

Re: Misc: Cache Sizes / Observations

<b91ec0c2-0016-44a9-8336-c13c2d01a106n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23893&group=comp.arch#23893

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:3011:b0:435:2331:b43a with SMTP id ke17-20020a056214301100b004352331b43amr3300343qvb.54.1646281589134;
Wed, 02 Mar 2022 20:26:29 -0800 (PST)
X-Received: by 2002:a05:6808:1705:b0:2d4:d7a:9c25 with SMTP id
bc5-20020a056808170500b002d40d7a9c25mr2932089oib.51.1646281588905; Wed, 02
Mar 2022 20:26:28 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Wed, 2 Mar 2022 20:26:28 -0800 (PST)
In-Reply-To: <89d9f2fb-57a0-42bc-ae83-0bca231076bcn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2607:fea8:1de1:fb00:c41b:c5f7:7270:ff1a;
posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 2607:fea8:1de1:fb00:c41b:c5f7:7270:ff1a
References: <svj5qj$19u$1@dont-email.me> <svjak4$9kl$1@dont-email.me>
<svkeq1$phm$1@dont-email.me> <svm7r9$3ec$1@dont-email.me> <2022Mar2.101217@mips.complang.tuwien.ac.at>
<svoa5q$ud$1@dont-email.me> <2022Mar2.192515@mips.complang.tuwien.ac.at>
<svogd7$o9s$1@dont-email.me> <svokin$rtp$1@dont-email.me> <3215daef-0814-4dd1-b2ae-a00cd22d0d1dn@googlegroups.com>
<svp7va$3ga$1@dont-email.me> <89d9f2fb-57a0-42bc-ae83-0bca231076bcn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b91ec0c2-0016-44a9-8336-c13c2d01a106n@googlegroups.com>
Subject: Re: Misc: Cache Sizes / Observations
From: robfi...@gmail.com (robf...@gmail.com)
Injection-Date: Thu, 03 Mar 2022 04:26:29 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 46

by: robf...@gmail.com - Thu, 3 Mar 2022 04:26 UTC

>I am not sure if others use the Even/Odd scheme, many diagrams online
>only show a single row of cache lines, but don't make it obvious how
>such a scheme would deal with a misaligned access which spans two cache
>lines.

I do not use an odd/even line scheme. I make the cache line wider by the number
of bits that might span onto a second line. For example, a 512-bit wide cache line
ends up being 640 bits wide to accommodate a 128-bit value that would
otherwise span two cache lines. This allows my cache to be single line with a
simple dual port approach (one read port, one write port) rather than needing two
read ports. It keeps the cache simple. The drawback is that it takes five 128-bit
memory accesses to fill the cache line rather than four. So, memory access is in
terms of five beat bursts. This is not as bad as it sounds as it is only one extra
clock cycle out of about 10 or so. There is also duplication of some data in the
cache, but it is a lot better than having two read ports which basically doubles
the RAM requirements for the cache. The approach I am using uses 1.25 times
the memory instead of 2.0 times.

The odd/even scheme sounds interesting to me. I take it there is only a single
read port involved on each line. Data that spans cache lines would need to be
routed from two places. Depending on which cache line is spanned, odd or even.

I appreciate the stats. I seem to get a lot worse cache performance, but I am
observing mostly start-ups when there’s nothing loaded in the cache, as opposed
to warm operation.

I had L1, L2 caches but decided to get rid of the L2 cache and use the block RAMs
for a larger L1 cache and a system cache. L1 is 16kB, 4-way associative, with
640-bit wide lines.

Re: Misc: Cache Sizes / Observations

<svpq72$i01$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23894&group=comp.arch#23894

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: cr88...@gmail.com (BGB)
Newsgroups: comp.arch
Subject: Re: Misc: Cache Sizes / Observations
Date: Thu, 3 Mar 2022 01:19:58 -0600
Organization: A noiseless patient Spider
Lines: 100
Message-ID: <svpq72$i01$1@dont-email.me>
References: <svj5qj$19u$1@dont-email.me> <svjak4$9kl$1@dont-email.me>
<svkeq1$phm$1@dont-email.me> <svm7r9$3ec$1@dont-email.me>
<2022Mar2.101217@mips.complang.tuwien.ac.at> <svoa5q$ud$1@dont-email.me>
<2022Mar2.192515@mips.complang.tuwien.ac.at> <svogd7$o9s$1@dont-email.me>
<svokin$rtp$1@dont-email.me>
<3215daef-0814-4dd1-b2ae-a00cd22d0d1dn@googlegroups.com>
<svp7va$3ga$1@dont-email.me>
<89d9f2fb-57a0-42bc-ae83-0bca231076bcn@googlegroups.com>
<b91ec0c2-0016-44a9-8336-c13c2d01a106n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 3 Mar 2022 07:20:02 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="c8be40ab30704a686bce3661f550954b";
logging-data="18433"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19Lm1OUQi+nQoPV92drJid6"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:0kfAhOWU4JKiTMR5caPlObhAzfk=
In-Reply-To: <b91ec0c2-0016-44a9-8336-c13c2d01a106n@googlegroups.com>
Content-Language: en-US

by: BGB - Thu, 3 Mar 2022 07:19 UTC

On 3/2/2022 10:26 PM, robf...@gmail.com wrote:
>> I am not sure if others use the Even/Odd scheme, many diagrams online
>> only show a single row of cache lines, but don't make it obvious how
>> such a scheme would deal with a misaligned access which spans two cache
>> lines.
>
> I do not use an odd/even line scheme. I make the cache line wider by the number
> of bits that might span onto a second line. For example, a 512-bit wide cache line
> ends up being 640 bits wide to accommodate a 128-bit value that would
> otherwise span two cache lines. This allows my cache to be single line with a
> simple dual port approach (one read port, one write port) rather than needing two
> read ports. It keeps the cache simple. The drawback is that it takes five 128-bit
> memory accesses to fill the cache line rather than four. So, memory access is in
> terms of five beat bursts. This is not as bad as it sounds as it is only one extra
> clock cycle out of about 10 or so. There is also duplication of some data in the
> cache, but it is a lot better than having two read ports which basically doubles
> the RAM requirements for the cache. The approach I am using uses 1.25 times
> the memory instead of 2.0 times.
>

Sounds more complicated than my approach.

My approach keeps everything power-of-two.

One downside is that it needs Request/Response logic for both the Even
and Odd sides.

Though, this is partly why some of my 2-way caches only allowed Dirty
lines in the A member of the set:
This allowed sharing the Request/Response logic between A and B:
A Loaded cache line could go to A, B, or Both
A Stored cache line could only come from A.

Though, the presence of Dirty lines reduces the effectiveness of such a
cache, since its effective capacity is reduced in the presence of dirty
lines (my recent models did not model this behavior).

This scheme can work OK, but does effectively require an Epoch Flush
mechanism to keep the cache working efficiently.

> The odd/even scheme sounds interesting to me. I take it there is only a single
> read port involved on each line. Data that spans cache lines would need to be
> routed from two places. Depending on which cache line is spanned, odd or even.
>

Yeah. The arrays follow the "One Read, One Write" pattern that Block RAM
tends to require.

The Even Lines only ever hold Even addresses, and Odd lines only hold
Odd addresses. Any request at any alignment can be served by one Even
and one Odd line.

In terms on access, the lines are actually split in half into 64-bit chunks.

So, for Addr(4:3):
00: { EvenHi, EvenLo }
01: { OddLo, EvenHi }
10: { OddHi, OddLo }
11: { EvenLo, OddHi }

This forms a virtual 128-bit block used for loading and storing values.

It gets a little more convoluted for the I$ because 96-bit bundles may
have a 16-bit alignment, which doesn't fit entirely in this scheme. One
needs to either perform a larger fetch, or perform the fetch at a 32-bit
alignment.

Eg, for Addr(4:3):
00: { OddLo, EvenHi, EvenLo }
01: { OddHi, OddLo, EvenHi }
10: { EvenLo, OddHi, OddLo }
11: { EvenHi, EvenLo, OddHi }

> I appreciate the stats. I seem to get a lot worse cache performance, but I am
> observing mostly start-ups when there’s nothing loaded in the cache, as opposed
> to warm operation.
>
> I had L1, L2 caches but decided to get rid of the L2 cache and use the block RAMs
> for a larger L1 cache and a system cache. L1 is 16kB, 4-way associative, with
> 640-bit wide lines.
>

OK.

I am using 32K L1 D$, Direct Mapped, with 128-bit cache lines.
It is effectively twice this size in terms of Block-RAM, as roughly half
the Block RAM is going into tag bits.

L2 is 256K, 2-way, with 512-bit cache lines.

The L1 is modulo indexed, whereas the L2 is hash indexed.

The L2 does not do Even/Odd, as the L2 has no concept of misaligned
access. It merely supports loading and storing 128-bit aligned, 128-bit
values.

Re: Misc: Cache Sizes / Observations

<2022Mar3.083539@mips.complang.tuwien.ac.at>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23896&group=comp.arch#23896

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ant...@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Misc: Cache Sizes / Observations
Date: Thu, 03 Mar 2022 07:35:39 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 27
Message-ID: <2022Mar3.083539@mips.complang.tuwien.ac.at>
References: <svj5qj$19u$1@dont-email.me> <svjak4$9kl$1@dont-email.me> <svkeq1$phm$1@dont-email.me> <svm7r9$3ec$1@dont-email.me> <2022Mar2.101217@mips.complang.tuwien.ac.at> <svoa5q$ud$1@dont-email.me> <2022Mar2.192515@mips.complang.tuwien.ac.at> <svogd7$o9s$1@dont-email.me> <svokin$rtp$1@dont-email.me> <svp6sl$s7t$1@dont-email.me>
Injection-Info: reader02.eternal-september.org; posting-host="824c5f00426ed9c8d52ea49f8e7fd3a1";
logging-data="31402"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19Dwu0FxvQ1mqIhqzmTP0WC"
Cancel-Lock: sha1:NgkN8dwL2ySRRtQM6z4f98YFcNg=
X-newsreader: xrn 10.00-beta-3

by: Anton Ertl - Thu, 3 Mar 2022 07:35 UTC

BGB <cr88192@gmail.com> writes:
>On 3/2/2022 2:37 PM, Stephen Fuld wrote:
>> https://en.wikipedia.org/wiki/Cache_performance_measurement_and_metric#Conflict_misses
>>
>
>I wasn't using fully-associative as this would be very steep.
>I was instead keeping a 4-way cache as a stand in for a fully
>associative one, but 8-way could be a better stand in.

You can use reuse distance
<https://www.cc.gatech.edu/classes/AY2014/cs7260_fall/lecture30.pdf>
to determine if a cache is a capacity miss: If you have a cache with N
lines, then every access with a reuse distance >N is a capacity miss.
A miss with a reuse distance <=N is a conflict miss.

An O(1) algorithm for a given N is to have a time stamp for every
cache-line-sized piece of accessible memory. If you access memory
address A, and it has not been previously accessed, it's a compulsory
miss. If timestamp(A)<current-N, count the access as a capacity miss
and increase current; otherwise do neither (and any miss in less
associative caches is a conflict miss). In any case, set the
timestamp(A)<-current.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Misc: Cache Sizes / Observations

<svpukc$6pk$1@gioia.aioe.org>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23897&group=comp.arch#23897

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!aioe.org!UXtAIYUgaw/fkqnS/V28xg.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: Misc: Cache Sizes / Observations
Date: Thu, 3 Mar 2022 09:35:31 +0100
Organization: Aioe.org NNTP Server
Message-ID: <svpukc$6pk$1@gioia.aioe.org>
References: <svj5qj$19u$1@dont-email.me> <svjak4$9kl$1@dont-email.me>
<svkeq1$phm$1@dont-email.me> <svm7r9$3ec$1@dont-email.me>
<2022Mar2.101217@mips.complang.tuwien.ac.at> <svoa5q$ud$1@dont-email.me>
<2022Mar2.192515@mips.complang.tuwien.ac.at> <svogd7$o9s$1@dont-email.me>
<svokin$rtp$1@dont-email.me>
<3215daef-0814-4dd1-b2ae-a00cd22d0d1dn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="6964"; posting-host="UXtAIYUgaw/fkqnS/V28xg.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.10.2
X-Notice: Filtered by postfilter v. 0.9.2

by: Terje Mathisen - Thu, 3 Mar 2022 08:35 UTC

MitchAlsup wrote:
> On Wednesday, March 2, 2022 at 2:37:47 PM UTC-6, Stephen Fuld wrote:
>> BTW, with a little more work, you can distinguish spatial locality hits
>> from temporal locality hits. You have to keep the actual address that
>> caused the miss, then on subsequent hits, see if they are to the same or
>> a different address.
> <
> We figured out (around 1992) that one could model all cache sizes
> and association levels simultaneously.

Hmmm, that sounds a bit hard unless you actually keep multiple cache
models at the same time. My first idea was to associate each cached item
with the minimum size/way needed for it to be valid, but this would not
be able to handle evictions properly.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: Misc: Cache Sizes / Observations

<cfa48891-0e27-44e3-ab19-c2c597dba03cn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23903&group=comp.arch#23903

copy link Newsgroups: comp.arch

X-Received: by 2002:ad4:5bee:0:b0:432:f52d:28b6 with SMTP id k14-20020ad45bee000000b00432f52d28b6mr15523378qvc.44.1646310618184;
Thu, 03 Mar 2022 04:30:18 -0800 (PST)
X-Received: by 2002:a05:6870:420f:b0:d9:a032:a120 with SMTP id
u15-20020a056870420f00b000d9a032a120mr3719060oac.0.1646310617964; Thu, 03 Mar
2022 04:30:17 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 3 Mar 2022 04:30:17 -0800 (PST)
In-Reply-To: <svpukc$6pk$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2607:fea8:1de1:fb00:89da:2f65:1a52:7759;
posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 2607:fea8:1de1:fb00:89da:2f65:1a52:7759
References: <svj5qj$19u$1@dont-email.me> <svjak4$9kl$1@dont-email.me>
<svkeq1$phm$1@dont-email.me> <svm7r9$3ec$1@dont-email.me> <2022Mar2.101217@mips.complang.tuwien.ac.at>
<svoa5q$ud$1@dont-email.me> <2022Mar2.192515@mips.complang.tuwien.ac.at>
<svogd7$o9s$1@dont-email.me> <svokin$rtp$1@dont-email.me> <3215daef-0814-4dd1-b2ae-a00cd22d0d1dn@googlegroups.com>
<svpukc$6pk$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <cfa48891-0e27-44e3-ab19-c2c597dba03cn@googlegroups.com>
Subject: Re: Misc: Cache Sizes / Observations
From: robfi...@gmail.com (robf...@gmail.com)
Injection-Date: Thu, 03 Mar 2022 12:30:18 +0000
Content-Type: text/plain; charset="UTF-8"

by: robf...@gmail.com - Thu, 3 Mar 2022 12:30 UTC

Got the size of the I$ wrong, I am using 32kB as well with a five entry victim cache.
Turns out the extra 128 bits can be packed into fewer block RAMs (9) than
expected, so there is only 12.5% overhead instead of 25%.

Re: Misc: Cache Sizes / Observations

<c57f9893-2e53-4a61-ad17-99df9b0260cdn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23915&group=comp.arch#23915

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:ac16:0:b0:60a:dac5:1656 with SMTP id e22-20020a37ac16000000b0060adac51656mr58601qkm.680.1646326304190;
Thu, 03 Mar 2022 08:51:44 -0800 (PST)
X-Received: by 2002:a05:6870:7391:b0:d9:ae66:b8df with SMTP id
z17-20020a056870739100b000d9ae66b8dfmr4402409oam.7.1646326303945; Thu, 03 Mar
2022 08:51:43 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 3 Mar 2022 08:51:43 -0800 (PST)
In-Reply-To: <svpukc$6pk$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3de0:ba8f:3b3:e7eb;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3de0:ba8f:3b3:e7eb
References: <svj5qj$19u$1@dont-email.me> <svjak4$9kl$1@dont-email.me>
<svkeq1$phm$1@dont-email.me> <svm7r9$3ec$1@dont-email.me> <2022Mar2.101217@mips.complang.tuwien.ac.at>
<svoa5q$ud$1@dont-email.me> <2022Mar2.192515@mips.complang.tuwien.ac.at>
<svogd7$o9s$1@dont-email.me> <svokin$rtp$1@dont-email.me> <3215daef-0814-4dd1-b2ae-a00cd22d0d1dn@googlegroups.com>
<svpukc$6pk$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c57f9893-2e53-4a61-ad17-99df9b0260cdn@googlegroups.com>
Subject: Re: Misc: Cache Sizes / Observations
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 03 Mar 2022 16:51:44 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 21

by: MitchAlsup - Thu, 3 Mar 2022 16:51 UTC

On Thursday, March 3, 2022 at 2:35:28 AM UTC-6, Terje Mathisen wrote:
> MitchAlsup wrote:
> > On Wednesday, March 2, 2022 at 2:37:47 PM UTC-6, Stephen Fuld wrote:
> >> BTW, with a little more work, you can distinguish spatial locality hits
> >> from temporal locality hits. You have to keep the actual address that
> >> caused the miss, then on subsequent hits, see if they are to the same or
> >> a different address.
> > <
> > We figured out (around 1992) that one could model all cache sizes
> > and association levels simultaneously.
> Hmmm, that sounds a bit hard unless you actually keep multiple cache
> models at the same time. My first idea was to associate each cached item
> with the minimum size/way needed for it to be valid, but this would not
> be able to handle evictions properly.
<
One of the major realizations was that if you hit in a cache of size
k, you hit in every cache size k*2^n.
>
> Terje
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Re: Misc: Cache Sizes / Observations

<b6fe9087-7344-4b60-82e3-905e6fb53452n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23916&group=comp.arch#23916

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:5c45:0:b0:2dd:baf7:d4fb with SMTP id j5-20020ac85c45000000b002ddbaf7d4fbmr29003646qtj.323.1646326366585;
Thu, 03 Mar 2022 08:52:46 -0800 (PST)
X-Received: by 2002:a05:6808:2226:b0:2d5:3cf5:89ee with SMTP id
bd38-20020a056808222600b002d53cf589eemr5314235oib.99.1646326364441; Thu, 03
Mar 2022 08:52:44 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 3 Mar 2022 08:52:44 -0800 (PST)
In-Reply-To: <cfa48891-0e27-44e3-ab19-c2c597dba03cn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:3de0:ba8f:3b3:e7eb;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:3de0:ba8f:3b3:e7eb
References: <svj5qj$19u$1@dont-email.me> <svjak4$9kl$1@dont-email.me>
<svkeq1$phm$1@dont-email.me> <svm7r9$3ec$1@dont-email.me> <2022Mar2.101217@mips.complang.tuwien.ac.at>
<svoa5q$ud$1@dont-email.me> <2022Mar2.192515@mips.complang.tuwien.ac.at>
<svogd7$o9s$1@dont-email.me> <svokin$rtp$1@dont-email.me> <3215daef-0814-4dd1-b2ae-a00cd22d0d1dn@googlegroups.com>
<svpukc$6pk$1@gioia.aioe.org> <cfa48891-0e27-44e3-ab19-c2c597dba03cn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b6fe9087-7344-4b60-82e3-905e6fb53452n@googlegroups.com>
Subject: Re: Misc: Cache Sizes / Observations
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Thu, 03 Mar 2022 16:52:46 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 6

by: MitchAlsup - Thu, 3 Mar 2022 16:52 UTC

On Thursday, March 3, 2022 at 6:30:20 AM UTC-6, robf...@gmail.com wrote:
> Got the size of the I$ wrong, I am using 32kB as well with a five entry victim cache.
> Turns out the extra 128 bits can be packed into fewer block RAMs (9) than
> expected, so there is only 12.5% overhead instead of 25%.
<
It is stuff like this that made the My 66130 cache 24KB 3-way set:: all
the tag bits fit in 1 SRAM.

Re: Misc: Cache Sizes / Observations

<2f0b4b04-c7d5-4a12-8024-40a93dadd79bn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23931&group=comp.arch#23931

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:494:b0:2dd:97dd:a839 with SMTP id p20-20020a05622a049400b002dd97dda839mr30130269qtx.567.1646353813312;
Thu, 03 Mar 2022 16:30:13 -0800 (PST)
X-Received: by 2002:a05:6870:2302:b0:d7:4f1f:b78b with SMTP id
w2-20020a056870230200b000d74f1fb78bmr6106307oao.37.1646353813083; Thu, 03 Mar
2022 16:30:13 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 3 Mar 2022 16:30:12 -0800 (PST)
In-Reply-To: <b6fe9087-7344-4b60-82e3-905e6fb53452n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2607:fea8:1de1:fb00:89da:2f65:1a52:7759;
posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 2607:fea8:1de1:fb00:89da:2f65:1a52:7759
References: <svj5qj$19u$1@dont-email.me> <svjak4$9kl$1@dont-email.me>
<svkeq1$phm$1@dont-email.me> <svm7r9$3ec$1@dont-email.me> <2022Mar2.101217@mips.complang.tuwien.ac.at>
<svoa5q$ud$1@dont-email.me> <2022Mar2.192515@mips.complang.tuwien.ac.at>
<svogd7$o9s$1@dont-email.me> <svokin$rtp$1@dont-email.me> <3215daef-0814-4dd1-b2ae-a00cd22d0d1dn@googlegroups.com>
<svpukc$6pk$1@gioia.aioe.org> <cfa48891-0e27-44e3-ab19-c2c597dba03cn@googlegroups.com>
<b6fe9087-7344-4b60-82e3-905e6fb53452n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2f0b4b04-c7d5-4a12-8024-40a93dadd79bn@googlegroups.com>
Subject: Re: Misc: Cache Sizes / Observations
From: robfi...@gmail.com (robf...@gmail.com)
Injection-Date: Fri, 04 Mar 2022 00:30:13 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 11

by: robf...@gmail.com - Fri, 4 Mar 2022 00:30 UTC

I have wondered what the obsession with keeping the ways a power of two is about.

I made the TLB five-way associative, which uses random replacement via an
automatic inverted page table walk, except for one of the ways. One way is
effectively excluded from automatic updates, it must be updated via software. I
have the way preloaded with translations to access the system ROM so the TLB
can be active at start-up. The TLB is huge only because it is the maximum size
that would fit into the minimum number of block RAMs to implement. It
supports 1024 entries for each way. It could have been made with fewer entries,
but it would use just as many block-RAMs. The TLB logic is clocked at twice the
rate of the cpu to reduce the number of cpu clocks the translation takes.

Re: Misc: Cache Sizes / Observations

<904e35c2-7dd6-4dd6-930b-7fcc8ae3557dn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=23935&group=comp.arch#23935

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:2889:b0:663:8d24:8cad with SMTP id j9-20020a05620a288900b006638d248cadmr1393678qkp.662.1646361395442;
Thu, 03 Mar 2022 18:36:35 -0800 (PST)
X-Received: by 2002:a05:6808:3021:b0:2cf:177:968a with SMTP id
ay33-20020a056808302100b002cf0177968amr7632694oib.119.1646361395144; Thu, 03
Mar 2022 18:36:35 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Thu, 3 Mar 2022 18:36:34 -0800 (PST)
In-Reply-To: <2f0b4b04-c7d5-4a12-8024-40a93dadd79bn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:20e8:8981:e8d6:173d;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:20e8:8981:e8d6:173d
References: <svj5qj$19u$1@dont-email.me> <svjak4$9kl$1@dont-email.me>
<svkeq1$phm$1@dont-email.me> <svm7r9$3ec$1@dont-email.me> <2022Mar2.101217@mips.complang.tuwien.ac.at>
<svoa5q$ud$1@dont-email.me> <2022Mar2.192515@mips.complang.tuwien.ac.at>
<svogd7$o9s$1@dont-email.me> <svokin$rtp$1@dont-email.me> <3215daef-0814-4dd1-b2ae-a00cd22d0d1dn@googlegroups.com>
<svpukc$6pk$1@gioia.aioe.org> <cfa48891-0e27-44e3-ab19-c2c597dba03cn@googlegroups.com>
<b6fe9087-7344-4b60-82e3-905e6fb53452n@googlegroups.com> <2f0b4b04-c7d5-4a12-8024-40a93dadd79bn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <904e35c2-7dd6-4dd6-930b-7fcc8ae3557dn@googlegroups.com>
Subject: Re: Misc: Cache Sizes / Observations
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 04 Mar 2022 02:36:35 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 25

by: MitchAlsup - Fri, 4 Mar 2022 02:36 UTC

On Thursday, March 3, 2022 at 6:30:15 PM UTC-6, robf...@gmail.com wrote:
> I have wondered what the obsession with keeping the ways a power of two is about.
>
> I made the TLB five-way associative, which uses random replacement via an
> automatic inverted page table walk, except for one of the ways. One way is
> effectively excluded from automatic updates, it must be updated via software. I
> have the way preloaded with translations to access the system ROM so the TLB
> can be active at start-up. The TLB is huge only because it is the maximum size
> that would fit into the minimum number of block RAMs to implement. It
> supports 1024 entries for each way. It could have been made with fewer entries,
> but it would use just as many block-RAMs. The TLB logic is clocked at twice the
> rate of the cpu to reduce the number of cpu clocks the translation takes.
<
In the distant past we could call up SRAM with as many or as few bits as we desired
(within reason) so powers of 2 were not usefully different in hardness as any other
size.
<
Now, we are essentially bound by SRAMs which come in powers of 2 (64-128-256)
You can't make SRAMs wider than 256 because of wire delay (metal gate wire)
of the word lines, you can't put more than 64 cells on a bit line (leakage and
cell stability), so you get either 128 or 256 bits--more like 128 than 256.
<
So, if you support a 45-bit physical address space, you can put 3 tags in a
128-bit SRAM word. This makes 3-way caches a good choice. Simply because
it fits in the building blocks you have at hand. Right now, 45-bits is "big enough".
It won't be for long.

server_pubkey.txt

rocksolid light 0.9.8
clearnet tor