Message-ID:

We are drowning in information but starved for knowledge. -- John Naisbitt, Megatrends

devel / comp.arch / Increasing subtly of cache block size

Increasing subtly of cache block size

<t2a5mj$ptq$1@dont-email.me>

https://www.novabbs.com/devel/article-flat.php?id=24560&group=comp.arch#24560

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: paaroncl...@gmail.com (Paul A. Clayton)
Newsgroups: comp.arch
Subject: Increasing subtly of cache block size
Date: Sat, 2 Apr 2022 14:46:43 -0400
Organization: A noiseless patient Spider
Lines: 111
Message-ID: <t2a5mj$ptq$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 2 Apr 2022 18:46:43 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="f95925a44571f3e60d0d4e023eddd735";
logging-data="26554"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+AwqQNVuc3iq6xvwqtl5V6ILJtNluRRdc="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
Thunderbird/68.0
Cancel-Lock: sha1:Vk9dgsmULQfBDXnqqr1S6cqVeBQ=
X-Mozilla-News-Host: news://news.eternal-september.org:119

by: Paul A. Clayton - Sat, 2 Apr 2022 18:46 UTC

Cache block size is recognized by false sharing granularity,
prefetch granularity, and internal victimization granularity.
(There may be other significant characteristics. E.g., the
number of blocks in a cache — or way — of a given size may
be a sufficiently different perspective from victimization
granularity to be considered.)

Sectored (subblock) caches complicate this perspective on
cache block size. False sharing is at the granularity of the
subblock — sector (IBM) or line (Intel)— but prefetch
granularity and internal victimization granularity is at the
larger sized chunk — block (IBM) or sector (Intel). Aligned-
adjacent prefetch (where a miss prefetches the adjacent block
in a higher alignment chunk) potentially provides the same
prefetching without the capacity waste and greater replacement
flexibility at the cost of tag overhead; the reduced capacity
waste also reduces the incentive to always fetch the adjacent
(sub)block(s) — predicted utility and expected cost (e.g.,
how busy a memory channel is) could motivate a decision not
to load subblocks.

While smaller blocks or smaller subblocks would naturally
provide earlier restart, such are not the only mechanisms
available for such. A fill buffer would provide similar
functionality. The latency of a critical word need not be
determined by the block size and bandwidth.

(Channel width and depth are also considerations. Commodity
DRAMs exploit high beat count to support bandwidth at lower
cost. The addressing overhead is similar to tag overhead.)

Mechanisms such as the V-way cache increase metadata overhead
by having tags without associated data and adding separate
index metadata. This extra metadata overhead might motivate
larger blocks, but such also presents the opportunity to use
subblocking with less overhead by providing multiple indexes —
invalid blocks would use no data storage. (While the original
proposal used global indexes to exploit a global replacement,
one could imagine small index values expanded based on the
metadata storage slot with each slot in a set using a different
collection of storage blocks. This would hurt miss rate and
NUCA optimization, but the lower tag overhead might be
attractive enough.) Even with just NUCA-targeting indirection,
subblocking could lose a significant negative factor.

(A V-way cache may have twice as many sets as expected for
the associativity and capacity. This increases the effective
associativity by mapping half of the potential conflicts to
another set. With the index indirection, inspired by NUCA
proposals, this also facilitates broader victim choice; the
victim is somewhat decoupled from the associativity. [The
effect of skewed associativity on such a design seems likely
to be interesting, especially with cuckoo/elbow replacement
where a tag/index could be migrated to a different metadata
storage slot/index without data movement normally associated
with cuckoo caches. This kind of design also presents the
possibility of sharing data storage among two L2 caches, where
the duplication of cache tags is similar to the V-way cache's
set duplication.])

One might somewhat safely just equate subblock size in a
sectored cache as the block size of interest, especially if
the capacity factor is mitigated by indirection.

In-cache compression, which increases the number of tags per
unit of actual data storage (like V-way cache), further
increases the fuzziness of cache block size. The capacity
cost of a larger block depends on the compressibility. If
lossy compression is used — even something as simple as early
termination or invalid-word-bitvector — the size of a cache
block would seem to be even less clearly meaningful. Lossy
compression would allow less than the full block to be valid,
similar to subblocking; partial replacement might allow better
packing. (Lossy compression might prefer retention of pointers
/indexes and perhaps unpredictable branch resolving data.)

(Extra tags have other uses such as supporting tag-based
inclusion and tracking reuse distance. Non-tag uses for the
tag metadata storage would also be possible.)

Tag compression could also introduce locality-influenced early
eviction. E.g., André Seznec's "don't use the page number but
a pointer to it" introduces the possibility of page-conflict
eviction when the page number 'cache' evicts an entry whose
index is still in use by the main cache.

There have also been proposals for finer-grained coherence
tracking/invalidation such that false sharing can be at a much
finer granularity that tag-store. (This could be combined with
lossy compression or exploit highly improbably error syndromes
to mark invalid chunks.)

In terms of indexing and associativity, skewed associativity
also messes with the simple model of conflicts. With overlaid
skewed associativity, one can also implement different
alignments (e.g., 64B blocks aligned at 32B boundaries with
different non-bank-conflicting indexings to different
alignments and each entry supporting multiple alignments
depending on which 'way' happens to use the entry). (One
could also imagine stride-based caching using overlaid skewed
associativity to reduce capacity waste.) Cache blocks larger
than their alignment would mess with the common conception of
cache block size. (This technique *might* be useful to reduce
tagging overhead or increase way prediction/memoization
effectiveness and seems likely to be more useful for
instruction caches.)

[This post is not as clear nor as comprehensive as I would like
but it does seem to provide a little signal to the newsgroup
and I do think that the observation that cache block size is
not necessarily as simple as sometimes presented.]

Subject	Author
Increasing subtly of cache block size	Paul A. Clayton