novaBBS - comp.arch - Re: Spectre and resource comtention (was: branchless binary search)

Re: Spectre and resource comtention (was: branchless binary search)

<18bd9df5-c3d1-460b-8327-99f85b7a88d4n@googlegroups.com>

https://www.novabbs.com/computers/article-flat.php?id=25437&group=comp.arch#25437

X-Received: by 2002:a05:620a:16a2:b0:6a0:68e2:b7a4 with SMTP id s2-20020a05620a16a200b006a068e2b7a4mr7592462qkj.374.1653086385808;
Fri, 20 May 2022 15:39:45 -0700 (PDT)
X-Received: by 2002:a05:6870:170b:b0:f1:8a5c:80a0 with SMTP id
h11-20020a056870170b00b000f18a5c80a0mr7224969oae.16.1653086385530; Fri, 20
May 2022 15:39:45 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 20 May 2022 15:39:45 -0700 (PDT)
In-Reply-To: <2022May20.205631@mips.complang.tuwien.ac.at>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:2185:754d:d068:fbc8;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:2185:754d:d068:fbc8
References: <jwv1qx4xe9z.fsf-monnier+comp.arch@gnu.org> <2022May16.080143@mips.complang.tuwien.ac.at>
<a52349f8-ace0-4f2c-af38-540713b6574en@googlegroups.com> <2022May16.175019@mips.complang.tuwien.ac.at>
<c17493ff-4a3d-4e90-bbbd-00f1ec658d23n@googlegroups.com> <2022May17.110354@mips.complang.tuwien.ac.at>
<2022May17.205327@mips.complang.tuwien.ac.at> <nkOhK.5260$wIO9.4690@fx12.iad>
<2022May20.185040@mips.complang.tuwien.ac.at> <ba27adf3-9b85-4392-b5b9-6afdfc8db1cbn@googlegroups.com>
<2022May20.205631@mips.complang.tuwien.ac.at>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <18bd9df5-c3d1-460b-8327-99f85b7a88d4n@googlegroups.com>
Subject: Re: Spectre and resource comtention (was: branchless binary search)
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 20 May 2022 22:39:45 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 12555

by: MitchAlsup - Fri, 20 May 2022 22:39 UTC

On Friday, May 20, 2022 at 2:40:12 PM UTC-5, Anton Ertl wrote:
> MitchAlsup <Mitch...@aol.com> writes:
> >On Friday, May 20, 2022 at 12:24:54 PM UTC-5, Anton Ertl wrote:
> >> However, we may significantly reduce the bandwidth of speculative=20
> >> accesses in order to avoid speculative side channels through resource=20
> >> contention. In that case, if there is only one cache line fetched=20
> >> speculatively, that will be useful 50% of the time.
> ><
> >No need to reduce BW, you just need somewhere other than the cache to
> >hold the data (and forward other units of data) until the LD passes the
> >completion point. Most processors have something like "Miss Buffers"
> >where all this can be done without adding any data-path HW (there is a bit
> >of added sequencing logic).
> That helps against attacks that then test for the presence of some
> data in a cache. But I also want to be immune to an attack like the
> following:
>
> One core contains your secret bit, and somehow the secret bit is
> accessed speculatively (there is some gadget in your code that the
> attacker can use). A load depends on the secret bit: if it is clear,
> the load accesses data in the L1 cache, if it is set, it accesses data
> in the shared L3 cache.
<
Phraseology:: a LD either hits in L1 or it does not (and so on)
I fail to see how secret bit is present in the core being attacked
unless the core performs a Flush to L3 on a line recently in L1.
{So, you prevent this attack by not flushing.}
<
I can see how attacker might conspire to move L1 cache to L3
cache if the implementation has a "generous" snoop policy.
But snoops, being as expensive as they are, almost invariably
use "precise" snooping. But under precise snooping, attacker
and attacked would have to be sharing address space for the
attacker to migrate a cache line from L1 to L3.
<
{Also note: in my current designs, L3 is a write back and prefetch
buffer for DRAM that is snoopable by cores which miss in L1 and
L2....}
<
Is secret bit visible to the instruction stream prior to accessing L1 ?
Why is secret bit present in core ?
<
I also fail to see how background I/O would not ALSO be a source
of noise to the attacker. Coherence traffic is another source of
timing irregularities.
<
But let us assume that there are explanations for the above......
>
> Another core is under control of the attacker, and the attacker
> constantly accesses the L3 cache speculatively, and measures the
> bandwidth he sees. If he does not get full bandwidth, he knows
> someone else has accessed the L3 cache. Under certain circumstances,
> this allows him to know that your secret bit is set. And if he gets
> the full bandwidth in the time window when he knows you should be
> doing the access, he knows that your secret bit is clear.
>
> What can we do against this attack?
<
Build the L3 using mainframe memory banking. As long as the multiple
cores are accessing independent banks of L3, everybody sees full bandwidth.
Given 4MB of L3, one could have between 16 and 128 banks for a chip
containing 8-32 cores.
<
Generally, a given core has only a handful of miss buffers (say 8).
......so we have a fundamental limit of 8 outstanding references per core
L1 cache is 1 cycle away with 1 cycle throughput (cache access width)
L2 cache is 10+ cycles away with 1 access per 8 cycles (cache line)
......L2 generally has multiple banks but this us used to save power.
L3 cache is 40+ cycles away with 1 access per 8 cycles (cache line)
......but L3 has multiple banks so multiple accesses are concurrent.
......total available L3 BW is typically bound by the generation of
......snoop traffic not by cache deliverance.
DRAM is 150+ cycles away when none of the caches hit (30ns gives
20ns for DRAM access and 10ns to route access to DRAM controller
and to route data back to requestor.) Coherence traffic is going to
be closer to 250 cycles (core access to knowing which data is
useable.)
<
Here the attacking core only gets 8/40 (1/5) of the available bandwidth
even if the attacker is the only core enabled to run code, once you factor
in coherence traffic, a given core accessing cacheable data will see less
BW.
<
So, we have a fundamental limit of 8 outstanding references per core
and a coherence limitation of ~250 cycles to send out all the accesses
and receive back all of the resulting coherence traffic. So, any given core
can only "do" and entire 1 access every 32 cycles.
<
And basically, this leaves plenty of available BW in multi-banked L3
so that temporal measurements are going to be very difficult.
<
The above was written as if cache lines are transmitted in 4-8 beat bursts.
Transmitting an entire cache line in a single beat does not change the
above numbers much, maybe taking 6 cycles out of each layer in the
caches/DRAM except L1 and L2.
>
> Divide the bandwidth such that every one of N cores gets only 1/Nth of
> the bandwidth.
<
Nada gonna fly.........but it happens naturally........so, don't worry about this.
>
> This seems overly restrictive for non-speculative accesses (we
> typically use software measures to avoid such problems for important
> secrets there), so my next idea is that non-speculative accesses can
> make use of the full bandwidth (crowding out speculative accesses),
> but speculative accesses can use only every 1/Nth time slot (and only
> of that time slot is not taken by a non-speculative access), with each
> time slot belonging exclusively to one core as far as speculative
> accesses are concerned. So a core can never measure if there was a
> speculative access by another core.
<
Given a multi-banked L3, one faces the exact same problem as faced by
CRAY 1/XMP 8 processor system. The first 6 processors could saturate
the banking of memory (3 accesses per cycle × 6 processors > banks).
The Cray 1/YMP solved this problem by having a priority mechanism at
the bank selection, so it would take a pending request over a newly
arriving request any time the bank was available to start a new access.
<
Presto: fairly banked large L3 where everyone gets essentially equal access
with small time variances. Fairly, here, means everybody gets rathe equal
BW per bank under bank collisions.
>
> For per-core caches something like that has to be done for snooping
> accesses, but with many more time slots for the core the cache belongs
> to.
<
One can deal with the snoops by replicating cache tags.
What one cannot do is to shorten the latency to collect snoops information
from all layers of the cache hierarchy--or makes these have guaranteed
latency (or even a short span of latencies.) This is a routing problem
not a latency problem.
>
> The question is how much performance that costs. It seems to me that
> the bandwidth-heavy code (HPC code) performs mostly non-speculative
> accesses, and most other code tends to work well with caches, so a CPU
> with large per-core caches should not suffer much from this
> countermeasure.
<
A) its been a long time since someone made an interconnect for HPC
access patterns that was not a GPU. Back in the CRAY days, there was
no cache, and no lack of cash to build suitable memory systems.
<
B) Your typical CPU caches see 95% code hit rates, 90% data hit rates
in L1. L2 cleans up another 4× {making this 8× causes too much added
L2 latency to "pay off" in the long run for "typical codes"}.
<
C) on the other hand, DRAM is still "lots" of latency cycles away, and is
generally accessed by EVERYONE, sharing L3 storage,....
>
D) The GPU I worked on used 2×1024-bit wide busses per "shader" core
(1 inbound, 1 outbound) whereas current cache lines are only 512-bits !!
<
So, this is what current implementations are faced with: Developing a
cache and routing hierarchy that provides a good illusion that every core
is essentially independent of every other core and treated fairly.
<
So, at this point, I know I have not given a solution to the issue raised,
merely outlines the current state of affairs. BUT:
<
1) it seems to me that when DRAM is doing any automomous (or
programatic) prefetching, this adds noise to L3 measurements
2) multi-banking makes the noise harder to measure
3) routing adds more noise
4) DRAM adds still more noise
5) L3 can be used to mitigate Meltdown (multiple writes get absorbed
by L3 and DRAM sees but 1 single write--which prevents writing to
the same 'word' of DRAM more than "every once in a while". and DRAM
manufactures don't have to do anything "new" about Meltdown)
<
But this still leaves me concerned that attack is possible.
>
> One thing that this idea does not help against is an attack from the
> same core (say, some sandboxed code under the attacker's control
> speculatively accesses data outside the sandbox). Javascript works
> against that be making measurement imprecise, but I would prefer if
> such an attack would be impossible.
<
Consider a JavaScript application which creates a native code trojan
and transfers control to trojan in order to read the high precision clock !?!
<
It is strategies like this that I dislike a page of memory being writable and
executable at the same time.
>
> However, AFAIK security researchers have not demonstrated Spectre
> attacks that use resource contention as a side channel yet, only
> persistent microarchitectural state, which is much less ephemeral. So
<
Likely because these resources have inherent noise in the measurements
AND because there are lots of other system noise being inserted randomly.
<
> if we fix the latter kinds of attacks in the way outlined by Mitch
> Alsup and me, we would gain a lot. Then security researchers could
> put their sights on the (harder) attacks involving resource
> contention, and we will see if they succeed, and if so, can enable the
> chicken bit for the fix I outlined above:-).
<
Agreed.
<
I wonder what would happen if the high precision timer was incremented
per Clock and per route. That is::
next HPT = current HPT + 1 + routes;
where routes is the number of unique routing decision made per clock in
the interconnect fabric (maybe just to the L3 or DAM but aggregating
noise to the HPT sill enabled it to be used for lots of stuff it is being used
for that we find acceptable).
<
> - anton
> --
> 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Subject	Replies	Author
Ill-advised use of CMOVE By: Stefan Monnier on Sun, 8 May 2022	131	Stefan Monnier

The documentation is in Japanese. Good luck. -- Rich $alz

computers / comp.arch / Re: Spectre and resource comtention (was: branchless binary search)