novaBBS - comp.os.vms - Re: OpenVMS async I/O, fast vs. slow

On 2023-11-08 03:00, Dan Cross wrote:
> In article <uidpd1$6s2$4@news.misty.com>,
> Johnny Billquist <bqt@softjar.se> wrote:
>> On 2023-11-04 15:14, Dan Cross wrote:
>>>>> [snip]
>>>>> Consider the problem of a per-core queue with work-stealing;
>>>>> the specifics of what's in the queue don't matter so much as the
>>>>> overall structure of the problem: items in the queue might
>>>>> represent IO requests, or they may represent runnable threads,
>>>>> or whatever. Anyway, a core will usually process things on
>>>>> its own local queue, but if it runs out of things to do, it
>>>>> may choose some victim and steal some work from it.
>>>>> Conventionally, this will involve everyone taking locks on
>>>>> these queues, but that's ok: we expect that work stealing is
>>>>> pretty rare, and that usually a core is simply locking its own
>>>>> queue, which will be an uncontended write on a cacheline it
>>>>> already owns. In any event, _most of the time_ the overhead
>>>>> of using a "lock" will be minimal (assuming relatively simply
>>>>> spinlocks or MCS locks or something like that).
>>>>
>>>> I generally agree with everything you posted. But I think it's
>>>> unfortunate that you bring in cache lines into this, as it's somewhat
>>>> incorrect to talk about "owning" a cache line, and cache lines are not
>>>> really that relevant at all in this context.
>>>
>>> This is an odd thing to say. It's quite common to talk about
>>> core ownership of cache lines in both MESI and MOESI, which are
>>> the cache coherency protocols in use on modern x86 systems from
>>> both Intel and AMD, respectively. Indeed, the "O" in "MOESI" is
>>> for "owned" (https://en.wikipedia.org/wiki/MOESI_protocol).
>>
>> I think we're talking past each other.
>
> Yes. See below.

:-)

And yes, I know how the cache coherency protocols work. Another thing
that was covered already when I was studying at University.

>>> Cache lines and their ownership are massively relevant in the
>>> context of mutual exclusion, atomic operations, and working on
>>> shared memory generally.
>>
>> You are then talking about it in the sense that if a CPU writes to
>> memory, and it's in the cache, then yes, you get ownership properties
>> then/there. Primarily because you are not immediately writing back to
>> main memory, and other CPUs are allowed to then hold read-only copies. I
>> sortof covered that exact topic further down here, about either
>> invalidate or update the caches of other CPUs. And it also holds that
>> other CPUs cannot freely write or update those memory cells when another
>> CPU holds a dirty cache about it.
>
> Well, no; in the protocols in use on x86, if you have exclusive
> access to a cache line, in either the exclusive or modified
> state, then any other cache's copy of that line is invalidated.

Yes.

> MOESI augments this by allowing sharing of dirty data via the
> OWNED state, allowing for cache-to-cache updates.

Yes. But it only transitions into owned state when you write to it, and
other CPUs have the data in their caches as well, and when it is not
written back to main memory at that point.

>> But before the CPU writes the data, it never owns the cache line. So it
>> don't make sense to say "uncontended write on a cacheline it already
>> owns". Ownership only happens when you do the write. And once you've
>> written, you own it.
>
> This ignores the context; see what I wrote above.
>
> To recap, I was hypothesizing a multiprocessor system with
> per-CPU work queues with work stealing. In such a system, an
> implementation may be to use a spinlock for each CPU's queue: in
> the rare cases where one wants to steal work, one must lock some
> other CPU's queue, _BUT_, in the usual scenario when the per-CPU
> queue is uncontended, the local worker will lock the queue,
> remove an item from it, unlock the queue, do some work, _and
> then repeat_. Note that once it acquires the lock on its queue
> the CPU will own the corresponding cacheline; since we assume
> work stealing is rare, it is likely _it will still own it on
> subsequent iterations of this loop._ Hence, making an
> uncontended write on a cache line it already owns; here, it owns
> it from the last time it made that same write.

That would assume that the cache has not been written back. Which is
likely if we talk about a short time after last updating the lock, but
rather unlikely most of the time.

This boils down to patterns and timing (obviously). I believe that spin
locks are hit hard when you are trying to get a lock, but once you have
it, you will not be touching that thing for quite a while, and it will
quickly go out of cache.

If you are a CPU trying to grab something from another CPU, with a spin
lock held by that other CPU, the spin lock mutex will most likely not
sit in the owners cache. So the other CPU will be hitting it, getting
the value and it gets into the cache. Might be several CPUs as well.
They'll get the value in the cache. Cached value will be flagged as
shared. Noone will have it as owned. The owning CPU then tries to
release the lock, at which time it also access the data, writes it, at
which point the other CPUs will still have it in shared, and the owning
CPU gets it as owned. Other CPUs then try to get the lock, so they all
start kicking the owning CPU cache, in order to get the data comitted to
main memory, so they can obtain ownership, and grab the lock. One of
them will succeed, the others again end up with a shared cache state.
The one CPU that managed to grab the lock will now be the owner, and all
other CPUs will hit it for the cached data, and eventuall it will be
written to memory, and dropped from the owners cache, since the owner is
not actually working on that data.

Cache lines are typically something like 128 bytes or so, so even though
locality means there is some other data around, the owning CPU is
unlikely to care about anything in that cache line, but that is
speculation on my part.

The really bad thing is if you have several spin locks in the same cache
line...

But now I tried to become a little more technical. ;-)

But also maybe we don't need to kick this around any moew. Seems like
we're drifting. I think we started out with the question of I/O
performance, and in this case specifically by using multiple threads in
VMS, and how Unix compatible layers seem to not get much performance,
which seems is no surprise to either of us, while VMS own primitives can
deliver fairly ok performance.

Beyond that, I'm not sure if we are arguing about something much, or
basically nitpicking. :-)

>>>> But that's all there is to it. Now, it is indeed very costly if you have
>>>> many CPUs trying to spin-lock on the same data, because each one will be
>>>> hitting the same memory, causing a big pressure on that memory address.
>>>
>>> More accurately, you'll be pushing the associated cache line
>>> through a huge number of state transitions as different cores
>>> vye for exclusive ownership of the line in order to write the
>>> lock value. Beyond a handful of cores, the generated cache
>>> coherency traffic begins to dominate, and overall throughput
>>> _decreases_.
>>>
>>> This is of paramount importance when we start talking about
>>> things like synchronization primitives, which are based on
>>> atomic updates to shared memory; cache-inefficient algorithms
>>> simply do not scale beyond a handful of actors, and why things
>>> like MCS locks or CHM locks are used on many-core machines.
>>> See, for example:
>>> https://people.csail.mit.edu/nickolai/papers/boyd-wickizer-locks.pdf
>>
>> Yes. This is the scaling problem with spinlocks. Cache coherency starts
>> becoming costly. Having algorithms or structures that allow locking to
>> not be localized to one address is an important piece to help alleviate it.
>
> Not just address, but _cache lines_, which is why this the topic
> is so important. Multiple simultaneous writers to multiple
> memory locations in the same cache line can lead to contention
> due to false sharing.

That is true. Cache lines is really the smallest denominator here. I
should take care about putting that correctly.
It don't help if you have some clever algorithm that spreads the lock
out over multiple addresses, if they all hit the same cache line.

>>>> (And then we had the PDP-11/74, which had to implement cache coherency
>>>> between CPUs without any hardware support...)
>>>
>>> Sounds rough. :-/
>>
>> It is. CPU was modified to actually always step around the cache for one
>> instruction (ASRB - used for spin locks), and then you manually turn on
>> and off cache bypass on a per-page basis, or in general of the CPU,
>> depending on what is being done, in order to not get into issues of
>> cache inconsistency.
>
> This implies that stores were in a total order, then, and
> these uncached instructions were serializing with respect to
> other CPUs?

The uncached instruction is basically there in order to be able to
implement a spin lock that works as you would expect. Once you have the
lock, then you either deal with data which is known to be shared, in
which case you need to run with cache disabled, or you are dealing with
data you know is not shared, in which case you can allow caching to work
as normal.

No data access to shared resources are allowed to be done without
getting the lock first.

Hey, it's an old system by today's standards. Rather primitive. I think
it's cool DEC even made it, and it works, and gives pretty acceptable
performance. But it was noted that going much above 4 CPUs really gave
diminishing returns. So the 11/74 never supported more than 4 CPUs. And
it gave/gives about 3.5 times the performance of a single 11/70.

Johnny

Subject	Replies	Author
OpenVMS async I/O, fast vs. slow By: Jake Hamby (Solid St on Fri, 3 Nov 2023	95	Jake Hamby (Solid State Jake)

Heisenberg may have been here.

computers / comp.os.vms / Re: OpenVMS async I/O, fast vs. slow