novaBBS - comp.arch - Re: compare-and-swap

Re: compare-and-swap

<d95cc3de-9dff-46ef-b681-f0e2c53f0542n@googlegroups.com>

https://www.novabbs.com/devel/article-flat.php?id=30138&group=comp.arch#30138

X-Received: by 2002:ac8:4448:0:b0:3a5:def:19fe with SMTP id m8-20020ac84448000000b003a50def19femr2628214qtn.175.1673060085789;
Fri, 06 Jan 2023 18:54:45 -0800 (PST)
X-Received: by 2002:a05:6830:4a3:b0:670:8334:ccf2 with SMTP id
l3-20020a05683004a300b006708334ccf2mr3582632otd.201.1673060085530; Fri, 06
Jan 2023 18:54:45 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 6 Jan 2023 18:54:45 -0800 (PST)
In-Reply-To: <HLYtL.535371$GNG9.277379@fx18.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=99.251.79.92; posting-account=QId4bgoAAABV4s50talpu-qMcPp519Eb
NNTP-Posting-Host: 99.251.79.92
References: <03a26eee-60af-4f11-90d5-9e681456e004n@googlegroups.com>
<2d65aa54-8e0c-4356-966b-fe51c6c2fc44n@googlegroups.com> <dbGtL.30897$jiuc.20712@fx44.iad>
<f7f6a291-ed12-4d57-b507-1f9a1c5b6150n@googlegroups.com> <sdItL.39332$rKDc.35464@fx34.iad>
<7559423e-f97f-4c1c-84f7-77836ea7962en@googlegroups.com> <CEJtL.28109$b7Kc.10980@fx39.iad>
<p8XtL.30784$0XR7.21877@fx07.iad> <HLYtL.535371$GNG9.277379@fx18.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d95cc3de-9dff-46ef-b681-f0e2c53f0542n@googlegroups.com>
Subject: Re: compare-and-swap
From: robfi...@gmail.com (robf...@gmail.com)
Injection-Date: Sat, 07 Jan 2023 02:54:45 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3385

by: robf...@gmail.com - Sat, 7 Jan 2023 02:54 UTC

>>> Probably better to say an integer-data-path than a mini-CPU.
>>> It certainly is not taking interrupts, allowed to deal with exceptions, no
>>> FP and likely not even LD, ST, or even branches.
>>
>> I think I am going to try implementing a coherence point processor, CPP,
>> to see how well it works. I am sketching out a CPP with 16 opcodes, 13-bit
>> instructions. It will be able to load / store and branch according to counted
>> loop. Treating the instruction buffer as a shift register, so no program
>> counter even. I am reminded of the additional logic that does raster ops in
>> EGA video or something like the COPPER in the AMIGA. The CPP will have
>> a set of “programs” allowing it to do compare-and-swap and
>> exchange-and-add and others. My main core will pass arguments and a
>> program number to execute to the CPP.

>How is this CPP different from a single serializing lock?

It could be used to implement a single serializing lock so it is not different. I had
thought being able to issue more memory operations under the same lock
would be useful. However, once a flag is set to lock an area ordinary instructions
could be used.
I have set the CPP aside for now as being overly complex. Unless a really
complicated AMO is desired, it is easier just to implement indivisible RMW
cycles at the coherence point. Implementing the RMW cycles bumped up the
size of my memory controller a lot, about 25%. 21,000 to 26,000 LUTs. A CPP
would be worse still. It will be a challenge getting the whole project to fit.
From the core's perspective how things are implemented at the coherence point
should be irrelevant. It just sends out an opcode to that point. I think I have got
the idea now. And using the Wishbone bus has some to be desired as it does not
support CMPXCHG directly. I must build onto it.

Re: compare-and-swap

<RGjuL.34355$jiuc.13539@fx44.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30150&group=comp.arch#30150

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx44.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: compare-and-swap
References: <03a26eee-60af-4f11-90d5-9e681456e004n@googlegroups.com> <2d65aa54-8e0c-4356-966b-fe51c6c2fc44n@googlegroups.com> <dbGtL.30897$jiuc.20712@fx44.iad> <f7f6a291-ed12-4d57-b507-1f9a1c5b6150n@googlegroups.com> <sdItL.39332$rKDc.35464@fx34.iad> <7559423e-f97f-4c1c-84f7-77836ea7962en@googlegroups.com> <CEJtL.28109$b7Kc.10980@fx39.iad> <p8XtL.30784$0XR7.21877@fx07.iad> <HLYtL.535371$GNG9.277379@fx18.iad> <d95cc3de-9dff-46ef-b681-f0e2c53f0542n@googlegroups.com>
In-Reply-To: <d95cc3de-9dff-46ef-b681-f0e2c53f0542n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 92
Message-ID: <RGjuL.34355$jiuc.13539@fx44.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sat, 07 Jan 2023 19:25:37 UTC
Date: Sat, 07 Jan 2023 14:25:04 -0500
X-Received-Bytes: 5656

by: EricP - Sat, 7 Jan 2023 19:25 UTC

robf...@gmail.com wrote:
>>>> Probably better to say an integer-data-path than a mini-CPU.
>>>> It certainly is not taking interrupts, allowed to deal with exceptions, no
>>>> FP and likely not even LD, ST, or even branches.
>>> I think I am going to try implementing a coherence point processor, CPP,
>>> to see how well it works. I am sketching out a CPP with 16 opcodes, 13-bit
>>> instructions. It will be able to load / store and branch according to counted
>>> loop. Treating the instruction buffer as a shift register, so no program
>>> counter even. I am reminded of the additional logic that does raster ops in
>>> EGA video or something like the COPPER in the AMIGA. The CPP will have
>>> a set of “programs” allowing it to do compare-and-swap and
>>> exchange-and-add and others. My main core will pass arguments and a
>>> program number to execute to the CPP.
>
>> How is this CPP different from a single serializing lock?
>
> It could be used to implement a single serializing lock so it is not different. I had
> thought being able to issue more memory operations under the same lock
> would be useful. However, once a flag is set to lock an area ordinary instructions
> could be used.
> I have set the CPP aside for now as being overly complex. Unless a really
> complicated AMO is desired, it is easier just to implement indivisible RMW
> cycles at the coherence point. Implementing the RMW cycles bumped up the
> size of my memory controller a lot, about 25%. 21,000 to 26,000 LUTs. A CPP
> would be worse still. It will be a challenge getting the whole project to fit.
> From the core's perspective how things are implemented at the coherence point
> should be irrelevant. It just sends out an opcode to that point. I think I have got
> the idea now. And using the Wishbone bus has some to be desired as it does not
> support CMPXCHG directly. I must build onto it.

There are two uses for Read-Modify-Write RMW memory operations:
as non-interruptible sequences for small communications between priority
interrupt handlers and lower or non-interrupt levels within a single core,
and atomic operation between cores in an SMP system.

Non-interruptible operations are needed so that an OS does not
have to be continually disabling and enabling interrupts as that
can be very expensive.

Non-interruptible operations are locally coherent because a
single core is always sees its own memory operations in order
and so do not require any memory barriers.

The set non-interruptible instructions is the same as atomic ones,
NFADD Non-interruptible fetch then ADD
NFAND Non-interruptible fetch then AND
NFOR Non-interruptible fetch then OR
NFXOR Non-interruptible fetch then XOR
NSWAP Non-interruptible SWAP
NCAS Non-interruptible Compare-And-Swap

There is also a use for some double wide swap operations
NSWAPW Non-interruptible SWAP Wide
NCASW Non-interruptible Compare-And-Swap Wide

Atomic operations have the additional requirement for various
memory barriers to ensure the global coherence rules are followed.
The barriers are applied locally first, such as flushing local caches,
then at the cache coherence level, such synchronizing with the
inbound and outbound cache comms message queues,
before the memory RMW changes can be applied.

Those barriers can be explicit or implicit.
If those barriers are explicit then to get atomic operations
they would be combined with the non-interruptible instructions:
MEMBAR_RW
NFADD
MEMBAR_RW

If those barriers are implied in the atomic instruction then there
is a second set of RMW instruction which combine the non-interruptible
operations with the implied barriers:
AFADD, AFAND, AFOR, AFXOR, ASWAP, ACAS, ASWAPW, and ACASW.

The advantage to having the barriers explicit is that a
programmer may know more about their use context and may
be able to use a weaker barrier with less performance implications.
Also multiple instructions can occur between two barriers.

With the implied barriers in atomic instructions they must
assume the worst case an use the strongest barrier possible
applied both before and after each atomic instruction.

My inclination would be to do all of the above, and implement
both the non-interruptible and atomic RMW's in the cache controller.
Sorting out the barriers and their effects on various things
like load-store-queue, store buffers, synchronizing with
pending inbound and outbound coherence messages,
looks to me like a far bigger job than the RMW operations.

Re: compare-and-swap

<DFluL.198330$Tcw8.48333@fx10.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30151&group=comp.arch#30151

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx10.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: compare-and-swap
References: <03a26eee-60af-4f11-90d5-9e681456e004n@googlegroups.com> <2d65aa54-8e0c-4356-966b-fe51c6c2fc44n@googlegroups.com> <dbGtL.30897$jiuc.20712@fx44.iad> <f7f6a291-ed12-4d57-b507-1f9a1c5b6150n@googlegroups.com> <sdItL.39332$rKDc.35464@fx34.iad> <7559423e-f97f-4c1c-84f7-77836ea7962en@googlegroups.com> <CEJtL.28109$b7Kc.10980@fx39.iad> <p8XtL.30784$0XR7.21877@fx07.iad> <HLYtL.535371$GNG9.277379@fx18.iad> <d95cc3de-9dff-46ef-b681-f0e2c53f0542n@googlegroups.com>
In-Reply-To: <d95cc3de-9dff-46ef-b681-f0e2c53f0542n@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 26
Message-ID: <DFluL.198330$Tcw8.48333@fx10.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sat, 07 Jan 2023 21:40:51 UTC
Date: Sat, 07 Jan 2023 16:40:40 -0500
X-Received-Bytes: 2492

by: EricP - Sat, 7 Jan 2023 21:40 UTC

robf...@gmail.com wrote:
> I have set the CPP aside for now as being overly complex. Unless a really
> complicated AMO is desired, it is easier just to implement indivisible RMW
> cycles at the coherence point. Implementing the RMW cycles bumped up the
> size of my memory controller a lot, about 25%. 21,000 to 26,000 LUTs. A CPP
> would be worse still. It will be a challenge getting the whole project to fit.
> From the core's perspective how things are implemented at the coherence point
> should be irrelevant. It just sends out an opcode to that point. I think I have got
> the idea now. And using the Wishbone bus has some to be desired as it does not
> support CMPXCHG directly. I must build onto it.

If implementation cost is a consideration then LL/SC might be
the way to go. It doesn't require putting an ALU and
sequencing logic into the cache controller, it just requires
pinning the cache line for a period of time while the lock is held.
It requires releasing the lock if an interrupt occurs,
or a timer-counter times out, or when a SC is successful.
LL/SC also gives you all the non-interruptible operations for free.

The performance cost is the line stays pinned slightly longer than
it would with the explicit atomic RMW's, and that may require
shutting off the cache coherence comms queues,
and maybe that affects system wide performance.

The memory barrier functions look the same either way.

Re: compare-and-swap

<xHCuL.57308$Ldj8.20161@fx47.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30163&group=comp.arch#30163

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx47.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: compare-and-swap
References: <03a26eee-60af-4f11-90d5-9e681456e004n@googlegroups.com> <2d65aa54-8e0c-4356-966b-fe51c6c2fc44n@googlegroups.com> <dbGtL.30897$jiuc.20712@fx44.iad> <f7f6a291-ed12-4d57-b507-1f9a1c5b6150n@googlegroups.com> <sdItL.39332$rKDc.35464@fx34.iad> <7559423e-f97f-4c1c-84f7-77836ea7962en@googlegroups.com> <CEJtL.28109$b7Kc.10980@fx39.iad> <2023Jan6.124450@mips.complang.tuwien.ac.at> <vQXtL.56957$Ldj8.3819@fx47.iad>
In-Reply-To: <vQXtL.56957$Ldj8.3819@fx47.iad>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 76
Message-ID: <xHCuL.57308$Ldj8.20161@fx47.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sun, 08 Jan 2023 17:03:25 UTC
Date: Sun, 08 Jan 2023 12:03:04 -0500
X-Received-Bytes: 4749

by: EricP - Sun, 8 Jan 2023 17:03 UTC

Scott Lurndal wrote:
> anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>> scott@slp53.sl.home (Scott Lurndal) writes:
>>> No, they started with the old ARMv7 mechanism and realized that it
>>> didn't scale to large processor counts in ARMv8 around which high-core
>>> count systems were being designed.
>> Why did it not scale? The only thing that comes to my mind is that
>> it's a contended memory location that several cores want to change at
>> the same time, much of the time).
>
> A big problem was fairness, i.e. avoiding starvation, when a large
> number of cores were accessing the same object.
>
>> Yes, in that case I can see that
>> doing it at a central place and only sending the results to the
>> individual cores can result in much higher throughput (say one change
>> every few ns) than sending the cache lines involved to one of the
>> cores (20ns-80ns on current multi-core CPUs), and letting it make the
>> change there, and then sending it onwards to the next core; and that
>> assumes that the other cores asking for the cache lines do not disturb
>> the chosen one.
>
> Indeed.
>
>> But can such high contention not be avoided by redesigning the
>> software? Admittedly, solving a problem in hardware is often less
>> effort, but it's not clear to me why that should be the case here.
>
> I spent a few weeks at LLNL a bit over a decade ago. One of their
> acceptance tests include spinning on a single spinlock from all
> cores (256 in our system) at the same time. The system we were testing was built
> from AMD instanbul cores with a custom ASIC on hypertransport to
> extend the processor coherency domain to other nodes over infiniband
> (or 10G ethernet, but ethernet switches had double the latency of
> IB switches). In Intel/AMD systems, when cache line contention
> causes an internal cache timer to fire because the cache couldn't
> acquire the line, the core will revert to the system-wide
> bus lock. Which absolutely killed performanace.

This is a well know problem for Test and Test and Set (TTS) spinlocks
under high contention - high rates of cache line ping-pong,
many nodes trying to do a read_exclusive at the same instant
causing timeouts and retries at the coherence protocol level,
floods of messages on the coherence network causing transmission backlogs.

> Removing the
> cache contention by handling atomics at the cache rather than
> acquiring the cache line is a partial solution for this type of
> scaling problem

Moving the atomic operations from the core to the cache controller
should eliminate the window during which a cache line is pinned
and that may help system performance, but does not eliminate the
ping-pongs or lessen the number of messages.

> (Customers aren't generally willing to
> rewrite applications if they don't have to....)

TTS spinlocks are used because they are cheap - 1 DWORD per lock.
The problem is that the TTS spinlock method is too simple and
it should only be used when contention is known to be unlikely.

The solution is a spin-queue which builds a linked list of cache lines
with each waiter spinning on a separate cache line, so no ping-pongs,
no thundering herds fighting for exclusive access to the same DWORD.
Most major OS and many user mode multithreading apps now use such
spin-queues wherever there is high contention for a data structure.
The cost is one cache line per lock, plus one line per contender,
and each lock has a pointer to the list head.
Enqueing a lock request is just an atomic exchange on the list head
which has none to very light contention.

In terms of code changes, its a few hours work replacing SpinlockXxx
functions with SpinqueueXxx functions.

Re: compare-and-swap

<ARCuL.28495$Lfzc.16337@fx36.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30164&group=comp.arch#30164

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx36.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: sco...@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: compare-and-swap
Newsgroups: comp.arch
References: <03a26eee-60af-4f11-90d5-9e681456e004n@googlegroups.com> <2d65aa54-8e0c-4356-966b-fe51c6c2fc44n@googlegroups.com> <dbGtL.30897$jiuc.20712@fx44.iad> <f7f6a291-ed12-4d57-b507-1f9a1c5b6150n@googlegroups.com> <sdItL.39332$rKDc.35464@fx34.iad> <7559423e-f97f-4c1c-84f7-77836ea7962en@googlegroups.com> <CEJtL.28109$b7Kc.10980@fx39.iad> <2023Jan6.124450@mips.complang.tuwien.ac.at> <vQXtL.56957$Ldj8.3819@fx47.iad> <xHCuL.57308$Ldj8.20161@fx47.iad>
Lines: 98
Message-ID: <ARCuL.28495$Lfzc.16337@fx36.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Sun, 08 Jan 2023 17:14:08 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Sun, 08 Jan 2023 17:14:08 GMT
X-Received-Bytes: 5532

by: Scott Lurndal - Sun, 8 Jan 2023 17:14 UTC

EricP <ThatWouldBeTelling@thevillage.com> writes:
>Scott Lurndal wrote:
>> anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>>> scott@slp53.sl.home (Scott Lurndal) writes:
>>>> No, they started with the old ARMv7 mechanism and realized that it
>>>> didn't scale to large processor counts in ARMv8 around which high-core
>>>> count systems were being designed.
>>> Why did it not scale? The only thing that comes to my mind is that
>>> it's a contended memory location that several cores want to change at
>>> the same time, much of the time).
>>
>> A big problem was fairness, i.e. avoiding starvation, when a large
>> number of cores were accessing the same object.
>>
>>> Yes, in that case I can see that
>>> doing it at a central place and only sending the results to the
>>> individual cores can result in much higher throughput (say one change
>>> every few ns) than sending the cache lines involved to one of the
>>> cores (20ns-80ns on current multi-core CPUs), and letting it make the
>>> change there, and then sending it onwards to the next core; and that
>>> assumes that the other cores asking for the cache lines do not disturb
>>> the chosen one.
>>
>> Indeed.
>>
>>> But can such high contention not be avoided by redesigning the
>>> software? Admittedly, solving a problem in hardware is often less
>>> effort, but it's not clear to me why that should be the case here.
>>
>> I spent a few weeks at LLNL a bit over a decade ago. One of their
>> acceptance tests include spinning on a single spinlock from all
>> cores (256 in our system) at the same time. The system we were testing was built
>> from AMD instanbul cores with a custom ASIC on hypertransport to
>> extend the processor coherency domain to other nodes over infiniband
>> (or 10G ethernet, but ethernet switches had double the latency of
>> IB switches). In Intel/AMD systems, when cache line contention
>> causes an internal cache timer to fire because the cache couldn't
>> acquire the line, the core will revert to the system-wide
>> bus lock. Which absolutely killed performanace.
>
>This is a well know problem for Test and Test and Set (TTS) spinlocks
>under high contention - high rates of cache line ping-pong,
>many nodes trying to do a read_exclusive at the same instant
>causing timeouts and retries at the coherence protocol level,
>floods of messages on the coherence network causing transmission backlogs.

It's been well known for half a century. By computer scientists.

The engineers and scientists who have written the applications, however,
not so much.

>
>> Removing the
>> cache contention by handling atomics at the cache rather than
>> acquiring the cache line is a partial solution for this type of
>> scaling problem
>
>Moving the atomic operations from the core to the cache controller
>should eliminate the window during which a cache line is pinned
>and that may help system performance, but does not eliminate the
>ping-pongs or lessen the number of messages.

It does eliminate a substantial fraction of line movement between
cores, particularly when replacing LL/SC style primitives.

>
>> (Customers aren't generally willing to
>> rewrite applications if they don't have to....)
>
>TTS spinlocks are used because they are cheap - 1 DWORD per lock.
>The problem is that the TTS spinlock method is too simple and
>it should only be used when contention is known to be unlikely.

Yes, that is also true and rather self-evident.

>
>The solution is a spin-queue which builds a linked list of cache lines
>with each waiter spinning on a separate cache line, so no ping-pongs,
>no thundering herds fighting for exclusive access to the same DWORD.
>Most major OS and many user mode multithreading apps now use such
>spin-queues wherever there is high contention for a data structure.
>The cost is one cache line per lock, plus one line per contender,
>and each lock has a pointer to the list head.
>Enqueing a lock request is just an atomic exchange on the list head
>which has none to very light contention.
>
>In terms of code changes, its a few hours work replacing SpinlockXxx
>functions with SpinqueueXxx functions.

That is likely underestimating the effort of both replacing the
existing synchronization primitives and testing the resulting
codebase.

The folks responsible for those codebases would rather be doing
science and engineering; not rewriting working code.

There is a reason that LLNL tests how systems behave under such
behavior.

Re: compare-and-swap

<411103c0-9d2a-4213-8079-197364a031cdn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30165&group=comp.arch#30165

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:518d:b0:531:97e3:7b26 with SMTP id kl13-20020a056214518d00b0053197e37b26mr2604484qvb.126.1673199674411;
Sun, 08 Jan 2023 09:41:14 -0800 (PST)
X-Received: by 2002:a54:4596:0:b0:364:856:31a4 with SMTP id
z22-20020a544596000000b00364085631a4mr846937oib.298.1673199674171; Sun, 08
Jan 2023 09:41:14 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 8 Jan 2023 09:41:13 -0800 (PST)
In-Reply-To: <ARCuL.28495$Lfzc.16337@fx36.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=199.203.251.52; posting-account=ow8VOgoAAAAfiGNvoH__Y4ADRwQF1hZW
NNTP-Posting-Host: 199.203.251.52
References: <03a26eee-60af-4f11-90d5-9e681456e004n@googlegroups.com>
<2d65aa54-8e0c-4356-966b-fe51c6c2fc44n@googlegroups.com> <dbGtL.30897$jiuc.20712@fx44.iad>
<f7f6a291-ed12-4d57-b507-1f9a1c5b6150n@googlegroups.com> <sdItL.39332$rKDc.35464@fx34.iad>
<7559423e-f97f-4c1c-84f7-77836ea7962en@googlegroups.com> <CEJtL.28109$b7Kc.10980@fx39.iad>
<2023Jan6.124450@mips.complang.tuwien.ac.at> <vQXtL.56957$Ldj8.3819@fx47.iad>
<xHCuL.57308$Ldj8.20161@fx47.iad> <ARCuL.28495$Lfzc.16337@fx36.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <411103c0-9d2a-4213-8079-197364a031cdn@googlegroups.com>
Subject: Re: compare-and-swap
From: already5...@yahoo.com (Michael S)
Injection-Date: Sun, 08 Jan 2023 17:41:14 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 6649

by: Michael S - Sun, 8 Jan 2023 17:41 UTC

On Sunday, January 8, 2023 at 7:14:11 PM UTC+2, Scott Lurndal wrote:
> EricP <ThatWould...@thevillage.com> writes:
> >Scott Lurndal wrote:
> >> an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
> >>> sc...@slp53.sl.home (Scott Lurndal) writes:
> >>>> No, they started with the old ARMv7 mechanism and realized that it
> >>>> didn't scale to large processor counts in ARMv8 around which high-core
> >>>> count systems were being designed.
> >>> Why did it not scale? The only thing that comes to my mind is that
> >>> it's a contended memory location that several cores want to change at
> >>> the same time, much of the time).
> >>
> >> A big problem was fairness, i.e. avoiding starvation, when a large
> >> number of cores were accessing the same object.
> >>
> >>> Yes, in that case I can see that
> >>> doing it at a central place and only sending the results to the
> >>> individual cores can result in much higher throughput (say one change
> >>> every few ns) than sending the cache lines involved to one of the
> >>> cores (20ns-80ns on current multi-core CPUs), and letting it make the
> >>> change there, and then sending it onwards to the next core; and that
> >>> assumes that the other cores asking for the cache lines do not disturb
> >>> the chosen one.
> >>
> >> Indeed.
> >>
> >>> But can such high contention not be avoided by redesigning the
> >>> software? Admittedly, solving a problem in hardware is often less
> >>> effort, but it's not clear to me why that should be the case here.
> >>
> >> I spent a few weeks at LLNL a bit over a decade ago. One of their
> >> acceptance tests include spinning on a single spinlock from all
> >> cores (256 in our system) at the same time. The system we were testing was built
> >> from AMD instanbul cores with a custom ASIC on hypertransport to
> >> extend the processor coherency domain to other nodes over infiniband
> >> (or 10G ethernet, but ethernet switches had double the latency of
> >> IB switches). In Intel/AMD systems, when cache line contention
> >> causes an internal cache timer to fire because the cache couldn't
> >> acquire the line, the core will revert to the system-wide
> >> bus lock. Which absolutely killed performanace.
> >
> >This is a well know problem for Test and Test and Set (TTS) spinlocks
> >under high contention - high rates of cache line ping-pong,
> >many nodes trying to do a read_exclusive at the same instant
> >causing timeouts and retries at the coherence protocol level,
> >floods of messages on the coherence network causing transmission backlogs.
> It's been well known for half a century. By computer scientists.
>
> The engineers and scientists who have written the applications, however,
> not so much.
> >
> >> Removing the
> >> cache contention by handling atomics at the cache rather than
> >> acquiring the cache line is a partial solution for this type of
> >> scaling problem
> >
> >Moving the atomic operations from the core to the cache controller
> >should eliminate the window during which a cache line is pinned
> >and that may help system performance, but does not eliminate the
> >ping-pongs or lessen the number of messages.
> It does eliminate a substantial fraction of line movement between
> cores, particularly when replacing LL/SC style primitives.
> >
> >> (Customers aren't generally willing to
> >> rewrite applications if they don't have to....)
> >
> >TTS spinlocks are used because they are cheap - 1 DWORD per lock.
> >The problem is that the TTS spinlock method is too simple and
> >it should only be used when contention is known to be unlikely.
> Yes, that is also true and rather self-evident.
> >
> >The solution is a spin-queue which builds a linked list of cache lines
> >with each waiter spinning on a separate cache line, so no ping-pongs,
> >no thundering herds fighting for exclusive access to the same DWORD.
> >Most major OS and many user mode multithreading apps now use such
> >spin-queues wherever there is high contention for a data structure.
> >The cost is one cache line per lock, plus one line per contender,
> >and each lock has a pointer to the list head.
> >Enqueing a lock request is just an atomic exchange on the list head
> >which has none to very light contention.
> >
> >In terms of code changes, its a few hours work replacing SpinlockXxx
> >functions with SpinqueueXxx functions.
> That is likely underestimating the effort of both replacing the
> existing synchronization primitives and testing the resulting
> codebase.
>
> The folks responsible for those codebases would rather be doing
> science and engineering; not rewriting working code.
>

So, your solution improves performance from 0.1% of what it should be
to 1% of what it should be.
Marketing can spin it as 10x improvement, but engineers should understand
that it is improvement by 0.9%.

> There is a reason that LLNL tests how systems behave under such
> behavior.

Sure there is a reason. But not a good reason.

Re: compare-and-swap

<T6EuL.278443$iS99.273528@fx16.iad>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30166&group=comp.arch#30166

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx16.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: compare-and-swap
References: <03a26eee-60af-4f11-90d5-9e681456e004n@googlegroups.com> <2d65aa54-8e0c-4356-966b-fe51c6c2fc44n@googlegroups.com> <dbGtL.30897$jiuc.20712@fx44.iad> <f7f6a291-ed12-4d57-b507-1f9a1c5b6150n@googlegroups.com> <sdItL.39332$rKDc.35464@fx34.iad> <7559423e-f97f-4c1c-84f7-77836ea7962en@googlegroups.com> <CEJtL.28109$b7Kc.10980@fx39.iad> <2023Jan6.124450@mips.complang.tuwien.ac.at> <vQXtL.56957$Ldj8.3819@fx47.iad> <xHCuL.57308$Ldj8.20161@fx47.iad> <ARCuL.28495$Lfzc.16337@fx36.iad>
In-Reply-To: <ARCuL.28495$Lfzc.16337@fx36.iad>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 49
Message-ID: <T6EuL.278443$iS99.273528@fx16.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sun, 08 Jan 2023 18:40:51 UTC
Date: Sun, 08 Jan 2023 13:40:16 -0500
X-Received-Bytes: 3444

by: EricP - Sun, 8 Jan 2023 18:40 UTC

Scott Lurndal wrote:
> EricP <ThatWouldBeTelling@thevillage.com> writes:
>> Scott Lurndal wrote:
>>> I spent a few weeks at LLNL a bit over a decade ago. One of their
>>> acceptance tests include spinning on a single spinlock from all
>>> cores (256 in our system) at the same time. The system we were testing was built
>>> from AMD instanbul cores with a custom ASIC on hypertransport to
>>> extend the processor coherency domain to other nodes over infiniband
>>> (or 10G ethernet, but ethernet switches had double the latency of
>>> IB switches). In Intel/AMD systems, when cache line contention
>>> causes an internal cache timer to fire because the cache couldn't
>>> acquire the line, the core will revert to the system-wide
>>> bus lock. Which absolutely killed performanace.
>
>> The solution is a spin-queue which builds a linked list of cache lines
>> with each waiter spinning on a separate cache line, so no ping-pongs,
>> no thundering herds fighting for exclusive access to the same DWORD.
>> Most major OS and many user mode multithreading apps now use such
>> spin-queues wherever there is high contention for a data structure.
>> The cost is one cache line per lock, plus one line per contender,
>> and each lock has a pointer to the list head.
>> Enqueing a lock request is just an atomic exchange on the list head
>> which has none to very light contention.
>>
>> In terms of code changes, its a few hours work replacing SpinlockXxx
>> functions with SpinqueueXxx functions.
>
> That is likely underestimating the effort of both replacing the
> existing synchronization primitives and testing the resulting
> codebase.
>
> The folks responsible for those codebases would rather be doing
> science and engineering; not rewriting working code.
>
> There is a reason that LLNL tests how systems behave under such
> behavior.

Replacing with spin queues would likely completely change the behavior
of such an application by eliminating overhead and changing the order
threads are released. Things happen quicker and in a different order.

And that could expose latent bugs, maybe an argument alias that went
unnoticed. Before the app had a chance to finish one calculation before
the next started, but now the two are overlapped so the alias bites you.

So yes it would have to be completely re-tested and re-tuned.

Overload -- core meltdown sequence initiated.

devel / comp.arch / Re: compare-and-swap

devel / comp.arch / Re: compare-and-swap

Subject	Author
compare-and-swap	robf...@gmail.com
Re: compare-and-swap	Terje Mathisen
Re: compare-and-swap	Scott Lurndal
Re: compare-and-swap	robf...@gmail.com
Re: compare-and-swap	MitchAlsup
Re: compare-and-swap	Scott Lurndal
Re: compare-and-swap	MitchAlsup
Re: compare-and-swap	Scott Lurndal
Re: compare-and-swap	MitchAlsup
Re: compare-and-swap	robf...@gmail.com
Re: compare-and-swap	Scott Lurndal
Re: compare-and-swap	MitchAlsup
Re: compare-and-swap	Scott Lurndal
Re: compare-and-swap	Scott Lurndal
Re: compare-and-swap	MitchAlsup
Re: compare-and-swap	Anton Ertl
Re: compare-and-swap	Terje Mathisen
Re: compare-and-swap	robf...@gmail.com
Re: compare-and-swap	EricP
Re: compare-and-swap	Scott Lurndal
Re: compare-and-swap	EricP
Re: compare-and-swap	Anton Ertl
Re: compare-and-swap	Scott Lurndal
Re: compare-and-swap	EricP
Re: compare-and-swap	Scott Lurndal
Re: compare-and-swap	Michael S
Re: compare-and-swap	EricP
Re: compare-and-swap	EricP
Re: compare-and-swap	Scott Lurndal
Re: compare-and-swap	robf...@gmail.com
Re: compare-and-swap	EricP
Re: compare-and-swap	EricP