Message-ID:

19 May, 2024: Line wrapping has been changed to be more consistent with Usenet standards.
If you find that it is broken please let me know here rocksolid.nodes.help

devel / comp.arch / Re: Memory dependency microbenchmark

On 11/4/2023 10:40 AM, Anton Ertl wrote:
> "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
>> On 11/3/2023 2:15 AM, Anton Ertl wrote:
>>> I have written a microbenchmark for measuring how memory dependencies
>>> affect the performance of various microarchitectures. You can find it
>>> along with a description and results on
>>> <http://www.complang.tuwien.ac.at/anton/memdep/>.
>> [...]
>>
>> Is the only arch out there that does not require an explicit memory
>> barrier for data-dependent loads a DEC Alpha? I think so.
>
> I don't know any architecture that requires memory barriers for
> single-threaded programs that access just memory, not even Alpha.

I am referring to a program that uses multiple threads.

>
> You may be thinking of the memory consistency model of Alpha, which is
> even weaker than everything else I know of. This is not surprising,
> given that a prominent advocacy paper for weak consistency
> [adve&gharachorloo95] came out of DEC.
[...]

Yup. A DEC alpha requires a memory barrier even for RCU iteration on the
read side of the algorithm. Afaict, std::memory_order_consume fits the
bill. It's a nop on every arch, except a DEC alpha. Unless I am missing
an arch that is as weak or weaker than a DEC alpha.

https://en.cppreference.com/w/cpp/atomic/memory_order
_______________
A load operation with this memory order performs a consume operation on
the affected memory location: no reads or writes in the current thread
dependent on the value currently loaded can be reordered before this
load. Writes to data-dependent variables in other threads that release
the same atomic variable are visible in the current thread. On most
platforms, this affects compiler optimizations only (see Release-Consume
ordering below).
_______________

Chris M. Thomasson wrote:

> On 11/4/2023 10:40 AM, Anton Ertl wrote:
>> "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
>>> On 11/3/2023 2:15 AM, Anton Ertl wrote:
>>>> I have written a microbenchmark for measuring how memory dependencies
>>>> affect the performance of various microarchitectures. You can find it
>>>> along with a description and results on
>>>> <http://www.complang.tuwien.ac.at/anton/memdep/>.
>>> [...]
>>>
>>> Is the only arch out there that does not require an explicit memory
>>> barrier for data-dependent loads a DEC Alpha? I think so.
>>
>> I don't know any architecture that requires memory barriers for
>> single-threaded programs that access just memory, not even Alpha.

CDC 6600 (via its stunt box) could calculate addresses in program order
and access memory in reverse order due to a conflict and the timing of
the stunt box loop versus the timing of a memory bank busy checks. Here
is a single processor that sees memory order different than processor
order.......

> I am referring to a program that uses multiple threads.

>>
>> You may be thinking of the memory consistency model of Alpha, which is
>> even weaker than everything else I know of. This is not surprising,
>> given that a prominent advocacy paper for weak consistency
>> [adve&gharachorloo95] came out of DEC.
> [...]

> Yup. A DEC alpha requires a memory barrier even for RCU iteration on the
> read side of the algorithm. Afaict, std::memory_order_consume fits the
> bill. It's a nop on every arch, except a DEC alpha. Unless I am missing
> an arch that is as weak or weaker than a DEC alpha.

> https://en.cppreference.com/w/cpp/atomic/memory_order
> _______________
> A load operation with this memory order performs a consume operation on
> the affected memory location: no reads or writes in the current thread
> dependent on the value currently loaded can be reordered before this
> load. Writes to data-dependent variables in other threads that release
> the same atomic variable are visible in the current thread. On most
> platforms, this affects compiler optimizations only (see Release-Consume
> ordering below).
> _______________

Re: Memory dependency microbenchmark

<uk29cp$3q0st$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=35206&group=comp.arch#35206

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: paaroncl...@gmail.com (Paul A. Clayton)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Mon, 27 Nov 2023 09:33:59 -0500
Organization: A noiseless patient Spider
Lines: 97
Message-ID: <uk29cp$3q0st$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>
<uisdmn$gd4s$2@dont-email.me> <uiu4t5$t4c2$2@dont-email.me>
<uj3c29$1t9an$1@dont-email.me> <uj3d0a$1tb8u$1@dont-email.me>
<ujpnba$28jgr$1@dont-email.me> <ujpni1$28jgr$2@dont-email.me>
<6339b1f3d3f7e47364800bfe1dd96bf0@news.novabbs.com>
<bQa8N.2234$PJoc.448@fx04.iad>
<db4d79b4f418a5d73ec7c63bf691af31@news.novabbs.com>
<7Pp8N.22587$yAie.21519@fx44.iad> <ujubgb$30tcj$1@dont-email.me>
<U7K8N.7152$Ycdc.4017@fx09.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 27 Nov 2023 14:34:01 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="334cf5ee17709d7ce182ab4cd5612dd7";
logging-data="3998621"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX188RLIERPm8M9sSclsDdGMbEmWxvN8ejdw="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.0
Cancel-Lock: sha1:jiUAE8uHwKyrCMzb8Gt1oyG7GXs=
In-Reply-To: <U7K8N.7152$Ycdc.4017@fx09.iad>

by: Paul A. Clayton - Mon, 27 Nov 2023 14:33 UTC

On 11/26/23 11:12 AM, Scott Lurndal wrote:
> "Paul A. Clayton" <paaronclayton@gmail.com> writes:
[snip paths for extending LL/SC guarantees]
>> Generating complexity is easy; producing elegance, not so much.
>
> Seems like far more work and complexity than just adding a set
> of atomic instructions to the ISA (like ARM did with LSE in V8.1).

Composing primitives provides more flexibility at a higher cost
for hardware management and communication to the programmer/
compiler of which special cases are optimized by hardware (and
code density).

For a special LL — or perhaps better a short prefix to modify
ordinary load instructions — hardware's work seems not to be that
much more complex. (Note: I do not design hardware!) This is
effectively a form of instruction fusion. With a special LL, the
decoder would know early that this is a case for fusion/special
translation.

Instruction fusion is more complex than instruction cracking, but
atomic operations are already complex.

This also shifts the communication of performance guarantees from
"instruction is valid" to "processor implements profile that
provides that guarantee". However, even with atomic instructions
one is likely to have different performance aspects for different
implementations. All might guarantee local forward progress as
part of the instruction definition, but the performance of the
operation would vary especially under contention. The performance
might also vary depending on the operation (e.g., adding a small
positive value might be faster/more parallel via sharing the
higher bit carry-out)

Atomic instructions are denser, but are subject to instruction
type expansion. Software written using a generic interface could
be more nearly best-performance for future implementations that
improve, e.g., an atomic floating-point addition operation.

Fitting atomic instructions into a fixed-size 32-bit instruction
also means that addressing modes are highly constrained. With such
a constraint, one could conceive of a prefix that makes a
following compute operation into an atomic operation (where a
particular source register is actually an indirect operand and
external memory destination). (Non-commutative operations would
require the ability to specify any operand as being in memory.)

If one need not consider binary compatibility (e.g., a translation
layer between the distribution format and the machine format),
the encoding choices would be quite different.

> The CPU can perform the atomic immediately if it already has
> the cache line, or it can send it to LLC if it doesn't. It
> can also send the atomic to a PCI device or coprocessor. LL/SC is far
> less efficient and adding complexity to the implementation
> to fix a broken interface doesn't make much sense to me.

I do not consider LL/SC a "broken interface", except perhaps in
facilitating inappropriate implementations. More general
interfaces tend to move more of the implementation behind the
abstraction, so performance and other "abstraction leakage" must
be communicated/contracted outside of the interface itself.
Without such communication/guarantees, implementations can
choose the minimum-effort design that barely meets the interface
definition or even be able to get away with an implementation
that fails to meet the specification (if that failure is
sufficiently obscure by rarity and other possible failure causes).☹

More general interfaces will be less efficient when the important
or common use cases are few (over all users of the interface) and
known beforehand.

Accumulating special cases, especially without foresight, can
easily lead to inelegance. (I like the density and semantic
clarity of atomic instructions — and they are not inconsistent
with also having a more general atomic interface.)

The "provide primitives not solutions" guideline for interfaces
(specifically ISAs) is a *guideline* reacting against a bias
toward localized solutions. An opposite bias also exists, to
incorporate every possible use case into a general interface,
typically leading to excess complexity.

LL/SC itself is a "solution" to optimize atomicity, which would be
physically possible otherwise but at greater overhead.

My 66000's exotic synchronization mechanism has substantial
generality and requires both a initiating instruction (a special
load) and a terminating instruction (which could be just a control
flow instruction leaving the basic block) as well as any
computation and predication. While it is described as handling
failures with replay, "as if" can be applied, especially for
simple atomic operations. Three instruction fusion might be
challenging (and I suspect providing hints/directives that the
operation is of that kind might be helpful). I do not know how
Mitch Alsup implemented ESM for his scalar core or how he
envisions implementing it in a "maximum-effort" core.

Re: Memory dependency microbenchmark

<e27cd793a43e659d12ea36e300941d51@news.novabbs.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=35214&group=comp.arch#35214

copy link Newsgroups: comp.arch

Date: Mon, 27 Nov 2023 18:51:58 +0000
Subject: Re: Memory dependency microbenchmark
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on novalink.us
From: mitchal...@aol.com (MitchAlsup)
Newsgroups: comp.arch
X-Rslight-Site: $2y$10$e/yh/rmAlR/E6baEXYC.nuPa/oXHAkmbn5CcrOROi1rkGtOVo6sTK
X-Rslight-Posting-User: 7e9c45bcd6d4757c5904fbe9a694742e6f8aa949
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <82b3b3b710652e607dac6cec2064c90b@news.novabbs.com> <uisdmn$gd4s$2@dont-email.me> <uiu4t5$t4c2$2@dont-email.me> <uj3c29$1t9an$1@dont-email.me> <uj3d0a$1tb8u$1@dont-email.me> <ujpnba$28jgr$1@dont-email.me> <ujpni1$28jgr$2@dont-email.me> <6339b1f3d3f7e47364800bfe1dd96bf0@news.novabbs.com> <bQa8N.2234$PJoc.448@fx04.iad> <db4d79b4f418a5d73ec7c63bf691af31@news.novabbs.com> <7Pp8N.22587$yAie.21519@fx44.iad> <ujubgb$30tcj$1@dont-email.me> <U7K8N.7152$Ycdc.4017@fx09.iad> <uk29cp$3q0st$1@dont-email.me>
Organization: novaBBS
Message-ID: <e27cd793a43e659d12ea36e300941d51@news.novabbs.com>

by: MitchAlsup - Mon, 27 Nov 2023 18:51 UTC

Paul A. Clayton wrote:

> On 11/26/23 11:12 AM, Scott Lurndal wrote:
>> "Paul A. Clayton" <paaronclayton@gmail.com> writes:
> [snip paths for extending LL/SC guarantees]
>>> Generating complexity is easy; producing elegance, not so much.
>>
>> Seems like far more work and complexity than just adding a set
>> of atomic instructions to the ISA (like ARM did with LSE in V8.1).

> Composing primitives provides more flexibility at a higher cost
> for hardware management and communication to the programmer/
> compiler of which special cases are optimized by hardware (and
> code density).

> For a special LL — or perhaps better a short prefix to modify
> ordinary load instructions — hardware's work seems not to be that
> much more complex. (Note: I do not design hardware!) This is
> effectively a form of instruction fusion. With a special LL, the
> decoder would know early that this is a case for fusion/special
> translation.

> Instruction fusion is more complex than instruction cracking, but
> atomic operations are already complex.

> This also shifts the communication of performance guarantees from
> "instruction is valid" to "processor implements profile that
> provides that guarantee". However, even with atomic instructions
> one is likely to have different performance aspects for different
> implementations. All might guarantee local forward progress as
> part of the instruction definition, but the performance of the
> operation would vary especially under contention. The performance
> might also vary depending on the operation (e.g., adding a small
> positive value might be faster/more parallel via sharing the
> higher bit carry-out)

> Atomic instructions are denser, but are subject to instruction
> type expansion. Software written using a generic interface could
> be more nearly best-performance for future implementations that
> improve, e.g., an atomic floating-point addition operation.

> Fitting atomic instructions into a fixed-size 32-bit instruction
> also means that addressing modes are highly constrained. With such
> a constraint, one could conceive of a prefix that makes a
> following compute operation into an atomic operation (where a
> particular source register is actually an indirect operand and
> external memory destination). (Non-commutative operations would
> require the ability to specify any operand as being in memory.)

> If one need not consider binary compatibility (e.g., a translation
> layer between the distribution format and the machine format),
> the encoding choices would be quite different.

>> The CPU can perform the atomic immediately if it already has
>> the cache line, or it can send it to LLC if it doesn't. It

There comes a time when you want and need ATOMICs with more than
one addressed location. Things like DCADS.....

>> can also send the atomic to a PCI device or coprocessor. LL/SC is far
>> less efficient and adding complexity to the implementation
>> to fix a broken interface doesn't make much sense to me.

> I do not consider LL/SC a "broken interface", except perhaps in
> facilitating inappropriate implementations. More general
> interfaces tend to move more of the implementation behind the
> abstraction, so performance and other "abstraction leakage" must
> be communicated/contracted outside of the interface itself.
> Without such communication/guarantees, implementations can
> choose the minimum-effort design that barely meets the interface
> definition or even be able to get away with an implementation
> that fails to meet the specification (if that failure is
> sufficiently obscure by rarity and other possible failure causes).☹

> More general interfaces will be less efficient when the important
> or common use cases are few (over all users of the interface) and
> known beforehand.

> Accumulating special cases, especially without foresight, can
> easily lead to inelegance. (I like the density and semantic
> clarity of atomic instructions — and they are not inconsistent
> with also having a more general atomic interface.)

> The "provide primitives not solutions" guideline for interfaces
> (specifically ISAs) is a *guideline* reacting against a bias
> toward localized solutions. An opposite bias also exists, to
> incorporate every possible use case into a general interface,
> typically leading to excess complexity.

> LL/SC itself is a "solution" to optimize atomicity, which would be
> physically possible otherwise but at greater overhead.

> My 66000's exotic synchronization mechanism has substantial
> generality and requires both a initiating instruction (a special
> load) and a terminating instruction (which could be just a control
> flow instruction leaving the basic block) as well as any
> computation and predication. While it is described as handling
> failures with replay, "as if" can be applied, especially for
> simple atomic operations.

a) the terminator must be an outbound memory reference (not a branch
because I specifically want to perform flow control within the event.

b) When the initiator is decoded, the recovery point is the initiator.
When the interrogator is decoded, the recovery point changes to
the branch target.
> Three instruction fusion might be
> challenging (and I suspect providing hints/directives that the
> operation is of that kind might be helpful). I do not know how
> Mitch Alsup implemented ESM for his scalar core or how he
> envisions implementing it in a "maximum-effort" core.

I use the miss buffer to watch for SNOOPs to participating cache
lines, and I have a branch instruction which interrogates the
miss buffer to see if any interference has transpired. Using
something that is already present and already doing pretty
much what is desired saves complexity.

In the maximal effort design, I use the Memory Dependence Matrix
to control memory order, each memory reference has a 2-bit state
which determines its allowed memory order {no order, causal,
sequential consistency, strongly ordered} and this controls
the sequencing of the memory events.

Any interrupt, exception, SysCall, will cause the ATOMIC event
to fail to the recovery point. So, you cannot single step through
events.

Re: Memory dependency microbenchmark

<0c6dnVgJXtBRIfj4nZ2dnZfqn_qdnZ2d@supernews.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=35222&group=comp.arch#35222

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!eternal-september.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!69.80.99.22.MISMATCH!Xl.tags.giganews.com!local-2.nntp.ord.giganews.com!nntp.supernews.com!news.supernews.com.POSTED!not-for-mail
NNTP-Posting-Date: Tue, 28 Nov 2023 10:11:24 +0000
Sender: Andrew Haley <aph@zarquon.pink>
From: aph...@littlepinkcloud.invalid
Subject: Re: Memory dependency microbenchmark
Newsgroups: comp.arch
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <82b3b3b710652e607dac6cec2064c90b@news.novabbs.com> <uisdmn$gd4s$2@dont-email.me> <uiu4t5$t4c2$2@dont-email.me> <uj3c29$1t9an$1@dont-email.me> <uj3d0a$1tb8u$1@dont-email.me> <ujpnba$28jgr$1@dont-email.me> <ujpni1$28jgr$2@dont-email.me> <6339b1f3d3f7e47364800bfe1dd96bf0@news.novabbs.com> <bQa8N.2234$PJoc.448@fx04.iad> <db4d79b4f418a5d73ec7c63bf691af31@news.novabbs.com> <7Pp8N.22587$yAie.21519@fx44.iad> <ujubgb$30tcj$1@dont-email.me> <U7K8N.7152$Ycdc.4017@fx09.iad> <uk29cp$3q0st$1@dont-email.me>
User-Agent: tin/1.9.2-20070201 ("Dalaruan") (UNIX) (Linux/4.18.0-477.27.1.el8_8.x86_64 (x86_64))
Message-ID: <0c6dnVgJXtBRIfj4nZ2dnZfqn_qdnZ2d@supernews.com>
Date: Tue, 28 Nov 2023 10:11:24 +0000
Lines: 19
X-Trace: sv3-C4dyqkO2tKb1DF46VTxLByIGQt+mFSZMTLk8vwgjp6rlKhqcjEHNbkDo0xLdXDaCboYd8O0lFwtvLC6!nAOG6GZ1Y6qlX0WApRfuhyUZ7OYMVGPYzWnsU3LrF89sG/7ZNYiqpWmL+iZAx3flcaIXwkhpb7pv!6ZvrJhdUbXM=
X-Complaints-To: www.supernews.com/docs/abuse.html
X-DMCA-Complaints-To: www.supernews.com/docs/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40

by: aph...@littlepinkcloud.invalid - Tue, 28 Nov 2023 10:11 UTC

Paul A. Clayton <paaronclayton@gmail.com> wrote:
> On 11/26/23 11:12?AM, Scott Lurndal wrote:
>> "Paul A. Clayton" <paaronclayton@gmail.com> writes:
> [snip paths for extending LL/SC guarantees]
>>> Generating complexity is easy; producing elegance, not so much.
>>
>> Seems like far more work and complexity than just adding a set
>> of atomic instructions to the ISA (like ARM did with LSE in V8.1).
>
> Composing primitives provides more flexibility at a higher cost
> for hardware management and communication to the programmer/
> compiler of which special cases are optimized by hardware (and
> code density).

Mmm, but the point of LSE, as the name suggests, is to facilitate
things like fire-and-forget counters on remote nodes without ping-
ponging cache lines.

Andrew.

Re: Memory dependency microbenchmark

<uk4f17$816b$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=35223&group=comp.arch#35223

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m....@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Tue, 28 Nov 2023 02:22:31 -0800
Organization: A noiseless patient Spider
Lines: 21
Message-ID: <uk4f17$816b$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>
<uisdmn$gd4s$2@dont-email.me> <uiu4t5$t4c2$2@dont-email.me>
<uj3c29$1t9an$1@dont-email.me> <uj3d0a$1tb8u$1@dont-email.me>
<ujpnba$28jgr$1@dont-email.me> <ujpni1$28jgr$2@dont-email.me>
<6339b1f3d3f7e47364800bfe1dd96bf0@news.novabbs.com>
<bQa8N.2234$PJoc.448@fx04.iad>
<db4d79b4f418a5d73ec7c63bf691af31@news.novabbs.com>
<7Pp8N.22587$yAie.21519@fx44.iad> <ujubgb$30tcj$1@dont-email.me>
<U7K8N.7152$Ycdc.4017@fx09.iad> <uk29cp$3q0st$1@dont-email.me>
<0c6dnVgJXtBRIfj4nZ2dnZfqn_qdnZ2d@supernews.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 28 Nov 2023 10:22:31 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="0cb77dcc24e2a1d3ec41353ca82b05d6";
logging-data="263371"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+ktucvH/+FjoHCJLC87Qqd28pdlAmbSIU="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:VnKtD5jlvUOwV9F6TGT9hr+CbSk=
In-Reply-To: <0c6dnVgJXtBRIfj4nZ2dnZfqn_qdnZ2d@supernews.com>
Content-Language: en-US

by: Chris M. Thomasson - Tue, 28 Nov 2023 10:22 UTC

On 11/28/2023 2:11 AM, aph@littlepinkcloud.invalid wrote:
> Paul A. Clayton <paaronclayton@gmail.com> wrote:
>> On 11/26/23 11:12?AM, Scott Lurndal wrote:
>>> "Paul A. Clayton" <paaronclayton@gmail.com> writes:
>> [snip paths for extending LL/SC guarantees]
>>>> Generating complexity is easy; producing elegance, not so much.
>>>
>>> Seems like far more work and complexity than just adding a set
>>> of atomic instructions to the ISA (like ARM did with LSE in V8.1).
>>
>> Composing primitives provides more flexibility at a higher cost
>> for hardware management and communication to the programmer/
>> compiler of which special cases are optimized by hardware (and
>> code density).
>
> Mmm, but the point of LSE, as the name suggests, is to facilitate
> things like fire-and-forget counters on remote nodes without ping-
> ponging cache lines.

Keep split counters in mind as well...

Re: Memory dependency microbenchmark

<uk4gcd$86kj$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=35225&group=comp.arch#35225

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m....@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Tue, 28 Nov 2023 02:45:33 -0800
Organization: A noiseless patient Spider
Lines: 114
Message-ID: <uk4gcd$86kj$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com>
<uiu3dp$svfh$1@dont-email.me>
<u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>
<uj39sg$1stm2$1@dont-email.me> <jwvleaw1j9b.fsf-monnier+comp.arch@gnu.org>
<491d288a9d5a47236c979622b79db056@news.novabbs.com>
<7bT5N.54414$BbXa.29985@fx16.iad>
<47787581fe3402eb5170fda771088ef7@news.novabbs.com>
<ujbbdj$3fmi7$1@dont-email.me>
<cc3d63d482758c3fbc4b884d77cbf608@news.novabbs.com>
<ujdonm$3u8mv$2@dont-email.me>
<d0d4b97887504ac3464fa158e8a4d45a@news.novabbs.com>
<ujmpcd$1nuki$1@dont-email.me>
<bd114daeb51a98aea245c1a842342caf@news.novabbs.com>
<ujpjt8$28487$3@dont-email.me> <ujpk3p$28487$4@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 28 Nov 2023 10:45:33 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="0cb77dcc24e2a1d3ec41353ca82b05d6";
logging-data="268947"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18pjmDlibSjqrzDrjPthh/fePBscxumvF4="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:zB02xf4UoxQRx+Z0OlXP1k6XlfE=
In-Reply-To: <ujpk3p$28487$4@dont-email.me>
Content-Language: en-US

by: Chris M. Thomasson - Tue, 28 Nov 2023 10:45 UTC

On 11/23/2023 11:41 PM, Chris M. Thomasson wrote:
> On 11/23/2023 11:38 PM, Chris M. Thomasson wrote:
>> On 11/23/2023 8:33 AM, MitchAlsup wrote:
>>> Chris M. Thomasson wrote:
>>>
>>>> On 11/19/2023 2:23 PM, MitchAlsup wrote:
>>>>> Chris M. Thomasson wrote:
>>>>>
>>>>> So, while it does not eliminate live/dead-lock situations, it
>>>>> allows SW
>>>>> to be constructed to avoid live/dead lock situations:: Why is a value
>>>>> which is provided when an ATOMIC event fails. 0 means success,
>>>>> negative
>>>>> values are spurious (buffer overflows,...) while positives represent
>>>>> the number of competing threads, so the following case, skips elements
>>>>> on a linked list to decrease future initerference.
>>>>>
>>>>> Element* getElement( unSigned Key )
>>>>> {
>>>>>      int count = 0;
>>>>>      for( p = structure.head; p ; p = p->next )
>>>>>      {
>>>>>           if( p->Key == Key )
>>>>>           {
>>>>>                if( count-- < 0 )
>>>>>                {
>>>>>                     esmLOCK( p );
>>>>>                     prev = p->prev;
>>>>>                     esmLOCK( prev );
>>>>>                     next = p->next;
>>>>>                     esmLOCK( next );
>>>>>                     if( !esmINTERFERENCE() )
>>>>>                     {
>>>>>                          p->prev = next;
>>>>>                          p->next = prev;
>>>>>                          p->prev = NULL;
>>>>>                 esmLOCK( p->next = NULL );
>>>>>                          return p;
>>>>>                     }
>>>>>                     else
>>>>>                     {
>>>>>                          count = esmWHY();
>>>>>                          p = structure.head;
>>>>>                     }
>>>>>                }
>>>>>           }
>>>>>      }
>>>>>      return NULL;
>>>>> }
>>>>>
>>>>> Doing ATOMIC things like this means one can take the BigO( n^3 )
>>>>> activity
>>>>> that happens when a timer goes off and n threads all want access to
>>>>> the
>>>>> work queue, down to BigO( 3 ) yes=constant, but in practice it is
>>>>> reduced
>>>>> to BigO( ln( n ) ) when requesters arrive in random order at random
>>>>> time.
>>>>>
>>>>>> I remember hearing from my friend Joe Seigh, who worked at IBM,
>>>>>> that they had some sort of logic that would prevent live lock in a
>>>>>> compare and swap wrt their free pool manipulation logic. Iirc, it
>>>>>> was somewhat related to ABA, hard to remember right now, sorry. I
>>>>>> need to find that old thread back in comp.programming.threads.
>>>>>
>>>>> Depending on system size: there can be several system function
>>>>> units that grant "order" for ATOMIC events. These are useful for
>>>>> 64+node systems
>>>>> and unnecessary for less than 8-node systems. Disjoint memory spaces
>>>>> can use independent ATOMIC arbiters and whether they are in use or
>>>>> not is
>>>>> invisible to SW.
>>>>>
>>>>>>>
>>>>>>>> ?
>>>>>>>
>>>>>>>> Check this out the old thread:
>>>>>>>
>>>>>>>> https://groups.google.com/g/comp.arch/c/shshLdF1uqs/m/VLmZSCBGDTkJ
>>>
>>>> Humm, you arch seems pretty neat/interesting to me. I need to learn
>>>> more about it. Can it be abused with a rouge thread that keeps
>>>> altering a cacheline(s) that are participating in the atomic block,
>>>> so to speak? Anyway, I am busy with family time. Will get back to you.
>>>
>>> While possible, it is a lot less likely than on a similar architecture
>>> without any of the bells and whistles.
>>
>> Iirc, Jow Seigh mentioned how CS on IBM systems would prevent live
>> lock by locking a bus or asserting a signal, that would ensure that a
>> compare-and-swap would never get into death spiral of always failing.
>> Iirc, Microsoft has something like this in its lock-free stack SList
>> or something, Cannot remember exactly right now. Sorry.
>
>
> Wrt Microsoft's SList, I think it goes into the kernel to handle memory
> reclamation issues, aba, and such...

SList is basically identical to IBM's free pool manipulation, iirc.

>
>>
>> Joe Seigh mentioned internal docs, candy striped.
>>
>>>
>>>> Fwiw, here is some of my work:
>>>
>>>> https://youtu.be/HwIkk9zENcg
>>>
>>> Octopi in a box playing ball.
>>
>

Re: Memory dependency microbenchmark

<wTidnZoPsPF1gfv4nZ2dnZfqn_adnZ2d@supernews.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=35227&group=comp.arch#35227

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!2.eu.feeder.erje.net!1.us.feeder.erje.net!feeder.erje.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!69.80.99.22.MISMATCH!Xl.tags.giganews.com!local-2.nntp.ord.giganews.com!nntp.supernews.com!news.supernews.com.POSTED!not-for-mail
NNTP-Posting-Date: Tue, 28 Nov 2023 17:01:28 +0000
Sender: Andrew Haley <aph@zarquon.pink>
From: aph...@littlepinkcloud.invalid
Subject: Re: Memory dependency microbenchmark
Newsgroups: comp.arch
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <82b3b3b710652e607dac6cec2064c90b@news.novabbs.com> <uisdmn$gd4s$2@dont-email.me> <uiu4t5$t4c2$2@dont-email.me> <uj3c29$1t9an$1@dont-email.me> <uj3d0a$1tb8u$1@dont-email.me> <ujpnba$28jgr$1@dont-email.me> <ujpni1$28jgr$2@dont-email.me> <6339b1f3d3f7e47364800bfe1dd96bf0@news.novabbs.com> <bQa8N.2234$PJoc.448@fx04.iad> <db4d79b4f418a5d73ec7c63bf691af31@news.novabbs.com> <7Pp8N.22587$yAie.21519@fx44.iad> <ujubgb$30tcj$1@dont-email.me> <U7K8N.7152$Ycdc.4017@fx09.iad> <uk29cp$3q0st$1@dont-email.me> <0c6dnVgJXtBRIfj4nZ2dnZfqn_qdnZ2d@supernews.com> <uk4f17$816b$1@dont-email.me>
User-Agent: tin/1.9.2-20070201 ("Dalaruan") (UNIX) (Linux/4.18.0-477.27.1.el8_8.x86_64 (x86_64))
Message-ID: <wTidnZoPsPF1gfv4nZ2dnZfqn_adnZ2d@supernews.com>
Date: Tue, 28 Nov 2023 17:01:28 +0000
Lines: 20
X-Trace: sv3-Gx52yCLSf8ofVhOZB3dGwQQO89FPiN51iYtg/8Ow3ma5O6jKZNq09zN5iYPMVCtob6daLDI9GhV+nVB!PWIf1FDDO0hLnzm3ScrxfIaJJlgXYetGDtBaV/k4ahnjOML/6tmD/E0lH/rlangSVIt/T133hyil!QcOeqWPpgeI=
X-Complaints-To: www.supernews.com/docs/abuse.html
X-DMCA-Complaints-To: www.supernews.com/docs/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40

by: aph...@littlepinkcloud.invalid - Tue, 28 Nov 2023 17:01 UTC

Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
> On 11/28/2023 2:11 AM, aph@littlepinkcloud.invalid wrote:
>> Paul A. Clayton <paaronclayton@gmail.com> wrote:
>>> Composing primitives provides more flexibility at a higher cost
>>> for hardware management and communication to the programmer/
>>> compiler of which special cases are optimized by hardware (and
>>> code density).
>>
>> Mmm, but the point of LSE, as the name suggests, is to facilitate
>> things like fire-and-forget counters on remote nodes without ping-
>> ponging cache lines.
>
> Keep split counters in mind as well...

Sure, I know (we all know?) that there are algorithms to do
distributed counting, but if hardware can do something easily and
cheaply, why not? Sure, it won't scale as well up to very large node
counts.

Andrew.

Re: Memory dependency microbenchmark

<uk604s$fosk$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=35233&group=comp.arch#35233

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m....@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Tue, 28 Nov 2023 16:20:44 -0800
Organization: A noiseless patient Spider
Lines: 37
Message-ID: <uk604s$fosk$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uirke6$8hef$3@dont-email.me>
<36011a9597060e08d46db0eddfed0976@news.novabbs.com>
<uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me>
<sQt4N.2604$rx%7.497@fx47.iad> <ujfvr0$cda9$1@dont-email.me>
<5yN6N.8007$DADd.5269@fx38.iad>
<NbacnZYbq7jFKcH4nZ2dnZfqn_idnZ2d@supernews.com>
<Fx57N.20583$BSkc.9831@fx06.iad>
<-zadnY8X-dbzVsD4nZ2dnZfqnPudnZ2d@supernews.com>
<26b59f3350c10e56a76da4c7f19cdc3a@news.novabbs.com>
<jcN7N.64676$cAm7.42877@fx18.iad> <ujpjoj$28487$2@dont-email.me>
<904a40b805c70a593356ecaebdf473c3@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 29 Nov 2023 00:20:44 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="006e444d818f21252b0f99133d5837bc";
logging-data="517012"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Zat8yPv2lzS7UVcC88tu4KvXl5M0xt6Q="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:36nuHS9IEEYAeoQ76R39p5DAnoo=
In-Reply-To: <904a40b805c70a593356ecaebdf473c3@news.novabbs.com>
Content-Language: en-US

by: Chris M. Thomasson - Wed, 29 Nov 2023 00:20 UTC

On 11/24/2023 10:32 AM, MitchAlsup wrote:
> Chris M. Thomasson wrote:
>
>> On 11/23/2023 10:52 AM, EricP wrote:
>>> MitchAlsup wrote:
>>>>
>>> There are two kinds of barriers/fences (I don't know if there are
>>> official
>>> terms for them), which are local bypass barriers, and global completion
>>> barriers.
>>>
>>> Bypass barriers restrict which younger ops in the local load-store queue
>>> may bypass and start execution before older ops have made a value
>>> locally
>>> visible.
>>>
>>> Completion barriers block younger ops from starting execution before
>>> older ops have completed and read or written globally visible values.
>>>
>>> You appear to be referring to bypass barriers whereas I'm referring to
>>> completion barriers which require globally visible results.
>>>
>
>> Basically, how to map the various membar ops into an arch that can be
>> RMO. Assume the programmers have no problem with it... ;^o SPARC did
>> it, but, is it worth it now? Is my knowledge of dealing with relaxed
>> systems, threads/processes and membars obsoleted? shit man... ;^o
>
> If the page being mapped is properly identified in the PTE, then there is
> no reason to need any MemBars.
> Also Note:: MemBars are the WRONG abstraction--a MemBar is like a wall
> whereas what programmers want is a bridge. As long as you are on the
> bridge (inside an ATOMIC event) you want one memory model and when you
> leave the bridge you are free to use a more performant model. MemBars
> only demark the edges of the bridge, they don't cover the whole bridge.

Well, there are different types of membars...

Re: Memory dependency microbenchmark

<uk68pd$gsp3$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=35234&group=comp.arch#35234

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: paaroncl...@gmail.com (Paul A. Clayton)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Tue, 28 Nov 2023 21:48:12 -0500
Organization: A noiseless patient Spider
Lines: 51
Message-ID: <uk68pd$gsp3$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>
<uisdmn$gd4s$2@dont-email.me> <uiu4t5$t4c2$2@dont-email.me>
<uj3c29$1t9an$1@dont-email.me> <uj3d0a$1tb8u$1@dont-email.me>
<ujpnba$28jgr$1@dont-email.me> <ujpni1$28jgr$2@dont-email.me>
<6339b1f3d3f7e47364800bfe1dd96bf0@news.novabbs.com>
<bQa8N.2234$PJoc.448@fx04.iad>
<db4d79b4f418a5d73ec7c63bf691af31@news.novabbs.com>
<7Pp8N.22587$yAie.21519@fx44.iad> <ujubgb$30tcj$1@dont-email.me>
<U7K8N.7152$Ycdc.4017@fx09.iad> <uk29cp$3q0st$1@dont-email.me>
<e27cd793a43e659d12ea36e300941d51@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 29 Nov 2023 02:48:14 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="d4e04e354400d797d41b241640ce7729";
logging-data="553763"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+WresxlgjdkJyOFr/cGliWNHNaeWMx4h0="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.0
Cancel-Lock: sha1:CzTQKLg/Q5txmO6fcrH7a0fh3e0=
In-Reply-To: <e27cd793a43e659d12ea36e300941d51@news.novabbs.com>

by: Paul A. Clayton - Wed, 29 Nov 2023 02:48 UTC

On 11/27/23 1:51 PM, MitchAlsup wrote:
> Paul A. Clayton wrote:
[snip]
>> My 66000's exotic synchronization mechanism has substantial
>> generality and requires both a initiating instruction (a special
>> load) and a terminating instruction (which could be just a control
>> flow instruction leaving the basic block) as well as any
>> computation and predication. While it is described as handling
>> failures with replay, "as if" can be applied, especially for
>> simple atomic operations.
>
> a) the terminator must be an outbound memory reference (not a branch
> because I specifically want to perform flow control within the event.

I am sorry for the misinformation. I should have just left off
that paragraph since I hurriedly looked at the My 66000
documentation (specifically the assembly for a simple lock
setting function). I thought that the store was not annotated
(even though the comment said "unlocked") and assumed that a
control flow statement (which I vaguely remembered was not
allowed inside such atomics, being restricted to predication —
assuming my memory for that is correct!) indicated an end.

Thank you for the quick correction.

> b) When the initiator is decoded, the recovery point is the
> initiator.
> When the interrogator is decoded, the recovery point changes to
> the branch target.
>> Three instruction fusion might be
>> challenging (and I suspect providing hints/directives that the
>> operation is of that kind might be helpful). I do not know how
>> Mitch Alsup implemented ESM for his scalar core or how he
>> envisions implementing it in a "maximum-effort" core.
>
> I use the miss buffer to watch for SNOOPs to participating cache
> lines, and I have a branch instruction which interrogates the miss
> buffer to see if any interference has transpired. Using
> something that is already present and already doing pretty
> much what is desired saves complexity.
>
> In the maximal effort design, I use the Memory Dependence Matrix
> to control memory order, each memory reference has a 2-bit state
> which determines its allowed memory order {no order, causal,
> sequential consistency, strongly ordered} and this controls the
> sequencing of the memory events.
>
> Any interrupt, exception, SysCall, will cause the ATOMIC event
> to fail to the recovery point. So, you cannot single step through
> events.

Re: Memory dependency microbenchmark

<ukgush$2nho8$3@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=35287&group=comp.arch#35287

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m....@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Sat, 2 Dec 2023 20:06:40 -0800
Organization: A noiseless patient Spider
Lines: 108
Message-ID: <ukgush$2nho8$3@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com>
<uiu3dp$svfh$1@dont-email.me>
<u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>
<uj39sg$1stm2$1@dont-email.me> <jwvleaw1j9b.fsf-monnier+comp.arch@gnu.org>
<491d288a9d5a47236c979622b79db056@news.novabbs.com>
<uj8eho$2u1ih$1@dont-email.me>
<6ad3395120a7dc743e0ed740e77d09a1@news.novabbs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 3 Dec 2023 04:06:41 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="8868b8f4564f0e4e4a8d04bd2a9ce818";
logging-data="2869000"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+TA7aQZ20ELst6u+8z9pJRUTi4MmXvHTg="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:GQ88mTLRvv/0qdpay2Earqdfe2Q=
Content-Language: en-US
In-Reply-To: <6ad3395120a7dc743e0ed740e77d09a1@news.novabbs.com>

by: Chris M. Thomasson - Sun, 3 Dec 2023 04:06 UTC

On 11/17/2023 12:49 PM, MitchAlsup wrote:
> Chris M. Thomasson wrote:
>
>> On 11/17/2023 10:37 AM, MitchAlsup wrote:
>>> Stefan Monnier wrote:
>>>
>>>>> As long as you fully analyze your program, ensure all multithreaded
>>>>> accesses
>>>>> are only through atomic variables, and you label every access to an
>>>>> atomic variable properly (although my point is: exactly what should
>>>>> that
>>>>> be??), then there is no problem.
>>>
>>>> BTW, the above sounds daunting when writing in C because you have to do
>>>> that analysis yourself, but there are programming languages out there
>>>> which will do that analysis for you as part of type checking.
>>>> I'm thinking here of languages like Rust or the STM library of
>>>> Haskell. This also solves the problem that memory accesses can be
>>>> reordered by the compiler, since in that case the compiler is fully
>>>> aware of which accesses can be reordered and which can't.
>>>
>>>
>>>>         Stefan
>>> <
>>> I created the Exotic Synchronization Method such that you could just
>>> write the code needed to do the work, and then decorate those accesses
>>> which are participating in the ATOMIC event. So, lets say you want to
>>> move an element from one doubly linked list to another place in some
>>> other doubly linked list:: you would write::
>>> <
>>> BOOLEAN MoveElement( Element *fr, Element *to )
>>> {
>>>      fn = fr->next;
>>>      fp = fr->prev;
>>>      tn = to->next;
>>>
>>>
>>>
>>>      if( TRUE )
>>>      {
>>>               fp->next = fn;
>>>               fn->prev = fp;
>>>               to->next = fr;
>>>               tn->prev = fr;
>>>               fr->prev = to;
>>>               fr->next = tn;
>>>               return TRUE;
>>>      }
>>>      return FALSE;
>>> }
>>>
>>> In order to change this into a fully qualified ATOMIC event, the code
>>> is decorated as::
>>>
>>> BOOLEAN MoveElement( Element *fr, Element *to )
>>> {
>>>      esmLOCK( fn = fr->next );         // get data
>>>      esmLOCK( fp = fr->prev );
>>>      esmLOCK( tn = to->next );
>>>      esmLOCK( fn );                    // touch data
>>>      esmLOCK( fp );
>>>      esmLOCK( tn );
>>>      if( !esmINTERFERENCE() )
>>>      {
>>>               fp->next = fn;           // move the bits around
>>>               fn->prev = fp;
>>>               to->next = fr;
>>>               tn->prev = fr;
>>>               fr->prev = to;
>
>
>>>      esmLOCK( fr->next = tn );
>> ^^^^^^^^^^^^^^^^^^^^^^^^
>
>> Why perform an esmLOCK here? Please correct my confusion.
>
> esmLOCK on the last participating Store is what tells the HW that the
> ATOMIC
> event is finished. So, the first esmLOCK begins the ATOMIC event and the
> only esmLOCK on an outgoing (ST or PostPush) memory reference ends the
> event.
>
>>>               return TRUE;
>>>      }
>>>      return FALSE;
>>> }
>>>
>>> Having a multiplicity of containers participate in an ATOMIC event
>>> is key to making ATOMIC stuff fast and needing fewer ATOMICs to to
>>> get the job(s) done.

For some damn reason, the more and more I look at your work wrt the
atomics in here, the more I think about a special deadlock free locking
mechanism of mine called the multex. It would sort addresses and remove
duplicates, then lock them using a mutex locking table via hash. Removal
of duplicates removed the need of recursive locking, the sorting allowed
for deadlock free locking... Therefore the hash lock table was 100%
separated from the user logic. Are there any relations to this, or am
off base here? Thanks.

https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ

(read all...)

Wait a minute, its hard to remember, but I think you have already seem
the multex, right? For some reason I remember showing it to you a while
back.

Re: Memory dependency microbenchmark

<ukh3nq$2o29e$2@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=35288&group=comp.arch#35288

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m....@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Sat, 2 Dec 2023 21:29:29 -0800
Organization: A noiseless patient Spider
Lines: 21
Message-ID: <ukh3nq$2o29e$2@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uirke6$8hef$3@dont-email.me>
<36011a9597060e08d46db0eddfed0976@news.novabbs.com>
<uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me>
<sQt4N.2604$rx%7.497@fx47.iad> <ujfvr0$cda9$1@dont-email.me>
<5yN6N.8007$DADd.5269@fx38.iad>
<NbacnZYbq7jFKcH4nZ2dnZfqn_idnZ2d@supernews.com>
<Fx57N.20583$BSkc.9831@fx06.iad>
<-zadnY8X-dbzVsD4nZ2dnZfqnPudnZ2d@supernews.com>
<ptM7N.70206$_Oab.45835@fx15.iad>
<GFydnR-3-5Vs6_34nZ2dnZfqn_adnZ2d@supernews.com>
<ujrdhu$2gjhg$5@dont-email.me>
<ZfmcnT7UHs-BTfz4nZ2dnZfqnPWdnZ2d@supernews.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 3 Dec 2023 05:29:30 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="8868b8f4564f0e4e4a8d04bd2a9ce818";
logging-data="2885934"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18m+tl7eJ9U/cZuBLQz/o1uHyODiZRxak8="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:GxrMupQl7HfejzotamMnG34uBFY=
In-Reply-To: <ZfmcnT7UHs-BTfz4nZ2dnZfqnPWdnZ2d@supernews.com>
Content-Language: en-US

by: Chris M. Thomasson - Sun, 3 Dec 2023 05:29 UTC

On 11/25/2023 2:44 AM, aph@littlepinkcloud.invalid wrote:
> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>
>> Side note, XCHG all by itself automatically implies a LOCK prefix... :^)
>
> Yeah, I remember reading that once, but I still see LOCK prefxies
> being used. I guess a LOCK doesn't hurt, and people want to be on the
> safe side, or something. Ah, programmers... :-)
>

To be prudent, LOCK XCHG

:^) Gets the point across.

sarcastic side note:

LOCK LOCK XCHG goes into LOCK XCHG, right? Implied vs syntax error?

;^) Just kidding. I need to look for some of my old 686 asm code I did
for an older project. It might be on the way back machine.

Re: Memory dependency microbenchmark

<ukh3r2$2o29e$3@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=35289&group=comp.arch#35289

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m....@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Sat, 2 Dec 2023 21:31:14 -0800
Organization: A noiseless patient Spider
Lines: 28
Message-ID: <ukh3r2$2o29e$3@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<uirke6$8hef$3@dont-email.me>
<36011a9597060e08d46db0eddfed0976@news.novabbs.com>
<uirqj3$9q9q$1@dont-email.me> <uisd4b$gd4s$1@dont-email.me>
<sQt4N.2604$rx%7.497@fx47.iad> <ujfvr0$cda9$1@dont-email.me>
<5yN6N.8007$DADd.5269@fx38.iad>
<NbacnZYbq7jFKcH4nZ2dnZfqn_idnZ2d@supernews.com>
<Fx57N.20583$BSkc.9831@fx06.iad>
<-zadnY8X-dbzVsD4nZ2dnZfqnPudnZ2d@supernews.com>
<ptM7N.70206$_Oab.45835@fx15.iad>
<GFydnR-3-5Vs6_34nZ2dnZfqn_adnZ2d@supernews.com>
<ujrdhu$2gjhg$5@dont-email.me>
<ZfmcnT7UHs-BTfz4nZ2dnZfqnPWdnZ2d@supernews.com>
<ukh3nq$2o29e$2@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 3 Dec 2023 05:31:15 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="8868b8f4564f0e4e4a8d04bd2a9ce818";
logging-data="2885934"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/JgFtwW5u31LtO8VebASKMGfPA3i5xmFY="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:Z2qveaM4r7cnKZQSTu7w7cfhbY4=
In-Reply-To: <ukh3nq$2o29e$2@dont-email.me>
Content-Language: en-US

by: Chris M. Thomasson - Sun, 3 Dec 2023 05:31 UTC

On 12/2/2023 9:29 PM, Chris M. Thomasson wrote:
> On 11/25/2023 2:44 AM, aph@littlepinkcloud.invalid wrote:
>> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>>
>>> Side note, XCHG all by itself automatically implies a LOCK prefix... :^)
>>
>> Yeah, I remember reading that once, but I still see LOCK prefxies
>> being used. I guess a LOCK doesn't hurt, and people want to be on the
>> safe side, or something. Ah, programmers... :-)
>>
>
> To be prudent, LOCK XCHG
>
> :^) Gets the point across.
>
> sarcastic side note:
>
> LOCK LOCK XCHG goes into LOCK XCHG, right? Implied vs syntax error?
>
> ;^) Just kidding. I need to look for some of my old 686 asm code I did
> for an older project. It might be on the way back machine.
>

Found some from my old AppCore project:

http://web.archive.org/web/20060214112345/http://appcore.home.comcast.net/appcore/src/cpu/i686/ac_i686_gcc_asm.html

Re: Memory dependency microbenchmark

<40ydnfxegJn2bPD4nZ2dnZfqnPudnZ2d@supernews.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=35323&group=comp.arch#35323

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!border-2.nntp.ord.giganews.com!nntp.giganews.com!Xl.tags.giganews.com!local-1.nntp.ord.giganews.com!nntp.supernews.com!news.supernews.com.POSTED!not-for-mail
NNTP-Posting-Date: Mon, 04 Dec 2023 15:34:03 +0000
Sender: Andrew Haley <aph@zarquon.pink>
From: aph...@littlepinkcloud.invalid
Subject: Re: Memory dependency microbenchmark
Newsgroups: comp.arch
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com> <uiu3dp$svfh$1@dont-email.me> <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com> <uj39sg$1stm2$1@dont-email.me> <jwvleaw1j9b.fsf-monnier+comp.arch@gnu.org> <491d288a9d5a47236c979622b79db056@news.novabbs.com> <uj8eho$2u1ih$1@dont-email.me> <6ad3395120a7dc743e0ed740e77d09a1@news.novabbs.com> <ukgush$2nho8$3@dont-email.me>
User-Agent: tin/1.9.2-20070201 ("Dalaruan") (UNIX) (Linux/4.18.0-477.27.1.el8_8.x86_64 (x86_64))
Message-ID: <40ydnfxegJn2bPD4nZ2dnZfqnPudnZ2d@supernews.com>
Date: Mon, 04 Dec 2023 15:34:03 +0000
Lines: 19
X-Trace: sv3-Wcjli/UNz/46ymZtmwnThduG7JEANV5hb04KTlHWAfq0K6TY0fpqlXtB22eTjmUHH++M+HIHktavljs!Vk6b+VRYU6jY2GIrILy/N4JuLupujuNjyloz4p4oPo97OKRnGgWnEJ69YGQ4P0L7/MCtO6uXgkdN!rNC9PCfYh+s=
X-Complaints-To: www.supernews.com/docs/abuse.html
X-DMCA-Complaints-To: www.supernews.com/docs/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40

by: aph...@littlepinkcloud.invalid - Mon, 4 Dec 2023 15:34 UTC

Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>
> For some damn reason, the more and more I look at your work wrt the
> atomics in here, the more I think about a special deadlock free locking
> mechanism of mine called the multex. It would sort addresses and remove
> duplicates, then lock them using a mutex locking table via hash. Removal
> of duplicates removed the need of recursive locking, the sorting allowed
> for deadlock free locking... Therefore the hash lock table was 100%
> separated from the user logic. Are there any relations to this, or am
> off base here? Thanks.
>
> https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ

This isn't a million miles from TL2,
http://people.csail.mit.edu/shanir/publications/Transactional_Locking.pdf.
It may not be immediately obvious, but it's essentially the same
solution.

Andrew.

Re: Memory dependency microbenchmark

<ukoort$jgsk$2@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=35391&group=comp.arch#35391

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!newsfeed.endofthelinebbs.com!news.hispagatos.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m....@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Tue, 5 Dec 2023 19:13:01 -0800
Organization: A noiseless patient Spider
Lines: 32
Message-ID: <ukoort$jgsk$2@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com>
<uiu3dp$svfh$1@dont-email.me>
<u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>
<uj39sg$1stm2$1@dont-email.me> <jwvleaw1j9b.fsf-monnier+comp.arch@gnu.org>
<491d288a9d5a47236c979622b79db056@news.novabbs.com>
<uj8eho$2u1ih$1@dont-email.me>
<6ad3395120a7dc743e0ed740e77d09a1@news.novabbs.com>
<ukgush$2nho8$3@dont-email.me>
<40ydnfxegJn2bPD4nZ2dnZfqnPudnZ2d@supernews.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 6 Dec 2023 03:13:01 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="ac4ca5e6668b81ff03b9289c19bca287";
logging-data="639892"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+iuZqh9pBQboEjiihsovEegjJP+GTnY4E="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:cIsumwGIHdE89OfSHA9EOVActAk=
In-Reply-To: <40ydnfxegJn2bPD4nZ2dnZfqnPudnZ2d@supernews.com>
Content-Language: en-US

by: Chris M. Thomasson - Wed, 6 Dec 2023 03:13 UTC

On 12/4/2023 7:34 AM, aph@littlepinkcloud.invalid wrote:
> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>
>> For some damn reason, the more and more I look at your work wrt the
>> atomics in here, the more I think about a special deadlock free locking
>> mechanism of mine called the multex. It would sort addresses and remove
>> duplicates, then lock them using a mutex locking table via hash. Removal
>> of duplicates removed the need of recursive locking, the sorting allowed
>> for deadlock free locking... Therefore the hash lock table was 100%
>> separated from the user logic. Are there any relations to this, or am
>> off base here? Thanks.
>>
>> https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ
>
> This isn't a million miles from TL2,
> http://people.csail.mit.edu/shanir/publications/Transactional_Locking.pdf.
> It may not be immediately obvious, but it's essentially the same
> solution.

Humm... Interesting, I remember reading that paper a while back, 2007
ish iirc. Humm... I also remember reading other STM variations that
mentioned explicitly handling sigsegv's to deal with memory reclamation
issues, not sure if they dealt with zombies as well... Iirc, it was kind
of similar, akin to how the Microsoft SList handles it. Iirc, Microsoft
uses its structured exception handling (SEH) thing to deal with memory
reclamation in their SList, which is basically a lock-free stack, ala
IBM free-list/pool manipulation described in their principles of
operation. Iirc, in an A appendix...

I know that keeping the locking table completely separated from user
logic is a key in dealing with certain interesting locking issues
dealing with proper resource reclamation...

Re: Memory dependency microbenchmark

<ul3n9r$2jhc0$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=35545&group=comp.arch#35545

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m....@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Sat, 9 Dec 2023 22:53:45 -0800
Organization: A noiseless patient Spider
Lines: 24
Message-ID: <ul3n9r$2jhc0$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com>
<uiu3dp$svfh$1@dont-email.me>
<u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>
<uj39sg$1stm2$1@dont-email.me> <jwvleaw1j9b.fsf-monnier+comp.arch@gnu.org>
<491d288a9d5a47236c979622b79db056@news.novabbs.com>
<uj8eho$2u1ih$1@dont-email.me>
<6ad3395120a7dc743e0ed740e77d09a1@news.novabbs.com>
<ukgush$2nho8$3@dont-email.me>
<40ydnfxegJn2bPD4nZ2dnZfqnPudnZ2d@supernews.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 10 Dec 2023 06:53:47 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="2d564cfde4c048c08a9eefbedc65698b";
logging-data="2737536"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/IKSpqpq9JF5zzNdSfM23rUGKjXNA2wg8="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:hDq8pk7B9EicGqoCYJLAKdAsyN4=
Content-Language: en-US
In-Reply-To: <40ydnfxegJn2bPD4nZ2dnZfqnPudnZ2d@supernews.com>

by: Chris M. Thomasson - Sun, 10 Dec 2023 06:53 UTC

Hey Andrew, just a shot in the dark, do you happen to remember a locking
scheme wrt IBM called PLO? I cannot remember it exactly right now. It
seems to deal with locking issues...

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>I have written a microbenchmark for measuring how memory dependencies
>affect the performance of various microarchitectures. You can find it
>along with a description and results on
><http://www.complang.tuwien.ac.at/anton/memdep/>.

We have now a new toy, a NUC with a Core i3-1315U (which runs it cores
700MHz slower than the documented max turbo, but that does not hurt my
measurements). I have run memdep with that, which needs some
interesting twists:

taskset -c 2 make CYCLES=cpu_core/cycles:u/ #for Golden Cove results
taskset -c 5 make CYCLES=cpu_atom/cycles:u/ #for Gracemont results

The results are also interesting; I show here each of the Alder Lake
cores and its predecessor, and also Firestorm (Apple M1 performance
core) for good measure:

X Y Z A B B1 B2 C D E F G H
0.55 0.55 0.55 0.52 0.84 0.84 0.61 1.68 3.32 3.29 5.00 4.94 6.36 FireStorm
1.21 1.00 1.00 1.00 1.53 1.55 1.19 1.49 3.00 3.00 4.50 4.50 6.11 Tremont
1.00 1.00 1.00 0.75 0.75 0.70 0.70 0.75 0.70 1.00 0.75 0.75 1.00 Gracemont
1.00 1.00 1.00 0.81 0.81 0.81 0.81 0.75 0.81 0.84 0.82 0.86 1.00 Tiger Lake
1.00 1.00 1.00 0.55 0.75 0.65 0.65 0.78 0.68 0.67 0.66 0.65 0.65 Golden Cove

So Gracemont (the small core) includes zero-cycle store-to-load
forwarding, a feature that even Firestorm does not have.

Golden Cove is even more interesting. Apparently it can not only
perform zero-cycle store-to-load forwarding, it can even perform adds
that take at most 0.65 cycles (Column H). My guess is that it can
perform two back-to-back adds in one cycle, and the store-to-load
optimization is so good that it does not prevent this combination of
two adds.

But I have not read anything about Golden Cove performing back-to-back
adds in one cycle. Did I miss that, or are my measurements (or the
computations) wrong? It should be easy enough to test the
back-to-back loads, but I am too tired to do it now.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>Golden Cove is even more interesting. Apparently it can not only
>perform zero-cycle store-to-load forwarding, it can even perform adds
>that take at most 0.65 cycles (Column H). My guess is that it can
>perform two back-to-back adds in one cycle, and the store-to-load
>optimization is so good that it does not prevent this combination of
>two adds.

It cannot do back-to-back adds in general, but it can perform 0-cycle
(or close to 0) constant adds, as I found out with the microbenchmark
at <http://www.complang.tuwien.ac.at/anton/additions/>.

You can find the text portion of this page below; there are a few
links on the page that are not produced below.

If you have information on any of the things I don't know, please let
me know about it.

- anton

Dependent Additions Microbenchmark

Most CPU cores have a latency of one cycle for an integer addition,
so a sequence like

addq $1, %rax
addq $1, %rax
addq $1, %rax
addq $1, %rax
addq $1, %rax

usually has 5 cycles of total latency. The P-Core of the Core
i3-1315U (a Raptor Cove with 1.25MB L2 cache (or does this mean it is
a Golden Cove?)), however, executes this sequence much faster. This
microbenchmark finds out performance characteristics about that.
There are 14 loops, most containing sequences like the one shown
above, but there are others:

a i j n k-m
addq %r8,%rax leaq 1(%rax),%r8 addq $1,%rax imul %rsi,%rax imul %rsi,%rax
addq %rcx,%rax leaq 1(%r8),%rcx addq %rax,%r8 addq %rsi,%rax addq $1,%rax
addq %rdx,%rax leaq 1(%rcx),%rdx addq $1,%rax addq $1,%rax
addq %rsi,%rax leaq 1(%rdx),%rsi addq %rax,%rcx ...
addq %rdi,%rax leaq 1(%rsi),%rax addq $1,%rax
cmpq %rax,%r9 cmpq %rax, %rdi addq %rax,%rdx
jg 2b jg 2b addq $1,%rax
addq %rax,%rsi
addq $1,%rax
addq %rax,%r9
addq $1,%rax
cmpq %rax,%rdi
jg 2b

* Loop a tests if the speedup also holds if both operands to the
addition are registers.
* Loop i tests if it also holds if we use leaq instead of addq, and
if we use more than one register.
* Loop j tests if it also holds if every intermediate result is
actually being used.
* Loop n measures the latency of a multiply with a register (that
contains 1), and a register-register add, in preparation for:
* Loops k-m measure the pure latency of the add $1, %eax sequences
by masking the ALU resource contention with the latency of an
imul instruction. I.e., these resources can be consumed for free
during the 3-cycle latency of imul.

The results we see are:

loop cyc/it Mops dep.adds remarks
a 5.00 6 5 addq reg, %eax
b 1.01 6 5 addq $1, %rax
f 1.23 7 6 addq $1, %rax
c 1.98 11 10 addq $1, %rax
d 2.00 12 11 addq $1, %rax
e 2.21 13 12 addq $1, %rax
g 3.01 18 17 addq $1, %rax
h 3.25 19 18 addq $1, %rax
i 1.00 6 5 leaq 1(reg1),reg2
j 2.00 12 6 addq $1, %rax; addq %rax, reg
n 4.00 3 1 imul %rsi, %rax; addq %rsi, %rax
k 3.00 3 1 imul %rsi, %rax; addq $1, %rax
l 3.01 12 10 imul %rsi, %rax; addq $1, %rax
m 3.20 17 15 imul %rsi, %rax; addq $1, %rax

The Mops column counts the addq, cmpq+jg, leaq, and imul as
Macro-ops.

* We see from loop a that a general add has a latency of one cycle.
* But adds with a constant that depend on each other run at 6
macro-ops per cycle (where cmpq+jg is a macro-op), at least if
the loop matches the constraints perfectly (loops b,d,g).
* For slight mismatches (loops f,c,e,h), there is a penalty beyond
just needing a resource for a cycle.
* Using leaq (loop i) works just as fast as using addq, and using
several registers does not slow execution down, either.
* If every intermediate result is used (loop j), the dependent adds
are not slowed down to 1 per cycle, but of course the additional
adds consume resources and that slows the whole loop down to 2
cycles/iteration.
* We see a latency of 4 cycles (3 for the imul, 1 for the addq) in
loop n.
* Interestingly, if we replace the addq %rsi, %rax with addq $1,
%rax, the latency goes down to 3 cycles (k). Additional adds
don't increase the cycles for a loop iteration up to and
including a sequence of 10 addq $1, %eax (l). After that it rises
slowly (m).

What is happening here? I can only guess:

I believe that the hardware optimizes sequences of constant adds at
the renaming stage of the microarchitecture. Not later, because
according to Chips and Cheese Golden Cove (and, by extension, Raptor
Cove) has only 5 ALUs, and we see >5 addqs per cycle. Not earlier,
because it works across the zero-cycle store-to-load-forwarding
(which probably happens not earlier than the renaming stage). The
renamer already performs move elimination, the constant-add
elimination could be a generalized form of that.

I don't have a good explanation for the apparent latency of 0 for the
adds following the imul in loops k and l. I would think that the
additions accumulated in some state in the renamer have to be reified
in a physical register at some point, and I would expect that to cost
one cycle of latency. One theory I considered was that the
microarchitecture internally has a multiply-add unit, but one would
expect that to work also for loop n, but there we see an additional
cycle of latency from the addition.

How to run this yourself

Download this directory and do make or possibly something like
taskset -c 2 make on a Linux system with perf. The calls are scaled
for 1G iterations (which you can see in the number of branches shown
in the perf output), so dividing the number of cycles by 1G gives the
cycles/iteration.

--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Anton Ertl wrote:

> anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>>Golden Cove is even more interesting. Apparently it can not only
>>perform zero-cycle store-to-load forwarding, it can even perform adds
>>that take at most 0.65 cycles (Column H). My guess is that it can
>>perform two back-to-back adds in one cycle, and the store-to-load
>>optimization is so good that it does not prevent this combination of
>>two adds.

> It cannot do back-to-back adds in general, but it can perform 0-cycle
> (or close to 0) constant adds, as I found out with the microbenchmark
> at <http://www.complang.tuwien.ac.at/anton/additions/>.

> You can find the text portion of this page below; there are a few
> links on the page that are not produced below.

> If you have information on any of the things I don't know, please let
> me know about it.

> - anton

> Dependent Additions Microbenchmark

> Most CPU cores have a latency of one cycle for an integer addition,
> so a sequence like

> addq $1, %rax
> addq $1, %rax
> addq $1, %rax
> addq $1, %rax
> addq $1, %rax

K9, my P4 lookalike, would perform peephole optimizations (mainly due to
stack pushes and pops) as it converted x86-64 code into µOps. The above
would end up as::

addq $5,%rax

in µOp cache (which we called a trace cache). We did this particular optimization
so that::

push Rax
push Rbx
push Rcx

would get changed from 3 serially dependent stores into 3 parallel stores

ST Rax,[SP]
ST Rbx,[SP+4]
ST Rcs,[SP+8]
ADD SP,SP,12

> usually has 5 cycles of total latency. The P-Core of the Core
> i3-1315U (a Raptor Cove with 1.25MB L2 cache (or does this mean it is
> a Golden Cove?)), however, executes this sequence much faster. This
> microbenchmark finds out performance characteristics about that.
> There are 14 loops, most containing sequences like the one shown
> above, but there are others:

> a i j n k-m
> addq %r8,%rax leaq 1(%rax),%r8 addq $1,%rax imul %rsi,%rax imul %rsi,%rax
> addq %rcx,%rax leaq 1(%r8),%rcx addq %rax,%r8 addq %rsi,%rax addq $1,%rax
> addq %rdx,%rax leaq 1(%rcx),%rdx addq $1,%rax addq $1,%rax
> addq %rsi,%rax leaq 1(%rdx),%rsi addq %rax,%rcx ...
> addq %rdi,%rax leaq 1(%rsi),%rax addq $1,%rax
> cmpq %rax,%r9 cmpq %rax, %rdi addq %rax,%rdx
> jg 2b jg 2b addq $1,%rax
> addq %rax,%rsi
> addq $1,%rax
> addq %rax,%r9
> addq $1,%rax
> cmpq %rax,%rdi
> jg 2b

> * Loop a tests if the speedup also holds if both operands to the
> addition are registers.
> * Loop i tests if it also holds if we use leaq instead of addq, and
> if we use more than one register.
> * Loop j tests if it also holds if every intermediate result is
> actually being used.
> * Loop n measures the latency of a multiply with a register (that
> contains 1), and a register-register add, in preparation for:
> * Loops k-m measure the pure latency of the add $1, %eax sequences
> by masking the ALU resource contention with the latency of an
> imul instruction. I.e., these resources can be consumed for free
> during the 3-cycle latency of imul.

> The results we see are:

> loop cyc/it Mops dep.adds remarks
> a 5.00 6 5 addq reg, %eax
> b 1.01 6 5 addq $1, %rax
> f 1.23 7 6 addq $1, %rax
> c 1.98 11 10 addq $1, %rax
> d 2.00 12 11 addq $1, %rax
> e 2.21 13 12 addq $1, %rax
> g 3.01 18 17 addq $1, %rax
> h 3.25 19 18 addq $1, %rax
> i 1.00 6 5 leaq 1(reg1),reg2
> j 2.00 12 6 addq $1, %rax; addq %rax, reg
> n 4.00 3 1 imul %rsi, %rax; addq %rsi, %rax
> k 3.00 3 1 imul %rsi, %rax; addq $1, %rax
> l 3.01 12 10 imul %rsi, %rax; addq $1, %rax
> m 3.20 17 15 imul %rsi, %rax; addq $1, %rax

> The Mops column counts the addq, cmpq+jg, leaq, and imul as
> Macro-ops.

> * We see from loop a that a general add has a latency of one cycle.
> * But adds with a constant that depend on each other run at 6
> macro-ops per cycle (where cmpq+jg is a macro-op), at least if
> the loop matches the constraints perfectly (loops b,d,g).
> * For slight mismatches (loops f,c,e,h), there is a penalty beyond
> just needing a resource for a cycle.
> * Using leaq (loop i) works just as fast as using addq, and using
> several registers does not slow execution down, either.
> * If every intermediate result is used (loop j), the dependent adds
> are not slowed down to 1 per cycle, but of course the additional
> adds consume resources and that slows the whole loop down to 2
> cycles/iteration.
> * We see a latency of 4 cycles (3 for the imul, 1 for the addq) in
> loop n.
> * Interestingly, if we replace the addq %rsi, %rax with addq $1,
> %rax, the latency goes down to 3 cycles (k). Additional adds
> don't increase the cycles for a loop iteration up to and
> including a sequence of 10 addq $1, %eax (l). After that it rises
> slowly (m).

> What is happening here? I can only guess:

Peephole optimizations as µOps are combined into groups.

> I believe that the hardware optimizes sequences of constant adds at
> the renaming stage of the microarchitecture. Not later, because
> according to Chips and Cheese Golden Cove (and, by extension, Raptor
> Cove) has only 5 ALUs, and we see >5 addqs per cycle. Not earlier,
> because it works across the zero-cycle store-to-load-forwarding
> (which probably happens not earlier than the renaming stage). The
> renamer already performs move elimination, the constant-add
> elimination could be a generalized form of that.

> I don't have a good explanation for the apparent latency of 0 for the
> adds following the imul in loops k and l. I would think that the
> additions accumulated in some state in the renamer have to be reified
> in a physical register at some point, and I would expect that to cost
> one cycle of latency. One theory I considered was that the
> microarchitecture internally has a multiply-add unit, but one would
> expect that to work also for loop n, but there we see an additional
> cycle of latency from the addition.

> How to run this yourself

> Download this directory and do make or possibly something like
> taskset -c 2 make on a Linux system with perf. The calls are scaled
> for 1G iterations (which you can see in the number of branches shown
> in the perf output), so dividing the number of cycles by 1G gives the
> cycles/iteration.

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
[Golden/Raptor Cove]
>It cannot do back-to-back adds in general, but it can perform 0-cycle
>(or close to 0) constant adds, as I found out with the microbenchmark
>at <http://www.complang.tuwien.ac.at/anton/additions/>.

I have now also tested the limits of this hardware optimization. The
intermediate sum of the increments must stay within the 13-bit range
-4096..4095, or you will see a slowdown.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Re: Memory dependency microbenchmark

<umdljg$3chk0$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=35952&group=comp.arch#35952

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m....@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Mon, 25 Dec 2023 20:42:24 -0800
Organization: A noiseless patient Spider
Lines: 65
Message-ID: <umdljg$3chk0$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com>
<uiu3dp$svfh$1@dont-email.me>
<u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>
<uj39sg$1stm2$1@dont-email.me>
<ksSdnZUAF6TBscr4nZ2dnZfqnPqdnZ2d@supernews.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 26 Dec 2023 04:42:25 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="5538523a4798ce7e5cc2863d9d055f49";
logging-data="3556992"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19UJWDJykXmLm0TSeOq1BNRJzf4z/QePII="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:p0QXDaWzLnwvkyuCeq93mijesKo=
In-Reply-To: <ksSdnZUAF6TBscr4nZ2dnZfqnPqdnZ2d@supernews.com>
Content-Language: en-US

by: Chris M. Thomasson - Tue, 26 Dec 2023 04:42 UTC

On 11/17/2023 1:03 AM, aph@littlepinkcloud.invalid wrote:
> Kent Dickey <kegs@provalid.com> wrote:
>> In article <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com>,
>> <aph@littlepinkcloud.invalid> wrote:
>>> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>>>> On 11/13/2023 3:49 AM, aph@littlepinkcloud.invalid wrote:
>>>>>
>>>>> I don't think that's really true. The reorderings we see in currently-
>>>>> produced hardware are, more or less, a subset of the same reorderings
>>>>> that C compilers perform. Therefore, if you see a confusing hardware
>>>>> reordering in a multi-threaded C program it may well be (probably is!)
>>>>> a bug (according to the C standard) *even on a TSO machine*. The only
>>>>> common counter-example to this is for volatile accesses.
>>>>
>>>> Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*
>>>
>>> Maybe I wasn't clear enough. If you use std::atomic and
>>> std::memory_order_* in such a way that there are no data races, your
>>> concurrent program will be fine on both TSO and relaxed memory
>>> ordering. If you try to fix data races with volatile instead of
>>> std::atomic and std::memory_order_*, that'll mostly fix things on a
>>> TSO machine, but not on a machine with relaxed memory ordering.
>>
>> What you are saying is:
>>
>> As long as you fully analyze your program, ensure all multithreaded
>> accesses are only through atomic variables, and you label every
>> access to an atomic variable properly (although my point is: exactly
>> what should that be??), then there is no problem.
>
> Well, this is definitely true. But it's not exactly a plan: in
> practice, people use careful synchronization boundaries and immutable
> data structures.
>
>> What I'm arguing is: the CPU should behave as if
>> memory_order_seq_cst is set on all accesses with no special
>> trickery.
>
> What I'm saying is: that isn't sufficient if you are using an
> optimizing compiler. And if you are programming for an optimizing
> compiler you have to follow the rules anway. And the optimizing
> compiler can reorder stores and loads as much, if not more, than the
> hardware does.
>
>> This acquire/release nonsense is all weakly ordered brain
>> damage. The problem is on weakly ordered CPUs, performance
>> definitely does matter in terms of getting this stuff right, but
>> that's their problem. Being weakly ordered makes them slower when
>> they have to execute barriers for correctness, but it's the barriers
>> themselves that are the slow down, not ordering the requests
>> properly.
>
> How is that? All the barriers do is enforce the ordering.

I feel the need to make the point that compiler barriers are different
than memory barriers...

>
> Take the Apple M1 as an exaple. It has a TSO mode bit. It also has TSO
> stores and loads, intended for when TSO mode is turned off. Are you
> saying that the TSO stores and loads use a different mechanism to
> enforce ordering from the one used when TSO is on by default?
>
> Andrew.

Re: Memory dependency microbenchmark

<umdobu$3cmi3$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=35953&group=comp.arch#35953

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: chris.m....@gmail.com (Chris M. Thomasson)
Newsgroups: comp.arch
Subject: Re: Memory dependency microbenchmark
Date: Mon, 25 Dec 2023 21:29:34 -0800
Organization: A noiseless patient Spider
Lines: 83
Message-ID: <umdobu$3cmi3$1@dont-email.me>
References: <2023Nov3.101558@mips.complang.tuwien.ac.at>
<82b3b3b710652e607dac6cec2064c90b@news.novabbs.com>
<uisdmn$gd4s$2@dont-email.me> <uiu4t5$t4c2$2@dont-email.me>
<uj3c29$1t9an$1@dont-email.me> <uj3d0a$1tb8u$1@dont-email.me>
<ujpnba$28jgr$1@dont-email.me> <ujpni1$28jgr$2@dont-email.me>
<6339b1f3d3f7e47364800bfe1dd96bf0@news.novabbs.com>
<bQa8N.2234$PJoc.448@fx04.iad>
<db4d79b4f418a5d73ec7c63bf691af31@news.novabbs.com>
<7Pp8N.22587$yAie.21519@fx44.iad> <uju081$2v76e$9@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 26 Dec 2023 05:29:34 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="5538523a4798ce7e5cc2863d9d055f49";
logging-data="3562051"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/vCMYEuYQ2FNZ/bgZ5NieUWm9dRvzqq9I="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:BN7YmxtJzksR4aDE/TpATD5a6bc=
Content-Language: en-US
In-Reply-To: <uju081$2v76e$9@dont-email.me>

by: Chris M. Thomasson - Tue, 26 Dec 2023 05:29 UTC

On 11/25/2023 3:33 PM, Chris M. Thomasson wrote:
> On 11/25/2023 9:05 AM, Scott Lurndal wrote:
>> mitchalsup@aol.com (MitchAlsup) writes:
>>> Scott Lurndal wrote:
>>>
>>>> mitchalsup@aol.com (MitchAlsup) writes:
>>>>> Chris M. Thomasson wrote:
>>>>>
>>>>>> Afaict MFENCE is basically a #StoreLoad. A LOCK RMW is a membar
>>>>>> but in
>>>>>> an interesting way where there is a word(s) being updated. I say
>>>>>> word's
>>>>>> is because of cmpxchg8b on a 32 bit system. Or cmpxchg16b on a 64 bit
>>>>>> system. Basically a DWCAS, or double-word compare-and-swap where it
>>>>>> works with two contiguous words. This is different that a DCAS
>>>>>> that can
>>>>>> work with two non-contiguous words.
>>>>>
>>>>>
>>>>> The later is generally known as DCADS Double compare and Double swap.
>>>>> I did see some academic literature a decade ago in wanting TCADS
>>>>> triple compare and double swap.
>>>>>
>>>>> It is for stuff like this that I invented esm so SW writers can
>>>>> program up any number of compares and and width of swapping. This
>>>>> means ISA does not have to change in the future wrt synchronization
>>>>> capabilities.
>>>>>
>>>>> It ends up that one of the most important properties is also found
>>>>> in LL/SC--and that is the the LL denotes the beginning of an ATOMIC
>>>>> event and SC denotes the end. LL/SC provide a bridge model to SW
>>>>> while ATOMIC events ending with xCAxS only provide notion one is
>>>>> in a ATOMIC event at the last instruction of the event.
>>>>>
>>>>> LL/SC can perform interference detection based on address while;
>>>>> xCAxS can only perform the interference based on the data at that
>>>>> address.....
>>>
>>>> The fatal problem with ll/sc is scaling. They scale poorly with large
>>>> processor counts.
>>>
>>>
>>> One can take the PoV that the fatal problem is that only 1 memory
>>> container
>>> is used !! And it does not matter a whole lot if the underpinnings
>>> are LL/SC
>>> or 1CA1S.
>>
>> Clearly 1000 processors beating on the same cache line is
>> something to be avoided. There still seem to be a dearth
>> of programmers that understand multithreaded programming.
>
> 1000 processors requires a distributed setup wrt NUMA. For instance, an
> amortized multi-queue that knows about locality tends to work rather
> well. The queue is distributed out where its unlikely that two
> processors might contend on the same thing in the queue logic.

If a processor obtains some work from the "queue" API (FIFO is not
guaranteed here.... ;^o), its highly likely that it is fairly local
work, close to "it" wrt the calling processor, wrt the physical layout
of the arch...

Hummm, I need to start another thread on something related.

>
>
>
>>
>>>
>>> One can also take the PoV that all currently known CPU driven ATOMICs
>>> suffer at the BigO( level ) rather similarly, whereas Memory ATOMICs
>>> at least guarantee forward progress::: ADDtoMemory( addr, value )
>>> without scaling poorly.....
>>
>> The difference between an atomic and 'll' is that the atomic is
>> guaranteed to make forward progress, while the 'll' generally
>> isn't and the chances of a SC failure increase exponentially with
>> the processor count.
>>
>> Atomics are often supported directly by the shared LLC as well.
>

Re: Memory dependency microbenchmark

<HMKdnU1BhM-daRb4nZ2dnZfqn_idnZ2d@supernews.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=35960&group=comp.arch#35960

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!newsfeed.endofthelinebbs.com!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!69.80.99.23.MISMATCH!Xl.tags.giganews.com!local-2.nntp.ord.giganews.com!nntp.supernews.com!news.supernews.com.POSTED!not-for-mail
NNTP-Posting-Date: Wed, 27 Dec 2023 09:53:04 +0000
Sender: Andrew Haley <aph@zarquon.pink>
From: aph...@littlepinkcloud.invalid
Subject: Re: Memory dependency microbenchmark
Newsgroups: comp.arch
References: <2023Nov3.101558@mips.complang.tuwien.ac.at> <84WdncuP3P3WkM_4nZ2dnZfqn_udnZ2d@supernews.com> <uiu3dp$svfh$1@dont-email.me> <u5mcnayk1ZSP1s74nZ2dnZfqnPqdnZ2d@supernews.com> <uj39sg$1stm2$1@dont-email.me> <ksSdnZUAF6TBscr4nZ2dnZfqnPqdnZ2d@supernews.com> <umdljg$3chk0$1@dont-email.me>
User-Agent: tin/1.9.2-20070201 ("Dalaruan") (UNIX) (Linux/4.18.0-477.27.1.el8_8.x86_64 (x86_64))
Message-ID: <HMKdnU1BhM-daRb4nZ2dnZfqn_idnZ2d@supernews.com>
Date: Wed, 27 Dec 2023 09:53:04 +0000
Lines: 35
X-Trace: sv3-6A2yV9+zNo4wC0EOD7nj9Hc0uFZ6fM+XfX0REwFY3lZiYdibchDb/mP29YtqCukuMN2nv+334b3+GA2!Bk+mFKYrvB3eZ/DkttKv4zvpODPY33XqHY5dsBsQFoANAb9tK1NHnNQzVcEZi+AqT2ZUCWOUqoOt!nLoMgwrFpSE=
X-Complaints-To: www.supernews.com/docs/abuse.html
X-DMCA-Complaints-To: www.supernews.com/docs/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40

by: aph...@littlepinkcloud.invalid - Wed, 27 Dec 2023 09:53 UTC

Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
> On 11/17/2023 1:03 AM, aph@littlepinkcloud.invalid wrote:
>> Kent Dickey <kegs@provalid.com> wrote:
>>> What I'm arguing is: the CPU should behave as if
>>> memory_order_seq_cst is set on all accesses with no special
>>> trickery.
>>
>> What I'm saying is: that isn't sufficient if you are using an
>> optimizing compiler. And if you are programming for an optimizing
>> compiler you have to follow the rules anway. And the optimizing
>> compiler can reorder stores and loads as much, if not more, than the
>> hardware does.
>>
>>> This acquire/release nonsense is all weakly ordered brain
>>> damage. The problem is on weakly ordered CPUs, performance
>>> definitely does matter in terms of getting this stuff right, but
>>> that's their problem. Being weakly ordered makes them slower when
>>> they have to execute barriers for correctness, but it's the barriers
>>> themselves that are the slow down, not ordering the requests
>>> properly.
>>
>> How is that? All the barriers do is enforce the ordering.
>
> I feel the need to make the point that compiler barriers are different
> than memory barriers...

Why do you feel the need to do that? Do you believe there is one
single person reading this who doesn't realize?

From the point of view of someone writing code in a high-level
language, for example (Standard) C, there is no difference. Even if
your machine was fully sequentially consistent, you'd still have to
insert barriers for the compiler in the same places.

Andrew.

Chris M. Thomasson wrote:

> On 11/25/2023 3:33 PM, Chris M. Thomasson wrote:
>> On 11/25/2023 9:05 AM, Scott Lurndal wrote:
>>>
>>>
>>> Clearly 1000 processors beating on the same cache line is
>>> something to be avoided. There still seem to be a dearth
>>> of programmers that understand multithreaded programming.
>>
>> 1000 processors requires a distributed setup wrt NUMA. For instance, an
>> amortized multi-queue that knows about locality tends to work rather
>> well. The queue is distributed out where its unlikely that two
>> processors might contend on the same thing in the queue logic.

> If a processor obtains some work from the "queue" API (FIFO is not
> guaranteed here.... ;^o), its highly likely that it is fairly local
> work, close to "it" wrt the calling processor, wrt the physical layout
> of the arch...

When you have 1000 processors, and the number of cycles it takes
to perform a synchronization is 100-cycles, you need the synchron-
ization to hand off enough work so that the winner has enough
cycles of computation before another synchronization to allow
all other processors to obtain work.

> Hummm, I need to start another thread on something related.

aph@littlepinkcloud.invalid wrote:

> Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
>> On 11/17/2023 1:03 AM, aph@littlepinkcloud.invalid wrote:
>>> Kent Dickey <kegs@provalid.com> wrote:
>>>> What I'm arguing is: the CPU should behave as if
>>>> memory_order_seq_cst is set on all accesses with no special
>>>> trickery.
>>>
>>> What I'm saying is: that isn't sufficient if you are using an
>>> optimizing compiler. And if you are programming for an optimizing
>>> compiler you have to follow the rules anway. And the optimizing
>>> compiler can reorder stores and loads as much, if not more, than the
>>> hardware does.
>>>
>>>> This acquire/release nonsense is all weakly ordered brain
>>>> damage. The problem is on weakly ordered CPUs, performance
>>>> definitely does matter in terms of getting this stuff right, but
>>>> that's their problem. Being weakly ordered makes them slower when
>>>> they have to execute barriers for correctness, but it's the barriers
>>>> themselves that are the slow down, not ordering the requests
>>>> properly.
>>>
>>> How is that? All the barriers do is enforce the ordering.
>>
>> I feel the need to make the point that compiler barriers are different
>> than memory barriers...

> Why do you feel the need to do that? Do you believe there is one
> single person reading this who doesn't realize?

There are often academicians who wander by now and then who are not
up to these fine points.

> From the point of view of someone writing code in a high-level
> language, for example (Standard) C, there is no difference. Even if
> your machine was fully sequentially consistent, you'd still have to
> insert barriers for the compiler in the same places.

> Andrew.

Subject	Author
Memory dependency microbenchmark	Anton Ertl
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	Anton Ertl
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Kent Dickey
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Kent Dickey
Re: Memory dependency microbenchmark	aph
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	aph
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Kent Dickey
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	aph
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	aph
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Stefan Monnier
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	aph
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Scott Lurndal
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Stefan Monnier
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Branimir Maksimovic
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Paul A. Clayton
Re: Memory dependency microbenchmark	Scott Lurndal
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Paul A. Clayton
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	aph
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	aph
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	aph
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	aph
Re: Memory dependency microbenchmark	Terje Mathisen
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	EricP
Re: Memory dependency microbenchmark	Paul A. Clayton
Re: Memory dependency microbenchmark	Chris M. Thomasson
weak consistency and the supercomputer attitude (was: Memory dependency microben	Anton Ertl
Re: weak consistency and the supercomputer attitude	Stefan Monnier
Re: weak consistency and the supercomputer attitude	MitchAlsup
Re: weak consistency and the supercomputer attitude	Paul A. Clayton
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Chris M. Thomasson
Re: Memory dependency microbenchmark	MitchAlsup
Re: Memory dependency microbenchmark	Anton Ertl
Alder Lake results for the memory dependency microbenchmark	Anton Ertl

19 May, 2024: Line wrapping has been changed to be more consistent with Usenet standards. If you find that it is broken please let me know here rocksolid.nodes.help

devel / comp.arch / Re: Memory dependency microbenchmark

devel / comp.arch / Re: Memory dependency microbenchmark

19 May, 2024: Line wrapping has been changed to be more consistent with Usenet standards.
If you find that it is broken please let me know here rocksolid.nodes.help