Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Lack of skill dictates economy of style. -- Joey Ramone


devel / comp.lang.c++ / Re: Wow !

SubjectAuthor
* Wow !Bonita Montero
+- Re: Wow !Bonita Montero
+* Re: Wow !Mr Flibble
|`* Re: Wow !Bonita Montero
| `* Re: Wow !Mr Flibble
|  `* Re: Wow !Bonita Montero
|   `* Re: Wow !Mr Flibble
|    `- Re: Wow !Bonita Montero
`* Re: Wow !Bonita Montero
 `* Re: Wow !Scott Lurndal
  `- Re: Wow !Bonita Montero

1
Wow !

<u38muc$3f1dc$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=34&group=comp.lang.c%2B%2B#34

  copy link   Newsgroups: comp.lang.c++
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Bonita.M...@gmail.com (Bonita Montero)
Newsgroups: comp.lang.c++
Subject: Wow !
Date: Sun, 7 May 2023 19:25:32 +0200
Organization: A noiseless patient Spider
Lines: 113
Message-ID: <u38muc$3f1dc$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 7 May 2023 17:25:32 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="ee76d0ade496c09315c724de439fa20a";
logging-data="3638700"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/Qh8key7Q/VmdrfFZCnA74G6zIALnyXwk="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.10.1
Cancel-Lock: sha1:VV/g9YUrpH8PN8w/JC8ozUNa8zY=
Content-Language: de-DE
 by: Bonita Montero - Sun, 7 May 2023 17:25 UTC

This morning I thought that blocks of memory the size of a cache line
that are completely overwritten by a thread make it unnecessary to
remotely load the previous contents of the block into the local cache.
Instead, it would be sufficient to simply send an invalidate message
to all other caches. Importantly, writes that are the full size of a
cache line queue up in the write queue, so there is no need to load
any data from a remote cache. I had tackled this issue before and
wrote the test below, but on my former Zen2 3990X Windows PC, which
is now my Linux PC, dumping two competing AVX stores on a cache line
for which the other cores are competing resulted in no performance
-advantages. But my Windows 11 PC is now a Ryzen 9 7950X 16 core.
Then I ran the benchmark below again. With this, the partial over-
writing of a cache line without deleting the cache line with two
AVX stores is about 2.3 times slower. I think that's an impressive
difference that could lead to a new kind of optimization.

#include <iostream>
#include <vector>
#include <memory>
#include <cstring>
#include <chrono>
#include <thread>
#include <functional>
#include <semaphore>
#if defined(_MSC_VER)
#include <intrin.h>
#elif defined(__GNUC__)
#include <immintrin.h>
#endif

using namespace std;
using namespace chrono;

int main()
{ constexpr size_t
#if defined(__cpp_lib_hardware_interference_size)
CL_SIZE = hardware_destructive_interference_size,
#else
CL_SIZE = 64,
#endif
BLOCK_SIZE = 16ull * 1024,
#if defined(NDEBUG)
ROUNDS = 0x10000;
#else
ROUNDS = 1;
#endif
atomic_int readyCountDown;
counting_semaphore ready( 0 );
atomic_int synch;
unsigned hc = jthread::hardware_concurrency();
vector<char> vc( BLOCK_SIZE + CL_SIZE - 1 );
char const *aligned = to_address( vc.cbegin() );
aligned = (char const *)(((size_t)aligned + CL_SIZE - 1) &
-(ptrdiff_t)CL_SIZE);
string_view avc( aligned, aligned + BLOCK_SIZE );
atomic_int64_t sumNs;
auto partitialWrite = [&]<typename Clear64>( Clear64 clear64 )
{
if( readyCountDown.fetch_sub( 1, memory_order::relaxed ) > 1 )
ready.acquire();
else
ready.release( hc - 1 );
if( synch.fetch_sub( 1, memory_order::relaxed ) > 1 )
while( synch.load( memory_order::relaxed ) );
auto start = high_resolution_clock::now();
for( size_t r = ROUNDS; r--; )
for( auto it = avc.begin(), end = avc.end(); end - it >= CL_SIZE; it
+= CL_SIZE )
clear64( to_address( it ) ),
(char &)*it = 0;
sumNs.fetch_add( duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count() );
};
static auto clClear = []( void const *p )
{
#if defined(_MSC_VER)
__m256d
zero = _mm256_setzero_pd(),
*pMM = (__m256d *)p;
pMM[0] = zero;
pMM[1] = zero;
#elif defined(__GNUC__) || defined(__clang__)
__m128d
zero = _mm_setzero_pd(),
*pMM = (__m128d *)p;
pMM[0] = zero;
pMM[1] = zero;
pMM[2] = zero;
pMM[3] = zero;
#endif
};
using fn_t = function<void ()> const;
fn_t
fnNoClear( [&]() { partitialWrite( []( void const *p ) {} ); } ),
fnClear( [&]() { partitialWrite( clClear ); } );
vector<jthread> threads;
auto initiate = [&]( bool clear, char const *head )
{
readyCountDown.store( hc, memory_order::relaxed );
synch.store( hc, memory_order::relaxed );
sumNs.store( 0, memory_order::relaxed );
for( unsigned t = hc; t--; )
threads.emplace_back( []( fn_t &fn ) { fn(); }, cref( !clear ?
fnNoClear : fnClear ) );
threads.resize( 0 );
double ns = sumNs.load( memory_order::relaxed ) / ((double)BLOCK_SIZE
/ CL_SIZE * ROUNDS);
cout << head << ns << endl;
};
initiate( false, "no-clear: " );
initiate( true, "clear: " );
}

Re: Wow !

<u38pfc$3feb3$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=35&group=comp.lang.c%2B%2B#35

  copy link   Newsgroups: comp.lang.c++
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Bonita.M...@gmail.com (Bonita Montero)
Newsgroups: comp.lang.c++
Subject: Re: Wow !
Date: Sun, 7 May 2023 20:08:44 +0200
Organization: A noiseless patient Spider
Lines: 4
Message-ID: <u38pfc$3feb3$1@dont-email.me>
References: <u38muc$3f1dc$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 7 May 2023 18:08:44 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="ee76d0ade496c09315c724de439fa20a";
logging-data="3651939"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+ZJTAkmFJ4H0ks/QyM5gir/KdAY8jSY5A="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.10.1
Cancel-Lock: sha1:vZ3YWbge4DNzLU+Nh31tFac2ZBE=
In-Reply-To: <u38muc$3f1dc$1@dont-email.me>
Content-Language: de-DE
 by: Bonita Montero - Sun, 7 May 2023 18:08 UTC

I further improved the code and I found that even if I clear
the cacheline with eight uint64_t stores I get the same speedup.
Would be nice to know wheter other x86-CPUs also have this kind
of optimization.

Re: Wow !

<175d0f29847a6702$1$273595$faa1acb7@news.newsdemon.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=39&group=comp.lang.c%2B%2B#39

  copy link   Newsgroups: comp.lang.c++
Date: Mon, 8 May 2023 05:01:44 +0100
Mime-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0
Subject: Re: Wow !
Newsgroups: comp.lang.c++
References: <u38muc$3f1dc$1@dont-email.me>
Content-Language: en-US
From: flibb...@reddwarf.jmc.corp (Mr Flibble)
In-Reply-To: <u38muc$3f1dc$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 118
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border-2.nntp.ord.giganews.com!border-1.nntp.ord.giganews.com!nntp.giganews.com!feeder.usenetexpress.com!tr1.iad1.usenetexpress.com!news.newsdemon.com!not-for-mail
Nntp-Posting-Date: Mon, 08 May 2023 04:01:46 +0000
X-Received-Bytes: 5512
Organization: NewsDemon - www.newsdemon.com
X-Complaints-To: abuse@newsdemon.com
Message-Id: <175d0f29847a6702$1$273595$faa1acb7@news.newsdemon.com>
 by: Mr Flibble - Mon, 8 May 2023 04:01 UTC

On 07/05/2023 6:25 pm, Bonita Montero wrote:
> This morning I thought that blocks of memory the size of a cache line
> that are completely overwritten by a thread make it unnecessary to
> remotely load the previous contents of the block into the local cache.
> Instead, it would be sufficient to simply send an invalidate message
> to all other caches. Importantly, writes that are the full size of a
> cache line queue up in the write queue, so there is no need to load
> any data from a remote cache. I had tackled this issue before and
> wrote the test below, but on my former Zen2 3990X Windows PC, which
> is now my Linux PC, dumping two competing AVX stores on a cache line
> for which the other cores are competing resulted in no performance
> -advantages. But my Windows 11 PC is now a Ryzen 9 7950X 16 core.
> Then I ran the benchmark below again. With this, the partial over-
> writing of a cache line without deleting the cache line with two
> AVX stores is about 2.3 times slower. I think that's an impressive
> difference that could lead to a new kind of optimization.
>
> #include <iostream>
> #include <vector>
> #include <memory>
> #include <cstring>
> #include <chrono>
> #include <thread>
> #include <functional>
> #include <semaphore>
> #if defined(_MSC_VER)
>     #include <intrin.h>
> #elif defined(__GNUC__)
>     #include <immintrin.h>
> #endif
>
> using namespace std;
> using namespace chrono;
>
> int main()
> {
>     constexpr size_t
> #if defined(__cpp_lib_hardware_interference_size)
>         CL_SIZE = hardware_destructive_interference_size,
> #else
>         CL_SIZE = 64,
> #endif
>         BLOCK_SIZE = 16ull * 1024,
> #if defined(NDEBUG)
>         ROUNDS = 0x10000;
> #else
>         ROUNDS = 1;
> #endif
>     atomic_int readyCountDown;
>     counting_semaphore ready( 0 );
>     atomic_int synch;
>     unsigned hc = jthread::hardware_concurrency();
>     vector<char> vc( BLOCK_SIZE + CL_SIZE - 1 );
>     char const *aligned = to_address( vc.cbegin() );
>     aligned = (char const *)(((size_t)aligned + CL_SIZE - 1) &
> -(ptrdiff_t)CL_SIZE);
>     string_view avc( aligned, aligned + BLOCK_SIZE );
>     atomic_int64_t sumNs;
>     auto partitialWrite = [&]<typename Clear64>( Clear64 clear64 )
>     {
>         if( readyCountDown.fetch_sub( 1, memory_order::relaxed ) > 1 )
>             ready.acquire();
>         else
>             ready.release( hc - 1 );
>         if( synch.fetch_sub( 1, memory_order::relaxed ) > 1 )
>             while( synch.load( memory_order::relaxed ) );
>         auto start = high_resolution_clock::now();
>         for( size_t r = ROUNDS; r--; )
>             for( auto it = avc.begin(), end = avc.end(); end - it >=
> CL_SIZE; it += CL_SIZE )
>                 clear64( to_address( it ) ),
>                 (char &)*it = 0;
>         sumNs.fetch_add( duration_cast<nanoseconds>(
> high_resolution_clock::now() - start ).count() );
>     };
>     static auto clClear = []( void const *p )
>     {
> #if defined(_MSC_VER)
>         __m256d
>             zero = _mm256_setzero_pd(),
>             *pMM = (__m256d *)p;
>         pMM[0] = zero;
>         pMM[1] = zero;
> #elif defined(__GNUC__) || defined(__clang__)
>         __m128d
>             zero = _mm_setzero_pd(),
>             *pMM = (__m128d *)p;
>         pMM[0] = zero;
>         pMM[1] = zero;
>         pMM[2] = zero;
>         pMM[3] = zero;
> #endif
>     };
>     using fn_t = function<void ()> const;
>     fn_t
>         fnNoClear( [&]() { partitialWrite( []( void const *p ) {} ); } ),
>         fnClear( [&]() { partitialWrite( clClear ); } );
>     vector<jthread> threads;
>     auto initiate = [&]( bool clear, char const *head )
>     {
>         readyCountDown.store( hc, memory_order::relaxed );
>         synch.store( hc, memory_order::relaxed );
>         sumNs.store( 0, memory_order::relaxed );
>         for( unsigned t = hc; t--; )
>             threads.emplace_back( []( fn_t &fn ) { fn(); }, cref(
> !clear ? fnNoClear : fnClear ) );
>         threads.resize( 0 );
>         double ns = sumNs.load( memory_order::relaxed ) /
> ((double)BLOCK_SIZE / CL_SIZE * ROUNDS);
>         cout << head << ns << endl;
>     };
>     initiate( false, "no-clear: " );
>     initiate( true, "clear: " );
> }

Your code is an unreadable mess made worse by the "using" directives.

/Flibble

Re: Wow !

<u39uo6$3pa9p$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=41&group=comp.lang.c%2B%2B#41

  copy link   Newsgroups: comp.lang.c++
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Bonita.M...@gmail.com (Bonita Montero)
Newsgroups: comp.lang.c++
Subject: Re: Wow !
Date: Mon, 8 May 2023 06:44:55 +0200
Organization: A noiseless patient Spider
Lines: 124
Message-ID: <u39uo6$3pa9p$1@dont-email.me>
References: <u38muc$3f1dc$1@dont-email.me>
<175d0f29847a6702$1$273595$faa1acb7@news.newsdemon.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 8 May 2023 04:44:54 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="72c3f10cd112499993e9b317077f3a6d";
logging-data="3975481"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+p132LCRJqNeS45E3XfS4b0rhwKMkhKxY="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.10.1
Cancel-Lock: sha1:D2sELWeJjuqpFuvPbOxjI1ctIW0=
In-Reply-To: <175d0f29847a6702$1$273595$faa1acb7@news.newsdemon.com>
Content-Language: de-DE
 by: Bonita Montero - Mon, 8 May 2023 04:44 UTC

Am 08.05.2023 um 06:01 schrieb Mr Flibble:
> On 07/05/2023 6:25 pm, Bonita Montero wrote:
>> This morning I thought that blocks of memory the size of a cache line
>> that are completely overwritten by a thread make it unnecessary to
>> remotely load the previous contents of the block into the local cache.
>> Instead, it would be sufficient to simply send an invalidate message
>> to all other caches. Importantly, writes that are the full size of a
>> cache line queue up in the write queue, so there is no need to load
>> any data from a remote cache. I had tackled this issue before and
>> wrote the test below, but on my former Zen2 3990X Windows PC, which
>> is now my Linux PC, dumping two competing AVX stores on a cache line
>> for which the other cores are competing resulted in no performance
>> -advantages. But my Windows 11 PC is now a Ryzen 9 7950X 16 core.
>> Then I ran the benchmark below again. With this, the partial over-
>> writing of a cache line without deleting the cache line with two
>> AVX stores is about 2.3 times slower. I think that's an impressive
>> difference that could lead to a new kind of optimization.
>>
>> #include <iostream>
>> #include <vector>
>> #include <memory>
>> #include <cstring>
>> #include <chrono>
>> #include <thread>
>> #include <functional>
>> #include <semaphore>
>> #if defined(_MSC_VER)
>>      #include <intrin.h>
>> #elif defined(__GNUC__)
>>      #include <immintrin.h>
>> #endif
>>
>> using namespace std;
>> using namespace chrono;
>>
>> int main()
>> {
>>      constexpr size_t
>> #if defined(__cpp_lib_hardware_interference_size)
>>          CL_SIZE = hardware_destructive_interference_size,
>> #else
>>          CL_SIZE = 64,
>> #endif
>>          BLOCK_SIZE = 16ull * 1024,
>> #if defined(NDEBUG)
>>          ROUNDS = 0x10000;
>> #else
>>          ROUNDS = 1;
>> #endif
>>      atomic_int readyCountDown;
>>      counting_semaphore ready( 0 );
>>      atomic_int synch;
>>      unsigned hc = jthread::hardware_concurrency();
>>      vector<char> vc( BLOCK_SIZE + CL_SIZE - 1 );
>>      char const *aligned = to_address( vc.cbegin() );
>>      aligned = (char const *)(((size_t)aligned + CL_SIZE - 1) &
>> -(ptrdiff_t)CL_SIZE);
>>      string_view avc( aligned, aligned + BLOCK_SIZE );
>>      atomic_int64_t sumNs;
>>      auto partitialWrite = [&]<typename Clear64>( Clear64 clear64 )
>>      {
>>          if( readyCountDown.fetch_sub( 1, memory_order::relaxed ) > 1 )
>>              ready.acquire();
>>          else
>>              ready.release( hc - 1 );
>>          if( synch.fetch_sub( 1, memory_order::relaxed ) > 1 )
>>              while( synch.load( memory_order::relaxed ) );
>>          auto start = high_resolution_clock::now();
>>          for( size_t r = ROUNDS; r--; )
>>              for( auto it = avc.begin(), end = avc.end(); end - it >=
>> CL_SIZE; it += CL_SIZE )
>>                  clear64( to_address( it ) ),
>>                  (char &)*it = 0;
>>          sumNs.fetch_add( duration_cast<nanoseconds>(
>> high_resolution_clock::now() - start ).count() );
>>      };
>>      static auto clClear = []( void const *p )
>>      {
>> #if defined(_MSC_VER)
>>          __m256d
>>              zero = _mm256_setzero_pd(),
>>              *pMM = (__m256d *)p;
>>          pMM[0] = zero;
>>          pMM[1] = zero;
>> #elif defined(__GNUC__) || defined(__clang__)
>>          __m128d
>>              zero = _mm_setzero_pd(),
>>              *pMM = (__m128d *)p;
>>          pMM[0] = zero;
>>          pMM[1] = zero;
>>          pMM[2] = zero;
>>          pMM[3] = zero;
>> #endif
>>      };
>>      using fn_t = function<void ()> const;
>>      fn_t
>>          fnNoClear( [&]() { partitialWrite( []( void const *p ) {} );
>> } ),
>>          fnClear( [&]() { partitialWrite( clClear ); } );
>>      vector<jthread> threads;
>>      auto initiate = [&]( bool clear, char const *head )
>>      {
>>          readyCountDown.store( hc, memory_order::relaxed );
>>          synch.store( hc, memory_order::relaxed );
>>          sumNs.store( 0, memory_order::relaxed );
>>          for( unsigned t = hc; t--; )
>>              threads.emplace_back( []( fn_t &fn ) { fn(); }, cref(
>> !clear ? fnNoClear : fnClear ) );
>>          threads.resize( 0 );
>>          double ns = sumNs.load( memory_order::relaxed ) /
>> ((double)BLOCK_SIZE / CL_SIZE * ROUNDS);
>>          cout << head << ns << endl;
>>      };
>>      initiate( false, "no-clear: " );
>>      initiate( true, "clear: " );
>> }
>
> Your code is an unreadable mess made worse by the "using" directives.

If that is your problem you'd have a lot of other problems with the
code.

Re: Wow !

<u3a5ga$3q9bj$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=42&group=comp.lang.c%2B%2B#42

  copy link   Newsgroups: comp.lang.c++
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Bonita.M...@gmail.com (Bonita Montero)
Newsgroups: comp.lang.c++
Subject: Re: Wow !
Date: Mon, 8 May 2023 08:40:10 +0200
Organization: A noiseless patient Spider
Lines: 93
Message-ID: <u3a5ga$3q9bj$1@dont-email.me>
References: <u38muc$3f1dc$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 8 May 2023 06:40:10 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="72c3f10cd112499993e9b317077f3a6d";
logging-data="4007283"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/cJmoWL9JRpnZzuDH8VMfgNS0g2cwi1C8="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.10.1
Cancel-Lock: sha1:VUx8rHx/8u8Inf/ArQP5cB7wCcQ=
In-Reply-To: <u38muc$3f1dc$1@dont-email.me>
Content-Language: de-DE
 by: Bonita Montero - Mon, 8 May 2023 06:40 UTC

My previous benchmark measures the time it takes to write a cache line
that other cores are competing for. But I figured that writing a cache
line would be faster even if it didn't have to be fetched from another
cache, but came directly from RAM. When loading from RAM, a snoop must
be sent to the other cores, but if the CPU recognizes that the cache
line is being completely overwritten, then only an invalidate message
must be sent to the other caches, i.e. the snoop response does not
have to contain any content.
The benchmark below measures this by repetitively scanning a one
gigabyte block of memory. On my 7950X Zen4 system, wiping the cache
line first gives a speedup of about 16%. On my 3990X Zen2 system
there's a slow-down of about 2%.

#include <iostream>
#include <vector>
#include <new>
#include <chrono>
#include <cmath>
#if defined(_MSC_VER)
#include <intrin.h>
#elif defined(__GNUC__) || defined(__clang__)
#include <immintrin.h>
#endif

using namespace std;
using namespace chrono;

int main()
{ constexpr size_t
#if defined(__cpp_lib_hardware_interference_size)
CL_SIZE = hardware_destructive_interference_size,
#else
CL_SIZE = 64,
#endif
BLOCK_SIZE = 1ull << 30,
ROUNDS = 30;
vector<char> block( BLOCK_SIZE + CL_SIZE - 1 );
char
*align = (char *)((size_t)(to_address( block.begin() ) + CL_SIZE - 1)
& -(ptrdiff_t)CL_SIZE),
*alignEnd = align + BLOCK_SIZE;
auto bench = [&]<typename Clear>( char const *head, Clear clear, double
before ) -> double
{
auto start = high_resolution_clock::now();
for( size_t r = ROUNDS; r--; )
for( char *p = (char *)align; p != alignEnd; p += CL_SIZE )
clear( p ),
*p = 0;
double ns = duration_cast<nanoseconds>( high_resolution_clock::now() -
start ).count() / ((double)ROUNDS * BLOCK_SIZE / CL_SIZE);
cout << head << ns;
if( before > 0.0 )
cout << " (" << trunc( 1.0e3 * before / ns ) / 10.0 << "%)";
cout << endl;
return ns;
};
static auto clClear = []( void const *p )
{
#if defined(_MSC_VER) && (defined(__AVX__) || defined(__AVX512VL__) ||
defined(__AVX512DQ__))
#if defined(__AVX512VL__) || defined(__AVX512DQ__)
*(__m512d *)p = _mm512_setzero_pd();
#else
__m256d
zero = _mm256_setzero_pd(),
*pMM = (__m256d *)p;
pMM[0] = zero;
pMM[1] = zero;
#endif
#elif defined(__GNUC__) || defined(__clang__)
__m128d
zero = _mm_setzero_pd(),
*pMM = (__m128d *)p;
pMM[0] = zero;
pMM[1] = zero;
pMM[2] = zero;
pMM[3] = zero;
#else
auto unroll = []<size_t ... Indices, typename Fn>(
index_sequence<Indices ...>, Fn fn )
{
((fn.template operator ()<Indices>()), ...);
};
unroll( make_index_sequence<CL_SIZE / 8>(), [&]<size_t I>() {
((uint64_t*)p)[I] = 0; } );
#endif
};
static auto clUntouched = []( auto ) {};
double untouched = bench( "untouched: ", clUntouched, -1.0 );
bench( "clear: ", clClear, untouched );
}

Re: Wow !

<175d31092941f7d5$108$4813$7aa12cbf@news.newsdemon.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=48&group=comp.lang.c%2B%2B#48

  copy link   Newsgroups: comp.lang.c++
Date: Mon, 8 May 2023 15:22:30 +0100
Mime-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0
Subject: Re: Wow !
Newsgroups: comp.lang.c++
References: <u38muc$3f1dc$1@dont-email.me> <175d0f29847a6702$1$273595$faa1acb7@news.newsdemon.com> <u39uo6$3pa9p$1@dont-email.me>
Content-Language: en-US
From: flibb...@reddwarf.jmc.corp (Mr Flibble)
In-Reply-To: <u39uo6$3pa9p$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Lines: 136
Path: i2pn2.org!i2pn.org!news.1d4.us!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!feeder.usenetexpress.com!tr1.iad1.usenetexpress.com!news.newsdemon.com!not-for-mail
Nntp-Posting-Date: Mon, 08 May 2023 14:22:31 +0000
X-Complaints-To: abuse@newsdemon.com
Organization: NewsDemon - www.newsdemon.com
Message-Id: <175d31092941f7d5$108$4813$7aa12cbf@news.newsdemon.com>
X-Received-Bytes: 6647
 by: Mr Flibble - Mon, 8 May 2023 14:22 UTC

On 08/05/2023 5:44 am, Bonita Montero wrote:
> Am 08.05.2023 um 06:01 schrieb Mr Flibble:
>> On 07/05/2023 6:25 pm, Bonita Montero wrote:
>>> This morning I thought that blocks of memory the size of a cache line
>>> that are completely overwritten by a thread make it unnecessary to
>>> remotely load the previous contents of the block into the local cache.
>>> Instead, it would be sufficient to simply send an invalidate message
>>> to all other caches. Importantly, writes that are the full size of a
>>> cache line queue up in the write queue, so there is no need to load
>>> any data from a remote cache. I had tackled this issue before and
>>> wrote the test below, but on my former Zen2 3990X Windows PC, which
>>> is now my Linux PC, dumping two competing AVX stores on a cache line
>>> for which the other cores are competing resulted in no performance
>>> -advantages. But my Windows 11 PC is now a Ryzen 9 7950X 16 core.
>>> Then I ran the benchmark below again. With this, the partial over-
>>> writing of a cache line without deleting the cache line with two
>>> AVX stores is about 2.3 times slower. I think that's an impressive
>>> difference that could lead to a new kind of optimization.
>>>
>>> #include <iostream>
>>> #include <vector>
>>> #include <memory>
>>> #include <cstring>
>>> #include <chrono>
>>> #include <thread>
>>> #include <functional>
>>> #include <semaphore>
>>> #if defined(_MSC_VER)
>>>      #include <intrin.h>
>>> #elif defined(__GNUC__)
>>>      #include <immintrin.h>
>>> #endif
>>>
>>> using namespace std;
>>> using namespace chrono;
>>>
>>> int main()
>>> {
>>>      constexpr size_t
>>> #if defined(__cpp_lib_hardware_interference_size)
>>>          CL_SIZE = hardware_destructive_interference_size,
>>> #else
>>>          CL_SIZE = 64,
>>> #endif
>>>          BLOCK_SIZE = 16ull * 1024,
>>> #if defined(NDEBUG)
>>>          ROUNDS = 0x10000;
>>> #else
>>>          ROUNDS = 1;
>>> #endif
>>>      atomic_int readyCountDown;
>>>      counting_semaphore ready( 0 );
>>>      atomic_int synch;
>>>      unsigned hc = jthread::hardware_concurrency();
>>>      vector<char> vc( BLOCK_SIZE + CL_SIZE - 1 );
>>>      char const *aligned = to_address( vc.cbegin() );
>>>      aligned = (char const *)(((size_t)aligned + CL_SIZE - 1) &
>>> -(ptrdiff_t)CL_SIZE);
>>>      string_view avc( aligned, aligned + BLOCK_SIZE );
>>>      atomic_int64_t sumNs;
>>>      auto partitialWrite = [&]<typename Clear64>( Clear64 clear64 )
>>>      {
>>>          if( readyCountDown.fetch_sub( 1, memory_order::relaxed ) > 1 )
>>>              ready.acquire();
>>>          else
>>>              ready.release( hc - 1 );
>>>          if( synch.fetch_sub( 1, memory_order::relaxed ) > 1 )
>>>              while( synch.load( memory_order::relaxed ) );
>>>          auto start = high_resolution_clock::now();
>>>          for( size_t r = ROUNDS; r--; )
>>>              for( auto it = avc.begin(), end = avc.end(); end - it >=
>>> CL_SIZE; it += CL_SIZE )
>>>                  clear64( to_address( it ) ),
>>>                  (char &)*it = 0;
>>>          sumNs.fetch_add( duration_cast<nanoseconds>(
>>> high_resolution_clock::now() - start ).count() );
>>>      };
>>>      static auto clClear = []( void const *p )
>>>      {
>>> #if defined(_MSC_VER)
>>>          __m256d
>>>              zero = _mm256_setzero_pd(),
>>>              *pMM = (__m256d *)p;
>>>          pMM[0] = zero;
>>>          pMM[1] = zero;
>>> #elif defined(__GNUC__) || defined(__clang__)
>>>          __m128d
>>>              zero = _mm_setzero_pd(),
>>>              *pMM = (__m128d *)p;
>>>          pMM[0] = zero;
>>>          pMM[1] = zero;
>>>          pMM[2] = zero;
>>>          pMM[3] = zero;
>>> #endif
>>>      };
>>>      using fn_t = function<void ()> const;
>>>      fn_t
>>>          fnNoClear( [&]() { partitialWrite( []( void const *p ) {} );
>>> } ),
>>>          fnClear( [&]() { partitialWrite( clClear ); } );
>>>      vector<jthread> threads;
>>>      auto initiate = [&]( bool clear, char const *head )
>>>      {
>>>          readyCountDown.store( hc, memory_order::relaxed );
>>>          synch.store( hc, memory_order::relaxed );
>>>          sumNs.store( 0, memory_order::relaxed );
>>>          for( unsigned t = hc; t--; )
>>>              threads.emplace_back( []( fn_t &fn ) { fn(); }, cref(
>>> !clear ? fnNoClear : fnClear ) );
>>>          threads.resize( 0 );
>>>          double ns = sumNs.load( memory_order::relaxed ) /
>>> ((double)BLOCK_SIZE / CL_SIZE * ROUNDS);
>>>          cout << head << ns << endl;
>>>      };
>>>      initiate( false, "no-clear: " );
>>>      initiate( true, "clear: " );
>>> }
>>
>> Your code is an unreadable mess made worse by the "using" directives.
>
> If that is your problem you'd have a lot of other problems with the
> code.

Yes, I do:

Another problem I have with the code is the lack of whitespace (again
hinders readability).

Yet another problem I have with the code is short or meaningless
variable names; meaningful variable and function names almost makes code
self documenting.

And yet another problem I have with the code is the author.

/Flibble

Re: Wow !

<u3b3u3$3tlpt$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=50&group=comp.lang.c%2B%2B#50

  copy link   Newsgroups: comp.lang.c++
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Bonita.M...@gmail.com (Bonita Montero)
Newsgroups: comp.lang.c++
Subject: Re: Wow !
Date: Mon, 8 May 2023 17:19:31 +0200
Organization: A noiseless patient Spider
Lines: 18
Message-ID: <u3b3u3$3tlpt$1@dont-email.me>
References: <u38muc$3f1dc$1@dont-email.me>
<175d0f29847a6702$1$273595$faa1acb7@news.newsdemon.com>
<u39uo6$3pa9p$1@dont-email.me>
<175d31092941f7d5$108$4813$7aa12cbf@news.newsdemon.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 8 May 2023 15:19:31 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="72c3f10cd112499993e9b317077f3a6d";
logging-data="4118333"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+2eT8Gdl6RqPT2sHtKzGqvL01bhMn2rWE="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.10.1
Cancel-Lock: sha1:mrwm1G+0kpcha/IvtxXi97eZGb8=
Content-Language: de-DE
In-Reply-To: <175d31092941f7d5$108$4813$7aa12cbf@news.newsdemon.com>
 by: Bonita Montero - Mon, 8 May 2023 15:19 UTC

Am 08.05.2023 um 16:22 schrieb Mr Flibble:

> Yes, I do:
>
> Another problem I have with the code is the lack of whitespace (again
> hinders readability).
>
> Yet another problem I have with the code is short or meaningless
> variable names; meaningful variable and function names almost makes code
> self documenting.
>
> And yet another problem I have with the code is the author.
>
> /Flibble

That's just a matter of taste and IQ.

Re: Wow !

<175d36bc0e01b1d6$3$2383644$baa1ecb3@news.newsdemon.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=54&group=comp.lang.c%2B%2B#54

  copy link   Newsgroups: comp.lang.c++
Date: Mon, 8 May 2023 17:06:56 +0100
Mime-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0
Subject: Re: Wow !
Newsgroups: comp.lang.c++
References: <u38muc$3f1dc$1@dont-email.me> <175d0f29847a6702$1$273595$faa1acb7@news.newsdemon.com> <u39uo6$3pa9p$1@dont-email.me> <175d31092941f7d5$108$4813$7aa12cbf@news.newsdemon.com> <u3b3u3$3tlpt$1@dont-email.me>
Content-Language: en-US
From: flibb...@reddwarf.jmc.corp (Mr Flibble)
In-Reply-To: <u3b3u3$3tlpt$1@dont-email.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 26
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!peer01.ams4!peer.am4.highwinds-media.com!news.highwinds-media.com!tr2.eu1.usenetexpress.com!feeder.usenetexpress.com!tr1.iad1.usenetexpress.com!news.newsdemon.com!not-for-mail
Nntp-Posting-Date: Mon, 08 May 2023 16:06:57 +0000
Organization: NewsDemon - www.newsdemon.com
X-Complaints-To: abuse@newsdemon.com
Message-Id: <175d36bc0e01b1d6$3$2383644$baa1ecb3@news.newsdemon.com>
X-Received-Bytes: 1917
 by: Mr Flibble - Mon, 8 May 2023 16:06 UTC

On 08/05/2023 4:19 pm, Bonita Montero wrote:
> Am 08.05.2023 um 16:22 schrieb Mr Flibble:
>
>> Yes, I do:
>>
>> Another problem I have with the code is the lack of whitespace (again
>> hinders readability).
>>
>> Yet another problem I have with the code is short or meaningless
>> variable names; meaningful variable and function names almost makes
>> code self documenting.
>>
>> And yet another problem I have with the code is the author.
>>
>> /Flibble
>
> That's just a matter of taste and IQ.

If the target audience of your code is just yourself then you could
probably get away with writing a shite incomprehensible mess but you are
posting your code to a public forum with the hope, presumably, that
others will read it so you should, as a bare minimum, attempt to make
your code comprehensible to others.

/Flibble

Re: Wow !

<u3b718$3u015$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=55&group=comp.lang.c%2B%2B#55

  copy link   Newsgroups: comp.lang.c++
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Bonita.M...@gmail.com (Bonita Montero)
Newsgroups: comp.lang.c++
Subject: Re: Wow !
Date: Mon, 8 May 2023 18:12:25 +0200
Organization: A noiseless patient Spider
Lines: 25
Message-ID: <u3b718$3u015$1@dont-email.me>
References: <u38muc$3f1dc$1@dont-email.me>
<175d0f29847a6702$1$273595$faa1acb7@news.newsdemon.com>
<u39uo6$3pa9p$1@dont-email.me>
<175d31092941f7d5$108$4813$7aa12cbf@news.newsdemon.com>
<u3b3u3$3tlpt$1@dont-email.me>
<175d36bc0e01b1d6$3$2383644$baa1ecb3@news.newsdemon.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 8 May 2023 16:12:24 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="72c3f10cd112499993e9b317077f3a6d";
logging-data="4128805"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19MWjb9JU6aoPtopMTycCVjPxIAWpdhWY0="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.10.1
Cancel-Lock: sha1:oD0YI7DRmSqpkZcSTesHCRMnMZ4=
Content-Language: de-DE
In-Reply-To: <175d36bc0e01b1d6$3$2383644$baa1ecb3@news.newsdemon.com>
 by: Bonita Montero - Mon, 8 May 2023 16:12 UTC

Am 08.05.2023 um 18:06 schrieb Mr Flibble:
> On 08/05/2023 4:19 pm, Bonita Montero wrote:
>> Am 08.05.2023 um 16:22 schrieb Mr Flibble:
>>
>>> Yes, I do:
>>>
>>> Another problem I have with the code is the lack of whitespace (again
>>> hinders readability).
>>>
>>> Yet another problem I have with the code is short or meaningless
>>> variable names; meaningful variable and function names almost makes
>>> code self documenting.
>>>
>>> And yet another problem I have with the code is the author.
>>>
>>> /Flibble
>>
>> That's just a matter of taste and IQ.
>
> If the target audience of your code is just yourself ...

No, it's people that also understood what I wrote along with
the source.

Re: Wow !

<tT96M.2805260$9sn9.916293@fx17.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=57&group=comp.lang.c%2B%2B#57

  copy link   Newsgroups: comp.lang.c++
Path: i2pn2.org!i2pn.org!news.neodome.net!feeder1.feed.usenet.farm!feed.usenet.farm!peer02.ams4!peer.am4.highwinds-media.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx17.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: sco...@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: Wow !
Newsgroups: comp.lang.c++
References: <u38muc$3f1dc$1@dont-email.me> <u3a5ga$3q9bj$1@dont-email.me>
Lines: 21
Message-ID: <tT96M.2805260$9sn9.916293@fx17.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Mon, 08 May 2023 16:59:05 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Mon, 08 May 2023 16:59:05 GMT
X-Received-Bytes: 1846
 by: Scott Lurndal - Mon, 8 May 2023 16:59 UTC

Bonita Montero <Bonita.Montero@gmail.com> writes:
>My previous benchmark measures the time it takes to write a cache line
>that other cores are competing for. But I figured that writing a cache
>line would be faster even if it didn't have to be fetched from another
>cache, but came directly from RAM. When loading from RAM, a snoop must
>be sent to the other cores, but if the CPU recognizes that the cache
>line is being completely overwritten, then only an invalidate message
>must be sent to the other caches, i.e. the snoop response does not
>have to contain any content.

Actually, the first store (e.g. 8-byte 64-bit store) that hits the
cache line requires the local cache to acquire exclusive access
to the entire line. At which point the coherency protocol will
request other cores to either write a modified line back to
memory (or a higher cache level, e.g. L3, depending on whether
the cache levels are exclusive or inclusive) or invalidate a
shared or owned line before responding to the local cache exclusive access request.

By the time the CPU knows that the entire line has been modified
it is far to late to simply invalidate any remote lines.

Re: Wow !

<u3c68i$158j$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=68&group=comp.lang.c%2B%2B#68

  copy link   Newsgroups: comp.lang.c++
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Bonita.M...@gmail.com (Bonita Montero)
Newsgroups: comp.lang.c++
Subject: Re: Wow !
Date: Tue, 9 May 2023 03:05:23 +0200
Organization: A noiseless patient Spider
Lines: 9
Message-ID: <u3c68i$158j$1@dont-email.me>
References: <u38muc$3f1dc$1@dont-email.me> <u3a5ga$3q9bj$1@dont-email.me>
<tT96M.2805260$9sn9.916293@fx17.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 9 May 2023 01:05:22 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="78575890c9ca26ab31e96e404032bd83";
logging-data="38163"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18w4akefS5qiYKJYmRJnwFEtSO+6wXGsTA="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.10.1
Cancel-Lock: sha1:f934s8W6KBAcSeoPPaLpT4nGFFY=
Content-Language: de-DE
In-Reply-To: <tT96M.2805260$9sn9.916293@fx17.iad>
 by: Bonita Montero - Tue, 9 May 2023 01:05 UTC

Am 08.05.2023 um 18:59 schrieb Scott Lurndal:

> By the time the CPU knows that the entire line has been modified
> it is far to late to simply invalidate any remote lines.

As I've shown there's a performance difference when the cacheline
is written a whole. So there's a remote invalidation only.


devel / comp.lang.c++ / Re: Wow !

1
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor