novaBBS - comp.lang.forth - Re: Shared memory

Re: Shared memory

<nnd$0ddc7cdc$118d16a9@0c11a6a8e2e845fe>

https://www.novabbs.com/devel/article-flat.php?id=25942&group=comp.lang.forth#25942

Newsgroups: comp.lang.forth
Subject: Re: Shared memory
References: <73c2da86-b581-4519-bdb0-0c17df4d646en@googlegroups.com> <c53be72b665c3d10796bfe67a7f02dcf@www.novabbs.com> <nnd$346dc707$39b68eb8@f0aeef389c7accd8> <c2fb7eb58b7ae773f632a15c1abac917@www.novabbs.com>
From: alb...@spenarnc.xs4all.nl
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: albert@cherry.(none) (albert)
Message-ID: <nnd$0ddc7cdc$118d16a9@0c11a6a8e2e845fe>
Organization: KPN B.V.
Date: Sun, 03 Mar 2024 12:19:13 +0100
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!tr2.iad1.usenetexpress.com!feeder.usenetexpress.com!tr1.eu1.usenetexpress.com!2001:67c:174:101:1:67:202:5.MISMATCH!feed.abavia.com!abe005.abavia.com!abp001.abavia.com!news.kpn.nl!not-for-mail
Lines: 111
Injection-Date: Sun, 03 Mar 2024 12:19:13 +0100
Injection-Info: news.kpn.nl; mail-complaints-to="abuse@kpn.com"

by: alb...@spenarnc.xs4all.nl - Sun, 3 Mar 2024 11:19 UTC

In article <c2fb7eb58b7ae773f632a15c1abac917@www.novabbs.com>,
mhx <mhx@iae.nl> wrote:
>> I have lost context, can you tell more about the simple example?
>> (My provider purges old messages swiftly)
>
>I was in the exploring/debugging phase and have only very recently
>completed the experiments.

>
>The final results are that with shared memory, on Windows
>11, it is possible to get an almost linear speedup with the
>number of cores in use. The way shared memory is implemented
>on Windows is with a memory-mapped file that uses the OS
>pagefile as backup. The file is guaranteed to not be swapped
>out under reasonable conditions, and Windows keeps its
>management invisible for users.

Linear speedup? That must depend on the program.
Can I surmise that the context is that you're comparing your
version/clone iSpice with LTSpice.
>
>I tried to make the file as small as possible. For this
>iForth benchmark it was 11 int64's (11 * 8 bytes) and 24
>extended floats (24 * 16 bytes), about 1/2 Kbyte. The file
>is touched very infrequently, just 24 result writes and
>then a loop over the 11 words to see if all cpu's finished
>(check at 10ms intervals). At the moment I have no idea
>what happens with very frequent read/writes (it is not
>the intended type of use).

>
>[During debugging I was lucky. When setting the number of
> working cpu's interactively, completely wrong results
> were obtained. This happened because #|cpus was defined
> as a VALUE in a configuration file. When changing #|cpus
> from the console, the value in sconfig.frt stayed the
> same (of course) while all the dynamically started cores
> used the on-disk value, not the value I typed in on
> CPU #0. Easy to understand in hindsight, but this type
> of 'black-hole' mistake can take hours to find in a 7000+
> line program. For some reason I just knew that it had to
> be #|cpus that was causing the problem.]

>
>The benchmark is a circuit file that defines a voltage
>source and a 2-resistor divider, all parameterized.
>These values were swept for a total of 24 different
>circuits. To calculate the result for one of the
>combinations takes 2.277s on a single core with iSPICE,
>or 24 x that value, 54.648s, for all 24 combinations.
>In the benchmark the 24 simulations are spread out over
>11 processes on an 8-core CPU :
>
>iSPICE> .ticker-info
>AMD Ryzen 7 5800X 8-Core Processor
> TICKS-GET uses os time & PROCESSOR-CLOCK 4192MHz
> Do: < n TO PROCESSOR-CLOCK RECALIBRATE >
>
>The aim is to get an 8 times speedup, or more if
>hyperthreads bring something, and do all combinations
>in less than 6.831 seconds. The best I managed is
>7.694s or about 7.67 "cores", which I consider not
>that bad. Here are the details (run 4 times):
>
>% cpus time [s] perf. ratio
> 1 49.874 1.46
> 2 25.314 2.39
> 3 17.391 3.23
> 4 13.335 4.11
> 5 10.565 5.17
> 6 9.468 5.71
> 7 8.712 6.22
> 8 7.694 7.67
> 9 7.260 7.37
> 10 7.874 6.72
> 11 7.856 6.73 ok
>
>For your information: Running the same 24 variations
>with LTspice 17.1.15, one of the fastest SPICE
>implementations currently available, takes 382.265
>seconds, almost exactly 7 times slower than the iSPICE
>single-core run. Using 8 cores (LTspice pretends to
>use 16 threads), that ratio becomes 62 times.

So LT spice becomes slower by using 8 cores
going from 7 times slower to 62 time slower than iSPICE.
There must be a mistake here.
>
>In the above table the performance ration for a single
>cpu is 1.46 (1.46 times faster than doing the 24
>simulations on a single core *without* shared memory),
>which might seem strange. I think the phenomenon is
>caused by the fact that a single combination takes
>only 2.277s and this may be too slow for the processor
>(or Windows) to ramp up the clock frequency. If the
>performance factor is normalized by the timing for
>1 cpu, the maximum speedup decreases to 5.25.
>We'll see what happens on an HPZ840.

You are going to run Windows 11 on the HP work station?
I'm going to install a Linux version, for I want to
experiment with CUDA.

>
>-marcel
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat purring. - the Wise from Antrim -

Re: Shared memory

<0678be0fc5470e4edb09427823d40717@www.novabbs.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=25998&group=comp.lang.forth#25998

copy link Newsgroups: comp.lang.forth

Date: Wed, 6 Mar 2024 10:54:55 +0000
Subject: Re: Shared memory
From: mhx...@iae.nl (mhx)
Newsgroups: comp.lang.forth
X-Rslight-Site: $2y$10$Xo3y6aqoVa430rK6t4grAOLt8mB3IHCE9KtF9hWkiU10zgknCvXbm
X-Rslight-Posting-User: 59549e76d0c3560fb37b97f0b9407a8c14054f24
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
User-Agent: Rocksolid Light
References: <73c2da86-b581-4519-bdb0-0c17df4d646en@googlegroups.com> <c53be72b665c3d10796bfe67a7f02dcf@www.novabbs.com> <nnd$346dc707$39b68eb8@f0aeef389c7accd8> <c2fb7eb58b7ae773f632a15c1abac917@www.novabbs.com> <nnd$0ddc7cdc$118d16a9@0c11a6a8e2e845fe>
Organization: novaBBS
Message-ID: <0678be0fc5470e4edb09427823d40717@www.novabbs.com>

by: mhx - Wed, 6 Mar 2024 10:54 UTC

albert@spenarnc.xs4all.nl wrote:

> In article <c2fb7eb58b7ae773f632a15c1abac917@www.novabbs.com>,
> mhx <mhx@iae.nl> wrote:
>>> I have lost context, can you tell more about the simple example?
[..]
>>The final results are that with shared memory, on Windows
>>11, it is possible to get an almost linear speedup with the
>>number of cores in use.
[..]
> Linear speedup? That must depend on the program.
> Can I surmise that the context is that you're comparing your
> version/clone iSpice with LTSpice.

The example is *not* about trying to speed up programs
by adding threads to work on parts that can be parallelized.
A circuit simulator is used as the example here. Circuits
contain on average about 30% of operations that can be done
in parallel, so a fine-grained threaded approach with an
infinite amount of threads can at most give 30% of a speedup.

Most circuit simulation problems can not be solved with a
single simulation. In almost every case one wants to re-run
a job with small variations on the original specification.
The variations can be on the circuit components themselves,
on variations in environmental conditions like temperature,
humidity, noise, variations on input sources or output loads,
or even parameters of their (digital) control algorithms.
Between 10 and many thousands of simulations could be
necessary. At the top level, this problem is trivial to solve
by editing the input netlist with the necessary changes,
re-run the simulation, and store the results in a database.
When all runs are done, the data is evaluated by querying.

In practice, it is difficult to keep the administration
straight if the above is done by hand. What I am looking for
is a simple way to specify variations, create a list
of all the simulations needed, then distribute the tasks
to as many cpu cores as are available (locally, on the network,
or in the Cloud), combine the results, and generate reports.

To do this in Forth, I found it useful to use either shared
memory, or a shared file. The post is about experiments with
shared memory (useful when the number of cores is less than
256 and the main memory requirement is less than 1 TByte.)

The concrete example is to run N variations of a circuit on
an 8 core system with 32GB of memory, with the features I
describe above. The question was: is it possible to get
a speedup of 8 when the benchmark runs on an 8 core CPU.

>>iSPICE> .ticker-info
>>AMD Ryzen 7 5800X 8-Core Processor
>> TICKS-GET uses os time & PROCESSOR-CLOCK 4192MHz
>> Do: < n TO PROCESSOR-CLOCK RECALIBRATE >
>>
>>The aim is to get an 8 times speedup, or more if
>>hyperthreads bring something, and do all combinations
>>in less than 6.831 seconds. The best I managed is
>>7.694s or about 7.67 "cores", which I consider not
>>that bad. Here are the details (run 4 times):
>>
>>% cpus time [s] perf. ratio
>> 1 49.874 1.46
>> 2 25.314 2.39
>> 3 17.391 3.23
>> 4 13.335 4.11
>> 5 10.565 5.17
>> 6 9.468 5.71
>> 7 8.712 6.22
>> 8 7.694 7.67
>> 9 7.260 7.37
>> 10 7.874 6.72
>> 11 7.856 6.73 ok
>>

>>For your information: Running the same 24 variations
>>with LTspice 17.1.15, one of the fastest SPICE
>>implementations currently available, takes 382.265
>>seconds, almost exactly 7 times slower than the iSPICE
>>single-core run. Using 8 cores (LTspice pretends to
>>use 16 threads), that ratio becomes 62 times.

I realize now that this comparison of iSPICE with LTspice
can confuse the reader. It does not matter at all for this
benchmark which SPICE simulator is used.

> So LT spice becomes slower by using 8 cores
> going from 7 times slower to 62 time slower than iSPICE.
> There must be a mistake here.

There is no mistake. LTspice is 7 slower than iSPICE for
the specific type of task used here. Although LTspice has
a mechanism to run multiple variations, and claims to use
8 cores / 16 threads, it does not appear to use them as
efficiently as iSPICE does using shared memory.

[..]
>>We'll see what happens on an HPZ840.

> You are going to run Windows 11 on the HP work station?
> I'm going to install a Linux version, for I want to
> experiment with CUDA.

I certainly want to see what happens if I run iSPICE on
my 44-core HPZ840 :--) The fastest way to implement that
should be to install Windows 10 or 11 on the HP. However,
if that proves problematic I have no problem using Linux.
I did not try iSPICE on Linux/WSL2 yet and I probably will
do that first.

I also want to experiment with CUDA (BTW, why not OpenCL,
did you already find arguments against that route?),
however, that would be to investigate a new way of circuit
simulation that not uses the standard SPICE algorithms.

-marcel

You do not have mail.

devel / comp.lang.forth / Re: Shared memory

devel / comp.lang.forth / Re: Shared memory

Subject	Author
Shared memory	Marcel Hendrix
Re: Shared memory	Hans Bezemer
Re: Shared memory	Anton Ertl
Re: Shared memory	Marcel Hendrix
Re: Shared memory	Anton Ertl
Re: Shared memory	minf...@arcor.de
Re: Shared memory	minf...@arcor.de
Re: Shared memory	none
Re: Shared memory	Marcel Hendrix
Re: Shared memory	Marcel Hendrix
Re: Shared memory	Marcel Hendrix
Re: Shared memory	Marcel Hendrix
Re: Shared memory	Marcel Hendrix
Re: Shared memory	Anton Ertl
Re: Shared memory	Marcel Hendrix
Re: Shared memory	Marcel Hendrix
Re: Shared memory	mhx
Re: Shared memory	minforth
Re: Shared memory	mhx
Re: Shared memory	albert
Re: Shared memory	mhx
Re: Shared memory	albert
Re: Shared memory	mhx
Re: Shared memory	albert
Re: Shared memory	mhx
Re: Shared memory	Paul Rubin
Re: Shared memory	Marcel Hendrix