Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  nodelist  faq  login

Though I'll admit readability suffers slightly... -- Larry Wall in <2969@jato.Jpl.Nasa.Gov>


programming / comp.lang.asm.x86 / Re: Confused about avx and sse2 performance

SubjectAuthor
* Confused about avx and sse2 performanceBranimir Maksimovic
`- Re: Confused about avx and sse2 performanceBranimir Maksimovic

1
Subject: Confused about avx and sse2 performance
From: Branimir Maksimovic
Newsgroups: comp.lang.asm.x86
Organization: usenet-news.net
Date: Thu, 27 May 2021 20:03 UTC
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: branimir...@nospicedham.gmail.com (Branimir Maksimovic)
Newsgroups: comp.lang.asm.x86
Subject: Confused about avx and sse2 performance
Date: Thu, 27 May 2021 20:03:12 GMT
Organization: usenet-news.net
Lines: 23
Sender: <news@fx37.iad.omicronmedia.com>
Approved: fbkotler@myfairpoint.net - comp.lang.asm.x86 moderation team.
Message-ID: <4WSrI.1471$jf1.216@fx37.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Info: reader02.eternal-september.org; posting-host="ec35b3dff26eccc32e83cd3b0fde3470";
logging-data="10215"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+1FzW4dLsTMxKZDbzXSZLzxnowmde8xBs="
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:rdKGvJ5eVnAOKYEeQ3bwHg6l24U=
View all headers
Can someone explain to me why is this faster:
https://github.com/bmaxa/shootout/blob/main/nbody/nbodysse2.asm
then this:
https://github.com/bmaxa/shootout/blob/main/nbody/nbody2.asm

also can someone run to see if this is quirk with my cpu (Zen1).
you need fasm, compile with fasm and
then gcc.
[code]
~/shootout/nbody >>> fasm nbody2.asm                                                                                                                                                                                       ±[●][main]
flat assembler  version 1.73.27  (16384 kilobytes memory)
4 passes, 10432 bytes.
~/shootout/nbody >>> gcc nbody2.o -o nbody2 -no-pie                                                                                                                                                                        ±[●][main]
~/shootout/nbody >>>
[/code]

Thanks!
--
current job title: senior software engineer
skills: x86 aasembler,c++,c,rust,go,nim,haskell...

press any key to continue or any other to quit...



Subject: Re: Confused about avx and sse2 performance
From: Branimir Maksimovic
Newsgroups: comp.lang.asm.x86
Organization: usenet-news.net
Date: Fri, 28 May 2021 16:43 UTC
References: 1
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: branimir...@nospicedham.gmail.com (Branimir Maksimovic)
Newsgroups: comp.lang.asm.x86
Subject: Re: Confused about avx and sse2 performance
Date: Fri, 28 May 2021 16:43:58 GMT
Organization: usenet-news.net
Lines: 191
Sender: <news@fx01.iad.omicronmedia.com>
Approved: fbkotler@myfairpoint.net - comp.lang.asm.x86 moderation team.
Message-ID: <i59sI.1080$G11.369@fx01.iad>
References: <4WSrI.1471$jf1.216@fx37.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Info: reader02.eternal-september.org; posting-host="848df1ed76237555f5327f2b6a6e960e";
logging-data="15977"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19YzkAtHLfEyE1BaXqayCMzgShsTHErArE="
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:FP2FH5mqBc15uWXK6Q8A1p+8N/8=
View all headers
On 2021-05-27, Branimir Maksimovic <branimir.maksimovic@nospicedham.gmail.com> wrote:
Can someone explain to me why is this faster:
https://github.com/bmaxa/shootout/blob/main/nbody/nbodysse2.asm
then this:
https://github.com/bmaxa/shootout/blob/main/nbody/nbody2.asm

also can someone run to see if this is quirk with my cpu (Zen1).
you need fasm, compile with fasm and
then gcc.
[code]
~/shootout/nbody >>> fasm nbody2.asm                                                                                                                                                                                       ±[●][main]
flat assembler  version 1.73.27  (16384 kilobytes memory)
4 passes, 10432 bytes.
~/shootout/nbody >>> gcc nbody2.o -o nbody2 -no-pie                                                                                                                                                                        ±[●][main]
~/shootout/nbody >>>
[/code]

Thanks!
Look this entry for C (fastest). Heavilly optimized for AVX:
https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/nbody-gcc-9.html
[code]
-0.169075164                                                                                                                                                                                                          [210/787]
-0.169059907

 Performance counter stats for './fastc 50000000':

          1,988.28 msec task-clock:u              #    0.999 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
                67      page-faults:u             #    0.034 K/sec
     8,059,303,822      cycles:u                  #    4.053 GHz                      (62.46%)
           200,129      stalled-cycles-frontend:u #    0.00% frontend cycles idle     (62.46%)
     7,382,686,755      stalled-cycles-backend:u  #   91.60% backend cycles idle      (62.46%)
     6,589,592,669      instructions:u            #    0.82  insn per cycle
                                                  #    1.12  stalled cycles per insn  (62.46%)
        49,984,983      branches:u                #   25.140 M/sec                    (62.47%)
             1,753      branch-misses:u           #    0.00% of all branches          (62.57%)
     4,498,871,311      L1-dcache-loads:u         # 2262.693 M/sec                    (62.57%)
             8,279      L1-dcache-load-misses:u   #    0.00% of all L1-dcache accesses  (62.55%)
   <not supported>      LLC-loads:u
   <not supported>      LLC-load-misses:u

       1.990374103 seconds time elapsed

       1.982009000 seconds user
       0.000000000 seconds sys
[/code]
my  sse2 version:
[code]
-0.169075164
-0.169059907

 Performance counter stats for './nbodysse2 50000000':

          2,603.44 msec task-clock:u              #    0.999 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
                53      page-faults:u             #    0.020 K/sec
    10,290,993,689      cycles:u                  #    3.953 GHz                      (62.35%)
           545,799      stalled-cycles-frontend:u #    0.01% frontend cycles idle     (62.37%)
     6,563,060,268      stalled-cycles-backend:u  #   63.77% backend cycles idle      (62.49%)
    32,393,147,312      instructions:u            #    3.15  insn per cycle
                                                  #    0.20  stalled cycles per insn  (62.61%)
     1,943,915,974      branches:u                #  746.673 M/sec                    (62.70%)
            12,391      branch-misses:u           #    0.00% of all branches          (62.61%)
    13,260,365,081      L1-dcache-loads:u         # 5093.407 M/sec                    (62.50%)
            11,338      L1-dcache-load-misses:u   #    0.00% of all L1-dcache accesses  (62.38%)
   <not supported>      LLC-loads:u
   <not supported>      LLC-load-misses:u

       2.606247643 seconds time elapsed

       2.585024000 seconds user
       0.003319000 seconds sys
[/code]
then optimized for simd (arrays instead of structures)
[code]
-0.169075164                                                                                                                                                                                                           [70/787]
-0.169059907

 Performance counter stats for './nbody2 50000000':

          3,188.49 msec task-clock:u              #    0.999 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec

Click here to read the complete article
1
rocksolid light 0.7.2
clearneti2ptor