Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  nodelist  faq  login

You will lose an important disk file.


programming / comp.lang.asm.x86 / Re: Confused about avx and sse2 performance

Subject: Re: Confused about avx and sse2 performance
From: Branimir Maksimovic
Newsgroups: comp.lang.asm.x86
Organization: usenet-news.net
Date: Fri, 28 May 2021 16:43 UTC
References: 1
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: branimir...@nospicedham.gmail.com (Branimir Maksimovic)
Newsgroups: comp.lang.asm.x86
Subject: Re: Confused about avx and sse2 performance
Date: Fri, 28 May 2021 16:43:58 GMT
Organization: usenet-news.net
Lines: 191
Sender: <news@fx01.iad.omicronmedia.com>
Approved: fbkotler@myfairpoint.net - comp.lang.asm.x86 moderation team.
Message-ID: <i59sI.1080$G11.369@fx01.iad>
References: <4WSrI.1471$jf1.216@fx37.iad>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Info: reader02.eternal-september.org; posting-host="848df1ed76237555f5327f2b6a6e960e";
logging-data="15977"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19YzkAtHLfEyE1BaXqayCMzgShsTHErArE="
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:FP2FH5mqBc15uWXK6Q8A1p+8N/8=
View all headers
On 2021-05-27, Branimir Maksimovic <branimir.maksimovic@nospicedham.gmail.com> wrote:
Can someone explain to me why is this faster:
https://github.com/bmaxa/shootout/blob/main/nbody/nbodysse2.asm
then this:
https://github.com/bmaxa/shootout/blob/main/nbody/nbody2.asm

also can someone run to see if this is quirk with my cpu (Zen1).
you need fasm, compile with fasm and
then gcc.
[code]
~/shootout/nbody >>> fasm nbody2.asm                                                                                                                                                                                       ±[●][main]
flat assembler  version 1.73.27  (16384 kilobytes memory)
4 passes, 10432 bytes.
~/shootout/nbody >>> gcc nbody2.o -o nbody2 -no-pie                                                                                                                                                                        ±[●][main]
~/shootout/nbody >>>
[/code]

Thanks!
Look this entry for C (fastest). Heavilly optimized for AVX:
https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/nbody-gcc-9.html
[code]
-0.169075164                                                                                                                                                                                                          [210/787]
-0.169059907

 Performance counter stats for './fastc 50000000':

          1,988.28 msec task-clock:u              #    0.999 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
                67      page-faults:u             #    0.034 K/sec
     8,059,303,822      cycles:u                  #    4.053 GHz                      (62.46%)
           200,129      stalled-cycles-frontend:u #    0.00% frontend cycles idle     (62.46%)
     7,382,686,755      stalled-cycles-backend:u  #   91.60% backend cycles idle      (62.46%)
     6,589,592,669      instructions:u            #    0.82  insn per cycle
                                                  #    1.12  stalled cycles per insn  (62.46%)
        49,984,983      branches:u                #   25.140 M/sec                    (62.47%)
             1,753      branch-misses:u           #    0.00% of all branches          (62.57%)
     4,498,871,311      L1-dcache-loads:u         # 2262.693 M/sec                    (62.57%)
             8,279      L1-dcache-load-misses:u   #    0.00% of all L1-dcache accesses  (62.55%)
   <not supported>      LLC-loads:u
   <not supported>      LLC-load-misses:u

       1.990374103 seconds time elapsed

       1.982009000 seconds user
       0.000000000 seconds sys
[/code]
my  sse2 version:
[code]
-0.169075164
-0.169059907

 Performance counter stats for './nbodysse2 50000000':

          2,603.44 msec task-clock:u              #    0.999 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
                53      page-faults:u             #    0.020 K/sec
    10,290,993,689      cycles:u                  #    3.953 GHz                      (62.35%)
           545,799      stalled-cycles-frontend:u #    0.01% frontend cycles idle     (62.37%)
     6,563,060,268      stalled-cycles-backend:u  #   63.77% backend cycles idle      (62.49%)
    32,393,147,312      instructions:u            #    3.15  insn per cycle
                                                  #    0.20  stalled cycles per insn  (62.61%)
     1,943,915,974      branches:u                #  746.673 M/sec                    (62.70%)
            12,391      branch-misses:u           #    0.00% of all branches          (62.61%)
    13,260,365,081      L1-dcache-loads:u         # 5093.407 M/sec                    (62.50%)
            11,338      L1-dcache-load-misses:u   #    0.00% of all L1-dcache accesses  (62.38%)
   <not supported>      LLC-loads:u
   <not supported>      LLC-load-misses:u

       2.606247643 seconds time elapsed

       2.585024000 seconds user
       0.003319000 seconds sys
[/code]
then optimized for simd (arrays instead of structures)
[code]
-0.169075164                                                                                                                                                                                                           [70/787]
-0.169059907

 Performance counter stats for './nbody2 50000000':

          3,188.49 msec task-clock:u              #    0.999 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
                53      page-faults:u             #    0.017 K/sec
    12,961,532,262      cycles:u                  #    4.065 GHz                      (62.49%)
           267,650      stalled-cycles-frontend:u #    0.00% frontend cycles idle     (62.49%)
    10,087,414,286      stalled-cycles-backend:u  #   77.83% backend cycles idle      (62.49%)
    22,690,576,224      instructions:u            #    1.75  insn per cycle
                                                  #    0.44  stalled cycles per insn  (62.49%)
     1,498,517,339      branches:u                #  469.976 M/sec                    (62.49%)
            10,791      branch-misses:u           #    0.00% of all branches          (62.52%)
    11,845,468,225      L1-dcache-loads:u         # 3715.066 M/sec                    (62.52%)
            10,205      L1-dcache-load-misses:u   #    0.00% of all L1-dcache accesses  (62.52%)
   <not supported>      LLC-loads:u
   <not supported>      LLC-load-misses:u

       3.191446269 seconds time elapsed

       3.178639000 seconds user
       0.000000000 seconds sys

[/code]

As you can see, C version is almost branchless but AVX version has less branches and instructions,
but is slower then SSE2 version? Main difference is that SSE2 version has different memory access.
While SSE2 version accesses one element of structure at time AVX version accesses in vector manner
as it has arrays of elements instead of arrays of structures, so I give credit to that.
But why is arrays slower?
To give my point I have avx version that uses same memory access as SSE2 version at is also faster
then new version, I thought optimized for SIMD...
[code]
-0.169075164
-0.169059907

 Performance counter stats for './nbodyavx 50000000':

          2,784.27 msec task-clock:u              #    0.999 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
                54      page-faults:u             #    0.019 K/sec
    10,806,896,379      cycles:u                  #    3.881 GHz                      (62.51%)
           276,219      stalled-cycles-frontend:u #    0.00% frontend cycles idle     (62.53%)
     8,428,091,473      stalled-cycles-backend:u  #   77.99% backend cycles idle      (62.53%)
    21,319,940,647      instructions:u            #    1.97  insn per cycle
                                                  #    0.40  stalled cycles per insn  (62.53%)
     1,848,343,065      branches:u                #  663.853 M/sec                    (62.53%)
             7,706      branch-misses:u           #    0.00% of all branches          (62.47%)
    12,510,801,413      L1-dcache-loads:u         # 4493.394 M/sec                    (62.45%)
            14,623      L1-dcache-load-misses:u   #    0.00% of all L1-dcache accesses  (62.45%)
   <not supported>      LLC-loads:u
   <not supported>      LLC-load-misses:u

       2.786929447 seconds time elapsed

       2.773920000 seconds user
       0.003332000 seconds sys

[/code]
So you see AVX versions have less instructions but less instructions per cycle.
Zen1 can execute up to 4 sse instructions per cycle, but only two AVX as Intel,
as 2 256bit units exists in both Intel and Zen. But Zen1 has 4 128 bit which are
paired when executing AVX so it is somewhat better then Intel with SSE2.
Also I have used plain old sqrtpd instead of approximation as on AMD(Zen1) it is
unecessary but it is there for Intel users.
[code]
;   cvtpd2ps xmm4,xmm3
;   rsqrtps xmm4,xmm4
        sqrtpd xmm7,xmm3
        mulpd xmm3,xmm7
    divpd xmm6,xmm3
;   mulpd xmm3,dqword[L2]
;   cvtps2pd xmm4,xmm4
    ;--------------------

;   movapd xmm7, xmm4

;   movapd xmm8,xmm3
;   mulpd xmm8, xmm7
;   mulpd xmm8, xmm7
;   mulpd xmm8, xmm7

;   mulpd xmm7,dqword[L1]

;   subpd xmm7,xmm8

    ;------------------------

;   movapd xmm8,xmm3
;   mulpd xmm8, xmm7
;   mulpd xmm8, xmm7
;   mulpd xmm8, xmm7

;   mulpd xmm7,dqword[L1]
;   subpd xmm7,xmm8 ; distance -> xmm7

    ;--------------------------

;   mulpd xmm6,xmm7 ; mag -> xmm6
[/code]
Of course I didn't bother to use this Intel optimization
for new AVX version, so If you want you can add  it.


--
current job title: senior software engineer
skills: x86 aasembler,c++,c,rust,go,nim,haskell...

press any key to continue or any other to quit...



SubjectRepliesAuthor
o Confused about avx and sse2 performance

By: Branimir Maksimovic on Thu, 27 May 2021

1Branimir Maksimovic
rocksolid light 0.7.2
clearneti2ptor