Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  nodelist  faq  login

//GO.SYSIN DD *, DOODAH, DOODAH


programming / comp.lang.asm.x86 / Re: Bit reversal in AVX2

SubjectAuthor
* Bit reversal in AVX2James Van Buskirk
`* Re: Bit reversal in AVX2Terje Mathisen
 `* Re: Bit reversal in AVX2James Van Buskirk
  `* Re: Bit reversal in AVX2Terje Mathisen
   `- Re: Bit reversal in AVX2James Van Buskirk

1
Subject: Bit reversal in AVX2
From: James Van Buskirk
Newsgroups: comp.lang.asm.x86
Organization: A noiseless patient Spider
Date: Sun, 22 May 2022 09:05 UTC
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: not_va...@nospicedham.comcast.net (James Van Buskirk)
Newsgroups: comp.lang.asm.x86
Subject: Bit reversal in AVX2
Date: Sun, 22 May 2022 03:05:58 -0600
Organization: A noiseless patient Spider
Lines: 492
Approved: fbkotler@myfairpoint.net - comp.lang.asm.x86 moderation team.
Message-ID: <t6cue2$9qp$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain;
format=flowed;
charset="iso-8859-1";
reply-type=original
Content-Transfer-Encoding: 7bit
Injection-Info: reader02.eternal-september.org; posting-host="e7b244fbc631d4eb426b614e815fa986";
logging-data="15033"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/AnL9gcqM2GABOMeGD4/iolHfNp7IO7gA="
Cancel-Lock: sha1:eFD7svimXRoyuMBoTi+nJweUSH8=
View all headers
Just for fun I thought I would try various strategies for permuting
an array of single-precision floating point numbers using AVX2.
bitrev1.asm just does vpunpckl/hdq/qdq or its lane-crossing
synthesis with vperm2i128 and has a hard limit of 24 clocks to
bit-reverse a 64-element array because all of these operations
use pipeline 5.

D:\gfortran\james\bitrev>type bitrev1.asm
format MS64 COFF

section '.text' code readable writeable executable
public bitrev1
bitrev1:
   sub rsp, 56
   vmovdqu [rsp+32], xmm6
   vmovdqu [rsp+16], xmm7
   vmovdqu [rsp], xmm8

   vmovdqu ymm0, [rcx]
   vmovdqu ymm1, [rcx+32]
   vmovdqu ymm2, [rcx+64]
   vmovdqu ymm3, [rcx+96]
   vmovdqu ymm4, [rcx+128]
   vmovdqu ymm5, [rcx+160]
   vmovdqu ymm6, [rcx+192]
   vmovdqu ymm8, [rcx+224]

   vperm2i128 ymm7, ymm0, ymm1, 32
   vperm2i128 ymm0, ymm0, ymm1, 49
   vperm2i128 ymm1, ymm2, ymm3, 32
   vperm2i128 ymm2, ymm2, ymm3, 49
   vperm2i128 ymm3, ymm4, ymm5, 32
   vperm2i128 ymm4, ymm4, ymm5, 49
   vperm2i128 ymm5, ymm6, ymm8, 32
   vperm2i128 ymm6, ymm6, ymm8, 49

   vpunpckldq ymm8, ymm7, ymm3
   vpunpckhdq ymm7, ymm7, ymm3
   vpunpckldq ymm3, ymm0, ymm4
   vpunpckhdq ymm0, ymm0, ymm4
   vpunpckldq ymm4, ymm1, ymm5
   vpunpckhdq ymm1, ymm1, ymm5
   vpunpckldq ymm5, ymm2, ymm6
   vpunpckhdq ymm6, ymm2, ymm6

   vpunpcklqdq ymm2, ymm8, ymm4
   vpunpckhqdq ymm8, ymm8, ymm4
   vpunpcklqdq ymm4, ymm7, ymm1
   vpunpckhqdq ymm7, ymm7, ymm1
   vpunpcklqdq ymm1, ymm3, ymm5
   vpunpckhqdq ymm3, ymm3, ymm5
   vpunpcklqdq ymm5, ymm0, ymm6
   vpunpckhqdq ymm0, ymm0, ymm6

   vmovdqu [rcx], ymm2
   vmovdqu [rcx+32], ymm1
   vmovdqu [rcx+64], ymm4
   vmovdqu [rcx+96], ymm5
   vmovdqu [rcx+128], ymm8
   vmovdqu [rcx+160], ymm3
   vmovdqu [rcx+192], ymm7
   vmovdqu [rcx+224], ymm0

epilog:
   vmovdqu xmm6, [rsp+32]
   vmovdqu xmm7, [rsp+16]
   vmovdqu xmm8, [rsp]
   add rsp, 56
   ret

D:\gfortran\james\bitrev>fasm bitrev1.asm
flat assembler  version 1.71.49  (1048576 kilobytes memory)
1 passes, 357 bytes.

bitrev2.asm is the ugliest version, using permutations up front
and in back to shift the data into position and then vpblendd
to shift data between registers. It has the potential to be fastest
because only 14 operations require pipeline 5.

D:\gfortran\james\bitrev>type bitrev2.asm
format MS64 COFF

section '.text' code readable writeable executable
public bitrev2
bitrev2:
   sub rsp, 56
   vmovdqu [rsp+32], xmm6
   vmovdqu [rsp+16], xmm7
   vmovdqu [rsp], xmm8

   vmovdqu ymm0, [rcx]
   vmovdqu ymm4, [rcx+32]
   vmovdqu ymm2, [rcx+64]
   vmovdqu ymm6, [rcx+96]
   vmovdqu ymm1, [rcx+128]
   vmovdqu ymm5, [rcx+160]
   vmovdqu ymm3, [rcx+192]
   vmovdqu ymm8, [rcx+224]

   vmovdqu ymm7, yword [perm0]
   vpermd ymm1, ymm7, ymm1
   vpermq ymm2, ymm2, 177
   vmovdqu ymm7, yword [perm1]
   vpermd ymm3, ymm7, ymm3
   vpermq ymm4, ymm4, 78
   vmovdqu ymm7, yword [perm0+16]
   vpermd ymm5, ymm7, ymm5
   vpermq ymm6, ymm6, 27
   vmovdqu ymm7, yword [perm1+16]
   vpermd ymm8, ymm7, ymm8

   vpblendd ymm7, ymm0, ymm4, 240
   vpblendd ymm0, ymm0, ymm4, 15
   vpblendd ymm4, ymm2, ymm6, 240
   vpblendd ymm2, ymm2, ymm6, 15
   vpblendd ymm6, ymm1, ymm5, 240
   vpblendd ymm1, ymm1, ymm5, 15
   vpblendd ymm5, ymm3, ymm8, 240
   vpblendd ymm3, ymm3, ymm8, 15

   vpblendd ymm8, ymm7, ymm4, 204
   vpblendd ymm7, ymm7, ymm4, 51
   vpblendd ymm4, ymm0, ymm2, 204
   vpblendd ymm0, ymm0, ymm2, 51
   vpblendd ymm2, ymm6, ymm5, 204
   vpblendd ymm6, ymm6, ymm5, 51
   vpblendd ymm5, ymm1, ymm3, 204
   vpblendd ymm1, ymm1, ymm3, 51

   vpblendd ymm3, ymm8, ymm2, 170
   vpblendd ymm8, ymm8, ymm2, 85
   vpblendd ymm2, ymm7, ymm1, 170
   vpblendd ymm7, ymm7, ymm1, 85
   vpblendd ymm1, ymm4, ymm5, 170
   vpblendd ymm4, ymm4, ymm5, 85
   vpblendd ymm5, ymm0, ymm6, 170
   vpblendd ymm0, ymm0, ymm6, 85

   vmovdqu ymm6, yword [perm1+8]
   vpermd  ymm8, ymm6, ymm8
   vmovdqu ymm6, yword [perm2]
   vpermd  ymm2, ymm6, ymm2
   vmovdqu ymm6, yword [perm3]
   vpermd  ymm7, ymm6, ymm7
   vpermq  ymm1, ymm1, 78
   vmovdqu ymm6, yword [perm1+24]
   vpermd  ymm4, ymm6, ymm4
   vmovdqu ymm6, yword [perm2+16]
   vpermd  ymm5, ymm6, ymm5
   vmovdqu ymm6, yword [perm3+16]
   vpermd  ymm0, ymm6, ymm0

   vmovdqu [rcx], ymm3
   vmovdqu [rcx+32], ymm1
   vmovdqu [rcx+64], ymm2
   vmovdqu [rcx+96], ymm5
   vmovdqu [rcx+128], ymm8
   vmovdqu [rcx+160], ymm4
   vmovdqu [rcx+192], ymm7
   vmovdqu [rcx+224], ymm0

..epilog:
   vmovdqu xmm6, [rsp+32]
   vmovdqu xmm7, [rsp+16]
   vmovdqu xmm8, [rsp]
   add rsp, 56
   ret

section '.data' data readable writeable align 32
   align 32
   perm0 dd 1,0,7,6,5,4,3,2,1,0,7,6
   perm1 dd 7,6,1,0,3,2,5,4,7,6,1,0,3,2
   perm2 dd 2,7,0,5,6,3,4,1,2,7,0,5
   perm3 dd 3,6,1,4,7,2,5,0,3,6,1,4

D:\gfortran\james\bitrev>fasm bitrev2.asm
flat assembler  version 1.71.49  (1048576 kilobytes memory)
3 passes, 901 bytes.

bitrev3.asm uses vpgatherdd, which is a really slow instruction.
It was the easiest to write, though.

D:\gfortran\james\bitrev>type bitrev3.asm
format MS64 COFF

section '.text' code readable writeable executable
public bitrev3
bitrev3:
   sub rsp, 72
   vmovdqu [rsp+48], xmm6
   vmovdqu [rsp+32], xmm7
   vmovdqu [rsp+16], xmm8
   vmovdqu [rsp], xmm9

   vmovdqu ymm6, yword [table]

   vpcmpeqd ymm7, ymm7, ymm7
   vpgatherdd ymm0, [rcx+ymm6], ymm7
   vpcmpeqd ymm7, ymm7, ymm7
   vpgatherdd ymm1, [rcx+ymm6+16], ymm7
   vpcmpeqd ymm7, ymm7, ymm7
   vpgatherdd ymm2, [rcx+ymm6+8], ymm7
   vpcmpeqd ymm7, ymm7, ymm7
   vpgatherdd ymm3, [rcx+ymm6+24], ymm7
   vpcmpeqd ymm7, ymm7, ymm7
   vpgatherdd ymm4, [rcx+ymm6+4], ymm7
   vpcmpeqd ymm7, ymm7, ymm7
   vpgatherdd ymm5, [rcx+ymm6+20], ymm7
   vpcmpeqd ymm7, ymm7, ymm7
   vpgatherdd ymm8, [rcx+ymm6+12], ymm7
   vpcmpeqd ymm7, ymm7, ymm7
   vpgatherdd ymm9, [rcx+ymm6+28], ymm7

   vmovdqu [rcx], ymm0
   vmovdqu [rcx+32], ymm1
   vmovdqu [rcx+64], ymm2
   vmovdqu [rcx+96], ymm3
   vmovdqu [rcx+128], ymm4
   vmovdqu [rcx+160], ymm5
   vmovdqu [rcx+192], ymm8
   vmovdqu [rcx+224], ymm9

epilog:
   vmovdqu xmm6, [rsp+48]
   vmovdqu xmm7, [rsp+32]
   vmovdqu xmm8, [rsp+16]
   vmovdqu xmm9, [rsp]
   add rsp, 72
   ret

section '.data' data readable writeable align 32
   align 32
   table dd 0,128,64,192,32,160,96,224

D:\gfortran\james\bitrev>fasm bitrev3.asm
flat assembler  version 1.71.49  (1048576 kilobytes memory)
3 passes, 401 bytes.

bitrev4.asm just reads addresses to swap from a lookup table. It
seems to have a hard limit of 56 clocks because my Haswell appears
to have only only store pipeline.

D:\gfortran\james\bitrev>type bitrev4.asm
format MS64 COFF

section '.text' code readable writeable executable
public bitrev4
bitrev4:
   push rbx
   lea r8, [table+56]
   mov r9, -56
..outer:
   mov rax, qword [r8+r9]
..inner:
   movzx edx, ah
   movzx r11d, al
   mov r10d, [rcx+4*rdx]
   mov ebx, [rcx+4*r11]
   mov [rcx+4*r11], r10d
   mov [rcx+4*rdx], ebx
   shr rax, 16
   jnz .inner
   add r9, 8
   jnz .outer
   pop rbx
   ret

section '.data' data readable writeable align 32
   align 32
   table db 1,32,2,16,3,48,4,8,5,40,6,24,7,56,9,36,10,20,11,52,13
         db 44,14,28,15,60,17,34,19,50,21,42,22,26,23,58,25,38,27
         db 54,29,46,31,62,35,49,37,41,39,57,43,53,47,61,55,59
D:\gfortran\james\bitrev>fasm bitrev4.asm

Click here to read the complete article
Subject: Re: Bit reversal in AVX2
From: Terje Mathisen
Newsgroups: comp.lang.asm.x86
Organization: Aioe.org NNTP Server
Date: Mon, 23 May 2022 06:36 UTC
References: 1
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: terje.ma...@nospicedham.tmsw.no (Terje Mathisen)
Newsgroups: comp.lang.asm.x86
Subject: Re: Bit reversal in AVX2
Date: Mon, 23 May 2022 08:36:54 +0200
Organization: Aioe.org NNTP Server
Lines: 29
Approved: fbkotler@myfairpoint.net - comp.lang.asm.x86 moderation team.
Message-ID: <t6fa22$1lpt$1@gioia.aioe.org>
References: <t6cue2$9qp$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: reader02.eternal-september.org; posting-host="f1b09d1214a67af09459b51971a32c2d";
logging-data="11329"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+APPN+YSczxlEQBQ8iBxgS/TJvdwNk7ZQ="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.12
Cancel-Lock: sha1:ufcgnSva1j7CYHScAQzAY9nUYxk=
View all headers
James Van Buskirk wrote:
Just for fun I thought I would try various strategies for permuting
an array of single-precision floating point numbers using AVX2.
bitrev1.asm just does vpunpckl/hdq/qdq or its lane-crossing
synthesis with vperm2i128 and has a hard limit of 24 clocks to
bit-reverse a 64-element array because all of these operations
use pipeline 5.
[snip]
To make a long story short these tests seemed to show that
bitrev1 took about 27 clocks, bitrev2 about 23, bitrev3 about
93, and bitrev4 about 72 clocks.

Interesting stuff, thanks for posting!


Oh well, maybe I should have gone out and played in the snow
instead today.

Snow today, in late May? Are you in some New Zealand southern island mountains/ditto South America/Antarctica?

Or just high up in the Rockies (US/Canada) and the snow is old stuff?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"



Subject: Re: Bit reversal in AVX2
From: James Van Buskirk
Newsgroups: comp.lang.asm.x86
Organization: A noiseless patient Spider
Date: Mon, 23 May 2022 22:43 UTC
References: 1 2
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: not_va...@nospicedham.comcast.net (James Van Buskirk)
Newsgroups: comp.lang.asm.x86
Subject: Re: Bit reversal in AVX2
Date: Mon, 23 May 2022 16:43:40 -0600
Organization: A noiseless patient Spider
Lines: 33
Approved: fbkotler@myfairpoint.net - comp.lang.asm.x86 moderation team.
Message-ID: <t6h2nk$20sev$1@dont-email.me>
References: <t6cue2$9qp$1@dont-email.me> <t6fa22$1lpt$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain;
format=flowed;
charset="UTF-8";
reply-type=response
Content-Transfer-Encoding: 7bit
Injection-Info: reader02.eternal-september.org; posting-host="4d0755cd954b58b80bf8e3d024669e26";
logging-data="22304"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+eSy6STChi+8O6T76gaqR9YjChZH4Bytw="
Cancel-Lock: sha1:jaYu56BGeGn337WmCQDFbE2SmuI=
View all headers
"Terje Mathisen"  wrote in message news:t6fa22$1lpt$1@gioia.aioe.org...

James Van Buskirk wrote:
Just for fun I thought I would try various strategies for permuting
an array of single-precision floating point numbers using AVX2.
bitrev1.asm just does vpunpckl/hdq/qdq or its lane-crossing
synthesis with vperm2i128 and has a hard limit of 24 clocks to
bit-reverse a 64-element array because all of these operations
use pipeline 5.
[snip]
To make a long story short these tests seemed to show that
bitrev1 took about 27 clocks, bitrev2 about 23, bitrev3 about
93, and bitrev4 about 72 clocks.

Interesting stuff, thanks for posting!

I have an update coming for bitrev2.

Oh well, maybe I should have gone out and played in the snow
instead today.

Snow today, in late May? Are you in some New Zealand southern island mountains/ditto South America/Antarctica?

Or just high up in the Rockies (US/Canada) and the snow is old stuff?

https://kdvr.com/weather/weather-forecast/near-record-90-degree-heat-thursday-snowstorm-friday/

I did manage to play in the snow today. It was a little slippery in spots
but nice and cool, perfect for hiking.

https://bouldercolorado.gov/media/2530/download?inline=



Subject: Re: Bit reversal in AVX2
From: Terje Mathisen
Newsgroups: comp.lang.asm.x86
Organization: Aioe.org NNTP Server
Date: Tue, 24 May 2022 06:26 UTC
References: 1 2 3
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: terje.ma...@nospicedham.tmsw.no (Terje Mathisen)
Newsgroups: comp.lang.asm.x86
Subject: Re: Bit reversal in AVX2
Date: Tue, 24 May 2022 08:26:13 +0200
Organization: Aioe.org NNTP Server
Lines: 52
Approved: fbkotler@myfairpoint.net - comp.lang.asm.x86 moderation team.
Message-ID: <t6htq1$ujc$1@gioia.aioe.org>
References: <t6cue2$9qp$1@dont-email.me> <t6fa22$1lpt$1@gioia.aioe.org>
<t6h2nk$20sev$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: reader02.eternal-september.org; posting-host="4d0755cd954b58b80bf8e3d024669e26";
logging-data="17690"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18cOtjq9wQBgjSdBvASgaGM4mm4Lov4ZhQ="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
Firefox/68.0 SeaMonkey/2.53.12
Cancel-Lock: sha1:Iyj9h9Br2sD8rYNOixcIXJ1mbHw=
View all headers
James Van Buskirk wrote:
"Terje Mathisen"  wrote in message news:t6fa22$1lpt$1@gioia.aioe.org...

James Van Buskirk wrote:
Just for fun I thought I would try various strategies for permuting
an array of single-precision floating point numbers using AVX2.
bitrev1.asm just does vpunpckl/hdq/qdq or its lane-crossing
synthesis with vperm2i128 and has a hard limit of 24 clocks to
bit-reverse a 64-element array because all of these operations
use pipeline 5.
[snip]
To make a long story short these tests seemed to show that
bitrev1 took about 27 clocks, bitrev2 about 23, bitrev3 about
93, and bitrev4 about 72 clocks.

Interesting stuff, thanks for posting!

I have an update coming for bitrev2.

Nice. :-)

Oh well, maybe I should have gone out and played in the snow
instead today.

Snow today, in late May? Are you in some New Zealand southern island mountains/ditto South America/Antarctica?

Or just high up in the Rockies (US/Canada) and the snow is old stuff?

https://kdvr.com/weather/weather-forecast/near-record-90-degree-heat-thursday-snowstorm-friday/

That link is location-limited, but it did confirm that we are talking about the Colorado side of the Rockies.


I did manage to play in the snow today. It was a little slippery in spots
but nice and cool, perfect for hiking.

https://bouldercolorado.gov/media/2530/download?inline=

Seems similar to my last trip to Boulder in Nov 2019, before Covid: Hiked/ran to the top of the Flatirons in 2-5 cm of fresh snow: A bit slippery in a few places with just rocks/slabs for footing.

https://photos.app.goo.gl/v2bRgcqtL1hy5Wtm8

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"



Subject: Re: Bit reversal in AVX2
From: James Van Buskirk
Newsgroups: comp.lang.asm.x86
Organization: A noiseless patient Spider
Date: Tue, 24 May 2022 06:59 UTC
References: 1 2 3 4
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: not_va...@nospicedham.comcast.net (James Van Buskirk)
Newsgroups: comp.lang.asm.x86
Subject: Re: Bit reversal in AVX2
Date: Tue, 24 May 2022 00:59:00 -0600
Organization: A noiseless patient Spider
Lines: 145
Approved: fbkotler@myfairpoint.net - comp.lang.asm.x86 moderation team.
Message-ID: <t6hvnn$24idb$1@dont-email.me>
References: <t6cue2$9qp$1@dont-email.me> <t6fa22$1lpt$1@gioia.aioe.org> <t6h2nk$20sev$1@dont-email.me> <t6htq1$ujc$1@gioia.aioe.org>
MIME-Version: 1.0
Content-Type: text/plain;
format=flowed;
charset="UTF-8";
reply-type=response
Content-Transfer-Encoding: 7bit
Injection-Info: reader02.eternal-september.org; posting-host="4d0755cd954b58b80bf8e3d024669e26";
logging-data="28325"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19c1vsbaPBbCKonfPXOaM6leACyDbMY4jU="
Cancel-Lock: sha1:iFbSCL+AZqDgpyDHxoILiMPMobE=
View all headers
"Terje Mathisen"  wrote in message news:t6htq1$ujc$1@gioia.aioe.org...

James Van Buskirk wrote:

I have an update coming for bitrev2.

Nice. :-)

Here it is. I have actually added some comments which show the
contents of the register written, either a working register (when
the data are the indices in the original array of the components)
or a permutation register required for vpermd (a pity that AVX2
doesn't have a lane-crossing version of palignr or a rotate
instruction.)

Didn't show any improvement in performance, but it certainly is
cleaner code.

D:\gfortran\james\bitrev>type bitrev2a.asm
format MS64 COFF

section '.text' code readable writeable executable
public bitrev2
bitrev2:
   sub rsp, 56
   vmovdqu [rsp+32], xmm6
   vmovdqu [rsp+16], xmm7
   vmovdqu [rsp], xmm8

   vmovdqu ymm0, [rcx]                 ;  0  1  2  3  4  5  6  7
   vmovdqu ymm4, [rcx+32]              ;  8  9 10 11 12 13 14 15
   vmovdqu ymm2, [rcx+64]              ; 16 17 18 19 20 21 22 23
   vmovdqu ymm6, [rcx+96]              ; 24 25 26 27 28 29 30 31
   vmovdqu ymm1, [rcx+128]             ; 32 33 34 35 36 37 38 39
   vmovdqu ymm5, [rcx+160]             ; 40 41 42 43 44 45 46 47
   vmovdqu ymm3, [rcx+192]             ; 48 49 50 51 52 53 54 55
   vmovdqu ymm8, [rcx+224]             ; 56 57 58 59 60 61 62 63

   vmovdqu ymm7, yword [perm0+24]      ;  7  0  1  2  3  4  5  6
   vpermd ymm1, ymm7, ymm1             ; 39 32 33 34 35 36 37 38
   vpermq ymm2, ymm2, 147              ; 22 23 16 17 18 19 20 21
   vmovdqu ymm7, yword [perm0+16]      ;  5  6  7  0  1  2  3  4
   vpermd ymm3, ymm7, ymm3             ; 53 54 55 48 49 50 51 52
   vpermq ymm4, ymm4, 78               ; 12 13 14 15  8  9 10 11
   vmovdqu ymm7, yword [perm0+8]       ;  3  4  5  6  7  0  1  2
   vpermd ymm5, ymm7, ymm5             ; 43 44 45 46 47 40 41 42
   vpermq ymm6, ymm6, 57               ; 26 27 28 29 30 31 24 25
   vmovdqu ymm7, yword [perm0]         ;  1  2  3  4  5  6  7  0
   vpermd ymm8, ymm7, ymm8             ; 57 58 59 60 61 62 63 56

   vpblendd ymm7, ymm0, ymm1, 170      ;  0 32  2 34  4 36  6 38
   vpblendd ymm0, ymm0, ymm1, 85       ; 39  1 33  3 35  5 37  7
   vpblendd ymm1, ymm2, ymm3, 170      ; 22 54 16 48 18 50 20 52
   vpblendd ymm2, ymm2, ymm3, 85       ; 53 23 55 17 49 19 51 21
   vpblendd ymm3, ymm4, ymm5, 170      ; 12 44 14 46  8 40 10 42
   vpblendd ymm4, ymm4, ymm5, 85       ; 43 13 45 15 47  9 41 11
   vpblendd ymm5, ymm6, ymm8, 170      ; 26 58 28 60 30 62 24 56
   vpblendd ymm6, ymm6, ymm8, 85       ; 57 27 59 29 61 31 63 25

   vpblendd ymm8, ymm7, ymm1, 204      ;  0 32 16 48  4 36 20 52
   vpblendd ymm7, ymm7, ymm1, 51       ; 22 54  2 34 18 50  6 38
   vpblendd ymm1, ymm0, ymm2, 102      ; 39 23 55  3 35 19 51  7
   vpblendd ymm0, ymm0, ymm2, 153      ; 53  1 33 17 49  5 37 21
   vpblendd ymm2, ymm3, ymm5, 204      ; 12 44 28 60  8 40 24 56
   vpblendd ymm3, ymm3, ymm5, 51       ; 26 58 14 46 30 62 10 42
   vpblendd ymm5, ymm4, ymm6, 102      ; 43 27 59 15 47 31 63 11
   vpblendd ymm4, ymm4, ymm6, 153      ; 57 13 45 29 61  9 41 25

   vpblendd ymm6, ymm8, ymm2, 240      ;  0 32 16 48  8 40 24 56
   vpblendd ymm8, ymm8, ymm2, 15       ; 12 44 28 60  4 36 20 52
   vpblendd ymm2, ymm7, ymm3, 60       ; 22 54 14 46 30 62  6 38
   vpblendd ymm7, ymm7, ymm3, 195      ; 26 58  2 34 18 50 10 42
   vpblendd ymm3, ymm1, ymm5, 120      ; 39 23 55 15 47 31 63  7
   vpblendd ymm1, ymm1, ymm5, 135      ; 43 27 59  3 35 19 51 11
   vpblendd ymm5, ymm0, ymm4, 30       ; 53 13 45 29 61  5 37 21
   vpblendd ymm0, ymm0, ymm4, 225      ; 57  1 33 17 49  9 41 25

   vpermq  ymm8, ymm8, 78              ;  4 36 20 52 12 44 28 60
   vpermq  ymm2, ymm2, 147             ;  6 38 22 54 14 46 30 62
   vpermq  ymm7, ymm7, 57              ;  2 34 18 50 10 42 26 58
   vmovdqu ymm4, yword [perm0+24]      ;  7  0  1  2  3  4  5  6
   vpermd  ymm3, ymm4, ymm3            ;  7 39 23 55 15 47 31 63
   vmovdqu ymm4, yword [perm0+8]       ;  3  4  5  6  7  0  1  2
   vpermd  ymm1, ymm4, ymm1            ;  3 35 19 51 11 43 27 59
   vmovdqu ymm4, yword [perm0+16]      ;  5  6  7  0  1  2  3  4
   vpermd  ymm5, ymm4, ymm5            ;  5 37 21 53 13 45 29 61
   vmovdqu ymm4, yword [perm0]         ;  1  2  3  4  5  6  7  0
   vpermd  ymm0, ymm4, ymm0            ;  1 33 17 49  9 41 25 57

   vmovdqu [rcx], ymm6
   vmovdqu [rcx+32], ymm8
   vmovdqu [rcx+64], ymm7
   vmovdqu [rcx+96], ymm2
   vmovdqu [rcx+128], ymm0
   vmovdqu [rcx+160], ymm5
   vmovdqu [rcx+192], ymm1
   vmovdqu [rcx+224], ymm3

..epilog:
   vmovdqu xmm6, [rsp+32]
   vmovdqu xmm7, [rsp+16]
   vmovdqu xmm8, [rsp]
   add rsp, 56
   ret

section '.data' data readable writeable align 32
   align 32
   perm0 dd 1,2,3,4,5,6,7,0,1,2,3,4,5,6

D:\gfortran\james\bitrev>fasm bitrev2a.asm
flat assembler  version 1.71.49  (1048576 kilobytes memory)
3 passes, 723 bytes.

D:\gfortran\james\bitrev>gfortran -O3 timer.f90 bitrev1.obj bitrev2a.obj bitrev3
..obj bitrev4.obj rdtscp.obj -otimer

D:\gfortran\james\bitrev>timer
[...]
bitrev2: check = 0.00000000
22872
22703
22721
22724
22752
22672
22709
22740
22736
22721
22706
22727
22709
22721
22706
22718
22712
22718
22706
22727
22712
22718
22706
[...]



1
rocksolid light 0.7.2
clearneti2ptor